From 69d5adaee3cb1e2595c586a9f984a481de239a1c Mon Sep 17 00:00:00 2001 From: Kang Date: Sun, 25 Jun 2023 19:13:41 +0800 Subject: [PATCH] [Improvement](doc) improve ngram and inverted index documents #21091 --- .../docs/data-table/index/inverted-index.md | 12 +++++----- .../index/ngram-bloomfilter-index.md | 2 +- .../docs/data-table/index/inverted-index.md | 24 +++++++++---------- .../index/ngram-bloomfilter-index.md | 2 +- 4 files changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/en/docs/data-table/index/inverted-index.md b/docs/en/docs/data-table/index/inverted-index.md index 57216d8ad4..331fa90491 100644 --- a/docs/en/docs/data-table/index/inverted-index.md +++ b/docs/en/docs/data-table/index/inverted-index.md @@ -74,15 +74,15 @@ The features for inverted index is as follows: - missing stands for no parser, the whole field is considered to be a term - "english" stands for english parser - "chinese" stands for chinese parser - - "unicode" stands for mixed-type word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters into 1-gram. + - "unicode" stands for muti-language mixed word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters one by one. - "parser_mode" is utilized to set the tokenizer/parser type for Chinese word segmentation. - - in "fine_grained" mode, the system will meticulously tokenize each possible segment. - - in "coarse_grained" mode, the system follows the maximization principle, performing accurate and comprehensive tokenization. + - in "fine_grained" mode, the system tend to generate short words, eg. 6 words '武汉' '武汉市' '市长' '长江' '长江大桥' '大桥' for '武汉长江大桥'. + - in "coarse_grained" mode, the system tend to generate long words, eg. 2 words '武汉市' '市长' '长江大桥' for '武汉长江大桥'. - default mode is "coarse_grained". - - "support_phrase" is utilized to specify if the index requires support for phrase mode. - - "true" indicates that support is needed. - - "false" indicates that support is not needed. + - "support_phrase" is utilized to specify if the index requires support for phrase mode query MATCH_PHRASE + - "true" indicates that support is needed, but needs more storage for index. + - "false" indicates that support is not needed, and less storage for index. MATCH_ALL can be used for matching multi words without order. - default mode is "false". - COMMENT is optional diff --git a/docs/en/docs/data-table/index/ngram-bloomfilter-index.md b/docs/en/docs/data-table/index/ngram-bloomfilter-index.md index e3e04eb315..d804c28b7e 100644 --- a/docs/en/docs/data-table/index/ngram-bloomfilter-index.md +++ b/docs/en/docs/data-table/index/ngram-bloomfilter-index.md @@ -29,7 +29,7 @@ under the License. -In order to improve the like query performance, the NGram BloomFilter index was implemented, which referenced to the ClickHouse's ngrambf skip indices; +In order to improve the like query performance, the NGram BloomFilter index was implemented. ## Create Column With NGram BloomFilter Index diff --git a/docs/zh-CN/docs/data-table/index/inverted-index.md b/docs/zh-CN/docs/data-table/index/inverted-index.md index 3ac4992519..15f7485d8e 100644 --- a/docs/zh-CN/docs/data-table/index/inverted-index.md +++ b/docs/zh-CN/docs/data-table/index/inverted-index.md @@ -52,7 +52,7 @@ Doris倒排索引的功能简要介绍如下: - 增加了字符串类型的全文检索 - 支持字符串全文检索,包括同时匹配多个关键字MATCH_ALL、匹配任意一个关键字MATCH_ANY、匹配短语词组MATCH_PHRASE - 支持字符串数组类型的全文检索 - - 支持英文、中文以及混合类型分词 + - 支持英文、中文以及Unicode多语言分词 - 加速普通等值、范围查询,覆盖bitmap索引的功能,未来会代替bitmap索引 - 支持字符串、数值、日期时间类型的 =, !=, >, >=, <, <= 快速过滤 - 支持字符串、数字、日期时间数组类型的 =, !=, >, >=, <, <= @@ -72,16 +72,16 @@ Doris倒排索引的功能简要介绍如下: - parser指定分词器 - 默认不指定代表不分词 - english是英文分词,适合被索引列是英文的情况,用空格和标点符号分词,性能高 - - chinese是中文分词,适合被索引列有中文或者中英文混合的情况,性能比english分词低 - - unicode是混合类型分词,适用于中英文混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文字符进行1-gram分词。 - - parser_mode用于指定中文分词的模式 - - fine_grained模式,系统将对可以进行分词的部分都进行详尽的分词处理 - - coarse_grained模式,系统则依据最大化原则,执行精确且全面的分词操作 - - 默认coarse_grained模式 - - support_phrase用于指定索引是否需要支持短语模式 - - true为需要 - - false为不需要 - - 默认false不需要 + - chinese是中文分词,适合被索引列主要是中文的情况,性能比english分词低 + - unicode是多语言混合类型分词,适用于中英文混合、多语言混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文按字符分词。 + - parser_mode用于指定分词的模式,目前parser = chinese时支持如下几种模式: + - fine_grained:细粒度模式,倾向于分出比较短的词,比如 '武汉长江大桥' 会分成 '武汉', '武汉市', '市长', '长江', '长江大桥', '大桥' 6个词 + - coarse_grained:粗粒度模式,倾向于分出比较长的词,,比如 '武汉长江大桥' 会分成 '武汉市' '长江大桥' 2个词 + - 默认coarse_grained + - support_phrase用于指定索引是否支持MATCH_PHRASE短语查询加速 + - true为支持,但是索引需要更多的存储空间 + - false为不支持,更省存储空间,可以用MATCH_ALL查询多个关键字 + - 默认false - COMMENT 是可选的,用于指定注释 ```sql @@ -150,7 +150,7 @@ USE test_inverted_index; -- 创建表的同时创建了comment的倒排索引idx_comment -- USING INVERTED 指定索引类型是倒排索引 --- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文混合分词,如果不指定"parser"参数表示不分词 +-- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文多语言混合分词,如果不指定"parser"参数表示不分词 CREATE TABLE hackernews_1m ( `id` BIGINT, diff --git a/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md b/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md index ea6304253a..27e2b23592 100644 --- a/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md +++ b/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md @@ -29,7 +29,7 @@ under the License. -为了提升like的查询性能,增加了NGram BloomFilter索引,其实现主要参照了ClickHouse的ngrambf。 +为了提升like的查询性能,增加了NGram BloomFilter索引。 ## NGram BloomFilter创建