[Improvement](doc) improve ngram and inverted index documents #21091

2023-06-25 19:13:41 +08:00
parent ee2492dd78
commit 69d5adaee3
4 changed files with 20 additions and 20 deletions
--- a/docs/en/docs/data-table/index/inverted-index.md
+++ b/docs/en/docs/data-table/index/inverted-index.md
@ -74,15 +74,15 @@ The features for inverted index is as follows:
      - missing stands for no parser, the whole field is considered to be a term
      - "english" stands for english parser
      - "chinese" stands for chinese parser
-      - "unicode" stands for mixed-type word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters into 1-gram.
+      - "unicode" stands for muti-language mixed word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters one by one.

    - "parser_mode" is utilized to set the tokenizer/parser type for Chinese word segmentation.
-      - in "fine_grained" mode, the system will meticulously tokenize each possible segment.
-      - in "coarse_grained" mode, the system follows the maximization principle, performing accurate and comprehensive tokenization.
+      - in "fine_grained" mode, the system tend to generate short words, eg. 6 words '武汉' '武汉市' '市长' '长江' '长江大桥' '大桥' for '武汉长江大桥'.
+      - in "coarse_grained" mode, the system tend to generate long words, eg. 2 words '武汉市' '市长' '长江大桥' for '武汉长江大桥'.
      - default mode is "coarse_grained".
-    - "support_phrase" is utilized to specify if the index requires support for phrase mode. 
-      - "true" indicates that support is needed.
-      - "false" indicates that support is not needed.
+    - "support_phrase" is utilized to specify if the index requires support for phrase mode query MATCH_PHRASE
+      - "true" indicates that support is needed, but needs more storage for index.
+      - "false" indicates that support is not needed, and less storage for index. MATCH_ALL can be used for matching multi words without order.
      - default mode is "false".
  - COMMENT is optional

--- a/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
+++ b/docs/en/docs/data-table/index/ngram-bloomfilter-index.md
@ -29,7 +29,7 @@ under the License.
 <version since="2.0.0">
 </version>

-In order to improve the like query performance, the NGram BloomFilter index was implemented, which referenced to the ClickHouse's ngrambf skip indices;
+In order to improve the like query performance, the NGram BloomFilter index was implemented.

 ## Create Column With NGram BloomFilter Index

--- a/docs/zh-CN/docs/data-table/index/inverted-index.md
+++ b/docs/zh-CN/docs/data-table/index/inverted-index.md
@ -52,7 +52,7 @@ Doris倒排索引的功能简要介绍如下：
 - 增加了字符串类型的全文检索
  - 支持字符串全文检索，包括同时匹配多个关键字MATCH_ALL、匹配任意一个关键字MATCH_ANY、匹配短语词组MATCH_PHRASE
  - 支持字符串数组类型的全文检索
-  - 支持英文、中文以及混合类型分词
+  - 支持英文、中文以及Unicode多语言分词
 - 加速普通等值、范围查询，覆盖bitmap索引的功能，未来会代替bitmap索引
  - 支持字符串、数值、日期时间类型的 =, !=, >, >=, <, <= 快速过滤
  - 支持字符串、数字、日期时间数组类型的 =, !=, >, >=, <, <=
@ -72,16 +72,16 @@ Doris倒排索引的功能简要介绍如下：
    - parser指定分词器
      - 默认不指定代表不分词
      - english是英文分词，适合被索引列是英文的情况，用空格和标点符号分词，性能高
-      - chinese是中文分词，适合被索引列有中文或者中英文混合的情况，性能比english分词低
-      - unicode是混合类型分词，适用于中英文混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词，并且可以对中文字符进行1-gram分词。
-    - parser_mode用于指定中文分词的模式
-      - fine_grained模式，系统将对可以进行分词的部分都进行详尽的分词处理
-      - coarse_grained模式，系统则依据最大化原则，执行精确且全面的分词操作
-      - 默认coarse_grained模式
-    - support_phrase用于指定索引是否需要支持短语模式
-      - true为需要
-      - false为不需要
-      - 默认false不需要
+      - chinese是中文分词，适合被索引列主要是中文的情况，性能比english分词低
+      - unicode是多语言混合类型分词，适用于中英文混合、多语言混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词，并且可以对中文按字符分词。
+    - parser_mode用于指定分词的模式，目前parser = chinese时支持如下几种模式：
+      - fine_grained：细粒度模式，倾向于分出比较短的词，比如 '武汉长江大桥' 会分成 '武汉', '武汉市', '市长', '长江', '长江大桥', '大桥' 6个词
+      - coarse_grained：粗粒度模式，倾向于分出比较长的词，，比如 '武汉长江大桥' 会分成 '武汉市' '长江大桥' 2个词
+      - 默认coarse_grained
+    - support_phrase用于指定索引是否支持MATCH_PHRASE短语查询加速
+      - true为支持，但是索引需要更多的存储空间
+      - false为不支持，更省存储空间，可以用MATCH_ALL查询多个关键字
+      - 默认false
  - COMMENT 是可选的，用于指定注释

 ```sql
@ -150,7 +150,7 @@ USE test_inverted_index;

 -- 创建表的同时创建了comment的倒排索引idx_comment
 --   USING INVERTED 指定索引类型是倒排索引
--   PROPERTIES("parser" = "english") 指定采用english分词，还支持"chinese"中文分词和"unicode"中英文混合分词，如果不指定"parser"参数表示不分词
+--   PROPERTIES("parser" = "english") 指定采用english分词，还支持"chinese"中文分词和"unicode"中英文多语言混合分词，如果不指定"parser"参数表示不分词
 CREATE TABLE hackernews_1m
 (
    `id` BIGINT,
--- a/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
+++ b/docs/zh-CN/docs/data-table/index/ngram-bloomfilter-index.md
@ -29,7 +29,7 @@ under the License.
 <version since="2.0.0">
 </version>

-为了提升like的查询性能，增加了NGram BloomFilter索引，其实现主要参照了ClickHouse的ngrambf。
+为了提升like的查询性能，增加了NGram BloomFilter索引。

 ## NGram BloomFilter创建