[Improvement](doc) improve ngram and inverted index documents #21091

This commit is contained in:
Kang
2023-06-25 19:13:41 +08:00
committed by GitHub
parent ee2492dd78
commit 69d5adaee3
4 changed files with 20 additions and 20 deletions

View File

@ -74,15 +74,15 @@ The features for inverted index is as follows:
- missing stands for no parser, the whole field is considered to be a term
- "english" stands for english parser
- "chinese" stands for chinese parser
- "unicode" stands for mixed-type word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters into 1-gram.
- "unicode" stands for muti-language mixed word segmentation suitable for situations with a mix of Chinese and English. It can segment email prefixes and suffixes, IP addresses, and mixed characters and numbers, and can also segment Chinese characters one by one.
- "parser_mode" is utilized to set the tokenizer/parser type for Chinese word segmentation.
- in "fine_grained" mode, the system will meticulously tokenize each possible segment.
- in "coarse_grained" mode, the system follows the maximization principle, performing accurate and comprehensive tokenization.
- in "fine_grained" mode, the system tend to generate short words, eg. 6 words '武汉' '武汉市' '市长' '长江' '长江大桥' '大桥' for '武汉长江大桥'.
- in "coarse_grained" mode, the system tend to generate long words, eg. 2 words '武汉市' '市长' '长江大桥' for '武汉长江大桥'.
- default mode is "coarse_grained".
- "support_phrase" is utilized to specify if the index requires support for phrase mode.
- "true" indicates that support is needed.
- "false" indicates that support is not needed.
- "support_phrase" is utilized to specify if the index requires support for phrase mode query MATCH_PHRASE
- "true" indicates that support is needed, but needs more storage for index.
- "false" indicates that support is not needed, and less storage for index. MATCH_ALL can be used for matching multi words without order.
- default mode is "false".
- COMMENT is optional

View File

@ -29,7 +29,7 @@ under the License.
<version since="2.0.0">
</version>
In order to improve the like query performance, the NGram BloomFilter index was implemented, which referenced to the ClickHouse's ngrambf skip indices;
In order to improve the like query performance, the NGram BloomFilter index was implemented.
## Create Column With NGram BloomFilter Index

View File

@ -52,7 +52,7 @@ Doris倒排索引的功能简要介绍如下:
- 增加了字符串类型的全文检索
- 支持字符串全文检索,包括同时匹配多个关键字MATCH_ALL、匹配任意一个关键字MATCH_ANY、匹配短语词组MATCH_PHRASE
- 支持字符串数组类型的全文检索
- 支持英文、中文以及混合类型分词
- 支持英文、中文以及Unicode多语言分词
- 加速普通等值、范围查询,覆盖bitmap索引的功能,未来会代替bitmap索引
- 支持字符串、数值、日期时间类型的 =, !=, >, >=, <, <= 快速过滤
- 支持字符串、数字、日期时间数组类型的 =, !=, >, >=, <, <=
@ -72,16 +72,16 @@ Doris倒排索引的功能简要介绍如下:
- parser指定分词器
- 默认不指定代表不分词
- english是英文分词,适合被索引列是英文的情况,用空格和标点符号分词,性能高
- chinese是中文分词,适合被索引列有中文或者中英文混合的情况,性能比english分词低
- unicode是混合类型分词,适用于中英文混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文字符进行1-gram分词。
- parser_mode用于指定中文分词的模式
- fine_grained模式,系统将对可以进行分词的部分都进行详尽的分词处理
- coarse_grained模式,系统则依据最大化原则,执行精确且全面的分词操作
- 默认coarse_grained模式
- support_phrase用于指定索引是否需要支持短语模式
- true为需要
- false为不需要
- 默认false不需要
- chinese是中文分词,适合被索引列主要是中文的情况,性能比english分词低
- unicode是多语言混合类型分词,适用于中英文混合、多语言混合的情况。它能够对邮箱前缀和后缀、IP地址以及字符数字混合进行分词,并且可以对中文字符分词。
- parser_mode用于指定分词的模式,目前parser = chinese时支持如下几种模式:
- fine_grained:细粒度模式,倾向于分出比较短的词,比如 '武汉长江大桥' 会分成 '武汉', '武汉市', '市长', '长江', '长江大桥', '大桥' 6个词
- coarse_grained:粗粒度模式,倾向于分出比较长的词,,比如 '武汉长江大桥' 会分成 '武汉市' '长江大桥' 2个词
- 默认coarse_grained
- support_phrase用于指定索引是否支持MATCH_PHRASE短语查询加速
- true为支持,但是索引需要更多的存储空间
- false为不支持,更省存储空间,可以用MATCH_ALL查询多个关键字
- 默认false
- COMMENT 是可选的,用于指定注释
```sql
@ -150,7 +150,7 @@ USE test_inverted_index;
-- 创建表的同时创建了comment的倒排索引idx_comment
-- USING INVERTED 指定索引类型是倒排索引
-- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文混合分词,如果不指定"parser"参数表示不分词
-- PROPERTIES("parser" = "english") 指定采用english分词,还支持"chinese"中文分词和"unicode"中英文多语言混合分词,如果不指定"parser"参数表示不分词
CREATE TABLE hackernews_1m
(
`id` BIGINT,

View File

@ -29,7 +29,7 @@ under the License.
<version since="2.0.0">
</version>
为了提升like的查询性能,增加了NGram BloomFilter索引,其实现主要参照了ClickHouse的ngrambf
为了提升like的查询性能,增加了NGram BloomFilter索引。
## NGram BloomFilter创建