diff --git a/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md b/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md index 1b19f5a07c..20a7324778 100644 --- a/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md +++ b/docs/en/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md @@ -25,28 +25,90 @@ under the License. --> ## bitmap_hash -### description + +### Name + +BITMAP_HASH + +### Description + +Calculating hash value for what your input and return a BITMAP which contain the hash value. MurMur3 is used for this function because it is high-performance with low collision rate. More important, the MurMur3 distribution is "simili-random"; the Chi-Square distribution test is used to prove it. By the way, Different hardware platforms and different SEED may change the result of MurMur3. For more information about its performance, see [Smhasher](http://rurban.github.io/smhasher/). + #### Syntax -`BITMAP BITMAP_HASH(expr)` - -Compute the 32-bits hash value of a expr of any type, then return a bitmap containing that hash value. Mainly be used to load non-integer value into bitmap column, e.g., - ``` -cat data | curl --location-trusted -u user:passwd -T - -H "columns: dt,page,device_id, device_id=bitmap_hash(device_id)" http://host:8410/api/test/testDb/_stream_load +BITMAP BITMAP_HASH() ``` -### example +#### Arguments -``` -mysql> select bitmap_count(bitmap_hash('hello')); -+------------------------------------+ -| bitmap_count(bitmap_hash('hello')) | -+------------------------------------+ -| 1 | -+------------------------------------+ +`` +any value or expression. + +#### Return Type + +BITMAP + +#### Remarks + +Generally, MurMurHash 32 is friendly to random, short STRING with low collision rate about one-billionth. But for longer STRING, such as your path of system, can cause more frequent collision. If you indexed your system path, you will find a lot of collisions. + +The following two values are the same. + +```sql +SELECT bitmap_to_string(bitmap_hash('/System/Volumes/Data/Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/KernelManagement.framework/KernelManagement.tbd')) AS a , + bitmap_to_string(bitmap_hash('/System/Library/PrivateFrameworks/Install.framework/Versions/Current/Resources/es_419.lproj/Architectures.strings')) AS b; ``` -### keywords +Here is the result. + +```text ++-----------+-----------+ +| a | b | ++-----------+-----------+ +| 282251871 | 282251871 | ++-----------+-----------+ +``` + +### Example + +If you want to calculate MurMur3 of a certain value, you can + +``` +select bitmap_to_array(bitmap_hash('hello'))[1]; +``` + +Here is the result. + +```text ++-------------------------------------------------------------+ +| %element_extract%(bitmap_to_array(bitmap_hash('hello')), 1) | ++-------------------------------------------------------------+ +| 1321743225 | ++-------------------------------------------------------------+ +``` + +If you want to `count distinct` some columns, using bitmap has higher performance in some scenes. + +```sql +select bitmap_count(bitmap_union(bitmap_hash(`word`))) from `words`; +``` + +Here is the result. + +```text ++-------------------------------------------------+ +| bitmap_count(bitmap_union(bitmap_hash(`word`))) | ++-------------------------------------------------+ +| 33263478 | ++-------------------------------------------------+ +``` + +### Keywords BITMAP_HASH,BITMAP + +### Best Practice + +For more information, see also: +- [BITMAP_HASH64](./bitmap_hash64.md) diff --git a/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md b/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md index 37496a45ed..c9d64f7ca3 100644 --- a/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md +++ b/docs/zh-CN/docs/sql-manual/sql-functions/bitmap-functions/bitmap_hash.md @@ -25,28 +25,90 @@ under the License. --> ## bitmap_hash -### description + +### Name + +BITMAP_HASH + +### Description + +对任意类型的输入,计算其 32 位的哈希值,并返回包含该哈希值的 bitmap。该函数使用的哈希算法为 MurMur3。MurMur3 算法是一种高性能的、低碰撞率的散列算法,其计算出来的值接近于随机分布,并且能通过卡方分布测试。需要注意的是,不同硬件平台、不同 Seed 值计算出来的散列值可能不同。关于此算法的性能可以参考 [Smhasher](http://rurban.github.io/smhasher/) 排行榜。 + #### Syntax -`BITMAP BITMAP_HASH(expr)` - -对任意类型的输入计算32位的哈希值,返回包含该哈希值的bitmap。主要用于stream load任务将非整型字段导入Doris表的bitmap字段。例如 - ``` -cat data | curl --location-trusted -u user:passwd -T - -H "columns: dt,page,device_id, device_id=bitmap_hash(device_id)" http://host:8410/api/test/testDb/_stream_load +BITMAP BITMAP_HASH() ``` -### example +#### Arguments -``` -mysql> select bitmap_count(bitmap_hash('hello')); -+------------------------------------+ -| bitmap_count(bitmap_hash('hello')) | -+------------------------------------+ -| 1 | -+------------------------------------+ +`` +任何值或字段表达式。 + +#### Return Type + +BITMAP + +#### Remarks + +一般来说,MurMur 32 位算法对于完全随机的、较短的字符串的散列效果较好,碰撞率能达到几十亿分之一,但对于较长的字符串,比如你的操作系统路径,碰撞率会比较高。如果你扫描你系统里的路径,就会发现碰撞率仅仅只能达到百万分之一甚至是十万分之一。 + +下面两个字符串的 MurMur3 散列值是一样的: + +```sql +SELECT bitmap_to_string(bitmap_hash('/System/Volumes/Data/Library/Developer/CommandLineTools/SDKs/MacOSX12.3.sdk/System/Library/Frameworks/KernelManagement.framework/KernelManagement.tbd')) AS a , + bitmap_to_string(bitmap_hash('/System/Library/PrivateFrameworks/Install.framework/Versions/Current/Resources/es_419.lproj/Architectures.strings')) AS b; ``` -### keywords +结果如下: + +```text ++-----------+-----------+ +| a | b | ++-----------+-----------+ +| 282251871 | 282251871 | ++-----------+-----------+ +``` + +### Example + +如果你想计算某个值的 MurMur3,你可以: + +``` +select bitmap_to_array(bitmap_hash('hello'))[1]; +``` + +结果如下: + +```text ++-------------------------------------------------------------+ +| %element_extract%(bitmap_to_array(bitmap_hash('hello')), 1) | ++-------------------------------------------------------------+ +| 1321743225 | ++-------------------------------------------------------------+ +``` + +如果你想统计某一列去重后的个数,可以使用位图的方式,某些场景下性能比 `count distinct` 好很多: + +```sql +select bitmap_count(bitmap_union(bitmap_hash(`word`))) from `words`; +``` + +结果如下: + +```text ++-------------------------------------------------+ +| bitmap_count(bitmap_union(bitmap_hash(`word`))) | ++-------------------------------------------------+ +| 33263478 | ++-------------------------------------------------+ +``` + +### Keywords BITMAP_HASH,BITMAP + +### Best Practice + +还可参见 +- [BITMAP_HASH64](./bitmap_hash64.md)