Fixes #2771
Main changes in this CL
* RoaringBitmap is renamed to BitmapValue and moved into bitmap_value.h
* leveraging Roaring64Map to support unsigned BIGINT for BITMAP type
* introduces two new format (SINGLE64 and BITMAP64) for BITMAP type
So far we have three storage format for BITMAP type
```
EMPTY := TypeCode(0x00)
SINGLE32 := TypeCode(0x01), UInt32LittleEndian
BITMAP32 := TypeCode(0x02), RoaringBitmap(defined by https://github.com/RoaringBitmap/RoaringFormatSpec/)
```
In order to support BIGINT element and keep backward compatibility, introduce two new format
```
SINGLE64 := TypeCode(0x03), UInt64LittleEndian
BITMAP64 := TypeCode(0x04), CustomRoaringBitmap64
```
Please note that SINGLE64/BITMAP64 doesn't replace SINGLE32/BITMAP32. Doris will choose the smaller (in terms of space) type automatically during serializing. For example, BITMAP32 is preferred over BITMAP64 when the maximum element is <= UINT32_MAX. This will also make BE rollback possible as long as user didn't write element larger than UINT32_MAX into bitmap column.
Another important design decision is that we fork and maintain our own version of Roaring64Map instead of using the one in "roaring/roaring64map.hh". The reasons are
1. RoaringBitmap doesn't define a standard for the binary format of 64-bits bitmap. As a result, different implementations of Roaring64Map use different format. For example the [C++ version](https://github.com/RoaringBitmap/CRoaring/blob/v0.2.60/cpp/roaring64map.hh#L545) is different from the [Java version](35104c564e/src/main/java/org/roaringbitmap/longlong/Roaring64NavigableMap.java (L1097)). Even for CRoaring, the format may change in future releases. However Doris require the serialized format to be stable across versions. Fork is a safe way to achieve this.
2. We may want to make some code changes to Roaring64Map according to our needs. For example, in order to use the BITMAP32 format when the maximum element can be represented in 32 bits, we may want to access the private member of Roaring64Map. Another example is we want to further customize and optimize the format for BITMAP64 case, such as using vint64 instead of uint64 for map size.
4.1 KiB
BITMAP
Create table
建表时需要使用聚合模型,数据类型是 bitmap , 聚合函数是 bitmap_union
CREATE TABLE `pv_bitmap` (
`dt` int(11) NULL COMMENT "",
`page` varchar(10) NULL COMMENT "",
`user_id` bitmap BITMAP_UNION NULL COMMENT ""
) ENGINE=OLAP
AGGREGATE KEY(`dt`, `page`)
COMMENT "OLAP"
DISTRIBUTED BY HASH(`dt`) BUCKETS 2;
注:当数据量很大时,最好为高频率的 bitmap_union 查询建立对应的 rollup 表
ALTER TABLE pv_bitmap ADD ROLLUP pv (page, user_id);
Data Load
TO_BITMAP(expr) : 将 0 ~ 18446744073709551615 的 unsigned bigint 转为 bitmap
BITMAP_EMPTY(): 生成空 bitmap 列,用于 insert 或导入的时填充默认值
BITMAP_HASH(expr): 将任意类型的列通过 Hash 的方式转为 bitmap
Stream Load
cat data | curl --location-trusted -u user:passwd -T - -H "columns: dt,page,user_id, user_id=to_bitmap(user_id)" http://host:8410/api/test/testDb/_stream_load
cat data | curl --location-trusted -u user:passwd -T - -H "columns: dt,page,user_id, user_id=bitmap_hash(user_id)" http://host:8410/api/test/testDb/_stream_load
cat data | curl --location-trusted -u user:passwd -T - -H "columns: dt,page,user_id, user_id=bitmap_empty()" http://host:8410/api/test/testDb/_stream_load
Insert Into
id2 的列类型是 bitmap
insert into bitmap_table1 select id, id2 from bitmap_table2;
id2 的列类型是 bitmap
INSERT INTO bitmap_table1 (id, id2) VALUES (1001, to_bitmap(1000)), (1001, to_bitmap(2000));
id2 的列类型是 bitmap
insert into bitmap_table1 select id, bitmap_union(id2) from bitmap_table2 group by id;
id2 的列类型是 int
insert into bitmap_table1 select id, to_bitmap(id2) from table;
id2 的列类型是 String
insert into bitmap_table1 select id, bitmap_hash(id_string) from table;
Data Query
Syntax
BITMAP_UNION(expr) : 计算输入 Bitmap 的并集,返回新的bitmap
BITMAP_UNION_COUNT(expr): 计算输入 Bitmap 的并集,返回其基数,和 BITMAP_COUNT(BITMAP_UNION(expr)) 等价。目前推荐优先使用 BITMAP_UNION_COUNT ,其性能优于 BITMAP_COUNT(BITMAP_UNION(expr))
BITMAP_UNION_INT(expr) : 计算 TINYINT,SMALLINT 和 INT 类型的列中不同值的个数,返回值和
COUNT(DISTINCT expr) 相同
INTERSECT_COUNT(bitmap_column_to_count, filter_column, filter_values ...) : 计算满足
filter_column 过滤条件的多个 bitmap 的交集的基数值。
bitmap_column_to_count 是 bitmap 类型的列,filter_column 是变化的维度列,filter_values 是维度取值列表
Example
下面的 SQL 以上面的 pv_bitmap table 为例:
计算 user_id 的去重值:
select bitmap_union_count(user_id) from pv_bitmap;
select bitmap_count(bitmap_union(user_id)) from pv_bitmap;
计算 id 的去重值:
select bitmap_union_int(id) from pv_bitmap;
计算 user_id 的 留存:
select intersect_count(user_id, page, 'meituan') as meituan_uv,
intersect_count(user_id, page, 'waimai') as waimai_uv,
intersect_count(user_id, page, 'meituan', 'waimai') as retention //在 'meituan' 和 'waimai' 两个页面都出现的用户数
from pv_bitmap
where page in ('meituan', 'waimai');
keyword
BITMAP,BITMAP_COUNT,BITMAP_EMPTY,BITMAP_UNION,BITMAP_UNION_INT,TO_BITMAP,BITMAP_UNION_COUNT,INTERSECT_COUNT