[feature] Support pre-aggregation for quantile type (#8234)

Add a new column-type to speed up the approximation of quantiles.
1. The  new column-type is named `quantile_state` with fixed aggregation function `quantile_union`, which stores the intermediate results of pre-aggregated approximation calculations for quantiles.
2. support pre-aggregation of new column-type and quantile_state related functions.
This commit is contained in:
spaces-x
2022-03-24 09:11:34 +08:00
committed by GitHub
parent 36c85d2f06
commit bea9a7ba4f
67 changed files with 1498 additions and 153 deletions

View File

@ -88,6 +88,11 @@ Syntax:
This type can only be queried by hll_union_agg, hll_cardinality, hll_hash functions.
BITMAP
BITMAP type, No need to specify length. Represent a set of unsigned bigint numbers, the largest element could be 2^64 - 1
QUANTILE_STATE
QUANTILE_STATE type, No need to specify length. Represents the quantile pre-aggregation result. Currently, only numerical raw data types are supported such as `int`,`float`,`double`, etc.
If the number of elements is less than 2048, the explict data is stored.
If the number of elements is greater than 2048, the intermediate result of the pre-aggregation of the TDigest algorithm is stored.
```
agg_type: Aggregation type. If not specified, the column is key column. Otherwise, the column is value column.
@ -95,8 +100,14 @@ Syntax:
* HLL_UNION: Only for HLL type
* REPLACE_IF_NOT_NULL: The meaning of this aggregation type is that substitution will occur if and only if the newly imported data is a non-null value. If the newly imported data is null, Doris will still retain the original value. Note: if NOT NULL is specified in the REPLACE_IF_NOT_NULL column when the user creates the table, Doris will convert it to NULL and will not report an error to the user. Users can leverage this aggregate type to achieve importing some of columns .**It should be noted here that the default value should be NULL, not an empty string. If it is an empty string, you should replace it with an empty string**.
* BITMAP_UNION: Only for BITMAP type
* QUANTILE_UNION: Only for QUANTILE_STATE type
Allow NULL: Default is NOT NULL. NULL value should be represented as `\N` in load source file.
Notice: The origin value of BITMAP_UNION column should be TINYINT, SMALLINT, INT, BIGINT.
Notice:
The origin value of BITMAP_UNION column should be TINYINT, SMALLINT, INT, BIGINT.
The origin value of QUANTILE_UNION column should be a numeric type such as TINYINT, INT, FLOAT, DOUBLE, DECIMAL, etc.
2. index_definition
Syntax:
`INDEX index_name (col_name[, col_name, ...]) [USING BITMAP] COMMENT 'xxxxxx'`
@ -125,6 +136,7 @@ Syntax:
table_name in CREATE TABLE stmt is table is Doris. They can be different or same.
MySQL table created in Doris is for accessing data in MySQL database.
Doris does not maintain and store any data from MySQL table.
2) For broker, properties should include:
```
@ -633,8 +645,21 @@ Syntax:
AGGREGATE KEY(k1, k2)
DISTRIBUTED BY HASH(k1) BUCKETS 32;
```
9. Create 2 colocate join table.
9. Create a table with QUANTILE_UNION column (the origin value of **v1** and **v2** columns must be **numeric** types)
```
CREATE TABLE example_db.example_table
(
k1 TINYINT,
k2 DECIMAL(10, 2) DEFAULT "10.5",
v1 QUANTILE_STATE QUANTILE_UNION,
v2 QUANTILE_STATE QUANTILE_UNION
)
ENGINE=olap
AGGREGATE KEY(k1, k2)
DISTRIBUTED BY HASH(k1) BUCKETS 32;
```
10. Create 2 colocate join table.
```
CREATE TABLE `t1` (
@ -657,7 +682,7 @@ Syntax:
);
```
10. Create a broker table, with file on BOS.
11. Create a broker table, with file on BOS.
```
CREATE EXTERNAL TABLE example_db.table_broker (
@ -675,7 +700,7 @@ Syntax:
);
```
11. Create a table with a bitmap index
12. Create a table with a bitmap index
```
CREATE TABLE example_db.table_hash
@ -692,7 +717,7 @@ Syntax:
DISTRIBUTED BY HASH(k1) BUCKETS 32;
```
12. Create a dynamic partitioning table (dynamic partitioning needs to be enabled in FE configuration), which creates partitions 3 days in advance every day. For example, if today is' 2020-01-08 ', partitions named 'p20200108', 'p20200109', 'p20200110', 'p20200111' will be created.
13. Create a dynamic partitioning table (dynamic partitioning needs to be enabled in FE configuration), which creates partitions 3 days in advance every day. For example, if today is' 2020-01-08 ', partitions named 'p20200108', 'p20200109', 'p20200110', 'p20200111' will be created.
```
[types: [DATE]; keys: [2020-01-08]; ‥types: [DATE]; keys: [2020-01-09]; )
@ -722,7 +747,7 @@ Syntax:
"dynamic_partition.buckets" = "32"
);
```
13. Create a table with rollup index
14. Create a table with rollup index
```
CREATE TABLE example_db.rolup_index_table
(
@ -742,7 +767,7 @@ Syntax:
PROPERTIES("replication_num" = "3");
```
14. Create a inmemory table:
15. Create a inmemory table:
```
CREATE TABLE example_db.table_hash
@ -760,7 +785,7 @@ Syntax:
PROPERTIES ("in_memory"="true");
```
15. Create a hive external table
16. Create a hive external table
```
CREATE TABLE example_db.table_hive
(
@ -777,7 +802,7 @@ Syntax:
);
```
16. Specify the replica distribution of the table through replication_allocation
17. Specify the replica distribution of the table through replication_allocation
```
CREATE TABLE example_db.table_hash

View File

@ -232,7 +232,11 @@ Where url is the url given by ErrorURL.
```Curl --location-trusted -u root -H "columns: k1, k2, v1=to_bitmap(k1), v2=bitmap_empty()" -T testData http://host:port/api/testDb/testTbl/_stream_load```
10. a simple load json
10. load a table with QUANTILE_STATE columns, which can be columns in the table or a column in the data used to generate QUANTILE_STATE columns, you can also use TO_QUANTILE_STATE to transfer numberical data to QUANTILE_STATE. 2048 is an optional parameter representing the precision of the TDigest algorithm, the valid value is [2048, 10000], the larger the value, the higher the precision, default is 2048
```Curl --location-trusted -u root -H "columns: k1, k2, v1, v2, v1=to_quantile_state(v1, 2048)" -T testData http://host:port/api/testDb/testTbl/_stream_load```
11. a simple load json
table schema:
`category` varchar(512) NULL COMMENT "",
`author` varchar(512) NULL COMMENT "",
@ -247,7 +251,7 @@ Where url is the url given by ErrorURL.
{"category":"Java","author":"avc","title":"Effective Java","price":95}
{"category":"Linux","author":"avc","title":"Linux kernel","price":195}
11. Matched load json by jsonpaths
12. Matched load json by jsonpaths
For example json data:
[
{"category":"xuxb111","author":"1avc","title":"SayingsoftheCentury","price":895},
@ -260,7 +264,7 @@ Where url is the url given by ErrorURL.
1)If the json data starts as an array and each object in the array is a record, you need to set the strip_outer_array to true to represent the flat array.
2)If the json data starts with an array, and each object in the array is a record, our ROOT node is actually an object in the array when we set jsonpath.
12. User specifies the json_root node
13. User specifies the json_root node
For example json data:
{
"RECORDS":[
@ -272,9 +276,9 @@ Where url is the url given by ErrorURL.
Matched imports are made by specifying jsonpath parameter, such as `category`, `author`, and `price`, for example:
curl --location-trusted -u root -H "columns: category, price, author" -H "label:123" -H "format: json" -H "jsonpaths: [\"$.category\",\"$.price\",\"$.author\"]" -H "strip_outer_array: true" -H "json_root: $.RECORDS" -T testData http://host:port/api/testDb/testTbl/_stream_load
13. delete all data which key columns match the load data
14. delete all data which key columns match the load data
curl --location-trusted -u root -H "merge_type: DELETE" -T testData http://host:port/api/testDb/testTbl/_stream_load
14. delete all data which key columns match the load data where flag is true, others append
15. delete all data which key columns match the load data where flag is true, others append
curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv, flag" -H "merge_type: MERGE" -H "delete: flag=1" -T testData http://host:port/api/testDb/testTbl/_stream_load
## keyword

View File

@ -0,0 +1,62 @@
---
{
"title": "QUANTILE_STATE",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# QUANTILE_STATE
## description
QUANTILE_STATE
QUANTILE_STATE cannot be used as a key column, and the aggregation type is QUANTILE_UNION when building the table.
The user does not need to specify the length and default value. The length is controlled within the system according to the degree of data aggregation.
And the QUANTILE_STATE column can only be queried or used through the supporting QUANTILE_PERCENT, QUANTILE_UNION and TO_QUANTILE_STATE functions.
QUANTILE_STATE is a type for calculating the approximate value of quantiles. Different values with the same key are pre-aggregated during loading process. When the number of aggregated values does not exceed 2048, all data are recorded in detail. When the number of aggregated values is greater than 2048, [TDigest] is used. (https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf) algorithm to aggregate (cluster) the data and save the centroid points after clustering.
related functions:
QUANTILE_UNION(QUANTILE_STATE):
This function is an aggregation function, which is used to aggregate the intermediate results of different quantile calculations. The result returned by this function is still QUANTILE_STATE
TO_QUANTILE_STATE(INT/FLOAT/DOUBLE raw_data [,FLOAT compression]):
This function converts a numeric type to a QUANTILE_STATE type
The compression parameter is optional and can be set in the range [2048, 10000].
The larger the value, the higher the precision of quantile approximation calculations, the greater the memory consumption, and the longer the calculation time.
An unspecified or set value for the compression parameter is outside the range [2048, 10000], run with the default value of 2048
QUANTILE_PERCENT(QUANTILE_STATE):
This function converts the intermediate result variable (QUANTILE_STATE) of the quantile calculation into a specific quantile value
## example
select QUANTILE_PERCENT(QUANTILE_UNION(v1)) from test_table group by k1, k2, k3;
## keyword
QUANTILE_STATE, QUANTILE_UNION, TO_QUANTILE_STATE, QUANTILE_PERCENT