[feature] Support pre-aggregation for quantile type (#8234)

Add a new column-type to speed up the approximation of quantiles. 1. The new column-type is named `quantile_state` with fixed aggregation function `quantile_union`, which stores the intermediate results of pre-aggregated approximation calculations for quantiles. 2. support pre-aggregation of new column-type and quantile_state related functions.
2022-03-24 09:11:34 +08:00
parent 36c85d2f06
commit bea9a7ba4f
67 changed files with 1498 additions and 153 deletions
--- a/docs/en/sql-reference/sql-statements/Data
+++ b/docs/en/sql-reference/sql-statements/Data
@ -88,6 +88,11 @@ Syntax:
            This type can only be queried by hll_union_agg, hll_cardinality, hll_hash functions.
        BITMAP
            BITMAP type, No need to specify length. Represent a set of unsigned bigint numbers, the largest element could be 2^64 - 1
+        QUANTILE_STATE
+            QUANTILE_STATE type, No need to specify length. Represents the quantile pre-aggregation result. Currently, only numerical raw data types are supported such as `int`,`float`,`double`, etc.
+            If the number of elements is less than 2048, the explict data is stored. 
+            If the number of elements is greater than 2048, the intermediate result of the pre-aggregation of the TDigest algorithm is stored.
+            
    ```
    agg_type: Aggregation type. If not specified, the column is key column. Otherwise, the column   is value column.

@ -95,8 +100,14 @@ Syntax:
       * HLL_UNION: Only for HLL type
       * REPLACE_IF_NOT_NULL: The meaning of this aggregation type is that substitution will occur if and only if the newly imported data is a non-null value. If the newly imported data is null, Doris will still retain the original value. Note: if NOT NULL is specified in the REPLACE_IF_NOT_NULL column when the user creates the table, Doris will convert it to NULL and will not report an error to the user. Users can leverage this aggregate type to achieve importing some of columns .**It should be noted here that the default value should be NULL, not an empty string. If it is an empty string, you should replace it with an empty string**.
       * BITMAP_UNION: Only for BITMAP type
+       * QUANTILE_UNION: Only for QUANTILE_STATE type
    Allow NULL: Default is NOT NULL. NULL value should be represented as `\N` in load source file.
-    Notice: The origin value of BITMAP_UNION column should be TINYINT, SMALLINT, INT, BIGINT.
+    
+    Notice: 
+    
+        The origin value of BITMAP_UNION column should be TINYINT, SMALLINT, INT, BIGINT.
+        
+        The origin value of QUANTILE_UNION column should be a numeric type such as TINYINT, INT, FLOAT, DOUBLE, DECIMAL, etc.
 2. index_definition
    Syntax:
        `INDEX index_name (col_name[, col_name, ...]) [USING BITMAP] COMMENT 'xxxxxx'`
@ -125,6 +136,7 @@ Syntax:
        table_name in CREATE TABLE stmt is table is Doris. They can be different or same.
        MySQL table created in Doris is for accessing data in MySQL database.
        Doris does not maintain and store any data from MySQL table.
+
    2) For broker, properties should include:

        ```
@ -633,8 +645,21 @@ Syntax:
    AGGREGATE KEY(k1, k2)
    DISTRIBUTED BY HASH(k1) BUCKETS 32;
    ```
-
-9. Create 2 colocate join table.
+9. Create a table with QUANTILE_UNION column (the origin value of **v1** and **v2** columns must be **numeric** types）
+    
+    ```
+    CREATE TABLE example_db.example_table
+    (
+    k1 TINYINT,
+    k2 DECIMAL(10, 2) DEFAULT "10.5",
+    v1 QUANTILE_STATE QUANTILE_UNION,
+    v2 QUANTILE_STATE QUANTILE_UNION
+    )
+    ENGINE=olap
+    AGGREGATE KEY(k1, k2)
+    DISTRIBUTED BY HASH(k1) BUCKETS 32;
+    ```
+10. Create 2 colocate join table.

    ```
    CREATE TABLE `t1` (
@ -657,7 +682,7 @@ Syntax:
    );
    ```

-10. Create a broker table, with file on BOS.
+11. Create a broker table, with file on BOS.

    ```
    CREATE EXTERNAL TABLE example_db.table_broker (
@ -675,7 +700,7 @@ Syntax:
    );
    ```

-11. Create a table with a bitmap index 
+12. Create a table with a bitmap index 

    ```
    CREATE TABLE example_db.table_hash
@ -692,7 +717,7 @@ Syntax:
    DISTRIBUTED BY HASH(k1) BUCKETS 32;
    ```
    
-12. Create a dynamic partitioning table (dynamic partitioning needs to be enabled in FE configuration), which creates partitions 3 days in advance every day. For example, if today is' 2020-01-08 ', partitions named 'p20200108', 'p20200109', 'p20200110', 'p20200111' will be created.
+13. Create a dynamic partitioning table (dynamic partitioning needs to be enabled in FE configuration), which creates partitions 3 days in advance every day. For example, if today is' 2020-01-08 ', partitions named 'p20200108', 'p20200109', 'p20200110', 'p20200111' will be created.

    ```
    [types: [DATE]; keys: [2020-01-08]; ‥types: [DATE]; keys: [2020-01-09]; )
@ -722,7 +747,7 @@ Syntax:
        "dynamic_partition.buckets" = "32"
         );
     ```
-13. Create a table with rollup index
+14. Create a table with rollup index
 ```
    CREATE TABLE example_db.rolup_index_table
    (
@ -742,7 +767,7 @@ Syntax:
    PROPERTIES("replication_num" = "3");
 ```

-14. Create a inmemory table:
+15. Create a inmemory table:

 ```
    CREATE TABLE example_db.table_hash
@ -760,7 +785,7 @@ Syntax:
    PROPERTIES ("in_memory"="true");
 ```

-15. Create a hive external table
+16. Create a hive external table
 ```
    CREATE TABLE example_db.table_hive
    (
@ -777,7 +802,7 @@ Syntax:
    );
 ```

-16. Specify the replica distribution of the table through replication_allocation
+17. Specify the replica distribution of the table through replication_allocation

 ```	
    CREATE TABLE example_db.table_hash
--- a/docs/en/sql-reference/sql-statements/Data
+++ b/docs/en/sql-reference/sql-statements/Data
@ -232,7 +232,11 @@ Where url is the url given by ErrorURL.

    ```Curl --location-trusted -u root -H "columns: k1, k2, v1=to_bitmap(k1), v2=bitmap_empty()" -T testData http://host:port/api/testDb/testTbl/_stream_load```

-10. a simple load json
+10. load a table with QUANTILE_STATE columns, which can be columns in the table or a column in the data used to generate QUANTILE_STATE columns, you can also use TO_QUANTILE_STATE to transfer numberical data to QUANTILE_STATE. 2048 is an optional parameter representing the precision of the TDigest algorithm, the valid value is [2048, 10000], the larger the value, the higher the precision, default is 2048
+    
+    ```Curl --location-trusted -u root -H "columns: k1, k2, v1, v2, v1=to_quantile_state(v1, 2048)" -T testData http://host:port/api/testDb/testTbl/_stream_load```
+
+11. a simple load json
       table schema:
           `category` varchar(512) NULL COMMENT "",
           `author` varchar(512) NULL COMMENT "",
@ -247,7 +251,7 @@ Where url is the url given by ErrorURL.
            {"category":"Java","author":"avc","title":"Effective Java","price":95}
            {"category":"Linux","author":"avc","title":"Linux kernel","price":195}
            
-11. Matched load json by jsonpaths
+12. Matched load json by jsonpaths
       For example json data:
           [
           {"category":"xuxb111","author":"1avc","title":"SayingsoftheCentury","price":895},
@ -260,7 +264,7 @@ Where url is the url given by ErrorURL.
        1）If the json data starts as an array and each object in the array is a record, you need to set the strip_outer_array to true to represent the flat array.
        2）If the json data starts with an array, and each object in the array is a record, our ROOT node is actually an object in the array when we set jsonpath.

-12. User specifies the json_root node
+13. User specifies the json_root node
       For example json data:
            {
            "RECORDS":[
@ -272,9 +276,9 @@ Where url is the url given by ErrorURL.
       Matched imports are made by specifying jsonpath parameter, such as `category`, `author`, and `price`, for example:
         curl --location-trusted -u root  -H "columns: category, price, author" -H "label:123" -H "format: json" -H "jsonpaths: [\"$.category\",\"$.price\",\"$.author\"]" -H "strip_outer_array: true" -H "json_root: $.RECORDS" -T testData http://host:port/api/testDb/testTbl/_stream_load

-13. delete all data which key columns match the load data 
+14. delete all data which key columns match the load data 
    curl --location-trusted -u root -H "merge_type: DELETE" -T testData http://host:port/api/testDb/testTbl/_stream_load
-14. delete all data which key columns match the load data where flag is true, others append
+15. delete all data which key columns match the load data where flag is true, others append
    curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv, flag" -H "merge_type: MERGE" -H "delete: flag=1"  -T testData http://host:port/api/testDb/testTbl/_stream_load

 ## keyword
--- a/docs/en/sql-reference/sql-statements/Data
+++ b/docs/en/sql-reference/sql-statements/Data
@ -0,0 +1,62 @@
+---
+{
+    "title": "QUANTILE_STATE",
+    "language": "zh-CN"
+}
+---
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# QUANTILE_STATE
+## description
+
+QUANTILE_STATE
+
+    QUANTILE_STATE cannot be used as a key column, and the aggregation type is QUANTILE_UNION when building the table.
+    The user does not need to specify the length and default value. The length is controlled within the system according to the degree of data aggregation.
+    And the QUANTILE_STATE column can only be queried or used through the supporting QUANTILE_PERCENT, QUANTILE_UNION and TO_QUANTILE_STATE functions.    
+    QUANTILE_STATE is a type for calculating the approximate value of quantiles. Different values with the same key are pre-aggregated during loading process. When the number of aggregated values does not exceed 2048, all data are recorded in detail. When the number of aggregated values is greater than 2048, [TDigest] is used. (https://github.com/tdunning/t-digest/blob/main/docs/t-digest-paper/histo.pdf) algorithm to aggregate (cluster) the data and save the centroid points after clustering.
+
+related functions:
+    
+    QUANTILE_UNION(QUANTILE_STATE):
+      
+      This function is an aggregation function, which is used to aggregate the intermediate results of different quantile calculations. The result returned by this function is still QUANTILE_STATE
+
+    
+    TO_QUANTILE_STATE(INT/FLOAT/DOUBLE raw_data [,FLOAT compression]):
+       
+       This function converts a numeric type to a QUANTILE_STATE type
+       The compression parameter is optional and can be set in the range [2048, 10000]. 
+       The larger the value, the higher the precision of quantile approximation calculations, the greater the memory consumption, and the longer the calculation time.
+       An unspecified or set value for the compression parameter is outside the range [2048, 10000], run with the default value of 2048
+
+    QUANTILE_PERCENT(QUANTILE_STATE):
+       This function converts the intermediate result variable (QUANTILE_STATE) of the quantile calculation into a specific quantile value
+
+    
+
+## example
+    select QUANTILE_PERCENT(QUANTILE_UNION(v1)) from test_table group by k1, k2, k3;
+    
+
+## keyword
+
+    QUANTILE_STATE, QUANTILE_UNION, TO_QUANTILE_STATE, QUANTILE_PERCENT