doris

Files

ElvinWei 76ad599fd7 [enhancement](histogram) optimise aggregate function histogram (#15317 )

This pr mainly to optimize the histogram(👉🏻 https://github.com/apache/doris/pull/14910)  aggregation function. Including the following:
1. Support input parameters `sample_rate` and `max_bucket_num`
2. Add UT and regression test
3. Add documentation
4. Optimize function implementation logic
 
Parameter description：
- `sample_rate`：Optional. The proportion of sample data used to generate the histogram. The default is 0.2.
- `max_bucket_num`：Optional. Limit the number of histogram buckets. The default value is 128.

---

Example：

```
MySQL [test]> SELECT histogram(c_float) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_float`)                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+

MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_string`)                                                                                                               |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
```

Query result description：

```
{
    "sample_rate": 0.2, 
    "max_bucket_num": 128, 
    "bucket_num": 3, 
    "buckets": [
        {
            "lower": "0.1", 
            "upper": "0.2", 
            "count": 2, 
            "pre_sum": 0, 
            "ndv": 2
        }, 
        {
            "lower": "0.8", 
            "upper": "0.9", 
            "count": 2, 
            "pre_sum": 2, 
            "ndv": 2
        }, 
        {
            "lower": "1.0", 
            "upper": "1.0", 
            "count": 2, 
            "pre_sum": 4, 
            "ndv": 1
        }
    ]
}
```

Field description：
- sample_rate：Rate of sampling
- max_bucket_num：Limit the maximum number of buckets
- bucket_num：The actual number of buckets
- buckets：All buckets
    - lower：Upper bound of the bucket
    - upper：Lower bound of the bucket
    - count：The number of elements contained in the bucket
    - pre_sum：The total number of elements in the front bucket
    - ndv：The number of different values in the bucket

> Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).

2023-01-07 00:50:32 +08:00

agg_collect_test.cpp

[chore](github) Add a workflow to check BE UT on macOS (#14506 )

2022-11-23 08:38:28 +08:00

agg_histogram_test.cpp

[enhancement](histogram) optimise aggregate function histogram (#15317 )

2023-01-07 00:50:32 +08:00

agg_min_max_by_test.cpp

[Feature](Retention) support retention function (#13056 )

2022-10-17 11:00:47 +08:00

agg_min_max_test.cpp

[fix](agg) Crashing caused by serialization in streaming aggregation (#12027 )