This pr mainly to optimize the histogram(👉🏻https://github.com/apache/doris/pull/14910) aggregation function. Including the following:
1. Support input parameters `sample_rate` and `max_bucket_num`
2. Add UT and regression test
3. Add documentation
4. Optimize function implementation logic
Parameter description:
- `sample_rate`:Optional. The proportion of sample data used to generate the histogram. The default is 0.2.
- `max_bucket_num`:Optional. Limit the number of histogram buckets. The default value is 128.
---
Example:
```
MySQL [test]> SELECT histogram(c_float) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_float`) |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_string`) |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
```
Query result description:
```
{
"sample_rate": 0.2,
"max_bucket_num": 128,
"bucket_num": 3,
"buckets": [
{
"lower": "0.1",
"upper": "0.2",
"count": 2,
"pre_sum": 0,
"ndv": 2
},
{
"lower": "0.8",
"upper": "0.9",
"count": 2,
"pre_sum": 2,
"ndv": 2
},
{
"lower": "1.0",
"upper": "1.0",
"count": 2,
"pre_sum": 4,
"ndv": 1
}
]
}
```
Field description:
- sample_rate:Rate of sampling
- max_bucket_num:Limit the maximum number of buckets
- bucket_num:The actual number of buckets
- buckets:All buckets
- lower:Upper bound of the bucket
- upper:Lower bound of the bucket
- count:The number of elements contained in the bucket
- pre_sum:The total number of elements in the front bucket
- ndv:The number of different values in the bucket
> Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).
This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733.
The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve.
For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```,
you need set it to ```true```, to make the index work for like query.
Add a new config "jdbc_drivers_dir" for both FE and BE.
User can put jdbc drivers' jar file in this dir, and only specify file name in "driver_url" properties
when creating jdbc resource.
And Doris will find jar files in this dir.
Also modify the logic so that when the jdbc resource is modified, the corresponding jdbc table
will get the latest properties.
Currently, `outfile` did not support `use_path_style` parameter and use `virtual-host style` by default,
however some Object-storage may only support `use_path_style` access mode.
This pr add the`use_path_style` parameter for s3 outfile, so that different object-storage can use different access mode.