bug: some chinese word not sort by pinyin in GBK coding
CREATE TABLE `test_convert` (
`a` varchar(100) NULL
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY HASH(`a`) BUCKETS 3
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
insert into test_convert values("b"), ("a"), ("c"), ("睿"), ("多"), ("丝");
Query OK, 6 rows affected (0.03 sec)
{'label':'insert_ca73a6acc2194d5b_888218a3949355a6', 'status':'VISIBLE', 'txnId':'18068'}
mysql [test]>select * from test_convert;
+------+
| a |
+------+
| a |
| c |
| 丝 |
| b |
| 多 |
| 睿 |
+------+
6 rows in set (0.01 sec)
mysql [test]>select * from test_convert order by convert(a using gbk);
+------+
| a |
+------+
| a |
| b |
| c |
| 多 |
| 丝 |
| 睿 |
+------+
6 rows in set (0.01 sec)
Enhance aggregate function `collect_set` and `collect_list` to support optional `max_size` param,
which enables to limit the number of elements in result array.
1. Fixed a problem with histogram statistics collection parameters.
2. Solved the problem that it takes a long time to collect histogram statistics.
TODO: Optimize histogram statistics sampling method and make the sampling parameters effective.
The problem is that the histogram function works as expected in the single-node test, but doesn't work in the multi-node test. In addition, the performance of the current support sampling to collect histogram is low, resulting in a large time consumption when collecting histogram information.
Fixed the parameter issue and temporarily removed support for sampling to speed up the collection of histogram statistics.
Will next support sampling to collect histogram information.
Introduced a new function non_nullable to BE, which can extract concrete data column from a nullable column. If the input argument is already not a nullable column, raise an error.
When the argument of truncate function is float type, it can match both truncate(DECIMALV3) and truncate(DOUBLE), if the match is truncate(DECIMALV3), the precision is lost when converting float to DECIMALV3(38, 0).
Here I modify it to match truncate(DOUBLE) for now, maybe we still need to solve the problem of losing precision when converting float to DECIMALV3.
This pr mainly to optimize the histogram(👉🏻https://github.com/apache/doris/pull/14910) aggregation function. Including the following:
1. Support input parameters `sample_rate` and `max_bucket_num`
2. Add UT and regression test
3. Add documentation
4. Optimize function implementation logic
Parameter description:
- `sample_rate`:Optional. The proportion of sample data used to generate the histogram. The default is 0.2.
- `max_bucket_num`:Optional. Limit the number of histogram buckets. The default value is 128.
---
Example:
```
MySQL [test]> SELECT histogram(c_float) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_float`) |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_string`) |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
```
Query result description:
```
{
"sample_rate": 0.2,
"max_bucket_num": 128,
"bucket_num": 3,
"buckets": [
{
"lower": "0.1",
"upper": "0.2",
"count": 2,
"pre_sum": 0,
"ndv": 2
},
{
"lower": "0.8",
"upper": "0.9",
"count": 2,
"pre_sum": 2,
"ndv": 2
},
{
"lower": "1.0",
"upper": "1.0",
"count": 2,
"pre_sum": 4,
"ndv": 1
}
]
}
```
Field description:
- sample_rate:Rate of sampling
- max_bucket_num:Limit the maximum number of buckets
- bucket_num:The actual number of buckets
- buckets:All buckets
- lower:Upper bound of the bucket
- upper:Lower bound of the bucket
- count:The number of elements contained in the bucket
- pre_sum:The total number of elements in the front bucket
- ndv:The number of different values in the bucket
> Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).