Fix three bugs when read iceberg v2 tables:
1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in
`RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first
row of current row group from `delete position`.
2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`.
3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row
group in such case.
Optimize four performance issues:
1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read
ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly.
2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set.
3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to
save memory usage, and the same as `IcebergReader`.
4. Change the recursive call of `_do_lazy_read` to loop logic.
The former logic inside aggregate_function_window.cpp would shutdown BE once encountering agg function with complex type like BITMAP. This pr makes it don't crash and would return one more concrete error message which tells the unsupported function signature to user.
Optimize for key topn query like `SELECT * FROM store_sales ORDER BY ss_sold_date_sk, ss_sold_time_sk LIMIT 100`
(ss_sold_date_sk, ss_sold_time_sk is prefix of table sort key).
Check per scanner limit and set eof true to reduce the data need to be read.
**Histogram statistics**
Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics.
For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan.
The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred.
The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency.
---
Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows:
**Syntax**
```SQL
histogram(expr)
```
> The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query.
**example**
```
MySQL [test]> select histogram(login_time) from dev_table;
+------------------------------------------------------------------------------------------------------------------------------+
| histogram(`login_time`) |
+------------------------------------------------------------------------------------------------------------------------------+
| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}|
+------------------------------------------------------------------------------------------------------------------------------+
```
**description**
```JSON
{
"bucket_size": 5,
"buckets": [
{
"lower": "2022-09-21 17:30:29",
"upper": "2022-09-21 22:30:29",
"count": 9,
"pre_sum": 0,
"ndv": 1
},
{
"lower": "2022-09-22 17:30:29",
"upper": "2022-09-22 22:30:29",
"count": 10,
"pre_sum": 9,
"ndv": 1
},
{
"lower": "2022-09-23 17:30:29",
"upper": "2022-09-23 22:30:29",
"count": 9,
"pre_sum": 19,
"ndv": 1
},
{
"lower": "2022-09-24 17:30:29",
"upper": "2022-09-24 22:30:29",
"count": 9,
"pre_sum": 28,
"ndv": 1
},
{
"lower": "2022-09-25 17:30:29",
"upper": "2022-09-25 22:30:29",
"count": 9,
"pre_sum": 37,
"ndv": 1
}
]
}
```
TODO:
- histogram func supports parameter and sample statistics (It's got another pr)
- use histogram statistics
- add p0 regression
It is very difficult to investigate the data inconsistency of multiple replicas.
When loading data, the number of rows between replicas is checked to avoid some data inconsistency problems.
The segment group is useless in current codebase, remove all the related code inside Doris. As for the related protobuf code, use reserved flag to prevent any future user from using that field.
when show databases/tables/table status where xxx, it will change a selectStmt to select result from
information_schema, it need catalog info to scan schema table, otherwise may get many
database or table info from multi catalog.
for example
mysql> show databases where schema_name='test';
+----------+
| Database |
+----------+
| test |
| test |
+----------+
MySQL [internal.test]> show tables from test where table_name='test_dc';
+----------------+
| Tables_in_test |
+----------------+
| test_dc |
| test_dc |
+----------------+
Fix three bugs:
1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed
2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed
3. fmt::format can't generate right format string, and will be failed
The result of load should be failed when all tablets delta writer failed to close on single node.
But the result returned to client is success.
The reason is that the committed tablets and error tablets are both empty, so publish will be success.
We should add it to error tablets when delta writer failed to close, then the transaction will be failed.