In my scene, We need to specify databases that are excluded to synchronize to doris,
like some databases store temporary table.
Since #17803 introduce `specified_database_list` to specify 'include databases',
this pr introduce new config `exclude_database_list` to specify 'exclude databases',
and rename `specified_database_list` to `include_database_list` for naming symmetry.
BTW, when `include_database_list` and `exclude_database_list` specify overlapping databases, `exclude_database_list` would take effect with higher privilege over `include_database_list`.
Sometimes the dict is not initialized when run comparison predicate here, for example, the full page is null, then the reader will skip read, so that the dictionary is not inited. The cached code is wrong during this case, because the following page maybe not null, and the dict should have items in the future.
This will result the dict string column query return wrong result, if there are many null values in the column.
I also add some regression test for dict column's equal query, larger than query, less than query.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
1. Added test p0 for sampling collection statistics
2. Modify the uniqueKeys of table analysis_jobs for deletion based on relevant conditions
3. Solve the problem that incremental statistics p0 is less stable
This pr mainly optimizes the following items:
- the collection of statistics: clear up invalid historical statistics before collecting them, so as not to affect the final table statistics.
- the incremental collection of statistics: in the case of incremental collection, only the corresponding partition statistics need to be collected.
TODO: Supports incremental collection of materialized view statistics.
Dynamic mode used in array type when serialize it to mysql row buffer using dynamic mode, when combine binary row format with dynamic mode,something goes wrong, and lead to invalid binary row format.
since we cannot do stats derive and cost estimate on agg very good.
this PR remove some aggregate pattern that usually not good.
1. one stage agg after exchange. this pattern is good only when process very few rows.
2. three stage distinct agg with gather middle merge.
Hive support create partition with a specific location. In this case, the file path for the create partition may not contain the partition name and value. Which will cause Doris fail to query the the hive partition.
This pr is to fix this bug.
segcompaction_p1 contains fairly large load jobs, which will exceed
memlimit or timeout in pipeline under such heavy loads.
Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
Fix bug when reading array type in parquet file:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = [INTERNAL_ERROR]Read parquet file xxx failed,
reason = [IO_ERROR]Decode too many values in current page
```
When reading normal columns, `ScalarColumnReader::_read_values` still calls `ColumnSelectVector::set_run_length_null_map` to initialize select vector, but `ScalarColumnReader::_read_nested_column` hasn't do this, making the number of values wrong.
The situation where this error occurs is particularly extreme: The column pages have remaining values to be read,
but all of them are null values at ancestor level, so there's no actual read operation, just skipping null values at ancestor level.
1. remove unnecessary project node above scan node.
2. fix in subquery may be recognized as scalar subquery bug
3. fix some Quantile related functions' return type bug
Fix broker load p2 test case error.
1. Move test data from cos Hong kong region to Beijing region.
2. Move broker load test to p2 group.
3. Fix error message mismatch error.
1. Supports sampling to collect statistics
2. Improved syntax for collecting statistics
3. Support histogram specifies the number of buckets
4. Tweaked some code structure
---
The syntax supports WITH and PROPERTIES, using the same syntax as before.
Column Statistics Collection Syntax:
```SQL
ANALYZE [ SYNC ] TABLE table_name
[ (column_name [, ...]) ]
[ [WITH SYNC] | [WITH INCREMENTAL] | [WITH SAMPLE PERCENT | ROWS ] ]
[ PROPERTIES ('key' = 'value', ...) ];
```
Column histogram collection syntax:
```SQL
ANALYZE [ SYNC ] TABLE table_name
[ (column_name [, ...]) ]
UPDATE HISTOGRAM
[ [ WITH SYNC ][ WITH INCREMENTAL ][ WITH SAMPLE PERCENT | ROWS ][ WITH BUCKETS ] ]
[ PROPERTIES ('key' = 'value', ...) ];
```
Illustrate:
- sync:Collect statistics synchronously. Return after collecting.
- incremental:Collect statistics incrementally. Incremental collection of histogram statistics is not supported.
- sample percent | rows:Collect statistics by sampling. Scale and number of rows can be sampled.
- buckets:Specifies the maximum number of buckets generated when collecting histogram statistics.
- table_name: The purpose table for collecting statistics. Can be of the form `db_name.table_name`.
- column_name: The specified destination column must be a column that exists in `table_name`, and multiple column names are separated by commas.
- properties:Properties used to set statistics tasks. Currently only the following configurations are supported (equivalent to the with statement)
- 'sync' = 'true'
- 'incremental' = 'true'
- 'sample.percent' = '50'
- 'sample.rows' = '1000'
- 'num.buckets' = 10
---
TODO:
- Supplement the complete p0 test
- `Incremental` statistics see #18653
Fix decimal v3 precision loss issues in the multi-catalog module.
Now it will use decimal v3 to represent decimal type in the multi-catalog module.
Regression Test: `test_load_with_decimal.groovy`
1. Evict the dropped stats from cache
2. Remove codes for the partition level stats collection
3. Disable analyze whole database directly
4. Fix the potential death loop in the stats cleaner
5. Sleep thread in each loop when scanning stats table to avoid excessive IO usage by this task.