This pr is mainly supplement statistics regression test. include the following:
analyze stats p0 tests:
1. Universal analysis
analyze stats p1 tests:
1. Universal analysis
2. Sampled analysis
3. Incremental analysis
4. Automatic analysis
5. Periodic analysis
manage stats p0 tests:
1. Alter table stats
2. Show table stats
3. Alter column stats
4. Show column stats and histogram
5. Drop column stats
6. Drop expired stats
TODO:
1. Supplement related documents
2. Optimize for unstable cases encountered during testing
3. Add other cases
For pr related to statistics, should ensure that all of these cases pass!
Add table level statistics, support SHOW TABLE STATS statement to show table level statistics.
Implement automatically analyze statistics, support ANALYZE... WITH AUTO ... statement to automatically analyze statistics.
TODO:
collate relevant p0 tests
Supplement the design description to README.md
Issue Number: close #xxx
This PR enables periodic collection of statistics and is a precursor to automatic statistics collection. It mainly includes the following contents:
support periodic collection of statistics.
Change the type of Date in statistics p0 to DateV2(see [Enhancement](data-type) add FE config to prohibit create date and decimalv2 type #19077) for test locally. complement cases(remove Chinese characters, optimize code, etc) , improve stability.
Supports setting whether to keep records of statistics synchronization job info, convenient for use in p0 testing.
The statistics job table was modified, and some auxiliary judgments were added to avoid the user perceiving the modification. This function was removed when the table schema is stable.
1. Added test p0 for sampling collection statistics
2. Modify the uniqueKeys of table analysis_jobs for deletion based on relevant conditions
3. Solve the problem that incremental statistics p0 is less stable
This pr mainly optimizes the following items:
- the collection of statistics: clear up invalid historical statistics before collecting them, so as not to affect the final table statistics.
- the incremental collection of statistics: in the case of incremental collection, only the corresponding partition statistics need to be collected.
TODO: Supports incremental collection of materialized view statistics.
1. Supports sampling to collect statistics
2. Improved syntax for collecting statistics
3. Support histogram specifies the number of buckets
4. Tweaked some code structure
---
The syntax supports WITH and PROPERTIES, using the same syntax as before.
Column Statistics Collection Syntax:
```SQL
ANALYZE [ SYNC ] TABLE table_name
[ (column_name [, ...]) ]
[ [WITH SYNC] | [WITH INCREMENTAL] | [WITH SAMPLE PERCENT | ROWS ] ]
[ PROPERTIES ('key' = 'value', ...) ];
```
Column histogram collection syntax:
```SQL
ANALYZE [ SYNC ] TABLE table_name
[ (column_name [, ...]) ]
UPDATE HISTOGRAM
[ [ WITH SYNC ][ WITH INCREMENTAL ][ WITH SAMPLE PERCENT | ROWS ][ WITH BUCKETS ] ]
[ PROPERTIES ('key' = 'value', ...) ];
```
Illustrate:
- sync:Collect statistics synchronously. Return after collecting.
- incremental:Collect statistics incrementally. Incremental collection of histogram statistics is not supported.
- sample percent | rows:Collect statistics by sampling. Scale and number of rows can be sampled.
- buckets:Specifies the maximum number of buckets generated when collecting histogram statistics.
- table_name: The purpose table for collecting statistics. Can be of the form `db_name.table_name`.
- column_name: The specified destination column must be a column that exists in `table_name`, and multiple column names are separated by commas.
- properties:Properties used to set statistics tasks. Currently only the following configurations are supported (equivalent to the with statement)
- 'sync' = 'true'
- 'incremental' = 'true'
- 'sample.percent' = '50'
- 'sample.rows' = '1000'
- 'num.buckets' = 10
---
TODO:
- Supplement the complete p0 test
- `Incremental` statistics see #18653
1. Evict the dropped stats from cache
2. Remove codes for the partition level stats collection
3. Disable analyze whole database directly
4. Fix the potential death loop in the stats cleaner
5. Sleep thread in each loop when scanning stats table to avoid excessive IO usage by this task.
Support to delete expired stats periodically and manually.
default cleaner running interval is 2 days
Manually clean syntax is
```sql
DROP EXPIRED STATS
```
TODO:
1. process external catalog's stats
2. run drop at the appointed time
3. sleep a short time after drop one batch
1. Reduce the configuration options for statistics framework, and add comment for those rest.
2. Move the logic of creation of analysis job to the `StatisticsRepository` which defined all the functions used to interact with internal statistics table
3. Move AnalysisJobScheduler to the statistics package
4. Support display and injections manually for statistics