[minor](stats) Add doc for stats framework (#19311)

This commit is contained in:
AKIRA
2023-05-06 14:30:55 +09:00
committed by GitHub
parent 1223f81228
commit bd23db762d

View File

@ -0,0 +1,116 @@
- [Requiredments](#requiredments)
- [Basic](#basic)
- [Adavanced(Not finished yet)](#adavancednot-finished-yet)
- [Specification](#specification)
- [Compatibility](#compatibility)
- [Function compatibility](#function-compatibility)
- [Version compatibility](#version-compatibility)
- [Implementation](#implementation)
- [Main class](#main-class)
- [Analyze execution flow](#analyze-execution-flow)
- [Load execution flow](#load-execution-flow)
- [Configure options](#configure-options)
- [User interface](#user-interface)
- [Test](#test)
# Requiredments
## Basic
Provide necessary data for the optimizer to calculate and compare various plans. This includes count, ndv, null_count, min, max, data_size, histogram for each column, as well as the number of rows in the table.
## Adavanced(Not finished yet)
Support incremental collectio and auto collection
## Specification
## Compatibility
### Function compatibility
No conflicts with any other function.
### Version compatibility
There may be compatibility issues if there are changes to the schema of the stats table in the future.
# Implementation
## Main class
|Class name|Function|
|---|---|
|AnalyzeStmt|Constructed by parsing user-input SQL, each AnalyzeStmt corresponds to a Job, and a Job can have multiple Tasks, with each Task responsible for collecting statistics information on a column.|
|AnalysisManager|Mainly responsible for managing Analyze Jobs/Tasks, including creation, execution, cancellation, and status updates, etc.|
|StatisticCache|The collected statistical information is cached here on demand.|
|StatisticCacheLoader|When `StatsCalculator#computeScan` fails to find the corresponding stats for a column in the cache, the load logic will be triggered, which is implemented in this class.|
|AnalysisTaskExecutor|Used to excute AnalyzeJob|
|AnalysisTaskWrapper|This class encapsulates an `AnalysisTask` and extends `FutureTask`. It overrides some methods for state updates.|
|AnalysisTaskScheduler|AnalysisTaskExecutor retrieves jobs from here for execution. Manually submitted jobs always have higher priority than automatically triggered ones.|
|StatisticsCleaner|Responsible for cleaning up expired statistics and job information.|
|StatisticsRepository|Most of the related SQL is defined here.|
|StatisticsUtil|Mainly consists of helper methods, such as checking the status of stats-related tables.|
## Analyze execution flow
```mermaid
sequenceDiagram
DdlExecutor->>AnalysisManager: createAnalysisJob
AnalysisManager->>AnalysisManager: validateAndGetPartitions
AnalysisManager->>AnalysisManager: createTaskForEachColumns
AnalysisManager->>AnalysisManager: createTaskForMVIdx
alt is sync task
AnalysisManager->>AnalysisManager: syncExecute
else is async task
AnalysisManager->>StatisticsRepository: persist
StatisticsRepository->>BE: write
AnalysisManager->>AnalysisTaskScheduler: schedule
AnalysisTaskScheduler->>AnalysisTaskExecutor: notify
AnalysisTaskExecutor->>AnalysisTaskScheduler: getPendingTasks
AnalysisTaskExecutor->>ThreadPoolExecutor: submit(AnalysisTaskWrapper)
ThreadPoolExecutor->>AnalysisTaskWrapper: run
AnalysisTaskWrapper->>BE: collect && write
AnalysisTaskWrapper->>StatisticCache: refresh
AnalysisTaskWrapper->>AnalysisManager: updateTaskStatus
alt is all task finished
AnalysisManager->> StatisticsUtil: execUpdate mark job finished
StatisticsUtil->> BE: update job status
end
end
```
## Load execution flow
```mermaid
sequenceDiagram
StatsCalculator->>StatisticCache: get
alt is cached
StatisticCache->>StatsCalculator: return cached stats
else not cached
StatisticCache->>StatsCalculator: return UNKNOWN stats
StatisticCache->>ThreadPoolExecutor: submit load task
ThreadPoolExecutor->>AsyncTask: get
AsyncTask->>StatisticsUtil: execStatisticQuery
alt exception occurred:
AsyncTask->>StatisticCache: return UNKNOWN stats
StatisticCache->> StatisticCache: cache UNKNOWN for the column
else no exception:
StatisticsUtil->>AsyncTask: Return results rows
AsyncTask->>StatisticsUtil: deserializeToColumnStatistics(result rows)
alt exception occurred:
AsyncTask->>StatisticCache: return UNKNOWN stats
StatisticCache->> StatisticCache: cache UNKNOWN for the column
else no exception:
StatisticCache->> StatisticCache: cache normal stats
end
end
end
```
# Configure options
# User interface
# Test