diff --git a/docs/en/docs/query-acceleration/statistics.md b/docs/en/docs/query-acceleration/statistics.md index 4bb3d941b0..069c25fb1a 100644 --- a/docs/en/docs/query-acceleration/statistics.md +++ b/docs/en/docs/query-acceleration/statistics.md @@ -75,7 +75,7 @@ This feature collects statistics only for tables and columns that either have no For tables with a large amount of data (default is 5GiB), Doris will automatically use sampling to collect statistics, reducing the impact on the system and completing the collection job as quickly as possible. Users can adjust this behavior by setting the `huge_table_lower_bound_size_in_bytes` FE parameter. If you want to collect statistics for all tables in full, you can set the `enable_auto_sample` FE parameter to false. For tables with data size greater than `huge_table_lower_bound_size_in_bytes`, Doris ensures that the collection interval is not less than 12 hours (this time can be controlled using the `huge_table_auto_analyze_interval_in_millis` FE parameter). -The default sample size for automatic sampling is 200,000 rows, but the actual sample size may be larger due to implementation reasons. If you want to sample more rows to obtain more accurate data distribution information, you can configure the `auto_analyze_job_record_count` FE parameter. +The default sample size for automatic sampling is 4194304(2^22) rows, but the actual sample size may be larger due to implementation reasons. If you want to sample more rows to obtain more accurate data distribution information, you can configure the `huge_table_default_sample_rows` FE parameter. ### Task Management @@ -234,8 +234,6 @@ Automatic collection tasks do not support viewing the completion status and fail | statistics_simultaneously_running_task_num | After submitting asynchronous jobs using `ANALYZE TABLE[DATABASE]`, this parameter limits the number of columns that can be analyzed simultaneously. All asynchronous tasks are collectively constrained by this parameter. | 5 | | analyze_task_timeout_in_minutes | Timeout for AnalyzeTask execution. | 12 hours | | stats_cache_size| The actual memory usage of statistics cache depends heavily on the characteristics of the data because the average size of maximum/minimum values and the number of buckets in histograms can vary greatly in different datasets and scenarios. Additionally, factors like JVM versions can also affect it. Below is the memory size occupied by statistics cache with 100,000 items. The average length of maximum/minimum values per item is 32, the average length of column names is 16, and the statistics cache occupies a total of 61.2777404785MiB of memory. It is strongly discouraged to analyze columns with very large string values as this may lead to FE memory overflow. | 100000 | -|full_auto_analyze_start_time|Start time for automatic statistics collection|00:00:00| -|full_auto_analyze_end_time|End time for automatic statistics collection|02:00:00| |enable_auto_sample|Enable automatic sampling for large tables. When enabled, statistics will be automatically collected through sampling for tables larger than the `huge_table_lower_bound_size_in_bytes` threshold.| false| |auto_analyze_job_record_count|Controls the persistence of records for automatically triggered statistics collection jobs.|20000| |huge_table_default_sample_rows|Defines the number of sample rows for large tables when automatic sampling is enabled.|4194304| @@ -249,7 +247,7 @@ Automatic collection tasks do not support viewing the completion status and fail |full_auto_analyze_end_time|End time for automatic statistics collection|02:00:00| |enable_full_auto_analyze|Enable automatic collection functionality|true| -Please note that when both FE configuration and global session variables are configured for the same parameter, the value of the global session variable takes precedence. +ATTENTION: The session variables listed above must be set globally using SET GLOBAL. ## Usage Recommendations @@ -273,6 +271,8 @@ The SQL execution time is controlled by the `query_timeout` session variable, wh When ANALYZE is executed, statistics data is written to the internal table `__internal_schema.column_statistics`. FE checks the tablet status of this table before executing ANALYZE. If there are unavailable tablets, the task is rejected. Please check the BE cluster status if this error occurs. +Users can use `SHOW BACKENDS\G` to verify the BE (Backend) status. If the BE status is normal, you can use the command `ADMIN SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db]` to check the tablet status within this database, ensuring that the tablet status is also normal. + ### Failure of ANALYZE on Large Tables Due to resource limitations, ANALYZE on some large tables may timeout or exceed BE memory limits. In such cases, it is recommended to use `ANALYZE ... WITH SAMPLE...`. diff --git a/docs/zh-CN/docs/query-acceleration/statistics.md b/docs/zh-CN/docs/query-acceleration/statistics.md index fd3066995e..d9aac9b678 100644 --- a/docs/zh-CN/docs/query-acceleration/statistics.md +++ b/docs/zh-CN/docs/query-acceleration/statistics.md @@ -75,7 +75,7 @@ ANALYZE < TABLE | DATABASE table_name | db_name > 对于数据量较大(默认为5GiB)的表,Doris会自动采取采样的方式去收集,以尽可能降低对系统造成的负担并尽快完成收集作业,用户可通过设置FE参数`huge_table_lower_bound_size_in_bytes`来调节此行为。如果希望对所有的表都采取全量收集,可配置FE参数`enable_auto_sample`为false。同时对于数据量大于`huge_table_lower_bound_size_in_bytes`的表,Doris保证其收集时间间隔不小于12小时(该时间可通过FE参数`huge_table_auto_analyze_interval_in_millis`控制)。 -自动采样默认采样200000行,但由于实现方式的原因实际采样数可能大于该值。如果希望采样更多的行以获得更准确的数据分布信息,可通过FE参数`auto_analyze_job_record_count`配置。 +自动采样默认采样4194304(2^22)行,但由于实现方式的原因实际采样数可能大于该值。如果希望采样更多的行以获得更准确的数据分布信息,可通过FE参数`huge_table_default_sample_rows`配置。 ### 任务管理 @@ -246,8 +246,6 @@ SHOW AUTO ANALYZE [ptable_name] | statistics_simultaneously_running_task_num | 通过`ANALYZE TABLE[DATABASE]`提交异步作业后,可同时analyze的列的数量,所有异步任务共同受到该参数约束 | 5 | | analyze_task_timeout_in_minutes | AnalyzeTask执行超时时间 | 12 hours | |stats_cache_size| 统计信息缓存的实际内存占用大小高度依赖于数据的特性,因为在不同的数据集和场景中,最大/最小值的平均大小和直方图的桶数量会有很大的差异。此外,JVM版本等因素也会对其产生影响。下面给出统计信息缓存在包含100000个项目时所占用的内存大小。每个项目的最大/最小值的平均长度为32,列名的平均长度为16,统计信息缓存总共占用了61.2777404785MiB的内存。强烈不建议分析具有非常大字符串值的列,因为这可能导致FE内存溢出。 | 100000 | -|full_auto_analyze_start_time|自动统计信息收集开始时间|00:00:00| -|full_auto_analyze_end_time|自动统计信息收集结束时间|02:00:00| |enable_auto_sample|是否开启大表自动sample,开启后对于大小超过huge_table_lower_bound_size_in_bytes会自动通过采样收集| false| |auto_analyze_job_record_count|控制统计信息的自动触发作业执行记录的持久化行数|20000| |huge_table_default_sample_rows|定义开启开启大表自动sample后,对大表的采样行数|4194304| @@ -261,7 +259,7 @@ SHOW AUTO ANALYZE [ptable_name] |full_auto_analyze_end_time|自动统计信息收集结束时间|02:00:00| |enable_full_auto_analyze|开启自动收集功能|true| -注意,对于fe配置和全局会话变量中均可配置的参数都设置的情况下,优先使用全局会话变量参数值。 +注意:上面列出的会话变量必须通过`SET GLOBAL`全局设置。 ## 使用建议 @@ -285,6 +283,8 @@ SQL执行时间受`query_timeout`会话变量控制,该变量默认值为300 执行ANALYZE时统计数据会被写入到内部表`__internal_schema.column_statistics`中,FE会在执行ANALYZE前检查该表tablet状态,如果存在不可用的tablet则拒绝执行任务。出现该报错请检查BE集群状态。 +用户可通过`SHOW BACKENDS\G`,确定BE状态是否正常。如果BE状态正常,可使用命令`ADMIN SHOW REPLICA STATUS FROM __internal_schema.[tbl_in_this_db]`,检查该库下tablet状态,确保tablet状态正常。 + ### 大表ANALYZE失败 由于ANALYZE能够使用的资源受到比较严格的限制,对一些大表的ANALYZE操作有可能超时或者超出BE内存限制。这些情况下,建议使用 `ANALYZE ... WITH SAMPLE...`。 diff --git a/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java b/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java index 2eac86bd91..ad74266a7c 100644 --- a/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java +++ b/fe/fe-core/src/main/java/org/apache/doris/statistics/BaseAnalysisTask.java @@ -146,11 +146,11 @@ public abstract class BaseAnalysisTask { } protected void init(AnalysisInfo info) { - tableSample = getTableSample(); DBObjects dbObjects = StatisticsUtil.convertIdToObjects(info.catalogId, info.dbId, info.tblId); catalog = dbObjects.catalog; db = dbObjects.db; tbl = dbObjects.table; + tableSample = getTableSample(); // External Table level task doesn't contain a column. Don't need to do the column related analyze. if (info.externalTableLevelTask) { return;