From 19196ae1cfcf529a99a91f34f85e248fa1082a8b Mon Sep 17 00:00:00 2001 From: seawinde <149132972+seawinde@users.noreply.github.com> Date: Tue, 27 Feb 2024 10:06:18 +0800 Subject: [PATCH] [doc](nereids) Optimize query rewerite by materialzied view doc (#31420) * [doc](nereids) Optimize the query rewrite by materialized view doc * [doc](nereids) add more --- .../query-async-materialized-view.md | 113 +++++++++++++----- .../Create/CREATE-ASYNC-MATERIALIZED-VIEW.md | 12 ++ .../query-async-materialized-view.md | 54 ++++++--- .../Create/CREATE-ASYNC-MATERIALIZED-VIEW.md | 10 ++ 4 files changed, 144 insertions(+), 45 deletions(-) diff --git a/docs/en/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md b/docs/en/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md index 8cd13b4e43..0e29d3b415 100644 --- a/docs/en/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md +++ b/docs/en/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md @@ -26,10 +26,13 @@ under the License. ## Overview -Doris's asynchronous materialized views employ a structure based on the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) pattern -for transparent rewriting algorithms. Doris can analyze the structural information of the query SQL, -automatically identify suitable materialized views, and attempt transparent rewriting by expressing the -query SQL using the materialized views. By utilizing precomputed materialized view results, +Doris's asynchronous materialized views employ an algorithm based on the SPJG (SELECT-PROJECT-JOIN-GROUP-BY) pattern +structure information for transparent rewriting. + +Doris can analyze the structural information of query SQL, automatically search for suitable materialized views, +and attempt transparent rewriting, utilizing the optimal materialized view to express the query SQL. + +By utilizing precomputed materialized view results, significant improvements in query performance and a reduction in computational costs can be achieved. Using the three tables: lineitem, orders, and partsupp from TPC-H, let's describe the capability of directly querying @@ -127,11 +130,13 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31'; ## Transparent Rewriting Capability ### Join rewriting -JOIN rewriting refers to the ability to transparently rewrite a query when the tables used in the query and -the materialized view are the same. This rewriting can occur either by joining the materialized view -and the query inside the JOIN clause or by placing conditions in the WHERE clause outside of the JOIN. -Additionally, under certain conditions, when the types of JOINs in the query and the materialized view do not match, -rewriting can still take place. + +Join rewriting refers to when the tables used in the query and the materialization are the same. +In this case, the optimizer will attempt transparent rewriting by either joining the input of the materialized +view with the query or placing the join in the outer layer of the query's WHERE clause. + +This pattern of rewriting is supported for multi-table joins and supports inner and left join types. +Support for other types is continually expanding. **Case 1:** @@ -160,9 +165,10 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31'; **Case 2:** -JOIN Derivation (Coming soon) -When the types of JOINs in the query and the materialized view do not match, but the materialized view can provide -all the data required for the query, transparent rewriting can also occur by compensating predicates above the JOIN. +JOIN Derivation occurs when the join type between the query and the materialized view does not match. +In cases where the materialization can provide all the necessary data for the query, transparent rewriting can +still be achieved by compensating predicates outside the join through predicate push down. + For example: Materialized view definition: @@ -201,6 +207,11 @@ o_orderdate; ``` ### Aggregate rewriting +In the definitions of both the query and the materialized view, the aggregated dimensions can either be consistent or inconsistent. +Filtering of results can be achieved by using fields from the dimensions in the WHERE clause. + +The dimensions used in the materialized view need to encompass those used in the query, +and the metrics utilized in the query can be expressed using the metrics of the materialized view. **Case 1** @@ -245,8 +256,10 @@ o_comment; **Case 2** -The following query can undergo transparent rewriting. The query and the materialized view use inconsistent -dimensions for aggregation, where the dimensions used by the materialized view include those used by the query. +The following query can be transparently rewritten: the query and the materialization use aggregated dimensions +that are inconsistent, but the dimensions used in the materialized view encompass those used in the query. +The query can filter results using fields from the dimensions. + The query will attempt to roll up using the functions after SELECT, such as the materialized view's bitmap_union will eventually roll up into bitmap_union_count, maintaining consistency with the semantics of the count(distinct) in the query. @@ -377,17 +390,53 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10; ## Auxiliary Functions **Data Consistency Issues After Transparent Rewriting** + +The unit of `grace_period` is seconds, referring to the permissible time for inconsistency between the materialized +view and the data in the underlying base tables. + +For example, setting `grace_period` to 0 means requiring the materialized view to be consistent with the base +table data before it can be used for transparent rewriting. As for external tables, +since changes in data cannot be perceived, the materialized view is used with them. +Regardless of whether the data in the external table is up-to-date or not, this materialized view can be used for +transparent rewriting. If the external table is configured with an HMS metadata source, +it becomes capable of perceiving data changes. Configuring the metadata source and enabling data change +perception functionality will be supported in subsequent iterations. + +Setting `grace_period` to 10 means allowing a 10-second delay between the data in the materialized view and +the data in the base tables. If there is a delay of up to 10 seconds between the data in the materialized +view and the data in the base tables, the materialized view can still be used for transparent rewriting within +that time frame. + For internal tables in the materialized view, you can control the maximum delay allowed for the data used by -the transparent rewriting by setting the grace_period property. +the transparent rewriting by setting the `grace_period` property. Refer to [CREATE-ASYNC-MATERIALIZED-VIEW](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md) **Viewing and Debugging Transparent Rewrite Hit Information** -You can use the following statements to view the hit information of transparent rewriting for a materialized view. It will display a concise overview of the transparent rewriting process. +You can use the following statements to view the hit information of transparent rewriting for a materialized view. +It will display a concise overview of the transparent rewriting process. -`explain ` +`explain ` The information returned is as follows, with the relevant information pertaining to materialized views extracted: +```text +| MaterializedView | +| MaterializedViewRewriteFail: | +| MaterializedViewRewriteSuccessButNotChose: | +| Names: | +| MaterializedViewRewriteSuccessAndChose: | +| Names: mv1 +``` -If you want to know the detailed information about materialized view candidates, rewriting, and the final selection process, you can execute the following statement. It will provide a detailed breakdown of the transparent rewriting process. +**MaterializedViewRewriteFail**: Lists transparent rewrite failures and summarizes the reasons. + +**MaterializedViewRewriteSuccessButNotChose**: Transparent rewrite succeeded, but the final CBO did not choose the +materialized view names list. + +**MaterializedViewRewriteSuccessAndChose**: Transparent rewrite succeeded, and the materialized view names list +chosen by the CBO. + + +If you want to know the detailed information about materialized view candidates, rewriting, and the final selection process, +you can execute the following statement. It will provide a detailed breakdown of the transparent rewriting process. `explain memo plan ` @@ -401,14 +450,20 @@ If you want to know the detailed information about materialized view candidates, ## Limitations -- The materialized view definition statement only allows SELECT, FROM, WHERE, JOIN, and GROUP BY statements, and -the input to JOIN cannot contain GROUP BY. Only INNER and LEFT OUTER JOIN types are currently supported; other -types of JOIN operations will be supported gradually. -- Materialized views based on External Tables do not guarantee strong consistency of query results. -- No support for rewriting non-deterministic functions, including rand, now, current_time, current_date, random, uuid, etc. -- No support for rewriting window functions. -- The definition of materialized views currently cannot use views and other materialized views. -- Currently, WHERE condition compensation supports cases where the materialized view has no WHERE clause, and -the query has a WHERE clause; or the materialized view has a WHERE clause, and the query's WHERE condition is a -superset of the materialized view's. Currently, range condition compensation is not yet supported, -such as the materialized view definition being a > 5, and the query being a > 10. +- The materialized view definition statement only allows SELECT, FROM, WHERE, JOIN, and GROUP BY clauses. +The input for JOIN can include simple GROUP BY (aggregation on a single table). +Supported types of JOIN operations include INNER and LEFT OUTER JOIN. +Support for other types of JOIN operations will be gradually added. + +- Materialized views based on External Tables do not guarantee strong consistency in query results. + +- The use of non-deterministic functions to build materialized views is not supported, +including rand, now, current_time, current_date, random, uuid, etc. + +- Transparent rewriting does not support window functions and LIMIT. + +- Currently, materialized view definitions cannot utilize views or other materialized views. + +- Currently, WHERE clause compensation supports scenarios where the materialized view does not have a WHERE clause, +but the query does, or where the materialized view has a WHERE clause and the query's WHERE clause is a superset +of the materialized view's. Range condition compensation is not yet supported but will be added gradually. diff --git a/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md b/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md index 91263a6ff5..36152eef73 100644 --- a/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md +++ b/docs/en/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md @@ -165,6 +165,18 @@ There are two types of partitioning methods for materialized views. If no partit For example, if the base table is a range partition with a partition field of `create_time` and partitioning by day, and `partition by(ct) as select create_time as ct from t1` is specified when creating a materialized view, then the materialized view will also be a range partition with a partition field of 'ct' and partitioning by day +The selection of partition fields and the definition of materialized views must meet the following constraints to be successfully created; +otherwise, an error "Unable to find a suitable base table for partitioning" will occur: + +- At least one of the base tables used by the materialized view must be a partitioned table. +- Partitioned tables used by the materialized view must employ list or range partitioning strategies. +- The top-level partition column in the materialized view can only have one partition field. +- The SQL of the materialized view needs to use partition columns from the base table. +- If GROUP BY is used, the partition column fields must be after the GROUP BY. +- If window functions are used, the partition column fields must be after the PARTITION BY. +- Data changes should occur on partitioned tables. If they occur on non-partitioned tables, the materialized view needs to be fully rebuilt. +- Using the fields that generate nulls in the JOIN as partition fields in the materialized view prohibits partition incremental updates. + #### property The materialized view can specify both the properties of the table and the properties unique to the materialized view. diff --git a/docs/zh-CN/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md b/docs/zh-CN/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md index 1921498f85..ab6a7692c0 100644 --- a/docs/zh-CN/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md +++ b/docs/zh-CN/docs/query-acceleration/async-materialized-view/query-async-materialized-view.md @@ -25,9 +25,9 @@ under the License. --> ## 概述 -Doris 的异步物化视图采用了基于 SPJG(SELECT-PROJECT-JOIN-GROUP-BY)模式的结构信息来进行透明改写的算法。 +Doris 的异步物化视图采用了基于 SPJG(SELECT-PROJECT-JOIN-GROUP-BY)模式结构信息来进行透明改写的算法。 -Doris 可以分析查询 SQL 的结构信息,自动寻找满足要求的物化视图,并尝试进行透明改写,使用物化视图来表达查询SQL。 +Doris 可以分析查询 SQL 的结构信息,自动寻找满足要求的物化视图,并尝试进行透明改写,使用最优的物化视图来表达查询SQL。 通过使用预计算的物化视图结果,可以大幅提高查询性能,减少计算成本。 @@ -126,9 +126,9 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31'; ## 透明改写能力 ### JOIN 改写 -JOIN 改写指的是查询和物化使用的表相同,可以在物化视图和查询 JOIN 的内部输入或者 JOIN 的外部写 WHERE,可以进行改写。 +Join 改写指的是查询和物化使用的表相同,可以在物化视图和查询 Join 的输入或者 Join 的外层写 where,优化器对此 pattern 的查询会尝试进行透明改写。 -当查询和物化视图的 Join 的类型不同时,满足一定条件时,也可以进行改写。 +支持多表 Join,支持 Join 的类型为 inner,left。其他类型在不断拓展中。 **用例1:** @@ -155,9 +155,9 @@ WHERE l_linenumber > 1 and o_orderdate = '2023-12-31'; **用例2:** -JOIN衍生(Coming soon) -当查询和物化视图的 JOIN 的类型不一致时,但物化可以提供查询所需的所有数据时,通过在 JOIN 的外部补偿谓词,也可以进行透明改写, -举例如下,待支持。 +JOIN衍生,当查询和物化视图的 JOIN 的类型不一致时,如果物化可以提供查询所需的所有数据时,通过在 JOIN 的外部补偿谓词,也可以进行透明改写, + +举例如下 mv 定义: ```sql @@ -195,6 +195,9 @@ o_orderdate; ``` ### 聚合改写 +查询和物化视图定义中,聚合的维度可以一致或者不一致,可以使用维度中的字段写 WHERE 对结果进行过滤。 + +物化视图使用的维度需要包含查询的维度,并且查询使用的指标可以使用物化视图的指标来表示。 **用例1** @@ -236,11 +239,10 @@ o_comment; **用例2** -如下查询可以进行透明改写,查询和物化使用聚合的维度不一致,物化视图使用的维度包含查询的维度。 可以使用维度中的字段进行过滤结果, +如下查询可以进行透明改写,查询和物化使用聚合的维度不一致,物化视图使用的维度包含查询的维度。 查询可以使用维度中的字段对结果进行过滤, 查询会尝试使用物化视图 SELECT 后的函数进行上卷,如物化视图的 `bitmap_union` 最后会上卷成 `bitmap_union_count`,和查询中 - -`count(distinct)` 的语义 保持一致。 +`count(distinct)` 的语义保持一致。 mv 定义: ```sql @@ -360,14 +362,34 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10; ## 辅助功能 **透明改写后数据一致性问题** -对于物化视图中的内表,可以通过设定 `grace_period`属性来控制透明改写使用的物化视图所允许数据最大的延迟时间。 +`grace_period` 的单位是秒,指的是容许物化视图和所用基表数据不一致的时间。 +比如 `grace_period` 设置成0,意味要求物化视图和基表数据保持一致,此物化视图才可用于透明改写;对于外表,因为无法感知数据变更,所以物化视图使用了外表, + +无论外表的数据是不是最新的,都可以使用此物化视图用于透明改写,如果外表配置了 HMS 元数据源,是可以感知数据变更的,配置数据源和感知数据变更的功能会在后面迭代支持。 + +如果设置成10,意味物化视图和基表数据允许10s的延迟,如果物化视图的数据和基表的数据有延迟,如果在10s内,此物化视图都可以用于透明改写。 + +对于物化视图中的内表,可以通过设定 `grace_period` 属性来控制透明改写使用的物化视图所允许数据最大的延迟时间。 可查看 [CREATE-ASYNC-MATERIALIZED-VIEW](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md) **查询透明改写命中情况查看和调试** 可通过如下语句查看物化视图的透明改写命中情况,会展示查询透明改写简要过程信息。 -`explain ` +`explain ` 返回的信息如下,截取了物化视图相关的信息 +```text +| MaterializedView | +| MaterializedViewRewriteFail: | +| MaterializedViewRewriteSuccessButNotChose: | +| Names: | +| MaterializedViewRewriteSuccessAndChose: | +| Names: mv1 +``` +**MaterializedViewRewriteFail**:列举透明改写失败及原因摘要。 + +**MaterializedViewRewriteSuccessButNotChose**:透明改写成功,但是最终CBO没有选择的物化视图名称列表。 + +**MaterializedViewRewriteSuccessAndChose**:透明改写成功,并且CBO选择的物化视图名称列表。 如果想知道物化视图候选,改写和最终选择情况的过程详细信息,可以执行如下语句,会展示透明改写过程详细的信息。 @@ -383,11 +405,11 @@ WHERE o_orderkey > 5 AND o_orderkey <= 10; ## 限制 -- 物化视图定义语句中只允许包含 SELECT、FROM、WHERE、JOIN、GROUP BY 语句,并且 JOIN 的输入不能包含 GROUP BY,其中JOIN的支持的类型为 +- 物化视图定义语句中只允许包含 SELECT、FROM、WHERE、JOIN、GROUP BY 语句,JOIN 的输入可以包含简单的 GROUP BY(单表聚合),其中JOIN的支持的类型为 INNER 和 LEFT OUTER JOIN 其他类型的 JOIN 操作逐步支持。 - 基于 External Table 的物化视图不保证查询结果强一致。 -- 不支持非确定性函数的改写,包括 rand、now、current_time、current_date、random、uuid等。 -- 不支持窗口函数的改写。 +- 不支持使用非确定性函数来构建物化视图,包括 rand、now、current_time、current_date、random、uuid等。 +- 不支持窗口函数和 LIMIT 的透明改写。 - 物化视图的定义暂时不能使用视图和物化视图。 - 目前 WHERE 条件补偿,支持物化视图没有 WHERE,查询有 WHERE情况的条件补偿;或者物化视图有 WHERE 且查询的 WHERE 条件是物化视图的超集。 -目前暂时还不支持,范围的条件补偿,比如物化视图定义是 a > 5,查询是 a > 10。 \ No newline at end of file +目前暂时还不支持范围的条件补偿,比如物化视图定义是 a > 5,查询是 a > 10,逐步支持。 \ No newline at end of file diff --git a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md index 52050763c2..7a616d3848 100644 --- a/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md +++ b/docs/zh-CN/docs/sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-ASYNC-MATERIALIZED-VIEW.md @@ -165,6 +165,16 @@ KEY(k1,k2) 例如:基表是range分区,分区字段为`create_time`并按天分区,创建物化视图时指定`partition by(ct) as select create_time as ct from t1` 那么物化视图也会是range分区,分区字段为`ct`,并且按天分区 +分区字段的选择和物化视图的定义需要满足如下约束才可以创建成功,否则会报错 `Unable to find a suitable base table for partitioning` +- 物化视图使用的 base table 中至少有一个是分区表。 +- 物化视图使用的分区表,必须使用 list 或者 range 分区策略。 +- 物化视图最顶层的分区列只能有一个分区字段。 +- 物化视图的 SQL 需要使用了 base table 中的分区列。 +- 如果使用了group by,分区列的字段一定要在 group by 后。 +- 如果使用了 window 函数,分区列的字段一定要在partition by后。 +- 数据变更应发生在分区表上,如果发生在非分区表,物化视图需要全量构建。 +- 物化视图使用 Join 的 null 产生端的字段作为分区字段,不能分区增量更新。 + #### property 物化视图既可以指定table的property,也可以指定物化视图特有的property。