rewrite func(para) over (partition by unique_keys)
1. func() is count(non-null) or rank/dense_rank/row_number -> 1
2. func(para) is min/max/sum/avg/first_value/last_value -> para
e.g
select max(c1) over(partition by pk) from t1;
-> select c1 from t1;
nereids support alter view stmt.
e.g. ALTER VIEW example_db.example_view
(
c1 COMMENT "column 1",
c2 COMMENT "column 2",
c3 COMMENT "column 3"
)
AS SELECT k1, k2, SUM(v1) FROM example_table
GROUP BY k1, k2
For now, it will reset the next journal id and return if the OP_TIMESTAMP
operation writes failed. Because BDBJE will replicate the committed txns (only
persisted in BDB log, but not replicated to other members) to FOLLOWERs after
the connection resumed, directly resetting the next journal id and returning
will cause subsequent txn written to the same journal ID not to be replayed by
the FOLLOWERS. So for OP_TIMESTAMP operation, try to write until it succeeds.
The partition key information recorded in PARTITION_KEYS table is sorted according to the INTEGER_IDX field, so we need to add an 'order by' clause to ensure that the obtained partition names are ordered.
Many domestic cloud vendors are compatible with the s3 protocol. However, early versions of s3 client will only generate path style http requests (https://github.com/aws/aws-sdk-java-v2/pull/763) when encountering endpoints that do not start with s3, while some cloud vendors only support virtual host style http request.
Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a virtual hosted path and implemented it through path style.
For example:
For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
- virtualBucket: my-bucket
- Bucket: data (bucket must be set, otherwise the s3 client will report an error) Especially this step is particularly tricky because of the limitations of the s3 client.
- Key: file.txt
The path style mode is used to generate an http request similar to the virtual host by setting the endpoint to virtualBucket + original endpoint, setting the bucket and key.
**However, the bucket and key here are inconsistent with the original concepts of s3, but the aws client happens to be able to generate an http request similar to the virtual host through the path style mode.**
However, after #30799 we have upgrade the aws sdk version from 2.17.257 to 2.20.131. The current aws s3 client can already generate a virtual host by third party by default style of http request. So in #31111 need to set the path style option, let the s3 client use doris' virtual bucket mechanism to continue working.
**Finally, the virtual bucket mechanism is too confusing and tricky, and we no longer need it with the new version of s3 client.**
### Resolution:
Rewrite `S3URI` to remove tricky virtual bucket mechanism and support different uri styles by flags.
This class represents a fully qualified location in S3 for input/output operations expressed as as URI.
#### For AWS S3, URI common styles:
- AWS Client Style(Hadoop S3 Style): `s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
- Virtual Host Style: `https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
- Path Style: `https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
Regarding the above-mentioned common styles, we can use <code>isPathStyle</code> to control whether to use path style
or virtual host style.
"Virtual host style" is the currently mainstream and recommended approach to use, so the default value of
<code>isPathStyle</code> is false.
#### Other Styles:
- Virtual Host AWS Client (Hadoop S3) Mixed Style:
`s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
- Path AWS Client (Hadoop S3) Mixed Style:
`s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
For these two styles, we can use <code>isPathStyle</code> and <code>forceParsingByStandardUri</code>
to control whether to use.
Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false && forceParsingByStandardUri = true</code>
Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && forceParsingByStandardUri = true</code>
When the incoming location is url encoded, the encoded string will be returned.
For <code>getKey()</code>, <code>getQueryParams()</code> will return the encoding string
run: select TABLE_SCHEMA as a, sum(TABLE_ROWS) as b from tables group by TABLE_SCHEMA limit 2;
old output:
TABLE_SCHEMA Nullable(Int64)_1
0 regression_test_mv_p0_sum_count 9
1 regression_test_query_p0_sql_functions_string_functions 70414
now output:
a b
0 regression_test_mv_p0_sum_count 9
1 regression_test_query_p0_sql_functions_string_functions 70414
Support query rewritting by nested materialized view.
Such as `inner_mv` def is as following
> select
> l_linenumber,
> o_custkey,
> o_orderkey,
> o_orderstatus,
> l_partkey,
> l_suppkey,
> l_orderkey
> from lineitem
> inner join orders on lineitem.l_orderkey = orders.o_orderkey;
the mv1_0 def is as following:
> select
> l_linenumber,
> o_custkey,
> o_orderkey,
> o_orderstatus,
> l_partkey,
> l_suppkey,
> l_orderkey,
> ps_availqty
> from inner_mv
> inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey;
for the following query, both inner_mv and mv1_0 can be successful when query rewritting by materialized view,and cbo will chose `mv1_0` finally.
> select lineitem.l_linenumber
> from lineitem
> inner join orders on l_orderkey = o_orderkey
> inner join partsupp on l_partkey = ps_partkey AND l_suppkey = ps_suppkey
> where o_orderstatus = 'o' AND l_linenumber in (1, 2, 3, 4, 5)
* [enhancement](Nereids) Enable parse sql from sql cache (#33262)
Before this pr, the query must pass through parser, analyzer, rewriter, optimizer and translator, then we can check whether this query can use sql cache, if the query is too long, or the number of join tables too big, the plan time usually >= 500ms.
This pr reduce this time by skip the fashion plan path, because we can reuse the previous physical plan and query result if no any changed. In some cases we should not parse sql from sql cache, e.g. table structure changed, data changed, user policies changed, privileges changed, contains non-deterministic functions, and user variables changed.
In my test case: query a view which has lots of join and union, and the tables has empty partition, the query latency is about 3ms. if not parse sql from sql cache, the plan time is about 550ms
## Features
1. use Config.sql_cache_manage_num to control how many sql cache be reused in on fe
2. if explain plan appear some plans contains `LogicalSqlCache` or `PhysicalSqlCache`, it means the query can use sql cache, like this:
```sql
mysql> set enable_sql_cache=true;
Query OK, 0 rows affected (0.00 sec)
mysql> explain physical plan select * from test.t;
+----------------------------------------------------------------------------------+
| Explain String(Nereids Planner) |
+----------------------------------------------------------------------------------+
| cost = 3.135 |
| PhysicalResultSink[53] ( outputExprs=[c1#0, c2#1] ) |
| +--PhysicalDistribute[50]@0 ( stats=3, distributionSpec=DistributionSpecGather ) |
| +--PhysicalOlapScan[t]@0 ( stats=3 ) |
+----------------------------------------------------------------------------------+
4 rows in set (0.02 sec)
mysql> select * from test.t;
+------+------+
| c1 | c2 |
+------+------+
| 1 | 2 |
| -2 | -2 |
| NULL | 30 |
+------+------+
3 rows in set (0.05 sec)
mysql> explain physical plan select * from test.t;
+-------------------------------------------------------------------------------------------+
| Explain String(Nereids Planner) |
+-------------------------------------------------------------------------------------------+
| cost = 0.0 |
| PhysicalSqlCache[2] ( queryId=78511f515cda466b-95385d892d6c68d0, backend=127.0.0.1:9050 ) |
| +--PhysicalResultSink[52] ( outputExprs=[c1#0, c2#1] ) |
| +--PhysicalDistribute[49]@0 ( stats=3, distributionSpec=DistributionSpecGather ) |
| +--PhysicalOlapScan[t]@0 ( stats=3 ) |
+-------------------------------------------------------------------------------------------+
5 rows in set (0.01 sec)
```
(cherry picked from commit 03bd2a337d4a56ea9c91673b3bd4ae518ed10f20)
* fix
* [fix](Nereids) fix some sql cache consistence bug between multiple frontends (#33722)
fix some sql cache consistence bug between multiple frontends which introduced by [enhancement](Nereids) Enable parse sql from sql cache #33262, fix by use row policy as the part of sql cache key.
support dynamic update the num of fe manage sql cache key
(cherry picked from commit 90abd76f71e73702e49794d375ace4f27f834a30)
* [fix](Nereids) fix bug of dry run query with sql cache (#33799)
1. dry run query should not use sql cache
2. fix test sql cache in cloud mode
3. enable cache OneRowRelation and EmptyRelation in frontend to skip parse sql
(cherry picked from commit dc80ecf7f33da7b8c04832dee88abd09f7db9ffe)
* remove cloud mode
* remove @NotNull