problem:
1. create a iceberg_type catalog:
2. use iceberg catalog to specify verison
```
mysql> show catalog iceberg;
+----------------------+--------------------------+
| Key | Value |
+----------------------+--------------------------+
| type | iceberg |
| iceberg.catalog.type | hms |
| hive.metastore.uris | thrift://127.0.0.1:9083 |
| hadoop.username | hadoop |
| create_time | 2023-07-25 16:51:00.522 |
+----------------------+--------------------------+
5 rows in set (0.02 sec)
mysql> select * from iceberg.iceberg_db.tb1 FOR VERSION AS OF 8783036402036752909;
ERROR 5090 (42000): errCode = 2, detailMessage = Only iceberg/hudi external table supports time travel in current version
```
change:
Add `ICEBERG_EXTERNAL_TABLE` type for specify the version and time
Currently, the new optimizer don't consider anything about partial update.
This PR add the ability to convert a delete statement to a partial update insert statement
for merge-on-write unique table
First of all, mysql does not have a boolean type, its boolean type is actually tinyint(1), in the previous logic, We force tinyint(1) to be a boolean by passing tinyInt1isBit=true, which causes an error if tinyint(1) is not a 0 or 1, Therefore, we need to match tinyint(1) according to tinyint instead of boolean, and this change will not affect the correctness of where k = 1 or where k = true queries
In this PR, we introduce TOKENIZE function for inverted index, it is used as following:
```
SELECT TOKENIZE('I love my country', 'english');
```
It has two arguments, first is text which has to be tokenized, the second is parser type which can be **english**, **chinese** or **unicode**.
It also can be used with existing table, like this:
```
mysql> SELECT TOKENIZE(c,"chinese") FROM chinese_analyzer_test;
+---------------------------------------+
| tokenize(`c`, 'chinese') |
+---------------------------------------+
| ["来到", "北京", "清华大学"] |
| ["我爱你", "中国"] |
| ["人民", "得到", "更", "实惠"] |
+---------------------------------------+
```
the global limit will create a gather action, and all the data will be calculated in one instance. If we push down the global limit, the node run after the limit node will run slowly.
We fix it by push down only local limit.
a join plan tree before fixing:
```
LogicalLimit(global)
LogicalLimit(local)
Plan()
LogicalLimit(global)
LogicalLimit(local)
LogicalJoin
LogicalLimit(global)
LogicalLimit(local)
Plan()
LogicalLimit(global)
LogicalLimit(local)
Plan()
after fixing:
LogicalLimit(global)
LogicalLimit(local)
Plan()
LogicalLimit(local)
LogicalJoin
LogicalLimit(local)
Plan()
LogicalLimit(local)
Plan()
```
New aggregation function: map_agg.
This function requires two arguments: a key and a value, which are used to build a map.
select map_agg(column1, column2) from t group by column3;
### Issue
Dictionary filtering is a mechanism that directly reads the dictionary encoding of a single string column filter condition for filter comparison. But dictionary filtered single string columns may be included in other multi-column filter conditions. This can cause problems.
For example:
`select * from multi_catalog.lineitem_string_date_orc where l_commitdate < l_receiptdate and l_receiptdate = '1995-01-01' order by l_orderkey, l_partkey, l_suppkey, l_linenumber limit 10;`
`l_receiptdate` is string filter column,it is included by multi-column filter condition `l_commitdate < l_receiptdate`.
### Solution
Resolve it by separating the multi-column filter conditions and executing it after the dictionary filter column is converted to string.
we convert input parameters to double for function ceil, floor and round,
because DecimalV2 could not do these operation. Since we intro DecimalV3,
we should convert all parameters to DecimalV3 to get correct result.
For example, when we use double as parameters, we get wrong result:
```sql
select round(341/20000,4),341/20000,round(0.01705,4);
+-------------------------+---------------+-------------------+
| round((341 / 20000), 4) | (341 / 20000) | round(0.01705, 4) |
+-------------------------+---------------+-------------------+
| 0.017 | 0.01705 | 0.0171 |
+-------------------------+---------------+-------------------+
```
DecimalV3 could get correct result
```sql
select round(341/20000,4),341/20000,round(0.01705,4);
+-------------------------+---------------+-------------------+
| round((341 / 20000), 4) | (341 / 20000) | round(0.01705, 4) |
+-------------------------+---------------+-------------------+
| 0.0171 | 0.01705 | 0.0171 |
+-------------------------+---------------+-------------------+
```
According the implementation in execution engine, all order keys
in SortNode will be output. We must normalize LogicalSort follow
by it.
We push down all non-slot order key in sort to materialize them
behind sort. So, all order key will be slot and do not need do
projection by SortNode itself.
This will simplify translation of SortNode by avoid to generate
resolvedTupleExprs and sortTupleDesc.
columnStatistics.minExpr and maxExpr is useful when we derive stats for cast function.
This pr
1. maintains the min/max expr during stats derive in filter condition: col<literal, col>literal and col=literal
2. adjust column stats range for cast function (now only support cast from string to other types)
ds9 is changed, but no performance issue: on tpcds_sf100_rf exe time is 1.5~1.6sec, the same as master
JdbcScanNode need to use the conjuncts to generate sql in finalize function. But the conjuncts have not passed to JdbcScanNode yet while calling finalize. This pr is to pass the conjuncts to scan node before using it to avoid scan the whole table.
With bellow json path
`["$.data","$.data.datatimestamp"]`
After `array_obj->PushBack` the `data` field owner will be taken from array_obj, and lead to null values for json path `$.data.datatimestamp`
Rapidjson doc:
```
//! Append a GenericValue at the end of the array.
\note The ownership of \c value will be transferred to this array on success.
*/
GenericValue& PushBack(GenericValue& value, Allocator& allocator);
```
The getTable function in CascadesContext only handles the internal catalog case (try to find table only in internal
catalog and dbs). However, it should take all the external catalogs into consideration, otherwise, it will failed to find a
table or get the wrong table while querying external table. This pr is to fix this bug.
We should not remove any limit from uncorrelated subquery. For Example
```sql
-- should return nothing, but return all tuple of t if we remove limit from exists
SELECT * FROM t WHERE EXISTS (SELECT * FROM t limit 0);
-- should return the tuple with smallest c1 in t,
-- but report error if we remove limit from scalar subquery
SELECT * FROM t WHERE c1 = (SELECT * FROM t ORDER BY c1 LIMIT 1);
```
consider the window function:
```sql
substr(
ref_1.cp_type,
sum(CASE WHEN ref_1.cp_type = 0 THEN 3 ELSE 2 END) OVER (),
1)
```
Before the pr, only "CASE WHEN ref_1.cp_type = 0 THEN 3 ELSE 2 END" is pushed down.
But both "ref_1.cp_type" and "CASE WHEN ref_1.cp_type = 0 THEN 3 ELSE 2 END"
should be pushed down.
This pr fix it
In some cases, it is necessary to unescape the original value, such as when converting a string to JSONB.
If not unescape, then later jsonb parse will be failed
REFACTOR:
1. Generate CTEAnchor, CTEProducer, CTEConsumer when analyze.
For example, statement `WITH cte1 AS (SELECT * FROM t) SELECT * FROM cte1`.
Before this PR, we got analyzed plan like this:
```
logicalCTE(LogicalSubQueryAlias(cte1))
+-- logicalProject()
+-- logicalCteConsumer()
```
we only have LogicalCteConsumer on the plan, but not LogicalCteProducer.
This is not a valid plan, and should not as the final result of analyze.
After this PR, we got analyzed plan like this:
```
logicalCteAnchor()
|-- logicalCteProducer()
+-- logicalProject()
+-- logicalCteConsumer()
```
This is a valid plan with LogicalCteProducer and LogicalCteConsumer
2. Replace re-analyze unbound plan with deepCopy plan when do CTEInline
Because we generate LogicalCteAnchor and LogicalCteProducer when analyze.
So, we could not do re-analyze to gnerate CTE inline plan anymore.
The another reason is, we reuse relation id between unbound and bound relation.
So, if we do re-analyze on unresloved CTE plan, we will get two relation
with same RelationId. This is wrong, because we use RelationId to distinguish
two different relations.
This PR implement two helper class to deep copy a new plan from CTEProducer.
`LogicalPlanDeepCopier` and `ExpressionDeepCopier`
3. New rewrite framework to ensure do CTEInline in right way.
Before this PR, we do CTEInline before apply any rewrite rule.
But sometimes, some CteConsumer could be eliminated after rewrite.
After this PR, we do CTEInline after the plans relaying on CTEProducer have
been rewritten. So we could do CTEInline if some the count of CTEConsumer
decrease under the threshold of CTEInline.
4. add relation id to all relation plan node
5. let all relation generated from table implement trait CatalogRelation
6. reuse relation id between unbound relation and relation after bind
ENHANCEMENT:
1. Pull up CTEAnchor before RBO to avoid break other rules' pattern
Before this PR, we will generate CTEAnchor and LogicalCTE in the middle of plan.
So all rules should process LogicalCTEAnchor, otherwise will generate unexpected plan.
For example, push down filter and push down project should add pattern like:
```
logicalProject(logicalCTE)
...
logicalFilter(logicalCteAnchor)
...
```
project and filter must be push through these virtual plan node to ensure all project
and filter could be merged togather and get right order of them. for Example:
```
logicalProject
+-- logicalFilter
+-- logicalCteAnchor
+-- logicalProject
+-- logicalFilter
+-- logicalOlapScan
```
upper plan will lead to translation error. because we could not do twice filter and
project on bottom logicalOlapScan.
BUGFIX:
1. Recursive analyze LogicalCTE to avoid bind outer relation on inner CTE
For example
```sql
SELECT * FROM (WITH cte1 AS (SELECT * FROM t1) SELECT * FROM cte1)v1, cte1 v2;
```
Before this PR, we will use nested cte name to bind outer plan.
So the outer cte1 with alias v2 will bound on the inner cte1.
After this PR, the sql will throw Table not exists exception when binding.
2. Use right way do withChildren in CTEProducer and remove projects in it
Before this PR, we add an attr named projects in CTEProducer to represent the output
of it. This is because we cannot get right output of it by call `getOutput` method on it.
The root reason of that is the wrong implementation of computeOutput of LogicalCteProducer.
This PR fix this problem and remove projects attr of CTEProducer.
3. Adjust nullable rule update CTEConsumer's output by CTEProducer's output
This PR process nullable on LogicalCteConsumer to ensure CteConsumer's output with right
nullable info, if the CteProducer's output nullable has been adjusted.
4. Bind set operation expression should not change children's output's nullable
This PR use fix a problem introduced by prvious PR #21168. The nullable info of
SetOperation's children should not changed after binding SetOperation.