1. the agg function without distinct keyword should be a "merge" funcion in threePhaseAggregateWithDistinct
2. use aggregateParam.aggMode.consumeAggregateBuffer instead of aggregateParam.aggPhase.isGlobal() to indicate if a agg function is a "merge" function
3. add an AvgDistinctToSumDivCount rule to support avg(distinct xxx) in some case
4. AggregateExpression's nullable method should call inner function's nullable method.
5. add a bind slot rule to bind pattern "logicalSort(logicalHaving(logicalProject()))"
6. don't remove project node in PhysicalPlanTranslator
7. add a cast to bigint expr when count( distinct datelike type )
8. fallback to old optimizer if bitmap runtime filter is enabled.
9. fix exchange node mem leak
* [feature-wip](inverted index)inverted index api: reader
* [feature-wip](inverted index) Fulltext query syntax with MATCH/MATCH_ALL/MATCH_ALL
* [feature-wip](inverted index) Adapt to index meta
* [enhance] add more metrics
* [enhance] add fulltext match query check for column type and index parser
* [feature-wip](inverted index) Support apply inverted index in compound predicate which except leaf node of and node
**Optimize**
PR #14470 has used `Expr` to filter delete rows to match current data file,
but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files)
to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows.
In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path`
columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column.
After testing, the performance of iceberg v2's MOR is improved by 30%+.
**Fix Bug**
Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr`
and the batch_eof is generated from next batch.
this PR remove typeCoercion on expected expr in ExpressionRewriteTestHelper. Because we should not rewrite expected expr at all. It will change the expected expr unexpectedly.
Cooldown time is wrong for data in SSD, because cooldown time for all `table/partitionis`
is only calculated once when class `DataProperty` loaded and that cannot be updated later.
This patch is to ensure that cooldown time for each table/partition can be calculated in real time
when table/partition is created.
Co-authored-by: weizuo <weizuo@xiaomi.com>
1. Add IntegralDivide operator to support `DIV` semantics
2. Add more operator rewriter to keep expression type consistent between operators
3. Support the convertion between float type and decimal type.
After this PR, below cases could be executed normaly like the legacy optimizer:
use test_query_db;
select k1, k5,100000*k5 from test order by k1, k2, k3, k4;
select avg(k9) as a from test group by k1 having a < 100.0 order by a;
currently, show create function is designed for native function, it has some non suitable points.
this pr bring several improvements, and make result of show create function can be used to create function.
1. add type property.
2. add ALWAYS_NULLABLE perperty for java_udf
3. use file property rather than object_file for java_udf, follow usage of create java_udf
4. remove md5 property, coz file may vary when create function again.
5. remove INIT_FN,UPDATE_FN,MERGE_FN,SERIALIZE_FN etc properties for java_udf, cos java_udf does not need these properties.
This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733.
The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve.
For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```,
you need set it to ```true```, to make the index work for like query.
Original: group by is bound to the outputExpression of the current node.
Problem: When the name of the new reference of outputExpression is the same as the child's output column, the child's output column should be used for group by, but at this time, the new reference of the node's outputExpression will be used for group by, resulting in an error
Now: Give priority to the child's output for group by binding. If the child does not have a corresponding column, use the outputExpression of this node for binding