1. the agg function without distinct keyword should be a "merge" funcion in threePhaseAggregateWithDistinct
2. use aggregateParam.aggMode.consumeAggregateBuffer instead of aggregateParam.aggPhase.isGlobal() to indicate if a agg function is a "merge" function
3. add an AvgDistinctToSumDivCount rule to support avg(distinct xxx) in some case
4. AggregateExpression's nullable method should call inner function's nullable method.
5. add a bind slot rule to bind pattern "logicalSort(logicalHaving(logicalProject()))"
6. don't remove project node in PhysicalPlanTranslator
7. add a cast to bigint expr when count( distinct datelike type )
8. fallback to old optimizer if bitmap runtime filter is enabled.
9. fix exchange node mem leak
* [feature-wip](inverted index)inverted index api: reader
* [feature-wip](inverted index) Fulltext query syntax with MATCH/MATCH_ALL/MATCH_ALL
* [feature-wip](inverted index) Adapt to index meta
* [enhance] add more metrics
* [enhance] add fulltext match query check for column type and index parser
* [feature-wip](inverted index) Support apply inverted index in compound predicate which except leaf node of and node
**Optimize**
PR #14470 has used `Expr` to filter delete rows to match current data file,
but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files)
to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows.
In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path`
columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column.
After testing, the performance of iceberg v2's MOR is improved by 30%+.
**Fix Bug**
Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr`
and the batch_eof is generated from next batch.
Refactor the usage of file cache
### Motivation
There may be many kinds of file cache for different scenarios.
So the logic of the file cache should be hidden inside the file reader,
so that for the upper-layer caller, the change of the file cache does not need to
modify the upper-layer calling logic.
### Details
1. Add `FileReaderOptions` param for `fs->open_file()`, and in `FileReaderOptions`
1. `CachePathPolicy`
Determine the cache file path for a given file path.
We can implement different `CachePathPolicy` for different file cache.
2. `FileCacheType`
Specified file cache type: SUB_FILE_CACHE, WHOLE_FILE_CACHE, FILE_BLOCK_SIZE, etc.
2. Hide the cache logic inside the file reader.
The `RemoteFileSystem` will handle the `CacheOptions` and determine whether to
return a `CachedFileReader` or a `RemoteFileReader`.
And the file cache is managed by `CachedFileReader`
Current column default value is used only for load task. But in the case of Iceberg schema change,
query task is also possible to read the default value for columns not exist in old schema.
This pr is to support default value for query task.
Manually tested the broker load and external emr regression cases.
test data: https://data.gharchive.org/2020-11-13-18.json.gz, 2GB, 197696 lines
before: String 13s vs. JSONB 28s
after: String 13s vs. JSONB 16s
**NOTICE: simdjson need to be patched since BOOL is conflicted with a macro BOOL defined in odbc sqltypes.h**
Current bitmap index only can apply pushed down predicates which in AND conditions. When predicates in OR conditions and other complex compound conditions, it will not be pushed down to the storage layer, this leads to read more data.
Based on that situation, this pr will do:
1. this pr in order to support bitmap index apply compound predicates, query sql like:
select * from tb where a > 'hello' or b < 100;
select * from tb where a > 'hello' or b < 100 or c > 'ok';
select * from tb where (a > 'hello' or b <100) and (a < 'world' or b > 200);
select * from tb where (not a> 'hello') or b < 100;
...
above sql,column a and b and c has created bitmap_index.
2. this optimization can reduce reading data by index
3. set config enable_index_apply_compound_predicates to use this optimization
This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733.
The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve.
For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```,
you need set it to ```true```, to make the index work for like query.
Support return bitmap data in select statement in vectorization mode
In the scenario of using Bitmap to circle people, users need to return the Bitmap results to the upper layer, which is parsing the contents of the Bitmap to deal with high QPS query scenarios
1. Fix 1 bug:
Throw null pointer exception when reading data after the reader reaches the end of file, so should return directly when `_do_lazy_read` read no data.
2. Optimize code:
Remove unused parameters.
3. Fix regression test