bp #36045, and turn on batch split, which is turn off in #36109
Generate and get split batch concurrently.
`SplitSource.getNextBatch` remove the synchronization, and make each get their splits concurrently, and `SplitAssignment` generates splits asynchronously.
The file list is got from external meta cache, and the file may already
be removed from storage.
We should ignore not found files and that query continue.
Cached blocks may be empty when VFileScanner return NOT_FOUND. This feature is introduced by https://github.com/apache/doris/pull/15226. Move this function inner `VFileScanner`.
File meta cache on BE is used to cache the meta for external table's file such as parquet footer.
This cache is counted by number, not memory consumption.
So if the cache object is big(eg, a large parquet footer), the total memory consumption of this cache
will be large and causing OOM.
This PR mainly changes:
1. Add a new method `exceed_prune_limit()` for `CachePolicy`
For `ObjLRUCache`, it always return true so that the minor of full gc on BE will prune the cache each time.
2. Reduce the default capability of file meta cache, from 20000 to 1000
Also change the default capability of hdfs file handle cache, from 20000 to 1000
4. Change judgement of whether enable file meta cache when querying
If the number of file need to be read is larger than the 1/3 of the file meta cache's capability, file meta cache
will be disabled for this query. Because cache is useless if there are too many files.
In previous, the counter in `profile` may be updated when close the file reader.
And the file reader may be closed when the object being deconstruted.
But at that time, the `profile` object may already be deleted, causing NPE and BE will crash.
This PR try to fix this issue:
1. Remove the "profile counter update" logic from all `close()` method.
2. Add a new interface `ProfileCollector`
It has 2 methods:
- `collect_profile_at_runtime()`
It can be called at runtime, eg, in every `get_next_block()` method.
So that the counter in profile can be updated at runtime.
- `collect_profile_before_close()`
Should be called before the object call `close()`. And it will only be called once.
3. Derived from `ProfileCollector`
All classes which may update the profile counter in `close()` method should extends
the `ProfileCollector`. Such as `GenericReader`, etc. And implement `collect_profile_before_close()`
And `collect_profile_before_close()` will be called in `scanner->mark_to_need_to_close()`.
VScanNode::get_next will check whether the ScanNode has reached limit condition, and send eos to TaskScheduler, and TaskScheduler will try to close ScanNode.
However, ScanNode must wait all running scanners finished, so even if ScanNode has reached limit condition, it can't be closed immediately.
This PR try to interrupt the running readers, and make ScanNode to end as soon as possible.
1. max compute partition prune,
we just support filter mc partitions by '=',it can filter just one partition
to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported.
2. add max compute row count cache and partitionValues cache
3. add max compute regression case
Fix two bugs:
1. Missing column is case sensitive, change the column name to lower case in FE for hive/iceberg/hudi
2. Iceberg use custom method to encode special characters in column name. Decode the column name to match the right column in parquet reader.
`VFileScanner` will try to append late arrival runtime filters in each loop of `ScannerScheduler::_scanner_scan`. However, `VFileScanner::_get_next_reader` only generates the `_push_down_conjuncts` in the first loop, so the late arrival runtime filters are ignored.
1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion.
2.Support hive alter table.
When executing broker load in ASAN mode, BE may crash with error:
```
F20231010 18:18:17.044978 185490 block.cpp:694] Check failed: d.column->use_count() == 1 (3 vs. 1)
*** Check failure stack trace: ***
@ 0x55e9d94c4e46 google::LogMessage::SendToLog()
@ 0x55e9d94c1410 google::LogMessage::Flush()
@ 0x55e9d94c5689 google::LogMessageFatal::~LogMessageFatal()
@ 0x55e9c509f80d doris::vectorized::Block::clear_column_data()
@ 0x55e9b6c170b3 doris::PlanFragmentExecutor::get_vectorized_internal()
@ 0x55e9b6c147e6 doris::PlanFragmentExecutor::open_vectorized_internal()
@ 0x55e9b6c12d9a doris::PlanFragmentExecutor::open()
@ 0x55e9b6c18426 doris::PlanFragmentExecutor::execute()
@ 0x55e9b6945cca doris::FragmentMgr::_exec_actual()
@ 0x55e9b696456c doris::FragmentMgr::exec_plan_fragment()::$_0::operator()()
```
It may happen when there is column maping like:
```
(k1,v2,v3,v4,v5,v6,v7,v8)
set (k2=v4,k3=v4,k4=v4)
```
in load stmt.
Case is covered by Baidu test cases
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.
2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec
4. Support snappy block codec
For hadoop snappy
5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.
Need to pick to branch-2.0 after this PR: #22304
Iceberg has its own metadata information, which includes count statistics for table data. If the table does not contain equli'ty delete, we can get the count data of the current table directly from the count statistics.
This pr fixes two issues:
1. when using s3 TVF to query files in AVRO format, due to the change of `TFileType`, the originally queried `FILE_S3 ` becomes `FILE_LOCAL`, causing the query failed.
2. currently, both parameters `s3.virtual.key` and `s3.virtual.bucket` are removed. A new `S3Utils` in jni-avro to parse the bucket and key of s3.
The purpose of doing this operation is mainly to unify the parameters of s3.
Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema by session var `truncate_char_or_varchar_columns`.
For load request, there are 2 tuples on scan node, input tuple and output tuple.
The input tuple is for reading file, and it will be converted to output tuple based on user specified column mappings.
And the broker load support different column mapping in different data description to same table(or partition).
So for each scanner, the output tuples are same but the input tuple can be different.
The previous implements save the input tuple in scan node level, causing different scanner using same input tuple,
which is incorrect.
This PR remove the input tuple from scan node and save them in each scanners.
Check whether there are complex types in parquet/orc reader in broker/stream load. Broker/stream load will cast any type as string type, and complex types will be casted wrong. This is a temporary method, and will be replaced by tvf.
Optimization "select count(*) from table" stmtement , push down "count" type to BE.
support file type : parquet ,orc in hive .
1. 4kfiles , 60kwline num
before: 1 min 37.70 sec
after: 50.18 sec
2. 50files , 60kwline num
before: 1.12 sec
after: 0.82 sec
### Issue
when partition has null partitions, it throws error
`Failed to fill partition column: t_int=null`
### Resolution
- Fix the following null partitions error in iceberg tables by replacing null partition to '\N'.
- Add regression test for hive null partition.
### 1
In previous implementation, for each FileSplit, there will be a `TFileScanRange`, and each `TFileScanRange`
contains a list of `TFileRangeDesc` and a `TFileScanRangeParams`.
So if there are thousands of FileSplit, there will be thousands of `TFileScanRange`, which cause the thrift
data send to BE too large, resulting in:
1. the rpc of sending fragment may fail due to timeout
2. FE will OOM
For a certain query request, the `TFileScanRangeParams` is the common part and is same of all `TFileScanRange`.
So I move this to the `TExecPlanFragmentParams`.
After that, for each FileSplit, there is only a list of `TFileRangeDesc`.
In my test, to query a hive table with 100000 partitions, the size of thrift data reduced from 151MB to 15MB,
and the above 2 issues are gone.
### 2
Support when setting `max_external_file_meta_cache_num` <=0, the file meta cache for parquet footer will
not be used.
Because I found that for some wide table, the footer is too large(1MB after compact, and much more after
deserialized to thrift), it will consuming too much memory of BE when there are many files.
This will be optimized later, here I just support to disable this cache.