Check whether there are complex types in parquet/orc reader in broker/stream load. Broker/stream load will cast any type as string type, and complex types will be casted wrong. This is a temporary method, and will be replaced by tvf.
Optimization "select count(*) from table" stmtement , push down "count" type to BE.
support file type : parquet ,orc in hive .
1. 4kfiles , 60kwline num
before: 1 min 37.70 sec
after: 50.18 sec
2. 50files , 60kwline num
before: 1.12 sec
after: 0.82 sec
select c_name from customer union select c_name from customer
this sql used agg node to get distinct row of c_name,
so it's no need to wait for inserted all data to hash map,
could output the data which it's inserted into hash map successed.
Fix tow bugs:
1. Unexpected null values in array column. If 65535 consecutive values are not null in nullable array column, this error will be triggered. The reason is that the array parser did not handle boundary conditions.
2. The number of rows of key filed, and that of value field in map column are not equal. Similarly, the number of rows among fields in struct column are not the same. This would be triggered when the number of rows are not equal among parquet pages of different columns in a row group.
### Issue
when partition has null partitions, it throws error
`Failed to fill partition column: t_int=null`
### Resolution
- Fix the following null partitions error in iceberg tables by replacing null partition to '\N'.
- Add regression test for hive null partition.
1. If only read the partition columns, the `JniConnector` will produce empty required fields, so `HudiJniScanner` should read the "_hoodie_record_key" field at least to know how many rows in current hoodie split. Even if the `JniConnector` doesn't read this field, the call of `releaseTable` in `JniConnector` will reclaim the resource.
2. To prevent BE failure and exit, `JniConnector` should call release methods after `HudiJniScanner` is initialized. It should be noted that `VectorTable` is created lazily in `JniScanner`, so we don't need to reclaim the resource when `HudiJniScanner` is failed to initialize.
## Remaining works
Other jni readers like `paimon` and `maxcompute` may encounter the same problems, the jni reader need to handle this abnormal situation on its own, and currently this fix can only ensure that BE will not exit.
First of all, mysql does not have a boolean type, its boolean type is actually tinyint(1), in the previous logic, We force tinyint(1) to be a boolean by passing tinyInt1isBit=true, which causes an error if tinyint(1) is not a 0 or 1, Therefore, we need to match tinyint(1) according to tinyint instead of boolean, and this change will not affect the correctness of where k = 1 or where k = true queries
### Issue
Dictionary filtering is a mechanism that directly reads the dictionary encoding of a single string column filter condition for filter comparison. But dictionary filtered single string columns may be included in other multi-column filter conditions. This can cause problems.
For example:
`select * from multi_catalog.lineitem_string_date_orc where l_commitdate < l_receiptdate and l_receiptdate = '1995-01-01' order by l_orderkey, l_partkey, l_suppkey, l_linenumber limit 10;`
`l_receiptdate` is string filter column,it is included by multi-column filter condition `l_commitdate < l_receiptdate`.
### Solution
Resolve it by separating the multi-column filter conditions and executing it after the dictionary filter column is converted to string.
1.support filesystem metastore
2.support predicate and project when split
3.fix partition table query error
todo: Now you need to manually put paimon-s3-0.4.0-incubating.jar in be/lib/java_extensions when use s3 filesystem
doc pr: #21966
runtime filter is shared among multi instances.
in the past, we cached pushdown expr(runtime filter generated)
every scannode[runtime filter consumer] will try to call prepare expr
but the expr may generated with different fn_context_id
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Co-authored-by: yiguolei <yiguolei@gmail.com>
If the scanner is failed during init or open, then not need update counters because the query is fail and the counter is useless.
And it may core during update counters. For example, update counters depend on scanner's tablet, but the tablet == null when init failed.
In some cases, it is necessary to unescape the original value, such as when converting a string to JSONB.
If not unescape, then later jsonb parse will be failed
### 1
In previous implementation, for each FileSplit, there will be a `TFileScanRange`, and each `TFileScanRange`
contains a list of `TFileRangeDesc` and a `TFileScanRangeParams`.
So if there are thousands of FileSplit, there will be thousands of `TFileScanRange`, which cause the thrift
data send to BE too large, resulting in:
1. the rpc of sending fragment may fail due to timeout
2. FE will OOM
For a certain query request, the `TFileScanRangeParams` is the common part and is same of all `TFileScanRange`.
So I move this to the `TExecPlanFragmentParams`.
After that, for each FileSplit, there is only a list of `TFileRangeDesc`.
In my test, to query a hive table with 100000 partitions, the size of thrift data reduced from 151MB to 15MB,
and the above 2 issues are gone.
### 2
Support when setting `max_external_file_meta_cache_num` <=0, the file meta cache for parquet footer will
not be used.
Because I found that for some wide table, the footer is too large(1MB after compact, and much more after
deserialized to thrift), it will consuming too much memory of BE when there are many files.
This will be optimized later, here I just support to disable this cache.
In parquet, min and max statistics may not be able to handle UTF8 correctly.
Current processing method is using min_value and max_value statistics introduced by PARQUET-1025 if they are used.
If not, current processing method is temporarily ignored. A better way is try to read min and max statistics if it contains
only ASCII characters. I will improve it in the future PR.
* [Fix](rowset) When a rowset is cooled down, it is directly deleted. This can result in data query misses in the second phase of a two-phase query.
related pr #20732
There are two reasons for moving the logic of delayed deletion from the Tablet to the StorageEngine. The first reason is to consolidate the logic and unify the delayed operations. The second reason is that delayed garbage collection during queries can cause rowsets to remain in the "stale rowsets" state, preventing the timely deletion of rowset metadata, It may cause rowset metadata too large.
* not use unused rowsets