Iceberg has its own metadata information, which includes count statistics for table data. If the table does not contain equli'ty delete, we can get the count data of the current table directly from the count statistics.
[Fix](orc-reader) Fix filling partition or missing column used incorrect row count.
`_row_reader->nextBatch` returns number of read rows. When orc lazy materialization is turned on, the number of read rows includes filtered rows, so caller must look at `numElements` in the row batch to determine how
many rows were not filtered which will to fill to the block.
In this case, filling partition or missing column used incorrect row count which will cause be crash by `filter.size() != offsets.size()` in filter column step.
When orc lazy materialization is turned off, add `_convert_dict_cols_to_string_cols(block, nullptr)` if `(block->rows() == 0)`.
1. the real value of BE config `file_cache_max_file_reader_cache_size` will be the 1/3 of process's max open file number.
2. use thread pool to create or init the file cache concurrently.
To solve the issue that when there are lots of files in file cache dir, the starting time of BE will be very slow because
it will traverse all file cache dirs sequentially.
Previously, delete statement with conditions on value columns are only supported on duplicate tables. After we introduce delete sign mechanism to do batch delete, a delete statement with conditions on value columns on unique tables will be transformed into the corresponding insert into ..., __DELETE_SIGN__ select ... statement. However, for unique table with merge-on-write enabled, the overhead of inserting these data can be eliminated. So this PR add the ability to allow delete predicate on value columns for merge-on-write unique tables.
The DCHECK may not always be right in case of Vertical compaction.
remove it to let DEBUG run.
Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
This issue is introduced by #22765, if #22765 is picked to 2.0, then also need to pick this PR.
When shuffle type is BUCKET_SHFFULE_HASH_PARTITIONED, since data of multi buckets maybe sent to the same channel, send eos too early may cause data lost.
Currently, we only return ambiguous "INTERNAL ERROR" to the user when
load. This commit will no more hide the root cause.
Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
before the parquet write export decimal as byte-binary,
but can't be import those fied to Hive.
Now, change to export decimal as fixed-len-byte-array in order to import hive directly.