When reading to the end of the segment file, clearing the block did not release the memory, leading to high memory usage during compaction.
When reading through segment file for columns that are dictionary encoded, the column iterator in the segment iterator will hold the dictionary. Release the segment iterator to free up the dictionary.
The previous logic was to read jsonbvalue while parsing the json path. For complex json paths, there will be a lot of repeated parsing work. The optimization idea is to separate the analysis and value of jsonpath
runtime filter is shared among multi instances.
in the past, we cached pushdown expr(runtime filter generated)
every scannode[runtime filter consumer] will try to call prepare expr
but the expr may generated with different fn_context_id
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Co-authored-by: yiguolei <yiguolei@gmail.com>
If the scanner is failed during init or open, then not need update counters because the query is fail and the counter is useless.
And it may core during update counters. For example, update counters depend on scanner's tablet, but the tablet == null when init failed.
In some cases, it is necessary to unescape the original value, such as when converting a string to JSONB.
If not unescape, then later jsonb parse will be failed
mysql >select sec_to_time(time_to_sec(cast('16:32:18' as time)));
+----------------------------------------------------+
| sec_to_time(time_to_sec(CAST('16:32:18' AS TIME))) |
+----------------------------------------------------+
| 16:32:18 |
+----------------------------------------------------+
1 row in set (0.53 sec)
mysql [test]>select sec_to_time(59538);
+--------------------+
| sec_to_time(59538) |
+--------------------+
| 16:32:18 |
+--------------------+
1 row in set (0.03 sec)
When enable two level hash table , if there is zero value in the existing one level hash table, it will cause dead loop when converting to two level hash table, because the PartitionedHashTable::_is_partitioned flag is not set correctly when doing the converting.
### 1
In previous implementation, for each FileSplit, there will be a `TFileScanRange`, and each `TFileScanRange`
contains a list of `TFileRangeDesc` and a `TFileScanRangeParams`.
So if there are thousands of FileSplit, there will be thousands of `TFileScanRange`, which cause the thrift
data send to BE too large, resulting in:
1. the rpc of sending fragment may fail due to timeout
2. FE will OOM
For a certain query request, the `TFileScanRangeParams` is the common part and is same of all `TFileScanRange`.
So I move this to the `TExecPlanFragmentParams`.
After that, for each FileSplit, there is only a list of `TFileRangeDesc`.
In my test, to query a hive table with 100000 partitions, the size of thrift data reduced from 151MB to 15MB,
and the above 2 issues are gone.
### 2
Support when setting `max_external_file_meta_cache_num` <=0, the file meta cache for parquet footer will
not be used.
Because I found that for some wide table, the footer is too large(1MB after compact, and much more after
deserialized to thrift), it will consuming too much memory of BE when there are many files.
This will be optimized later, here I just support to disable this cache.
arrow is not support key column has null element , but doris default map key column is nullable , so need to deal with if doris map row if key column has null element , we put null to arrow
In parquet, min and max statistics may not be able to handle UTF8 correctly.
Current processing method is using min_value and max_value statistics introduced by PARQUET-1025 if they are used.
If not, current processing method is temporarily ignored. A better way is try to read min and max statistics if it contains
only ASCII characters. I will improve it in the future PR.
* [Fix](rowset) When a rowset is cooled down, it is directly deleted. This can result in data query misses in the second phase of a two-phase query.
related pr #20732
There are two reasons for moving the logic of delayed deletion from the Tablet to the StorageEngine. The first reason is to consolidate the logic and unify the delayed operations. The second reason is that delayed garbage collection during queries can cause rowsets to remain in the "stale rowsets" state, preventing the timely deletion of rowset metadata, It may cause rowset metadata too large.
* not use unused rowsets
After the last time to call scan_task.scan_func(),the should be ended, this means PipelineFragmentContext could be released.
Then after PipelineFragmentContext is released, visiting its field such as query_ctx or _state may cause core dump.
But it can only explain core 2
void ScannerScheduler::_task_group_scanner_scan(ScannerScheduler* scheduler,
taskgroup::ScanTaskTaskGroupQueue* scan_queue) {
while (!_is_closed) {
taskgroup::ScanTask scan_task;
auto success = scan_queue->take(&scan_task);
if (success) {
int64_t time_spent = 0;
{
SCOPED_RAW_TIMER(&time_spent);
scan_task.scan_func();
}
scan_queue->update_statistics(scan_task, time_spent);
}
}
}
Issue Number: close #xxx
when cal array hash, elem size is not need to seed hash
hash = HashUtil::zlib_crc_hash(reinterpret_cast<const char*>(&elem_size),
sizeof(elem_size), hash);
but we need to be care [[], [1]] vs [[1], []], when array nested array , and nested array is empty, we should make hash seed to
make difference
2. use range for one hash value to avoid virtual function call in loop.
which double the performance. I make it in ut
column: array[int64]
50 rows , and single array has 10w elements