Currently, for broadcast shuffle, we serialize a block once and then send it by RPC through multiple channel. After this, we will serialize next block in the same memory for consideration of memory reuse. However, since the RPC is asynchronized, maybe the next block serialization will happen before sending the previous block.
So, in this PR, I use a ref count to identify if the serialized block can be reuse in broadcast shuffle.
`CachedRemoteFileReader` has used fixed segment size(file_cache_max_file_segment_size=4M) to cache remote file blocks. However, the column size in a rowgroup/strip maybe smaller than 10K if a parquet/orc file has many columns, resulting in particularly serious read amplification. For example:
Q1 in clickbench: select count(*) from hits
```
- FileCache: 0ns
- IOHitCacheNum: 552
- IOTotalNum: 835
- ReadFromFileCacheBytes: 19.98 MB
- ReadFromWriteCacheBytes: 0.00
- ReadTotalBytes: 29.52 MB
- SkipCacheBytes: 0.00
- WriteInFileCacheBytes: 915.77 MB
- WriteInFileCacheNum: 283
```
Only 30MB of data is needed, but 900MB+ of data is read from hdfs. The query time of Q1(single scan thread) increased from **5.17s** to **24.45s** when enable file cache.
Therefore, this PR introduce dynamic segment size which is based on the `read_size` of the data. In order to prevent too small or too large IO, the segment size is limited in [4096, file_cache_max_file_segment_size].
Q1 in clickbench is **5.66s** when enable file cache. The performance is almost the same as if the cache is disabled, and the data size read from hdfs is reduced to 45MB.
```
- FileCache: 0ns
- IOHitCacheNum: 297
- IOTotalNum: 835
- ReadFromFileCacheBytes: 8.73 MB
- ReadFromWriteCacheBytes: 0.00
- ReadTotalBytes: 29.52 MB
- SkipCacheBytes: 0.00
- WriteInFileCacheBytes: 45.66 MB
- WriteInFileCacheNum: 544
```
## Remaining Problems
Small queries may result in a large number of small files(4KB at least), and the `BE` saves too much meta information of cached segments.
## Fix bug
`FileCachePolicy` in `FileReaderOptions` is a constant reference, but the parameter passed in `FileFactory::create_file_reader` is a temporary variable, resulting in segmentation fault.
make rows_read correct so that the scheduler could using this correctly.
use single scanner if has limit clause. Move it from fragment context to scannode.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
In vertical base compaction, same rows will be filtered in vertical_merge_iterator,
we should skip these filtered rows when set agg flag of delete sign.
For example, schema is a,b,delete_sign, and data is
1,1,1
1,1,0
1,1,0
2,2,1
2,2
and Block we get in VerticalBlockReader is
1,1,1
2,2,1
and we should set agg flag idex 0,4 to true when handle delete sign, so
we add a function continuous_agg_count to skip same rows filtered in
VerticalMergeIterator.
If alter task in queue, compaction is not enabled and may cause too much version.
Keep last 10 version in new tablet so that base tablet's max version will
not be merged and than we can copy data from base tablet to new tablet.
both update status and open_vectorized_internal will call send_report and stop report thread. move update_status code to open method and remove unnecessary send_report and stop_report_thread.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Tow improvements:
1. Refactor rowgroup&page filtering in `ParquetReader`, and use the operator overloading of Doris native c++ type to process comparison.
2. Support decimal/decimal v3/date/datev2/datetime/datetimev2
call pthread condition wait may block brpc thread.
no need wait for fragment because two phase exec fragment already guarantee that the fragment instance exits when runtime filter comes. So that I remove the condition wait code.
Co-authored-by: yiguolei <yiguolei@gmail.com>
MetricRegistry::trigger_all_hooks holds the metrics lock and is stuck in get_je_metrics, to_prometheus is waiting for MetricRegistry::trigger_all_hooks to release the lock, so get_je_metrics is no longer called in MetricRegistry::trigger_all_hooks.
Add DATA_TYPE in information schema for types: datev2, datatimev2, decimal, jsonb. It was 'unknown' for these types and cause problem for tools such as BI using information schema.
In disk balancer, if a tablet is in highly concurrent load,
new rowset creation time(which use current time) may be same as the
newest rowset, and when add tablet, there has a creation time check
that new_time must bigger than old time, so disk balancer will failed
many times and makes this tablet lose many verisons as migration will
block writes.
convert_nullable_flags does not contain nullable info for RowID column, but valid_column_ids contain RowID column, nullable falg will be undefined for RowID column