Currently, for broadcast shuffle, we serialize a block once and then send it by RPC through multiple channel. After this, we will serialize next block in the same memory for consideration of memory reuse. However, since the RPC is asynchronized, maybe the next block serialization will happen before sending the previous block.
So, in this PR, I use a ref count to identify if the serialized block can be reuse in broadcast shuffle.
`CachedRemoteFileReader` has used fixed segment size(file_cache_max_file_segment_size=4M) to cache remote file blocks. However, the column size in a rowgroup/strip maybe smaller than 10K if a parquet/orc file has many columns, resulting in particularly serious read amplification. For example:
Q1 in clickbench: select count(*) from hits
```
- FileCache: 0ns
- IOHitCacheNum: 552
- IOTotalNum: 835
- ReadFromFileCacheBytes: 19.98 MB
- ReadFromWriteCacheBytes: 0.00
- ReadTotalBytes: 29.52 MB
- SkipCacheBytes: 0.00
- WriteInFileCacheBytes: 915.77 MB
- WriteInFileCacheNum: 283
```
Only 30MB of data is needed, but 900MB+ of data is read from hdfs. The query time of Q1(single scan thread) increased from **5.17s** to **24.45s** when enable file cache.
Therefore, this PR introduce dynamic segment size which is based on the `read_size` of the data. In order to prevent too small or too large IO, the segment size is limited in [4096, file_cache_max_file_segment_size].
Q1 in clickbench is **5.66s** when enable file cache. The performance is almost the same as if the cache is disabled, and the data size read from hdfs is reduced to 45MB.
```
- FileCache: 0ns
- IOHitCacheNum: 297
- IOTotalNum: 835
- ReadFromFileCacheBytes: 8.73 MB
- ReadFromWriteCacheBytes: 0.00
- ReadTotalBytes: 29.52 MB
- SkipCacheBytes: 0.00
- WriteInFileCacheBytes: 45.66 MB
- WriteInFileCacheNum: 544
```
## Remaining Problems
Small queries may result in a large number of small files(4KB at least), and the `BE` saves too much meta information of cached segments.
## Fix bug
`FileCachePolicy` in `FileReaderOptions` is a constant reference, but the parameter passed in `FileFactory::create_file_reader` is a temporary variable, resulting in segmentation fault.
There is something wrong with the `test_broker_load` suite(s3 auth problem).
So I ignore this case temporarily.
cc @wsjz , please help to solve it and add it back
When JDBC client enable server side prepared statement, it will cache OlapScanNode and reuse it for performance, but each
time call `addScanRangeLocations` will add new item to `bucketSeq2locations`, so the `bucketSeq2locations` lead to a memleak if OlapScanNode
cached in memory
make rows_read correct so that the scheduler could using this correctly.
use single scanner if has limit clause. Move it from fragment context to scannode.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
In vertical base compaction, same rows will be filtered in vertical_merge_iterator,
we should skip these filtered rows when set agg flag of delete sign.
For example, schema is a,b,delete_sign, and data is
1,1,1
1,1,0
1,1,0
2,2,1
2,2
and Block we get in VerticalBlockReader is
1,1,1
2,2,1
and we should set agg flag idex 0,4 to true when handle delete sign, so
we add a function continuous_agg_count to skip same rows filtered in
VerticalMergeIterator.
before this PR, Doris cannot process sql like that
```sql
CREATE TABLE `test_sq_dj1` (
`c1` int(11) NULL,
`c2` int(11) NULL,
`c3` int(11) NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`c1`) BUCKETS 3
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
CREATE TABLE `test_sq_dj2` (
`c1` int(11) NULL,
`c2` int(11) NULL,
`c3` int(11) NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`c1`) BUCKETS 3
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
insert into test_sq_dj1 values(1, 2, 3), (10, 20, 30), (100, 200, 300);
insert into test_sq_dj2 values(10, 20, 30);
-- core
SELECT * FROM test_sq_dj1 WHERE c1 IN (SELECT c1 FROM test_sq_dj2) OR c1 IN (SELECT c1 FROM test_sq_dj2) OR c1 < 10;
-- invalid slot
SELECT * FROM test_sq_dj1 WHERE c1 IN (SELECT c1 FROM test_sq_dj2) OR c1 IN (SELECT c2 FROM test_sq_dj2) OR c1 < 10;
```
there are two problems:
1. we should remove redundant sub-query in one conjuncts to avoid generate useless join node
2. when we have more than one sub-query in one disjunct. we should put the conjunct contains the disjunct at the top node of the set of mark join nodes. And pop up the mark slot to the top node.
RewriteBinaryPredicatesRule rewrite expression like
`cast(A decimal) > decimal` to `A > some_other_bigint`
in order to:
1. push down the rewrite predicate
2. avoid convert column A to decimal
We get the datatype of `A` by `expr0.getSrcSlotRef().getColumn().getType()`.
However, when A is result of a function from sub-query, this rule is not applicable.
For example:
```
select *
from (
select TIMESTAMPDIFF(MINUTE,startTime,endTime) AS timediff
from CNC_SliceSate) T
where timediff > 5.0;
```
we cannot push predicate down to OlapScan(CNC_SliceSate) to save effort.
adjust nullable on empty set should apply after unnested sub-query
some function should propagate nullable when args are datev2 or datetimev2
add back tpch sf0.1 nereids regression test
When the argument of truncate function is float type, it can match both truncate(DECIMALV3) and truncate(DOUBLE), if the match is truncate(DECIMALV3), the precision is lost when converting float to DECIMALV3(38, 0).
Here I modify it to match truncate(DOUBLE) for now, maybe we still need to solve the problem of losing precision when converting float to DECIMALV3.
If alter task in queue, compaction is not enabled and may cause too much version.
Keep last 10 version in new tablet so that base tablet's max version will
not be merged and than we can copy data from base tablet to new tablet.
```
create table tbl1 (k1 varchar(100), k2 string) distributed by hash(k1) buckets 1 properties("replication_num" = "1");
insert into tbl1 values(1, "alice");
select cast(k1 as INT) as id from tbl1 order by id limit 2;
```
The above query could pass `checkEnableTwoPhaseRead` since the order by element is SlotRef but actually it's an function call expr
both update status and open_vectorized_internal will call send_report and stop report thread. move update_status code to open method and remove unnecessary send_report and stop_report_thread.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
This pr added support for the pre-aggregation hint. Users could use /*+PREAGGOPEN*/ to enable pre-preaggregation for OLAP table.
For example:
Let's say we have an aggregate-keys table t (k1 int, k2 int, v1 int sum, v2 int sum). Pre-aggregation could be enabled by query with a hint: select k1, v1 from t /*+PREAGGOPEN*/.