With the default config of 90%, be may meet OOM when the load pressure is big.
when set to 80%, be works well with the same load pressure in my cluster.
The core is due to a DCHECK:
F0513 22:48:56.059758 3996895 tablet.cpp:2690] Check failed: num_to_read == num_read
Finally, we found that the DCHECK failure is due to page cache:
1. At first we have 20 segments, which id is 0-19.
2. For MoW table, memtable flush process will calculate the delete bitmap. In this procedure, the index pages and data pages of PrimaryKeyIndex is loaded to cache
3. Segment compaction compact all these 10 segments to 2 segment, and rename it to id 0,1
4. Finally, before the load commit, we'll calculate delete bitmap between segments in current rowset. This procedure need to iterator primary key index of each segments, but when we access data of new compacted segments, we read data of old segments in page cache
To fix this issue, the best policy is:
1. Add a crc32 or last modified time to CacheKey.
2. Or invalid related cache keys after segment compaction.
For policy 1, we don't have crc32 in segment footer, and getting the last-modified-time needs to perform 1 additional disk IO.
For policy 2, we need to add additional page cache invalidation methods, which may cause the page cache not stable
So I think we can simply add a file size to identify that the file is changed.
In LSM-Tree, all modification will generate new files, such file-name reuse is not normal case(as far as I know, only segment compaction), file size is enough to identify the file change.
fix some wrong downcast founded by ubsan.
```cpp
doris/be/src/olap/bloom_filter_predicate.h:43:32: runtime error: downcast of address 0x7f8ec2b691a0 which does not point to an object of type 'doris::BloomFilterColumnPredicate<doris::TYPE_DATE>::SpecificFilter' (aka 'BloomFilterFunc<(doris::PrimitiveType)11U>')
0x7f8ec2b691a0: note: object is of type 'doris::BloomFilterFunc<(doris::PrimitiveType)12>'
e5 55 00 00 10 74 58 42 e5 55 00 00 00 00 10 00 8e 7f 00 00 20 07 6f cc 8e 7f 00 00 80 fe 68 cc
^~~~~~~~~~~~~~~~~~~~~~~
vptr for 'doris::BloomFilterFunc<(doris::PrimitiveType)12>'
```
1. TYPE_DATE/TYPE_DATETIME have same data format, so I change the cast about bloom filter to reinterpret cast.
```cpp
doris/be/src/vec/exec/format/orc/vorc_reader.h:281:17: runtime error: downcast of address 0x7f562f4c3180 which does not point to an object of type 'ColumnVector<int>'
0x7f562f4c3180: note: object is of type 'doris::vectorized::ColumnDecimal<doris::vectorized::Decimal<int> >'
74 65 00 00 20 91 70 f5 ca 55 00 00 02 00 00 00 00 00 00 00 f0 d4 4c 2f 56 7f 00 00 f0 d4 4c 2f
^~~~~~~~~~~~~~~~~~~~~~~
vptr for 'doris::vectorized::ColumnDecimal<doris::vectorized::Decimal<int> >'
```
2. doris use ColumnDecimal to store decimal elements.
1. fix function define of `Retention` inconsist, this function return tinyint on `FE` and return uint8 on `BE`
2. make assert_cast support cast to derived
3. change some static cast to assert cast
4. support sum(bool)/avg(bool)
For performance issue, we would specify rowset included by cold heat separation table to use file block cache no matter what config user has set.
I've tested the config using cold_heat_seperation_case_p2 and it works well.
1. support export `LARGEINT` data type to parquet/orc file format.
2. Export the DORIS `DATE/DATETIME` type to the `Date/Timestamp` logic type of parquet file format.
3. Fix that the data is not correct when the DATE type data is exported to ORC.
If we use Clang-16 to build the third-party libraries and build doris_be_test against them, we can not run doris_be_test successfully. Some errors with BRPC occur.
I tested this on Linux (x86_64) and macOS (x86_64/arm64), these errors always raised.
supports users to manually inject table level statistics.
table stats type:
- row_count
Modify table or partition statistics:
```SQL
ALTER TABLE table_name SET STATS ('k1' = 'v1', ...)
```
TODO:
- support other table stats type if necessary
- update statistics cache if necessary
1. refactor aggregate normalization to avoid data amplification before aggregate
2. remove useless aggreagte processing in ExtractAndNormalizeWindowExpression
3. only push distinct aggregate function children
TODO:
1. push down redundant expression in aggregate functions
2. refactor normalize repeat rule
3. move expression normalization and optimization after plan normalization to avoid unexpected expression optimization.
When user set default_storage_medium to true, the storage medium of all partitions should be SSD,
and cooldown time should be 9999-12-31 23:59:59.
So that it won't change to HDD.
But looks like sometimes it still change to HDD.
So I change the debug log to info to observer it.
Unfortunately BthreadCountDownEvent will not serve as one sync primitive for this scenario where are all pthread workers. BthreadCountDownEvent::time_wait is used for bthread so it will result in some confusing sync problem like heap buffer use after free.
1.some encrypt and decrypt functions have wrong blockEncryptionMode
2.topN node should compare tuples from intermediate_row_desc with first_sort_slot.tuple_id
3.must keep the limit if it's an uncorrelated in-subquery with limit on sort, like select a from t1 where a in ( select b from t2 order by xx limit yy )