- Add 2 new BE config
- `s3_read_base_wait_time_ms` and `s3_read_max_wait_time_ms`
When meet s3 429 error, the "get" request will
sleep `s3_read_base_wait_time_ms (*1, *2, *3, *4)` ms get try again.
The max sleep time is s3_read_max_wait_time_ms
and the max retry time is max_s3_client_retry
- Add more metrics for s3 file reader
- `s3_file_reader_too_many_request`: counter of 429 error.
- `s3_file_reader_s3_get_request`: the QPS of s3 get request.
- `TotalGetRequest`: Get request counter in profile
- `TooManyRequestErr`: 429 error counter in profile
- `TooManyRequestSleepTime`: Sum of sleep time after 429 error in profile
- `TotalBytesRead`: Total bytes read from s3 in profile
do not process rf on HashJoinBuildSinkLocalState::close when query
```cpp
*** Query id: ee97f0c64a76436b-babc251c7d6702fb ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1716780426 (unix time) try "date -d @1716780426" if you are using GNU date ***
*** Current BE git commitID: 813074b ***
*** SIGSEGV address not mapped to object (@0x0) received by PID 12924 (TID 15847 OR 0x7efbe5aa5700) from PID 0; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /root/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo_t*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x00007F064FF1C090 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::BloomFilterFuncBase::merge(doris::BloomFilterFuncBase*) at /root/doris/be/src/exprs/bloom_filter_func.h:169
5# doris::RuntimePredicateWrapper::merge(doris::RuntimePredicateWrapper const*) at /root/doris/be/src/exprs/runtime_filter.cpp:507
6# doris::IRuntimeFilter::merge_from(doris::RuntimePredicateWrapper const*) at /root/doris/be/src/exprs/runtime_filter.cpp:1497
7# doris::IRuntimeFilter::publish(bool)::$_2::operator()() const in /home/work/unlimit_teamcity/TeamCity/Agents/20240527104837agent_172.16.0.93_1/work/60183217f6ee2a9c/output/be/lib/doris_be
8# doris::IRuntimeFilter::publish(bool) at /root/doris/be/src/exprs/runtime_filter.cpp:1015
9# doris::VRuntimeFilterSlots::publish(bool) at /root/doris/be/src/exprs/runtime_filter_slots.h:137
10# doris::pipeline::HashJoinBuildSinkLocalState::close(doris::RuntimeState*, doris::Status) in /home/work/unlimit_teamcity/TeamCity/Agents/20240527104837agent_172.16.0.93_1/work/60183217f6ee2a9c/output/be/lib/doris_be
11# doris::pipeline::DataSinkOperatorXBase::close(doris::RuntimeState*, doris::Status) at /root/doris/be/src/pipeline/exec/operator.h:491
12# doris::pipeline::PipelineTask::close(doris::Status) at /root/doris/be/src/pipeline/pipeline_task.cpp:436
13# doris::pipeline::_close_task(doris::pipeline::PipelineTask*, doris::Status) at /root/doris/be/src/pipeline/task_scheduler.cpp:88
14# doris::pipeline::TaskScheduler::_do_work(unsigned long) in /home/work/unlimit_teamcity/TeamCity/Agents/20240527104837agent_172.16.0.93_1/work/60183217f6ee2a9c/output/be/lib/doris_be
15# doris::ThreadPool::dispatch_thread() at /root/doris/be/src/util/threadpool.cpp:551
16# doris::Thread::supervise_thread(void*) at /root/doris/be/src/util/thread.cpp:499
17# start_thread at /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:478
18# __clone at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
```
## Proposed changes
Issue Number: close #xxx
<!--Describe your changes.-->
## Further comments
If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
The file list is got from external meta cache, and the file may already
be removed from storage.
We should ignore not found files and that query continue.
## Proposed changes
Before error msg:
```
Failed to submit scanner to scanner pool
```
After error msg:
```
Failed to submit scanner to scanner pool reason:Scan thread pool had shutdown|type 1
```
## Proposed changes
backport #35347
## Further comments
If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
pick from master #35445
## Proposed changes
Issue Number: close #xxx
<!--Describe your changes.-->
## Further comments
If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
## Proposed changes
Issue Number: close #xxx
<!--Describe your changes.-->
## Further comments
If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
1. In the past, if error code is not ok and then get status, the status
maybe ok. some dcheck maybe failed.
In this PR use std mutex to make this behavior stable.
If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
## Proposed changes
Some operators has limit condition, the source operator should notify
the sink operator that limit reached.
Although FE has limit logic but it not always send .
## Further comments
If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
1. `memory.usage_in_bytes ~= free.used + free.(buff/cache) - (buff)`, free cache can be reused,
so, modify cgroup_memory_usage = memory.usage_in_bytes - memory.meminfo["Cached"].
2. If system not configured with cgroup, find cgroup file path will failed, refactor refresh cgroup memory info, compatible with find failed.
* [fix](compaction test) show single replica compaction status and fix test (#33076)
* [improve](http action) add http interface to calculate the crc of all files in tablet (#34915)
* [fix](compression) handle exception to reuse compression context
Otherwise, there is memleak and new context is allocated, then flush tlb
consumes a lot sys cpu.
Otherwise result in crash
```
*** SIGSEGV address not mapped to object (@0x0) received by PID 4149909 (TID 4152328 OR 0x7efefc60d700) from PID 0; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_master/doris/be/src/common/signal_handler.h:421
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
3# 0x00007F031AD0E090 in /lib/x86_64-linux-gnu/libc.so.6
4# doris::Status doris::vectorized::MutableBlock::merge_impl<doris::vectorized::Block const&>(doris::vectorized::Block const&) at /home/zcp/repo_center/doris_master/doris/be/src/vec/core/block.h:586
5# doris::Status doris::vectorized::MutableBlock::merge<doris::vectorized::Block const&>(doris::vectorized::Block const&) at /home/zcp/repo_center/doris_master/doris/be/src/vec/core/block.h:521
```
* Issue: Doris occasionally encounters an issue where memory usage becomes exceptionally high and does not decrease. The leaked memory is occupied by Bloom filters stored in memory.
Reason: The segment cache stores segment objects read from files into memory. It functions as an LRU cache with an eviction strategy: when the number of segments exceeds the maximum number, or the total memory size of segment objects in the cache exceeds the maximum usage, it evicts the older segments. However, there is a piece of logic in the code that first reads the segment object into memory, assuming it occupies memory size A, then places the read segment object into the cache (at this point, the cache considers the segment object size to be A). It then reads the segment's Bloom filter from the file and assigns it to the segment's Bloom filter member variable, assuming the Bloom filter occupies memory size B. Thus, the total size of the segment object at this point is A+B. However, the cache does not update this size, leading to the actual size of the segment object stored in the cache (A+B) being larger than the size considered by the cache (A). When the number of segment objects in the cache increases to a certain extent, the used memory will surge dramatically. However, the cache does not perceive the size as reaching the eviction limit, so it does not evict the segment objects. In such cases, a memory leak issue arises.
Solution: Since each segment object only reads the Bloom filter once, the issue can be resolved by changing the logic from reading the segment, placing it into the cache, and then reading the Bloom filter to reading the segment, reading the Bloom filter, and then placing it into the cache.