Suppose three queries are executed in a resource group with a memory_limit of 8G, and they consume memory of query_a = 3G, query_b = 3G, and query_c = 3G. The total memory used is counted as 9G when the resource group GC is executed, which exceeds the resource group limit and cancels query_a.
When the resource group is next GC, the memory of query_a may not be freed yet, and it will be counted again in the total memory consumed by that resource group, which again exceeds the resource group limit and cancels query_b.
From the user's perspective, it is fine to execute query_a and query_b at the same time, but executing query_ a, query_b and query_c will be cancelled for two queries, which is not as expected.
This pr skips the queries that are cancelled when counting the memory used by the resource group. If this causes the process memory to grow, the process gc will handle it.
After the outer catch exception, faststring resize reserve build may throw a memory alloc failure exception from the Allocator.
Currently page body compress will catch memory alloc failure exception
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
No check mem tracker limit and no cancel task in mem hook, only in Allocator. This helps in clearer analysis of memory issues and reduces performance loss.
PODArray/hash table/arena memory allocation will use Allocator.
Optimize mem limit exceeded log printing
Optimize compilation time
Remove page cache regular clear
Now the page cache is turned off by default. If the user manually opens the page cache, it can be considered that the user can accept the memory usage of the page cache, and then can consider adding a manual clear command to the cache.
fix memory gc cancel top memory query
jemalloc prof is not enabled by default
After the memory exceeds the limit, the previous query waited for memory free in the mem hook, and changed it to wait in the Allocator.
more controllable and safe
Problem:
1. FE will split the parquet file into split. So a file can have several splits.
2. BE will scan each split, read the footer of the parquet file.
3. If 2 splits belongs to a same parquet file, the footer of this file will be read twice.
This PR mainly changes:
1. Use kv cache to cache the footer of parquet file.
2. The kv cache is belong to a scan node, so all parquet reader belong to this scan node will share same kv cache.
3. In cache, the key is "meta_file_path", the value is parsed thrift footer.
The KV Cache is sharded into mutlti sub cache.
So that different file can use different sub cache, avoid blocking each other
In my test, a query with 26 splits can reduce the footer parse time from 4s -> 1s
Add a special counter for memtracker, faster, but relaxed ordering and not accurate in real time
Track thread create and destroy memory, which was previously removed due to performance loss and added back
Fix Redhat 4.x OS /proc/meminfo has no MemAvailable, disable MemAvailable to control memory.
vm_rss_str and mem_available_str recorded when gc is triggered, to avoid memory changes during gc and cause inaccurate logs.
join probe catch bad_alloc, this may alloc 64G memory at a time, avoid OOM.
Modify document doris_be_all_segments_num and doris_be_all_rowsets_num names.
MetricRegistry::trigger_all_hooks holds the metrics lock and is stuck in get_je_metrics, to_prometheus is waiting for MetricRegistry::trigger_all_hooks to release the lock, so get_je_metrics is no longer called in MetricRegistry::trigger_all_hooks.
1. When the process memory is insufficient, print the process memory statistics in a more timely and detailed manner.
2. Support regular GC cache, currently only page cache and chunk allocator are included, because many people reported that the memory does not drop after the query ends.
3. Reduce system available memory warning water mark to reduce memory waste
4. Optimize soft mem limit logging
Overcommit memory means that when the memory is sufficient, it no longer checks whether the memory of query/load exceeds the exec mem limit.
Instead, when the process memory exceeds the limit or the available system memory is insufficient, cancel the top overcommit query in minor gc, and cancel top memory query in full gc.
Previously only query supported overcommit memory, this pr supports load, including insert into and stream load.
Detailed explanation, I will update the memory document in these two days~
15bd56cd43/docs/zh-CN/docs/admin-manual/maint-monitor/memory-management/memory-limit-exceeded-analysis.md
Add conf enable_query_memroy_overcommit
If true, when the process does not exceed the soft mem limit, the query memory will not be limited; when the process memory exceeds the soft mem limit, the query with the largest ratio between the currently used memory and the exec_mem_limit will be canceled.
If false, cancel query when the memory used exceeds exec_mem_limit, same as before.
When the system MemAvailable is less than the warning water mark, or the memory used by the BE process exceeds the mem soft limit, run minor gc and try to release cache.
When the MemAvailable of the system is less than the low water mark, or the memory used by the BE process exceeds the mem limit, run fucc gc, try to release the cache, and start canceling from the query with the largest memory usage until the memory of mem_limit * 20% is released.
boost::stacktrace::stacktrace() has memory leak, so use glog internal func to print stacktrace.
The reason for the memory leak of boost::stacktrace is that a state is saved in the thread local of each thread but not actively released. The test found that each thread leaked about 100M after calling boost::stacktrace.
refer to:
boostorg/stacktrace#118boostorg/stacktrace#111
mem tracker can be logically divided into 4 layers: 1)process 2)type 3)query/load/compation task etc. 4)exec node etc.
type includes
enum Type {
GLOBAL = 0, // Life cycle is the same as the process, e.g. Cache and default Orphan
QUERY = 1, // Count the memory consumption of all Query tasks.
LOAD = 2, // Count the memory consumption of all Load tasks.
COMPACTION = 3, // Count the memory consumption of all Base and Cumulative tasks.
SCHEMA_CHANGE = 4, // Count the memory consumption of all SchemaChange tasks.
CLONE = 5, // Count the memory consumption of all EngineCloneTask. Note: Memory that does not contain make/release snapshots.
BATCHLOAD = 6, // Count the memory consumption of all EngineBatchLoadTask.
CONSISTENCY = 7 // Count the memory consumption of all EngineChecksumTask.
}
Object pointers are no longer saved between each layer, and the values of process and each type are periodically aggregated.
other fix:
In [fix](memtracker) Fix transmit_tracker null pointer because phamp is not thread safe #13528, I tried to separate the memory that was manually abandoned in the query from the orphan mem tracker. But in the actual test, the accuracy of this part of the memory cannot be guaranteed, so put it back to the orphan mem tracker again.
# Proposed changes
This PR fixed lots of issues when building from source on macOS with Apple M1 chip.
## ATTENTION
The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime:
1. Some errors with memory tracker occur when BE (RELEASE) starts.
2. Some UT cases fail.
...
Temporarily, the following changes are made on macOS to start BE successfully.
1. Disable memory tracker.
2. Use tcmalloc instead of jemalloc.
This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues.
## Use case
```shell
./build.sh -j 8 --be --clean
cd output/be/bin
ulimit -n 60000
./start_be.sh --daemon
```
## Something else
It takes around _**10+**_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the development experience on macOS greatly when we finish the adaptation job.