Manually track query/load/compaction/etc. memory in Allocator instead of mem hook.
Can still use Mem Hook when cannot manually track memory code segments and find memory locations during debugging.
This will cause memory tracking loss for Query, loss less than 10% compared to the past, but this is expected to be more controllable.
Similarly, Mem Hook will no longer track unowned memory to the orphan mem tracker by default, so the total memory of all MemTrackers will be less than before.
Not need to get memory size from jemalloc in Mem Hook each memory alloc and free, which would lose performance in the past.
Not require caching bthread local in pthread local for memory hook, in the past this has caused core dumps inside bthread, seems to be a bug in bthread.
ThreadContext life cycle to manual control
In the past, ThreadContext was automatically created when it was used for the first time (this was usually in the Jemalloc Hook when the first malloc memory), and was automatically destroyed when the thread exited.
Now instead of manually controlling the create and destroy of ThreadContext, it is mainly created manually when the task thread start and destroyed before the task thread end.
Run 43 clickbench query tests.
Use MemHook in the past:
In some cloud native deployment scenario, BE(especially the Compute Node BE) will be add to cluster and remove from cluster very frequently. User's query will fail if there is a fragment is running on the shutting down BE. Users could use stop_be.sh --grace, then BE will wait all running queries to stop to avoiding running query failure, but if the waiting time exceed the limit, then be will exit directly. During this period, FE will not send any queries to BE and waiting for all running queries to stop
Current initialization dependency:
Daemon ───┬──► StorageEngine ──► ExecEnv ──► Disk/Mem/CpuInfo
│
│
BackendService ─┘
However, original code incorrectly initialize Daemon before StorageEngine.
This PR also stop and join threads of daemon services in their dtor, to ensure Daemon services release resources in reverse order of initialization via RAII.
1. the real value of BE config `file_cache_max_file_reader_cache_size` will be the 1/3 of process's max open file number.
2. use thread pool to create or init the file cache concurrently.
To solve the issue that when there are lots of files in file cache dir, the starting time of BE will be very slow because
it will traverse all file cache dirs sequentially.
Jemalloc heap profile follows libgcc's way of backtracing by default.
rewrites dl_iterate_phdr will cause Jemalloc to fail to run after enable profile.
TODO, two solutions:
- Jemalloc specifies GNU libunwind as the prof backtracing way, but my test failed,
--enable-prof-libunwind not work: --enable-prof-libunwind not work jemalloc/jemalloc#2504
- ClickHouse/libunwind solves Jemalloc profile backtracing, but the branch of ClickHouse/libunwind
has been out of touch with GNU libunwind and LLVM libunwind, which will leave the fate to others.
Introduce libunwind get stack trace, cost is negligible and has line numbers.
use StackTraceCache, PHDRCache speed up, is customizable and has some optimizations.
Other stack trace tools remain: glog, boost, glibc, in case for need.
TODO:
currently support linux __x86_64__, __arm__, __powerpc__, not supported __FreeBSD__, APPLE
Note: __arm__, __powerpc__ not been verified
Support signal handle
libunwid support unw_backtrace for jemalloc
Use of undefined compile option USE_MUSL for later
1.
Fix bug that the field of s3_file_write_bufferpool is not initialized, causing undefined behavior.
2.
add fs_s3 benchmark tool,Reference to the usage of tools https://github.com/apache/doris/pull/20770
And opt the output:
`sh bin/run-fs-benchmark.sh --conf=conf/s3.conf --fs_type=s3 --operation=single_read --threads=1 --iterations=1`
```
------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
------------------------------------------------------------------------------------------------------------------------------
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1 7366 ms 123 ms 1 ReadRate(B/S)=12.1823M/s ReadTime(S)=7.36572 ReadTotal(B)=89.7314M
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1 6163 ms 116 ms 1 ReadRate(B/S)=14.5597M/s ReadTime(S)=6.16299 ReadTotal(B)=89.7314M
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1 6048 ms 110 ms 1 ReadRate(B/S)=14.8366M/s ReadTime(S)=6.04796 ReadTotal(B)=89.7314M
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_mean 6526 ms 116 ms 3 ReadRate(B/S)=13.8596M/s ReadTime(S)=6.52556 ReadTotal(B)=89.7314M
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_median 6163 ms 116 ms 3 ReadRate(B/S)=14.5597M/s ReadTime(S)=6.16299 ReadTotal(B)=89.7314M
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_stddev 730 ms 6.68 ms 3 ReadRate(B/S)=1.45914M/s ReadTime(S)=0.729876 ReadTotal(B)=0
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_cv 11.18 % 5.75 % 3 ReadRate(B/S)=10.53% ReadTime(S)=11.18% ReadTotal(B)=0.00%
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_max 7366 ms 123 ms 3 ReadRate(B/S)=14.8366M/s ReadTime(S)=7.36572 ReadTotal(B)=89.7314M
S3ReadBenchmark/iterations:1/repeats:3/manual_time/threads:1_min 6048 ms 110 ms 3 ReadRate(B/S)=12.1823M/s ReadTime(S)=6.04796 ReadTotal(B)=89.7314M
```
1. Refactor file cache. Before refactor, the file cache config format is "[{"path":"/path/to/file_cache","normal":21474836480,"persistent":10737418240,"query_limit":10737418240}]" and now change to "[{"path":"/mnt/disk3/selectdb_cloud/file_cache","total_size":21474836480,"query_limit":10737418240}]". It will be simpler than before.
2. Support more strategy. Support file cache priority. The file cache will have three queue, name as 'index'/'normal'/'disposable'. We can avoid that the higher priority data is eliminate by the lower priority data.
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
Follow #17586.
This PR mainly changes:
Remove env/
Remove FileUtils/FilesystemUtils
Some methods are moved to LocalFileSystem
Remove olap/file_cache
Add s3 client cache for s3 file system
In my test, the time of open s3 file can be reduced significantly
Fix cold/hot separation bug for s3 fs.
This is the last PR of #17764.
After this, all IO operation should be in io/fs.
Except for tests in #17586, I also tested some case related to fs io:
clone
concurrency query on local/s3/hdfs
load error log create and clean
disk metrics
Daemon threads in doris_main.cpp will upload tablet metrics periodically, which will use StorageEngine::instance(). However loading file cache is a process in main thread, when it takes a lot of time to load file cache, StorageEngine::instance() will be a null pointer in daemon threads.
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes
Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.
Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`
Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
boost::stacktrace::stacktrace() has memory leak, so use glog internal func to print stacktrace.
The reason for the memory leak of boost::stacktrace is that a state is saved in the thread local of each thread but not actively released. The test found that each thread leaked about 100M after calling boost::stacktrace.
refer to:
boostorg/stacktrace#118boostorg/stacktrace#111
mem tracker can be logically divided into 4 layers: 1)process 2)type 3)query/load/compation task etc. 4)exec node etc.
type includes
enum Type {
GLOBAL = 0, // Life cycle is the same as the process, e.g. Cache and default Orphan
QUERY = 1, // Count the memory consumption of all Query tasks.
LOAD = 2, // Count the memory consumption of all Load tasks.
COMPACTION = 3, // Count the memory consumption of all Base and Cumulative tasks.
SCHEMA_CHANGE = 4, // Count the memory consumption of all SchemaChange tasks.
CLONE = 5, // Count the memory consumption of all EngineCloneTask. Note: Memory that does not contain make/release snapshots.
BATCHLOAD = 6, // Count the memory consumption of all EngineBatchLoadTask.
CONSISTENCY = 7 // Count the memory consumption of all EngineChecksumTask.
}
Object pointers are no longer saved between each layer, and the values of process and each type are periodically aggregated.
other fix:
In [fix](memtracker) Fix transmit_tracker null pointer because phamp is not thread safe #13528, I tried to separate the memory that was manually abandoned in the query from the orphan mem tracker. But in the actual test, the accuracy of this part of the memory cannot be guaranteed, so put it back to the orphan mem tracker again.
There are several configs related to tcmalloc, users do know how to config them. Actually users just want two modes, performance or compact, in performance mode, users want doris run query and load quickly while in compact mode, users want doris run with less memory usage.
If we want to config tcmalloc individually, we can use env variables which are supported by tcmalloc.
# Proposed changes
This PR fixed lots of issues when building from source on macOS with Apple M1 chip.
## ATTENTION
The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime:
1. Some errors with memory tracker occur when BE (RELEASE) starts.
2. Some UT cases fail.
...
Temporarily, the following changes are made on macOS to start BE successfully.
1. Disable memory tracker.
2. Use tcmalloc instead of jemalloc.
This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues.
## Use case
```shell
./build.sh -j 8 --be --clean
cd output/be/bin
ulimit -n 60000
./start_be.sh --daemon
```
## Something else
It takes around _**10+**_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the development experience on macOS greatly when we finish the adaptation job.
When the physical memory of the process reaches 90% of the mem limit, trigger the load channel mgr to brush down
The default value of be.conf mem_limit is changed from 90% to 80%, and stability is the priority.
Fix deadlock in arena_locks in BufferPool::BufferAllocator::ScavengeBuffers and _lock in DebugString
The mem hook consumes the orphan tracker by default. If the thread does not attach other trackers, by default all consumption will be passed to the process tracker through the orphan tracker.
In real time, consumption of all other trackers + orphan tracker consumption = process tracker consumption.
Ideally, all threads are expected to attach to the specified tracker, so that "all memory has its own ownership", and the consumption of the orphan mem tracker is close to 0, but greater than 0.