Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
The origin scan pools are in exec_env.
But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used.
All scan task will be submitted to the scan pool in scanner_scheduler.
BTW, reorganize the scan pool into 3 kinds:
local scan pool
For olap scan node
remote scan pool
For file scan node
limited scan pool
For query which set cpu resource limit or with small limit clause
TODO:
Use bthread to unify all IO task.
Some trivial issues:
fix bug that the memtable flush size printed in log is not right
Add RuntimeProfile param in VScanner
The mem hook record tracker cannot guarantee that the final consumption is 0, nor can it guarantee that the memory alloc and free are recorded in a one-to-one correspondence.
In the life cycle of a memtable from insert to flush, the memory free of hook is more than that of alloc, resulting in tracker consumption less than 0.
In order to avoid the cumulative error of the upper load channel tracker, the memtable tracker consumption is reset to zero on destructor.
The memory value automatically tracked by the tcmalloc hook in the DeltaWriter is smaller than the value recorded manually in the memtable, because the first 4096-byte Chunk requested by each MemPool when the memtable is initialized is not tracked to the DeltaWriter by the hook.
The values of the two are not equal, causing the mem_consumption() == _mem_table->memory_usage branch judgment to fail.
1. Added memory leak detection for `DeltaWriter` and `MemTable` mem tracker
2. Modify memtable mem tracker to virtual to avoid frequent recursive consumption of parent tracker.
3. Disable memtable flush thread attach memtable tracker, ensure that memtable mem tracker is completely accurate.
4. Modify `memory_verbose_track=false`. At present, there is a performance problem in the frequent switch thread mem tracker.
- Because the mem tracker exists as a shared_ptr in the thread local. Each time it is switched, the atomic variable use_count in the shared_ptr of the current tracker will be -1, and the tracker to be replaced use_count +1, multi-threading Frequent changes to the same tracker shared_ptr are slow.
- TODO: 1. Reduce unnecessary thread mem tracker switch, 2. Consider using raw pointers for mem tracker in thread local.
Currently, there are 2 status code in BE, one is common/Status.h,
and the other is olap/olap_define.h called OLAPStatus.
OLAPStatus is just an enum type, it is very simple and could not save many informations,
I will unify these code to common/Status.
If an load task has a relatively short timeout, then we need to ensure that
each RPC of this task does not get blocked for a long time.
And an RPC is usually blocked for two reasons.
1. handling "memory exceeds limit" in the RPC
If the system finds that the memory occupied by the load exceeds the threshold,
it will select the load channel that occupies the most memory and flush the memtable in it.
this operation is done in the RPC, which may be more time consuming.
2. close the load channel
When the load channel receives the last batch, it will end the task.
It will wait for all memtables flushes to finish synchronously. This process is also time consuming.
Therefore, this PR solves this problem by.
1. Use timeout to determine whether it is a high-priority load task
If the timeout of an load task is relatively short, then we mark it as a high-priority task.
2. not processing "memory exceeds limit" for high priority tasks
3. use a separate flush thread to flush memtable for high priority tasks.
In the previous implementation, in an load job,
multiple memtables of the same tablet are written to disk sequentially.
In fact, multiple memtables can be written out of order in parallel,
only need to ensure that each memtable uses a different segment writer.
For a tablet, there may be multiple memtables, which will be
flushed to disk one by one in the order of generation.
If a memtable flush fails, then the load job will definitely
fail, but the previous implementation will overwrite `_flush_status`,
which may make the error can not be detected, leads to an error
load job to be success.
This patch also have two other changes:
1. Use `std::bind` to replace `boost::bind`;
2. Removes some unneeded headers.
1. MemTableFlushExecutor maintain a ThreadPool to receive FlushTask.
2. FlushToken is used to seperate different tasks from different tablets.
Every DeltaWriter of tablet constructs a FlushToken,
task in FlushToken are handle serially, task between FlushToken are
handle concurrently.
3. I have remove thread limit on data_dir, because of I/O is not the main
timer consumer of Flush thread. Much of time is consumed in CPU decoding
and compress.
Mainly contains the following modifications:
1. Use `std::unique_ptr` to replace some naked pointers
2. Modify some methods from member-method to local-static-function
3. Modify some methods do not need to be public to private
4. Some formatting changes: such as wrapping lines that are too long
5. Remove some useless variables
6. Add or modify some comments for easier understanding
No functional changes in this patch.
This CL fixes the following problems
1. check whether TabletsChannel has been closed/cancelled in `reduce_mem_usage` to avoid using a closed DeltaWriter
2. make `FlushHandle.wait` wait for all submitted tasks to finish so that memtable is deallocated before its delta writer
3. make `~MemTracker()` release its consumption bytes to accommodate situations in aggregate_func.h that bitmap and hll call `MemTracker::consume` without corresponding `MemTracker::release`, which cause the consumption of root tracker never drops to zero
The current load process is:
Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk
In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed:
Insert tuple into different memtables according to tablet ID
When the memtable size reaches the threshold, it is written to disk.
The above operations are equivalent to single thread execution for a single load task.
In fact, the insertion of memtable and the flush of memtable can be executed synchronously.
Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing.
In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads.
By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE.
DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data.
This design can improve the performance of load large files.
In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.