1. Fix the problem of BE crash caused by destruct sequence. (close#8058)
2. Add a new BE config `compaction_task_num_per_fast_disk`
This config specify the max concurrent compaction task num on fast disk(typically .SSD).
So that for high speed disk, we can execute more compaction task at same time,
to compact the data as soon as possible
3. Avoid frequent selection of unqualified tablet to perform compaction.
4. Modify some log level to reduce the log size of BE.
5. Modify some clone logic to handle error correctly.
Currently, if we encounter a problem with a replica of a tablet during the load process,
such as a write error, rpc error, -235, etc., it will cause the entire load job to fail,
which results in a significant reduction in Doris' fault tolerance.
This PR mainly changes:
1. refined the judgment of failed replicas in the load process, so that the failure of a few replicas will not affect the normal completion of the load job.
2. fix a bug introduced from #7754 that may cause BE coredump
If an load task has a relatively short timeout, then we need to ensure that
each RPC of this task does not get blocked for a long time.
And an RPC is usually blocked for two reasons.
1. handling "memory exceeds limit" in the RPC
If the system finds that the memory occupied by the load exceeds the threshold,
it will select the load channel that occupies the most memory and flush the memtable in it.
this operation is done in the RPC, which may be more time consuming.
2. close the load channel
When the load channel receives the last batch, it will end the task.
It will wait for all memtables flushes to finish synchronously. This process is also time consuming.
Therefore, this PR solves this problem by.
1. Use timeout to determine whether it is a high-priority load task
If the timeout of an load task is relatively short, then we mark it as a high-priority task.
2. not processing "memory exceeds limit" for high priority tasks
3. use a separate flush thread to flush memtable for high priority tasks.
To avoid showing too many memtracker on BE web pages.
The MemTracker level now has 3 levels: OVERVIEW, TASK and VERBOSE.
OVERVIEW Mainly used for main memory consumption module such as Query/Load/Metadata.
TASK is mainly used to record the memory overhead of a single task such as a single query, load, and compaction task.
VERBOSE is used for other more detailed memtrackers.
MemTracker can provide memory consumption for us to find out which
module consume more memory, but it's just a current value, this patch
add metrics for some large memory consumers, then we can find out
which module consume more memory in timeline, it would be useful to
troubleshoot OOM problems and optimize configs.
* [Enhance] Make MemTracker more accurate (#5515)
This PR main about:
1. Improve the readability of MemTrackers' name
2. Add the MemTracker of:
* Load
* Compaction
* SchemaChange
* StoragePageCache
* TabletManager
3. Change SchemaChange to a Singleon
* revise some code for Code Review
* change the name of mem_tracker
* keep reader_context have the same lifetime of rowset_reader in schema change.
* change vlog notice to log(warning) in schema change
In the previous broker load, multiple OlapTableSinks would send data to the same LoadChannel,
and because of the lock granularity problem, LoadChannel could only process these requests serially,
which made it impossible to make full use of cluster resources.
This CL modifies the related locks so that LoadChannel can process these requests in parallel.
In the test, with a size of 20G, the load speed of 334 million rows of data in 3 nodes has been
increased from 9min to 5min, and after enabling 2 concurrency, it can be increased to 3min.
Also modify the profile of load job.
At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
BE can not graceful exit because some threads are running in endless
loop. This patch do the following optimization:
- Use the well encapsulated Thread and ThreadPool instead of std::thread
and std::vector<std::thread>
- Use CountDownLatch in thread's loop condition to avoid endless loop
- Introduce a new class Daemon for daemon works, like tcmalloc_gc,
memory_maintenance and calculate_metrics
- Decouple statistics type TaskWorkerPool and StorageEngine notification
by submit tasks to TaskWorkerPool's queue
- Reorder objects' stop and deconstruct in main(), i.e. stop network
services at first, then internal services
- Use libevent in pthreads mode, by calling evthread_use_pthreads(),
then EvHttpServer can exit gracefully in multi-threads
- Call brpc::Server's Stop() and ClearServices() explicitly
Redesign metrics to 3 layers:
MetricRegistry - MetricEntity - Metrics
MetricRegistry : the register center
MetricEntity : the entity registered on MetricRegistry. Generally a MetricRegistry can be registered on several
MetricEntities, each of MetricEntity is an independent entity, such as server, disk_devices, data_directories, thrift
clients and servers, and so on.
Metric : metrics of an entity. Such as fragment_requests_total on server entity, disk_bytes_read on a disk_device entity,
thrift_opened_clients on a thrift_client entity.
MetricPrototype: the type of a metric. MetricPrototype is a global variable, can be shared by the same metrics across
different MetricEntities.
We make all MemTrackers shared, in order to show MemTracker real-time consumptions on the web.
As follows:
1. nearly all MemTracker raw ptr -> shared_ptr
2. Use CreateTracker() to create new MemTracker(in order to add itself to its parent)
3. RowBatch & MemPool still use raw ptrs of MemTracker, it's easy to ensure RowBatch & MemPool destructor exec
before MemTracker's destructor. So we don't change these code.
4. MemTracker can use RuntimeProfile's counter to calc consumption. So RuntimeProfile's counter need to be shared
too. We add a shared counter pool to store the shared counter, don't change other counters of RuntimeProfile.
Note that, this PR doesn't change the MemTracker tree structure. So there still have some orphan trackers, e.g. RowBlockV2's MemTracker. If you find some shared MemTrackers are little memory consumption & too time-consuming, you could make them be the orphan, then it's fine to use the raw ptr.
We can observe the workload of BE, and also it's a way to check
whether there is any problem in BE, like some container increase
too large and lead to OOM.
This patch add the following metrics:
```
Name Description
rowset_count_generated_and_in_use The total count of rowset id generated and in use since BE last start
unused_rowsets_count The total count of unused rowset waiting to be GC
broker_count The total count of brokers in management
data_stream_receiver_count The total count of data stream receivers in management
fragment_endpoint_count The total count of fragment endpoints of data stream in management, should always equal to data_stream_receiver_count
active_scan_context_count The total count of active scan contexts
plan_fragment_count The total count of plan fragments in executing
load_channel_count The total count of load channels in management
result_buffer_block_count The total count of result buffer blocks for queries, each block has a limited queue size (default 1024)
result_block_queue_count The total count of queues for fragments, each queue has a limited size (default 20, by config::max_memory_sink_batch_count)
routine_load_task_count The total count of routine load tasks in executing
small_file_cache_count The total count of cached small files' digest info
stream_load_pipe_count The total count of stream load pipes, each pipe has a limited buffer size (default 1M)
tablet_writer_count The total count of tablet writers
brpc_endpoint_stub_count The total count of brpc endpoints
```
In the past, when we want to modify some BE configs, we have to modify be.conf and then restart BE.
This patch provides a way to modify configs in the type of 'threshold', 'interval', 'enable flag'
when BE is running without restarting it.
You can update a single config once by BE's http API: `be_host:be_http_port/api/update_config?config_name=new_value`
Mainly contains the following modifications:
1. Use `std::unique_ptr` to replace some naked pointers
2. Modify some methods from member-method to local-static-function
3. Modify some methods do not need to be public to private
4. Some formatting changes: such as wrapping lines that are too long
5. Remove some useless variables
6. Add or modify some comments for easier understanding
No functional changes in this patch.
This CL fixes the following problems
1. check whether TabletsChannel has been closed/cancelled in `reduce_mem_usage` to avoid using a closed DeltaWriter
2. make `FlushHandle.wait` wait for all submitted tasks to finish so that memtable is deallocated before its delta writer
3. make `~MemTracker()` release its consumption bytes to accommodate situations in aggregate_func.h that bitmap and hll call `MemTracker::consume` without corresponding `MemTracker::release`, which cause the consumption of root tracker never drops to zero
[Load]
When performing a long-time load job, the following errors may occur. Causes the load to fail.
load channel manager add batch with unknown load id: xxx
There is a case of this error because Doris opened an unrelated channel during the load
process. This channel will not receive any data during the entire load process. Therefore,
after a fixed timeout, the channel will be released.
And after the entire load job is completed, it will try to close all open channels. When it try to
close this channel, it will find that the channel no longer exists and an error is reported.
This CL will pass the timeout of load job to the load channel, so that the timeout of load channels
will be same as load job's.