We can observe the workload of BE, and also it's a way to check
whether there is any problem in BE, like some container increase
too large and lead to OOM.
This patch add the following metrics:
```
Name Description
rowset_count_generated_and_in_use The total count of rowset id generated and in use since BE last start
unused_rowsets_count The total count of unused rowset waiting to be GC
broker_count The total count of brokers in management
data_stream_receiver_count The total count of data stream receivers in management
fragment_endpoint_count The total count of fragment endpoints of data stream in management, should always equal to data_stream_receiver_count
active_scan_context_count The total count of active scan contexts
plan_fragment_count The total count of plan fragments in executing
load_channel_count The total count of load channels in management
result_buffer_block_count The total count of result buffer blocks for queries, each block has a limited queue size (default 1024)
result_block_queue_count The total count of queues for fragments, each queue has a limited size (default 20, by config::max_memory_sink_batch_count)
routine_load_task_count The total count of routine load tasks in executing
small_file_cache_count The total count of cached small files' digest info
stream_load_pipe_count The total count of stream load pipes, each pipe has a limited buffer size (default 1M)
tablet_writer_count The total count of tablet writers
brpc_endpoint_stub_count The total count of brpc endpoints
```
This PR is to enhance the performance for txn manage task, when there are so many txn in
BE, the only one txn_map_lock and additional _txn_locks may cause poor performance, and
now we remove the additional _txn_locks and split the txn_map_lock into many small locks.
Earlier we introduced `BlockManager` to separate data access logic from
underlying file read and write logic.
This CL further unifies all `SegmentV2` data access to the `BlockManager`,
removes the previous `FileManager` class, and move the file cache to the `FileBlockManager`.
There are no logical changes to this CL.
After this CL, all user table data is read through the `WritableBlock` and `ReadableBlock`
returned by the `BlockManager`, and no file operations are performed directly.
During the use of the `block`, some methods in the block manager will be referenced.
So `file_block_mgr` should be a resident and globally unique object.
I put it in `StorageEngine`.
TODO: the `BlockManager`, `Env` need to be reorganized.
This CL try to fix a potential bug describe in ISSUE: #3097. But I'm not sure this is the root cause.
Also remove lots of verbose log, and fix a memory leak.
Now disks_total_capacity metric is a user specified capacity, but
disks_avail_capacity is the disk's actual available capacity, so
disks_total_capacity may be less than disks_avail_capacity, and
UsedPct on FE may be a negative number as a result.
We'd better to use disk actual capacity for disks_total_capacity metric.
Currently, the report from BE to FE is completed in the background
threads of `AgentServer` (`report_tablet_thread` and
`report_disk_stat_thread`). These two threads will sleep and be in
a standby state after each report, if there is any need to report
immediately, they will be notified and wake up immediately to report.
For example, when background thread (`disk_monitor_thread`) in
`StorageEngine` finds some tablets were deleted, it will notify
`AgentServer` to trigger a report immediately.
In the current implementation, in order to report ASAP, a local variable
(`_is_drop_tables`) and two other flags are used to record whether
reporting is needed, and then `StorageEngine::disk_monitor_thread` checks
the value of this variable every time it runs, to determine whether it
needs to be triggered Reporting. This is actually superfluous, and it
may result in untimely notifications, as shown below:
```
(thread_1) (thread_2)
disk-monitor disk-stat-reporter
| |
| reporting
| |
notify_1 |
| |
| wait_for_notify(will wait until timeout or next notification)
| |
V V
```
When `report_tablet_thread` has not started waiting,
`StorageEngine::disk_monitor_thread` triggers a notification, so this
notification will not be received by `report_tablet_thread`,
resulting in the BE not reporting to the FE until the lock times out
or the next round of `disk_monitor_thread` detection.
This change restructures the triggering implementation, and solves the above problem.
This change also changes some methods(that do not need to be public) to private.
In StorageEngine, the variable _min_percentage_of_error_disk was not
initialized (so it defaults to 0), which causes the process to exit
whenever one disk fails.
What we expect is that exit the process only when the number of
failed disks reach a certain percentage.
Also, this variable should mean the maximum percentage of
error disks allowed, not the minimum, so change the configuration
name to max_percentage_of_error_disk.
Compaction task may sometimes consume much memory and results in OOM.
And currently, there is no good way to predict the mem consumption of
a compaction task, so I add a new BE config: max_compaction_concurrency
to limit the max concurrency of running compaction tasks manually.
* Unify the names of methods in `TabletManager` which do not require locks
Currently, there are several naming patterns in `TabletManager` class
for methods (mainly private methods) that needs to be executed inside the lock:
1. **`xxx_with_no_lock()`**:
The "with_no_lock" suffix has two meanings: one is not needed,
and the other is that a lock has been added externally;
2. **`xxx_unlock()`**:
"unlock" is a verb and may be mistaken for the need to unlock
a mutex in this method.
3. **`xxx_unlocked()`**:
Note that "unlocked" is an adjective that means the operation
in this method is not locked.
4. **`xxx_locked()`**:
"locked" is also an adjective, meaning that the method is locked.
This is also more likely to be misunderstood: one is already
locked externally; the other is locked internally by the method.
Actually what we really want is `xxx_already_locked`, but this way
the name is a little longer.
5. There is no identification in the method name:
the reader cannot intuitively know whether the method needs to be locked
This patch unifies all the above pattern to be `xxx_unlocked()`, and adjust
some indentation in code style.
Additionally, this patch also remove an unused `add_tablet()` method, because
a new version has already been used.
This patch doesn't contain any functional modifications.
The current compaction selection strategy and cumulative point update logic
will cause the cumulative compaction to not work, and all compaction tasks
will be completed only by the base compaction. This can cause a large number
of data versions to pile up.
In the current cumulative point update logic, when a cumulative cannot select
enough number of rowsets, it will directly increase the cumulative point.
Therefore, when the data version generates the same speed as the cumulative
compaction polling, it will cause the cumulative point to continuously increase
without triggering the cumulative compaction.
The new strategy mainly modifies the update logic of cumulative point to ensure
that the above problems do not occur. At the same time, the new strategy also
takes into account the problem that compaction cannot be performed if cumulative
points stagnate for a long time. Cumulative points will be forced to increase
through threshold settings to ensure that compaction has a chance to execute.
Also add a new HTTP API to view the compaction status of specified tablet.
See `compaction-action.md` for details.
Add a flag in RowsetMeta to record whether it has been deleted from rowset meta.
Before this PR, 37156 rowsets only cost 1642 s.
With this PR, 37319 rowsets just cost 1 s.
The control framework is implemented through heartbeat message. Use uint64_t as flags to control different functions.
Now add a flag to set the default rowset type to beta.
[Bug][BetaRowset] fix beta rowset read slowly with limit
beta rowset do not update raw_rows_read in statistics and will read all
data in tablet when query with limit, which lead to long query time.
The current load process is:
Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk
In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed:
Insert tuple into different memtables according to tablet ID
When the memtable size reaches the threshold, it is written to disk.
The above operations are equivalent to single thread execution for a single load task.
In fact, the insertion of memtable and the flush of memtable can be executed synchronously.
Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing.
In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads.
By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE.
DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data.
This design can improve the performance of load large files.
In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.
1. Calculate cumulative point when loading tablet first time.
2. Simplify pick rowsets logic upon delete predicate.
3. Saving meta and modify rowsets only once after cumulative compaction.
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.
1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
Nameing rowset with tablet_id and version will lead to
many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
with different format.