1. `StorageEngine::_delete_tablets_on_unused_root_path` will try to obtain tablet shard write lock in `TabletManager`
```
StorageEngine::_delete_tablets_on_unused_root_path
TabletManager::drop_tablets_on_error_root_path
obtain each tablet shard's write lock
```
2. `TabletManager::build_all_report_tablets_info` and other methods will obtain tablet shard read lock frequently.
So, `StorageEngine::_delete_tablets_on_unused_root_path` will hold `_store_lock` for a long time.
This will make it difficult for other threads to get write `_store_lock`, such as `StorageEngine::get_stores_for_create_tablet`
`drop_tablets_on_error_root_path` is a small probability event, `TabletManager::drop_tablets_on_error_root_path` should return when its param `tablet_info_vec` is empty
Before drop a tablet, it will try to find the tablet in tablet map.
But the tablet maybe has been not existed.
Therefore, it is better to print the error message and error status.
* [Metrics] Add metrics to monitor BE's agent task queue size
Sometimes, user's DDL or background task may last a long time,
it's not easy to find out which procedure has problem.
This patch add metric to monitor BE's agent task queue size,
which would be helpful for troubleshooting.
The raw metrics on BE looks like:
doris_be_agent_task_queue_size{type="REPORT_OLAP_TABLE"} 0
doris_be_agent_task_queue_size{type="REPORT_DISK_STATE"} 0
doris_be_agent_task_queue_size{type="REPORT_TASK"} 0
doris_be_agent_task_queue_size{type="CHECK_CONSISTENCY"} 0
doris_be_agent_task_queue_size{type="DELETE"} 0
doris_be_agent_task_queue_size{type="CLEAR_TRANSACTION_TASK"} 0
doris_be_agent_task_queue_size{type="PUBLISH_VERSION"} 0
doris_be_agent_task_queue_size{type="UPLOAD"} 0
doris_be_agent_task_queue_size{type="DROP_TABLE"} 0
doris_be_agent_task_queue_size{type="CREATE_TABLE"} 39
doris_be_agent_task_queue_size{type="RELEASE_SNAPSHOT"} 0
doris_be_agent_task_queue_size{type="STORAGE_MEDIUM_MIGRATE"} 245
doris_be_agent_task_queue_size{type="CLONE"} 0
doris_be_agent_task_queue_size{type="MOVE"} 0
doris_be_agent_task_queue_size{type="ALTER_TABLE"} 0
doris_be_agent_task_queue_size{type="DOWNLOAD"} 0
doris_be_agent_task_queue_size{type="PUSH"} 0
doris_be_agent_task_queue_size{type="UPDATE_TABLET_META_INFO"} 0
doris_be_agent_task_queue_size{type="MAKE_SNAPSHOT"} 0
* fix typo
There are some redundant code for report task, disk and tablet in be, and when fe return error report message, there is no any warn log showing report failed.
Co-authored-by: caiconghui [蔡聪辉] <caiconghui@xiaomi.com>
* [doris-1008] support backup and restore directly to cloud storage via aws s3 protocol
* Internal][S3DirectAccess] Support backup,restore,load,export directlyconnect to s3
1. Support load and export data from/to s3 directly.
2. Add a config to auto convert broker access to s3 acces when available
Change-Id: Iac96d4b3670776708bc96a119ff491db8cb4cde7
(cherry picked from commit 2f03832ca52221cc7436069b96c45c48c4bc7201)
* [Internal][S3DirectAccess] File path glob compatible with broker
Change-Id: Ie55e07a547aa22c6fa8d432ca926216c10384e68
(cherry picked from commit d4fb25544c0dc06d23e1ada571ec3f8edd4ba56f)
* [internal] [doris-1008] fix log4j class not found
Change-Id: I468176aca0d821383c74ee658d461aba9e7d5be3
(cherry picked from commit 029adaa9d6ded8503acbd6644c1519456f3db232)
* add poms
Co-authored-by: yangzhengguo01 <yangzhengguo01@baidu.com>
In version 0.13, we support a more efficient compaction logic.
This logic will maintain multiple version paths of the tablet.
This can avoid -230 errors and can also support incremental clone.
But the previous incremental clone uses the incremental rowset meta recorded in `incr_rs_meta`.
At present, the incremental rowset meta recorded in `incr_rs_meta` and the records
in `stale_rs_meta` are duplicated, and the current clone logic does not adapt to the
new multi-version path, resulting in many cases not triggering incremental clone.
This CL mainly modified:
1. Removed `incr_rs_meta` metadata
2. Modified the clone logic. When the clone is incremented, it will try to read the rowset in `stale_rs_meta`.
3. Delete a lot of code that was previously used for version compatibility.
At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
Add trace for create tablet tasks, it's a useful tool for admin to find
out the bottleneck when create tablets timeouted.
For example, admin could enlarge 'tablet_map_shard_size' when found
'got tablets shard lock' procedure cost too much time.
'_task_worker_type' is not well initialized when use it to init '_name',
then '_name' is always 'TaskWorkerPool.CREATE_TABLE', this patch fix
this bug.
This CL refactor the storage medium migration task process in BE.
I did not modify the execution logic. Just extract part of the logic
in the migration task and put it in task_work_pool.
In this way, the migration task is only used to process the migration
from the specified tablet to the specified data dir.
Later, we can use this task to migrate of tablets between different disks. #4476
The tablet and disk information reporting threads need to report to the FE periodically.
At the same time these two reporting threads will also be triggered by certain events.
The modification in PR #4440 caused these two threads to be triggered only by events,
and could not report regularly.
BE can not graceful exit because some threads are running in endless
loop. This patch do the following optimization:
- Use the well encapsulated Thread and ThreadPool instead of std::thread
and std::vector<std::thread>
- Use CountDownLatch in thread's loop condition to avoid endless loop
- Introduce a new class Daemon for daemon works, like tcmalloc_gc,
memory_maintenance and calculate_metrics
- Decouple statistics type TaskWorkerPool and StorageEngine notification
by submit tasks to TaskWorkerPool's queue
- Reorder objects' stop and deconstruct in main(), i.e. stop network
services at first, then internal services
- Use libevent in pthreads mode, by calling evthread_use_pthreads(),
then EvHttpServer can exit gracefully in multi-threads
- Call brpc::Server's Stop() and ClearServices() explicitly
Sometimes we want to detect the hotspot of a cluster, for example, hot scanned tablet, hot wrote tablet,
but we have no insight about tablets in the cluster.
This patch introduce tablet level metrics to help to achieve this object, now support 4 metrics on tablets: `query_scan_bytes `, `query_scan_rows `, `flush_bytes `, `flush_count `.
However, one BE may holds hundreds of thousands of tablets, so I add a parameter for the metrics HTTP request,
and not return tablet level metrics by default.
In the process of historical data transformation of materialized views, it may occur that the transformation fails due to data quality.
Add an error status code :OLAP_ERR_DATE_QUALITY_ERR to determine if a data problem is causing the failure
#3344
In some very special circumstances, such as code bugs, or human misoperation, etc.,
all replicas of some tablets may be lost. In this case, the data has been substantially lost.
However, in some scenarios, the business still hopes to ensure that the query will not
report errors even if there is data loss, and reduce the perception of the user layer.
At this point, we can use the blank Tablet to fill the missing replica to ensure that the query can be executed normally.
Add a new FE config `recover_with_empty_tablet`. default is false. true means to use empty tablet to fill the missing one.
Also fix a bug in Fix#4274
Redesign metrics to 3 layers:
MetricRegistry - MetricEntity - Metrics
MetricRegistry : the register center
MetricEntity : the entity registered on MetricRegistry. Generally a MetricRegistry can be registered on several
MetricEntities, each of MetricEntity is an independent entity, such as server, disk_devices, data_directories, thrift
clients and servers, and so on.
Metric : metrics of an entity. Such as fragment_requests_total on server entity, disk_bytes_read on a disk_device entity,
thrift_opened_clients on a thrift_client entity.
MetricPrototype: the type of a metric. MetricPrototype is a global variable, can be shared by the same metrics across
different MetricEntities.
Fixes#3893
In a cluster with frequent load activities, FE will ignore most tablet report from BE
because currently it only handle reports whose version >= BE's latest report version
(which is increased each time a transaction is published). This can be observed from FE's log,
with many logs like `out of date report version 15919277405765 from backend[177969252].
current report version[15919277405766]` in it.
However many system functionalities rely on TabletReport processing to work properly. For example
1. bad or version miss replica is detected and repaired during TabletReport
2. storage medium migration decision and action is made based on TabletReport
3. BE's old transaction is cleared/republished during TabletReport
In fact, it is not necessary to update the report version after the publish task.
Because this is actually a problem left over by history. In the reporting logic of the current version,
we will no longer decrease the version information of the replica in the FE metadata according to the report.
So even if we receive a stale version of the report, it does not matter.
This CL contains mainly two changes
1. do not increase report version for publish task
2. populate `tabletWithoutPartitionId` out of read lock of TabletInvertedIndex
1、Add a PushBrokerReader in push_handle.cpp.
2、PushBrokerReader wraps the ParquetScanner to support reading data from parquet format file through broker.
This bug occurred when BE make snapshot, the version required by fe had been merged into the cumulative version, so the snapshot task could not complete the task even if it retried. In order to solve this problem, the BackupJob could be set to CANCELLED, and the user could continue to retry the job.
Fix#3057
Now disks_total_capacity metric is a user specified capacity, but
disks_avail_capacity is the disk's actual available capacity, so
disks_total_capacity may be less than disks_avail_capacity, and
UsedPct on FE may be a negative number as a result.
We'd better to use disk actual capacity for disks_total_capacity metric.
1. MonoTime/MonoDelta
MonoTime: The MonoTime represents a particular point in time, relative to some fixed but unspecified reference point.
MonoDelta: The MonoDelta class represents an elapsed duration of time, the delta between two MonoTime instances.
2. CountDownLatch
This is a C++ implementation of the Java CountDownLatch
The `TResourceInfo` was used to help `cgruops` to isolate resources,
but it is no longer used.
In fact, the `TResourceInfo` information is no longer carried in
the requests from FE to BE.
[Metric] Add tablet compaction score metrics
Backend:
Add metric "tablet_max_compaction_score" to monitor the current max compaction
score of tablets on this Backend. This metric will be updated each time
the compaction thread picking tablets to compact.
Frontend:
Add metric "tablet_max_compaction_score" for each Backend. These metrics will
be updated when backends report tablet.
And also add a calculated metric "max_tablet_compaction_core" to monitor the
max compaction core of tablets on all Backends.
When there are lots of expired transactions on BE, and with large
number of tablet, the report thread may become to slow. Because it
has to iterate the whole transaction map for each tablet.
But this is unnecessary. We should first build a expired transaction
map with 'tablet id' as key. And for each tablet, we only need to seek
the expired transaction map once with tablet id, instead of traversing
the whole transaction map.
Now Env has unify all environment operation, such as file operation.
However some of our old functions don't leverage it. This change unify
FileUtils::scan_dir to use Env's function.
FE uses partition_id to publish version. BE should check whether all tablets related with this partition have the version. But Tablet in BE does not have partition id in its metadata. So that BE could not check it.
This patch will add partition id to tablet meta during report task.
Sync at most 10k tablets during set tablet meta.
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.
1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
Nameing rowset with tablet_id and version will lead to
many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
with different format.