Commit Graph

38 Commits

Author SHA1 Message Date
64a06ea9d4 [UT] Fix some BE unit tests (#3110)
And also support graceful exit for StorageEngine to avoid hang too long
time in unit test.
2020-03-16 13:31:44 +08:00
42931d22cb [Bug] tablet meta is not updated correctly after compaction (#3098)
This CL try to fix a potential bug describe in ISSUE: #3097. But I'm not sure this is the root cause.

Also remove lots of verbose log, and fix a memory leak.
2020-03-14 23:39:11 +08:00
a1f5b57011 Support sharding tablet_map_lock into more small map locks to make good performance for tablet manage task (#3051)
Support sharding tablet_map_lock into more small map locks to make good performance for tablet manage task
2020-03-09 16:29:56 +08:00
aa58cd99d9 Fix disks_total_capacity metric bug (#2988)
Now disks_total_capacity metric is a user specified capacity, but
disks_avail_capacity is the disk's actual available capacity, so
disks_total_capacity may be less than disks_avail_capacity, and
UsedPct on FE may be a negative number as a result.
We'd better to use disk actual capacity for disks_total_capacity metric.
2020-03-02 19:09:50 +08:00
5440e19d01 Improve the triggering strategy of BE report (#2881)
Currently, the report from BE to FE is completed in the background
threads of `AgentServer` (`report_tablet_thread` and
`report_disk_stat_thread`).  These two threads will sleep and be in
a standby state after each report, if there is any need to report
immediately, they will be notified and wake up immediately to report.

For example, when background thread (`disk_monitor_thread`) in
`StorageEngine` finds some tablets were deleted, it will notify
`AgentServer` to trigger a report immediately.

In the current implementation, in order to report ASAP, a local variable
(`_is_drop_tables`) and two other flags are used to record whether
reporting is needed, and then `StorageEngine::disk_monitor_thread` checks
the value of this variable every time it runs, to determine whether it
needs to be triggered Reporting. This is actually superfluous, and it
may result in untimely notifications, as shown below:

```
(thread_1)        (thread_2)
disk-monitor     disk-stat-reporter
    |                  |
    |               reporting
    |                  |
  notify_1             |
    |                  |
    |                wait_for_notify(will wait until timeout or next notification)
    |                  |
    V                  V
```

When `report_tablet_thread` has not started waiting,
`StorageEngine::disk_monitor_thread` triggers a notification, so this
notification will not be received by `report_tablet_thread`,
resulting in the BE not reporting to the FE until the lock times out
or the next round of `disk_monitor_thread` detection.

This change restructures the triggering implementation, and solves the above problem.

This change also changes some methods(that do not need to be public) to private.
2020-02-11 20:38:44 +08:00
c89d0a090c Fix bug that _min_percentage_of_error_disk was not initialized (#2867)
In StorageEngine, the variable _min_percentage_of_error_disk was not
initialized (so it defaults to 0), which causes the process to exit
whenever one disk fails.
What we expect is that exit the process only when the number of
failed disks reach a certain percentage.
Also, this variable should mean the maximum percentage of
error disks allowed, not the minimum, so change the configuration
name to max_percentage_of_error_disk.
2020-02-10 16:58:24 +08:00
3690f3e917 Add rowset state (#2691)
1. add rowset state to rowset
2. add close api to rowset to release resources
issue: #2665
2020-01-10 14:17:57 +08:00
6cab929d6d [Compaction] Limit the max concurrency of running compaction tasks (#2635)
Compaction task may sometimes consume much memory and results in OOM.
And currently, there is no good way to predict the mem consumption of
a compaction task, so I add a new BE config: max_compaction_concurrency
to limit the max concurrency of running compaction tasks manually.
2020-01-02 14:47:54 +08:00
379619dfbd Unify the names of methods in TabletManager which do not require locks (#2525)
* Unify the names of methods in `TabletManager` which do not require locks

Currently, there are several naming patterns in `TabletManager` class
for methods (mainly private methods) that needs to be executed inside the lock:

  1. **`xxx_with_no_lock()`**:
     The "with_no_lock" suffix has two meanings: one is not needed,
     and the other is that a lock has been added externally;
  2. **`xxx_unlock()`**:
     "unlock" is a verb and may be mistaken for the need to unlock
     a mutex in this method.
  3. **`xxx_unlocked()`**:
     Note that "unlocked" is an adjective that means the operation
     in this method is not locked.
  4. **`xxx_locked()`**:
     "locked" is also an adjective, meaning that the method is locked.
     This is also more likely to be misunderstood: one is already
     locked externally; the other is locked internally by the method.
     Actually what we really want is `xxx_already_locked`, but this way
     the name is a little longer.
  5. There is no identification in the method name:
     the reader cannot intuitively know whether the method needs to be locked

This patch unifies all the above pattern to be `xxx_unlocked()`, and adjust
some indentation in code style.

Additionally, this patch also remove an unused `add_tablet()` method, because
a new version has already been used.

This patch doesn't contain any functional modifications.
2019-12-27 02:34:35 -06:00
e1ba0efbc7 Optimize compaction strategy of tablet on BE (#2473)
The current compaction selection strategy and cumulative point update logic
will cause the cumulative compaction to not work, and all compaction tasks
will be completed only by the base compaction. This can cause a large number
of data versions to pile up.

In the current cumulative point update logic, when a cumulative cannot select
enough number of rowsets, it will directly increase the cumulative point.
Therefore, when the data version generates the same speed as the cumulative
compaction polling, it will cause the cumulative point to continuously increase
without triggering the cumulative compaction.

The new strategy mainly modifies the update logic of cumulative point to ensure
that the above problems do not occur. At the same time, the new strategy also
takes into account the problem that compaction cannot be performed if cumulative
points stagnate for a long time. Cumulative points will be forced to increase
through threshold settings to ensure that compaction has a chance to execute.

Also add a new HTTP API to view the compaction status of specified tablet.
See `compaction-action.md` for details.
2019-12-17 10:30:43 +08:00
d00c5e3066 Fix base_compaction minor log error (#2461) 2019-12-16 13:45:19 +08:00
4d958ec7a1 Fix BE do_tablet_meta_checkpoint retain _meta_lock for a long time (#2430)
Add a flag in RowsetMeta to record whether it has been deleted from rowset meta.
Before this PR, 37156 rowsets only cost 1642 s.
With this PR, 37319 rowsets just cost 1 s.
2019-12-12 23:21:43 +08:00
c07f37d78c [Segment V2] Add a control framework between FE and BE through heartbeat #2247 (#2364)
The control framework is implemented through heartbeat message. Use uint64_t as flags to control different functions. 
Now add a flag to set the default rowset type to beta.
2019-12-12 12:18:32 +08:00
333aee9610 Fix segmentation fault bug (#2391) 2019-12-05 21:20:30 +08:00
f716fd2b0b Ignore non-existent tablet in clear_transaction_task() (#2296)
This commit also remove some duplicated logs, which are duplicated
printed inside and outside the function
2019-11-26 08:17:56 -06:00
fda46654a2 Support setting properties for storage_root_path (#2235)
We can specify the properties of storage_root_path by setting ':', seperate by ','
e.g.
storage_root_path = /home/disk1/palo,medium:ssd,capacity:50
2019-11-22 18:12:26 +08:00
11872d5cf6 Sending clear txn task explicitly after transaction being aborted (#2182) 2019-11-13 11:22:45 +08:00
d0316d158d Refactor and reorganize the file utils (#2089) 2019-11-11 20:25:41 +08:00
0bcfddab92 Remove clear_alter_task (#2056)
Alter task has been refactored and clear_alter_task is not necessary.
2019-10-24 18:57:14 +08:00
e6bd1855e2 fix default compaction rowset type bug (#2042) 2019-10-23 11:08:14 +08:00
6634051359 Make default rowset type to config (#2020) 2019-10-21 21:44:00 +08:00
3bca253fb3 Fix beta rowset read slow (#1994)
[Bug][BetaRowset] fix beta rowset read slowly with limit

beta rowset do not update raw_rows_read in statistics and will read all
data in tablet when query with limit, which lead to long query time.
2019-10-17 19:19:46 +08:00
7370b44ab2 Tablet report does not set version miss (#1961) 2019-10-12 14:36:49 +08:00
71731b25f4 Ignore some compaction errors to reduce logs (#1955) 2019-10-11 19:58:38 +08:00
0e4b3755a2 Refactor txn manager methods (#1950) 2019-10-11 17:16:13 +08:00
b72a4a4bc6 Add tablet meta checkpoint mechanism (#1936) 2019-10-10 09:39:02 +08:00
c643cbd30c Optimize the load performance for large file (#1798)
The current load process is:

Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk

In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed:

Insert tuple into different memtables according to tablet ID
When the memtable size reaches the threshold, it is written to disk.
The above operations are equivalent to single thread execution for a single load task.
In fact, the insertion of memtable and the flush of memtable can be executed synchronously.
Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing.

In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads.
By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE.
DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data.

This design can improve the performance of load large files.
In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.
2019-09-25 13:49:32 +08:00
720808fda5 Remove config::max_file_descriptor_number (#1833) 2019-09-20 07:50:57 +08:00
d1676c3c3d Check file descriptor number is larger than 65536 upon start (#1819) 2019-09-19 12:48:36 +08:00
981e0feb99 Check rowset is useful atomicly (#1750)
* Check rowset is useful atomicly

* Only release rowset id when it is added to unused rowset

* remove release rowset id when save rowset meta
2019-09-06 17:21:42 +08:00
a63989cc61 Use RowsetFactory to create and init RowsetWriter (#1740) 2019-09-04 17:02:43 +08:00
6f4feca3dc Add rowset id generator to FE and BE (#1678) 2019-09-02 18:51:31 +08:00
7e981b2b14 Limit the disk usage to avoid running out of disk capacity (#1702)
Set high watermark and flood stage of disk used capacity.
And forbid some operations if disk usage is too high.
2019-08-27 22:18:17 +08:00
2b2bc82ae2 Add timeout on snapshot of data (#1672)
Release snapshot when finishing or cancelling backup/restore job.
Snapshot may takes a lot disk space if not releasing them in time.
2019-08-21 21:18:53 +08:00
851b2ca3bd Remove unused code in StorageEngine (#1671) 2019-08-20 10:50:07 +08:00
dcb75729db Change cumulative compaction for decoupling storage from compution (#1576)
1. Calculate cumulative point when loading tablet first time.
2. Simplify pick rowsets logic upon delete predicate.
3. Saving meta and modify rowsets only once after cumulative compaction.
2019-08-13 18:25:56 +08:00
d938f9a6ea Implement the initial version of BetaRowset (#1568) 2019-08-06 10:40:16 +08:00
0d48a3961c Refactor Storage Engine (#1478)
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.

1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
   Nameing rowset with tablet_id and version will lead to
   many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
   with different format.
2019-07-15 21:18:22 +08:00