doris

Author	SHA1	Message	Date
Yingchun Lai	72f3082358	[Metrics] Add some metrics for container size in BE (#3246 ) We can observe the workload of BE, and also it's a way to check whether there is any problem in BE, like some container increase too large and lead to OOM. This patch add the following metrics: ``` Name Description rowset_count_generated_and_in_use The total count of rowset id generated and in use since BE last start unused_rowsets_count The total count of unused rowset waiting to be GC broker_count The total count of brokers in management data_stream_receiver_count The total count of data stream receivers in management fragment_endpoint_count The total count of fragment endpoints of data stream in management, should always equal to data_stream_receiver_count active_scan_context_count The total count of active scan contexts plan_fragment_count The total count of plan fragments in executing load_channel_count The total count of load channels in management result_buffer_block_count The total count of result buffer blocks for queries, each block has a limited queue size (default 1024) result_block_queue_count The total count of queues for fragments, each queue has a limited size (default 20, by config::max_memory_sink_batch_count) routine_load_task_count The total count of routine load tasks in executing small_file_cache_count The total count of cached small files' digest info stream_load_pipe_count The total count of stream load pipes, each pipe has a limited buffer size (default 1M) tablet_writer_count The total count of tablet writers brpc_endpoint_stub_count The total count of brpc endpoints ```	2020-04-25 16:13:39 +08:00
Yingchun Lai	4a7a88ede1	[LSAN] Fix some memory leak detected by LSAN (#3326 )	2020-04-22 22:59:44 +08:00
caiconghui	a5703ef114	[Performance] Support sharding txn_map_lock into more small map locks to make good performance for txn manage task (#3222 ) This PR is to enhance the performance for txn manage task, when there are so many txn in BE, the only one txn_map_lock and additional _txn_locks may cause poor performance, and now we remove the additional _txn_locks and split the txn_map_lock into many small locks.	2020-04-09 22:35:15 +08:00
HuangWei	162b1c5d8b	[Storage] Open data dirs parallelly (#3260 )	2020-04-07 20:59:56 +08:00
HuangWei	0462607d8d	StorageEngine: unused_rowsets use unordered_multimap (#3207 )	2020-03-27 14:30:31 +08:00
Mingyu Chen	8aa8b8c96d	[Code Refactor] Using block manager to unify the data file access. (#3189 ) Earlier we introduced `BlockManager` to separate data access logic from underlying file read and write logic. This CL further unifies all `SegmentV2` data access to the `BlockManager`, removes the previous `FileManager` class, and move the file cache to the `FileBlockManager`. There are no logical changes to this CL. After this CL, all user table data is read through the `WritableBlock` and `ReadableBlock` returned by the `BlockManager`, and no file operations are performed directly.	2020-03-25 20:39:07 +08:00
kangpinghuang	f6374fa9a5	Use default_rowset_type to replace compaction_rowset_type (#3101 ) * use default_rowset_type to replace compaction_rowset_type * segment v2 usage document	2020-03-16 22:23:48 +08:00
Mingyu Chen	ee06ce31ba	[Bug] Fix bug that the file_block_mgr object was incorrectly destructed (#3122 ) During the use of the `block`, some methods in the block manager will be referenced. So `file_block_mgr` should be a resident and globally unique object. I put it in `StorageEngine`. TODO: the `BlockManager`, `Env` need to be reorganized.	2020-03-16 17:07:27 +08:00
Yingchun Lai	64a06ea9d4	[UT] Fix some BE unit tests (#3110 ) And also support graceful exit for StorageEngine to avoid hang too long time in unit test.	2020-03-16 13:31:44 +08:00
Mingyu Chen	42931d22cb	[Bug] tablet meta is not updated correctly after compaction (#3098 ) This CL try to fix a potential bug describe in ISSUE: #3097. But I'm not sure this is the root cause. Also remove lots of verbose log, and fix a memory leak.	2020-03-14 23:39:11 +08:00
caiconghui	a1f5b57011	Support sharding tablet_map_lock into more small map locks to make good performance for tablet manage task (#3051 ) Support sharding tablet_map_lock into more small map locks to make good performance for tablet manage task	2020-03-09 16:29:56 +08:00
Yingchun Lai	aa58cd99d9	Fix disks_total_capacity metric bug (#2988 ) Now disks_total_capacity metric is a user specified capacity, but disks_avail_capacity is the disk's actual available capacity, so disks_total_capacity may be less than disks_avail_capacity, and UsedPct on FE may be a negative number as a result. We'd better to use disk actual capacity for disks_total_capacity metric.	2020-03-02 19:09:50 +08:00
LingBin	5440e19d01	Improve the triggering strategy of BE report (#2881 ) Currently, the report from BE to FE is completed in the background threads of `AgentServer` (`report_tablet_thread` and `report_disk_stat_thread`). These two threads will sleep and be in a standby state after each report, if there is any need to report immediately, they will be notified and wake up immediately to report. For example, when background thread (`disk_monitor_thread`) in `StorageEngine` finds some tablets were deleted, it will notify `AgentServer` to trigger a report immediately. In the current implementation, in order to report ASAP, a local variable (`_is_drop_tables`) and two other flags are used to record whether reporting is needed, and then `StorageEngine::disk_monitor_thread` checks the value of this variable every time it runs, to determine whether it needs to be triggered Reporting. This is actually superfluous, and it may result in untimely notifications, as shown below: ``` (thread_1) (thread_2) disk-monitor disk-stat-reporter \| \| \| reporting \| \| notify_1 \| \| \| \| wait_for_notify(will wait until timeout or next notification) \| \| V V ``` When `report_tablet_thread` has not started waiting, `StorageEngine::disk_monitor_thread` triggers a notification, so this notification will not be received by `report_tablet_thread`, resulting in the BE not reporting to the FE until the lock times out or the next round of `disk_monitor_thread` detection. This change restructures the triggering implementation, and solves the above problem. This change also changes some methods(that do not need to be public) to private.	2020-02-11 20:38:44 +08:00
LingBin	c89d0a090c	Fix bug that _min_percentage_of_error_disk was not initialized (#2867 ) In StorageEngine, the variable _min_percentage_of_error_disk was not initialized (so it defaults to 0), which causes the process to exit whenever one disk fails. What we expect is that exit the process only when the number of failed disks reach a certain percentage. Also, this variable should mean the maximum percentage of error disks allowed, not the minimum, so change the configuration name to max_percentage_of_error_disk.	2020-02-10 16:58:24 +08:00
kangpinghuang	3690f3e917	Add rowset state (#2691 ) 1. add rowset state to rowset 2. add close api to rowset to release resources issue: #2665	2020-01-10 14:17:57 +08:00
Mingyu Chen	6cab929d6d	[Compaction] Limit the max concurrency of running compaction tasks (#2635 ) Compaction task may sometimes consume much memory and results in OOM. And currently, there is no good way to predict the mem consumption of a compaction task, so I add a new BE config: max_compaction_concurrency to limit the max concurrency of running compaction tasks manually.	2020-01-02 14:47:54 +08:00
LingBin	379619dfbd	Unify the names of methods in `TabletManager` which do not require locks (#2525 ) * Unify the names of methods in `TabletManager` which do not require locks Currently, there are several naming patterns in `TabletManager` class for methods (mainly private methods) that needs to be executed inside the lock: 1. `xxx_with_no_lock()`: The "with_no_lock" suffix has two meanings: one is not needed, and the other is that a lock has been added externally; 2. `xxx_unlock()`: "unlock" is a verb and may be mistaken for the need to unlock a mutex in this method. 3. `xxx_unlocked()`: Note that "unlocked" is an adjective that means the operation in this method is not locked. 4. `xxx_locked()`: "locked" is also an adjective, meaning that the method is locked. This is also more likely to be misunderstood: one is already locked externally; the other is locked internally by the method. Actually what we really want is `xxx_already_locked`, but this way the name is a little longer. 5. There is no identification in the method name: the reader cannot intuitively know whether the method needs to be locked This patch unifies all the above pattern to be `xxx_unlocked()`, and adjust some indentation in code style. Additionally, this patch also remove an unused `add_tablet()` method, because a new version has already been used. This patch doesn't contain any functional modifications.	2019-12-27 02:34:35 -06:00
Mingyu Chen	e1ba0efbc7	Optimize compaction strategy of tablet on BE (#2473 ) The current compaction selection strategy and cumulative point update logic will cause the cumulative compaction to not work, and all compaction tasks will be completed only by the base compaction. This can cause a large number of data versions to pile up. In the current cumulative point update logic, when a cumulative cannot select enough number of rowsets, it will directly increase the cumulative point. Therefore, when the data version generates the same speed as the cumulative compaction polling, it will cause the cumulative point to continuously increase without triggering the cumulative compaction. The new strategy mainly modifies the update logic of cumulative point to ensure that the above problems do not occur. At the same time, the new strategy also takes into account the problem that compaction cannot be performed if cumulative points stagnate for a long time. Cumulative points will be forced to increase through threshold settings to ensure that compaction has a chance to execute. Also add a new HTTP API to view the compaction status of specified tablet. See `compaction-action.md` for details.	2019-12-17 10:30:43 +08:00
kangkaisen	d00c5e3066	Fix base_compaction minor log error (#2461 )	2019-12-16 13:45:19 +08:00
Lijia Liu	4d958ec7a1	Fix BE do_tablet_meta_checkpoint retain _meta_lock for a long time (#2430 ) Add a flag in RowsetMeta to record whether it has been deleted from rowset meta. Before this PR, 37156 rowsets only cost 1642 s. With this PR, 37319 rowsets just cost 1 s.	2019-12-12 23:21:43 +08:00
kangpinghuang	c07f37d78c	[Segment V2] Add a control framework between FE and BE through heartbeat #2247 (#2364 ) The control framework is implemented through heartbeat message. Use uint64_t as flags to control different functions. Now add a flag to set the default rowset type to beta.	2019-12-12 12:18:32 +08:00
WingC	333aee9610	Fix segmentation fault bug (#2391 )	2019-12-05 21:20:30 +08:00
LingBin	f716fd2b0b	Ignore non-existent tablet in clear_transaction_task() (#2296 ) This commit also remove some duplicated logs, which are duplicated printed inside and outside the function	2019-11-26 08:17:56 -06:00
HuangWei	fda46654a2	Support setting properties for storage_root_path (#2235 ) We can specify the properties of storage_root_path by setting ':', seperate by ',' e.g. storage_root_path = /home/disk1/palo,medium:ssd,capacity:50	2019-11-22 18:12:26 +08:00
Mingyu Chen	11872d5cf6	Sending clear txn task explicitly after transaction being aborted (#2182 )	2019-11-13 11:22:45 +08:00
Seaven	d0316d158d	Refactor and reorganize the file utils (#2089 )	2019-11-11 20:25:41 +08:00
lichaoyong	0bcfddab92	Remove clear_alter_task (#2056 ) Alter task has been refactored and clear_alter_task is not necessary.	2019-10-24 18:57:14 +08:00
kangpinghuang	e6bd1855e2	fix default compaction rowset type bug (#2042 )	2019-10-23 11:08:14 +08:00
kangpinghuang	6634051359	Make default rowset type to config (#2020 )	2019-10-21 21:44:00 +08:00
kangpinghuang	3bca253fb3	Fix beta rowset read slow (#1994 ) [Bug][BetaRowset] fix beta rowset read slowly with limit beta rowset do not update raw_rows_read in statistics and will read all data in tablet when query with limit, which lead to long query time.	2019-10-17 19:19:46 +08:00
yiguolei	7370b44ab2	Tablet report does not set version miss (#1961 )	2019-10-12 14:36:49 +08:00
yiguolei	71731b25f4	Ignore some compaction errors to reduce logs (#1955 )	2019-10-11 19:58:38 +08:00
yiguolei	0e4b3755a2	Refactor txn manager methods (#1950 )	2019-10-11 17:16:13 +08:00
yiguolei	b72a4a4bc6	Add tablet meta checkpoint mechanism (#1936 )	2019-10-10 09:39:02 +08:00
Mingyu Chen	c643cbd30c	Optimize the load performance for large file (#1798 ) The current load process is: Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed: Insert tuple into different memtables according to tablet ID When the memtable size reaches the threshold, it is written to disk. The above operations are equivalent to single thread execution for a single load task. In fact, the insertion of memtable and the flush of memtable can be executed synchronously. Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing. In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads. By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE. DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data. This design can improve the performance of load large files. In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.	2019-09-25 13:49:32 +08:00
lichaoyong	720808fda5	Remove config::max_file_descriptor_number (#1833 )	2019-09-20 07:50:57 +08:00
lichaoyong	d1676c3c3d	Check file descriptor number is larger than 65536 upon start (#1819 )	2019-09-19 12:48:36 +08:00
yiguolei	981e0feb99	Check rowset is useful atomicly (#1750 ) * Check rowset is useful atomicly * Only release rowset id when it is added to unused rowset * remove release rowset id when save rowset meta	2019-09-06 17:21:42 +08:00
Dayue Gao	a63989cc61	Use RowsetFactory to create and init RowsetWriter (#1740 )	2019-09-04 17:02:43 +08:00
yiguolei	6f4feca3dc	Add rowset id generator to FE and BE (#1678 )	2019-09-02 18:51:31 +08:00
Mingyu Chen	7e981b2b14	Limit the disk usage to avoid running out of disk capacity (#1702 ) Set high watermark and flood stage of disk used capacity. And forbid some operations if disk usage is too high.	2019-08-27 22:18:17 +08:00
Mingyu Chen	2b2bc82ae2	Add timeout on snapshot of data (#1672 ) Release snapshot when finishing or cancelling backup/restore job. Snapshot may takes a lot disk space if not releasing them in time.	2019-08-21 21:18:53 +08:00
lichaoyong	851b2ca3bd	Remove unused code in StorageEngine (#1671 )	2019-08-20 10:50:07 +08:00
lichaoyong	dcb75729db	Change cumulative compaction for decoupling storage from compution (#1576 ) 1. Calculate cumulative point when loading tablet first time. 2. Simplify pick rowsets logic upon delete predicate. 3. Saving meta and modify rowsets only once after cumulative compaction.	2019-08-13 18:25:56 +08:00
Dayue Gao	d938f9a6ea	Implement the initial version of BetaRowset (#1568 )	2019-08-06 10:40:16 +08:00
lichaoyong	0d48a3961c	Refactor Storage Engine (#1478 ) NOTE: This patch would modify all Backend's data. And this will cause a very long time to restart be. So if you want to interferer your product environment, you should upgrade backend one by one. 1. Refactoring be is to clarify the structure the codes. 2. Use unique id to indicate a rowset. Nameing rowset with tablet_id and version will lead to many conflicts among compaction, clone, restore. 3. Extract an rowset interface to encapsulate rowsets with different format.	2019-07-15 21:18:22 +08:00

46 Commits