Sometime the broker writer failed to close, but we do not handle this failure.
This may create an empty file on remote storage but be treated as normal.
Also enhance some usabilities:
1. getting latest 2000 transactions instead of getting the earliest.
2. Show backend which download and upload tasks are being executed.
Fix direct compilation failed:
fix compile thirdparty in ubuntu will install libs to lib dir instead of lib64
fix compile error in gcc5 due to the defect of c++11 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60970)
fix gcc version check will not work on some OS
Each load job has several load tasks, and each task is a query plan
with serveral plan fragments. Each plan fragment report query profile
independently.
So we need to collect each plan fragment's report, separately.
In previous compaction, only rowsets will be taken into consideration.
Doing streaming load, the singleton rowset may is made up of many overlapping segments.
Scanning these overlapping segments will result in read amplification.
To address this problem, overlapping segments should be taken into consideration
when doing cumulative compaction to reduce read amplification.
The current load process is:
Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk
In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed:
Insert tuple into different memtables according to tablet ID
When the memtable size reaches the threshold, it is written to disk.
The above operations are equivalent to single thread execution for a single load task.
In fact, the insertion of memtable and the flush of memtable can be executed synchronously.
Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing.
In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads.
By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE.
DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data.
This design can improve the performance of load large files.
In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.
The function named assign conjuncts has been invoked before creating aggregation plan node.
If the empty set node is the child of aggregation node, the conjuncts will be assign to empty set node which could not be executed correctly in Backend.
It will thrown the exception "couldn't resolve slot descriptor" for query which has both empty set node and aggregation node.
For example: select sum(pv) from test where type != 1 and 1=0 group by type;
This commit fix this bug. It remove conjuncts for empty set node.
Now size of HyperLogLog struct is so large that it lead the rowset is
too small when ingesting data. In this CL, registers in HyperLogLog are
only created when it is needed. When ingesting data, it's normal case
that there are only few values in one HyperLogLog.
Reproduce:
1. start a routine load, send a routine load task to BE
2. BE executes task successfully and commit to FE.
3. Commit request failed on FE because database is renamed(throw db not found exception)
4. After commit failed, BE will send rollback request to FE.
5. FE receive this rollback request and mistakenly update the routine load progress,
because the number of loaded rows in this rollback request's attachment is larger than 0
In Doris, one block has 1024 rows.
1. If the previous ScanKey scan rows multiple blocks,
and also the final block has 1024 rows just right.
2. The current ScanKey scan rows with number less than one block.
Under the two conditions, if not seek block, the position of prefix shortkey columns is wrong.
Random distribution is no longer supported since version 0.9.
And we need a way to convert the random distribution to hash distribution.
ALTER TABLE db.tbl SET ("distribution_type" = "hash");