1. Get DataTypeSerde in advance to avoid get temporary DataTypeSerde iterate each column
2. Iterate the original row once is enoungh for deserializing by introducing a map for record the index of each column's unique id
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
Arena can replace MemPool in most scenarios. Except for memory reuse, MemPool supports reuse of previous memory chunks after clear, but Arena does not.
Some comparisons between MemPool and Arena:
1. Expansion
Arena is less than 128M index 2 alloc chunk; more than 128M memory, allocate 128M * n > `size`, n is equal to the minimum value that satisfies the expression;
MemPool less than 512K index 2 alloc chunk, greater than 512K memory, separately apply for a `size` length chunk
After Arena applied for a chunk larger than 128M last time, the minimum chunk applied for after that is 128M. Does this seem to be a waste of memory? MemPool is also similar. After the chunk of 512K was applied for last time, the minimum chunk of subsequent applications is 512K.
2. Alignment
MemPool defaults to 16 alignment, because memtable and other places that use int128 require 16 alignment;
Arena has no default alignment;
3. Memory reuse
Arena only supports `rollback`, which reuses the memory of the current chunk, usually the memory requested last time.
MemPool supports clear(), all chunks can be reused; or call ReturnPartialAllocation() to roll back the last requested memory; if the last chunk has no memory, search for the most free chunk for allocation
4. Realloc
Arena supports realloc contiguous memory; it also supports realloc contiguous memory from any position at the time of the last allocation. The difference between `alloc_continue` and `realloc` is:
1. Alloc_continue does not need to specify the old size, but the default old size = head->pos - range_start
2. alloc_continue supports expansion from range_start when additional_bytes is between head and pos, which is equivalent to reusing a part of memory, while realloc completely allocates a new memory
MemPool does not support realloc, but supports transferring or absorbing chunks between two MemPools
5. check mem limit
MemPool checks the mem limit, and Arena checks at the Allocator layer.
6. Support for ASAN
Arena does something extra
7. Error handling
MemPool supports returning the error message of application failure directly through `Status`, and Arena throws Exception.
Tests that Arena can consider
1. After the last applied chunk is larger than 128M, the minimum applied chunk is 128M, which seems to waste memory;
2. Support clear, memory multiplexing;
3. Increase the large list, alloc the memory larger than 128M, and the size is equal to `size`, so as to avoid the current chunk not being fully used, which is wasteful.
4. In some cases, it may be possible to allocate backwards to find chunks t
1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns
2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility
The current distribution model for Doris is as follows:
OlapTableSink seperate the original Block into serveral subblocks of each node(BE) by tablets distribution and distributes subblocks to storage engine of backends, then the storage engine will seperate the subblock into multiple tablets channel and each delta writer will handle partial of the block.
This model causes blocks to be split according to tablets, and the splitting process can be a relatively heavy operation. After splitting, the blocks are distributed to different DeltaWriters (Memtables) through RPCs to TabletChannels. The distribution operation on TabletChannels is also a relatively heavy operation. If the distribution property of the table is RANDOM distribution, then we have the opportunity to distribute the blocks according to the complete block during distribution. The advantage of doing so is to reduce memory copying and improve write locality, similar to appending the entire block to the memtable.
This optimze could save 10% ~ 20% CPU cost of RANDOM distribution table load when enable load_to_single_tablet
Issue Number: close#16351
Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
The origin scan pools are in exec_env.
But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used.
All scan task will be submitted to the scan pool in scanner_scheduler.
BTW, reorganize the scan pool into 3 kinds:
local scan pool
For olap scan node
remote scan pool
For file scan node
limited scan pool
For query which set cpu resource limit or with small limit clause
TODO:
Use bthread to unify all IO task.
Some trivial issues:
fix bug that the memtable flush size printed in log is not right
Add RuntimeProfile param in VScanner
* [refactor] delete non vec load from memtable
delete non vec load from memtable totally.
remove function keys_type() in memtable.
Co-authored-by: zhoubintao <1229701101@qq.com>
It is very difficult to investigate the data inconsistency of multiple replicas.
When loading data, the number of rows between replicas is checked to avoid some data inconsistency problems.
when some case(need modify be.conf), a memtable may flush many segments and then calc delete bitmap with new data. but now, it just only load one segment with max sgement id and this bug will not cala delte bitmap with all data of all segment of one memtable, and will get many rows with same key from merge-on-write table.
mem tracker can be logically divided into 4 layers: 1)process 2)type 3)query/load/compation task etc. 4)exec node etc.
type includes
enum Type {
GLOBAL = 0, // Life cycle is the same as the process, e.g. Cache and default Orphan
QUERY = 1, // Count the memory consumption of all Query tasks.
LOAD = 2, // Count the memory consumption of all Load tasks.
COMPACTION = 3, // Count the memory consumption of all Base and Cumulative tasks.
SCHEMA_CHANGE = 4, // Count the memory consumption of all SchemaChange tasks.
CLONE = 5, // Count the memory consumption of all EngineCloneTask. Note: Memory that does not contain make/release snapshots.
BATCHLOAD = 6, // Count the memory consumption of all EngineBatchLoadTask.
CONSISTENCY = 7 // Count the memory consumption of all EngineChecksumTask.
}
Object pointers are no longer saved between each layer, and the values of process and each type are periodically aggregated.
other fix:
In [fix](memtracker) Fix transmit_tracker null pointer because phamp is not thread safe #13528, I tried to separate the memory that was manually abandoned in the query from the orphan mem tracker. But in the actual test, the accuracy of this part of the memory cannot be guaranteed, so put it back to the orphan mem tracker again.
The mem hook record tracker cannot guarantee that the final consumption is 0, nor can it guarantee that the memory alloc and free are recorded in a one-to-one correspondence.
In the life cycle of a memtable from insert to flush, the memory free of hook is more than that of alloc, resulting in tracker consumption less than 0.
In order to avoid the cumulative error of the upper load channel tracker, the memtable tracker consumption is reset to zero on destructor.
1. use rlock in most logic instead of wrlock
2. filter stale rowset's delete bitmap in save meta
3. add a delete_bitmap lock to handle compaction and publish_txn confict
Co-authored-by: yixiutt <yixiu@selectdb.com>
In our origin design, we calc delete bitmap in publish txn, and this operation
will cost too much time as it will load segment data and lookup row key in pre
rowset and segments.And publish version task should run in order, so it'll lead
to timeout in publish_txn.
In this pr, we seperate delete_bitmap calculation to tow part, one of it will be
done in flush mem table, so this work can run parallel. And we calc final
delete_bitmap in publish_txn, get a rowset_id set that should be included and
remove rowsets that has been compacted, the rowset difference between memtable_flush
and publish_txn is really small so publish_txn become very fast.In our test,
publish_txn cost about 10ms.
Co-authored-by: yixiutt <yixiu@selectdb.com>