* [improvement](scanner_schedule) reduce memory consumption of scanner
1. limit scanner by memory consumptin rather than blocks.
2. scheduler run correcty instread of at lest 1.
For load request, there are 2 tuples on scan node, input tuple and output tuple.
The input tuple is for reading file, and it will be converted to output tuple based on user specified column mappings.
And the broker load support different column mapping in different data description to same table(or partition).
So for each scanner, the output tuples are same but the input tuple can be different.
The previous implements save the input tuple in scan node level, causing different scanner using same input tuple,
which is incorrect.
This PR remove the input tuple from scan node and save them in each scanners.
Firstly, to reduce memory usage, we do not pre-allocate blocks, instead we lazily allocate block when upper call get_free_block. And when upper call return_free_block to return free block, we add the block to a queue for memory reuse, and we will free the blocks in the queue when the scanner_context was closed instead of destructed.
Secondly, to limit the memory usage of the scanner, we introduce a variable _free_blocks_capacity to indicate the current number of free blocks available to the scanners. The number of scanners that can be scheduled will be calculated based on this value.
ssb flat test
previous
lineorder 1.2G:
load time: 3s, query time: 0.355s
lineorder 5.8G:
load time: 330s, query time: 0.970s
load time: 349s, query time: 0.949s
load time: 349s, query time: 0.955s
load time: 360s, query time: 0.889s (pipeline enabled)
after
lineorder 1.2G:
load time: 3s, query time: 0.349s
lineorder 5.8G:
load time: 342s, query time: 0.929s
load time: 337s, query time: 0.913s
load time: 345s, query time: 0.946s
load time: 346s, query time: 0.865s (pipeline enabled)
disallow call new method explicitly
force to use create_shared or create_unique to use shared ptr
placement new is allowed
reference https://abseil.io/tips/42 to add factory method to all class.
I think we should follow this guide because if throw exception in new method, the program will terminate.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
* [improvement](block exception safe) make block queue exception safe
This is part of exception safe: #16366.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
make rows_read correct so that the scheduler could using this correctly.
use single scanner if has limit clause. Move it from fragment context to scannode.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`.
TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase:
1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode.
2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine.
After the second phase read, Block will contain all the data needed for the query
Now in ScannerContext::push_back_scanner_and_reschedule, _num_running_scanners-- is before _num_scheduling_ctx++.
InPipScannerContext::can_finish, we check _num_running_scanners == 0 && _num_scheduling_ctx == 0 without obtaining _transfer_lock.
In follow case, PipScannerContext::can_finish will return wrong result.
_num_running_scanners--
Check _num_running_scanners == 0 && _num_scheduling_ctx == 0` return true.
_num_scheduling_ctx++
So, we can set _num_running_scanners-- in the last of this func.
Describe your changes.
PipScannerContext::get_block_from_queue not block.
Set _num_running_scanners-- in the last of ScannerContext::push_back_scanner_and_reschedule.
1.
In the previous implementation, the max thread num of olap scanner was set relatively small, such as 3.
which would slow down some of queries.
In this PR, I changed the max thread num to a quarter of the scaner thread pool(default is 12),
which is less than the old scan node's max thread num, but larger than the previous implementation.
The upper limit of the max thread num of the old scan node is too high, which is not reasonable.
2.
Lower down the number of pre allocated free blocks.
This change serves the following purposes:
1. use ScanPredicate instead of TCondition for external table, it can reuse old code branch.
2. simplify and delete some useless old code
3. use ColumnValueRange to save predicate
When select large number of data from a table, the profile will show that:
- ScannerCtxSchedCount: 2.82664M(2826640)
But there is only 8 times of ScannerSchedCount, most of them are busy running.
After improvement, the ScannerCtxSchedCount will be reduced to only 10.
Copy most of profiles from VOlapScanNode and VOlapScanner to NewOlapScanNode and NewOlapScanner.
Fix some blocking bug of new scan framework.
TODO:
Memtracker
Opentelemetry spen
The new framework is still disabled by default, so it will not effect other feature.
There are currently many types of ScanNodes in Doris. And most of the logic of these ScanNodes is the same, including:
Runtime filter
Predicate pushdown
Scanner generation and scheduling
So I intend to unify the common logic of all ScanNodes.
Different data sources only need to implement different Scanners for data access.
So that the future optimization for scan can be applied to the scan of all data sources,
while also reducing the code duplication.
This PR mainly adds 4 new class:
VScanner
All Scanners' parent class. The subclasses can inherit this class to implement specific data access methods.
VScanNode
The unified ScanNode, and is responsible for common logic including RuntimeFilter, predicate pushdown, Scanner generation and scheduling.
ScannerContext
ScannerContext is responsible for recording the execution status
of a group of Scanners corresponding to a ScanNode.
Including how many scanners are being scheduled, and maintaining
a producer-consumer blocks queue between scanners and scan nodes.
ScannerContext is also the scheduling unit of ScannerScheduler.
ScannerScheduler schedules a ScannerContext at a time,
and submits the Scanners to the scanner thread pool for data scanning.
ScannerScheduler
Unified responsible for all Scanner scheduling tasks
Test:
This work is still in progress and default is disabled.
I tested it with jmeter with 50 concurrency, but currently the scanner is just return without data.
The QPS can reach about 9000.
I can't compare it to origin implement because no data is read for now. I will test it when new olap scanner is ready.
Co-authored-by: morningman <morningman@apache.org>