In previous, the counter in `profile` may be updated when close the file reader.
And the file reader may be closed when the object being deconstruted.
But at that time, the `profile` object may already be deleted, causing NPE and BE will crash.
This PR try to fix this issue:
1. Remove the "profile counter update" logic from all `close()` method.
2. Add a new interface `ProfileCollector`
It has 2 methods:
- `collect_profile_at_runtime()`
It can be called at runtime, eg, in every `get_next_block()` method.
So that the counter in profile can be updated at runtime.
- `collect_profile_before_close()`
Should be called before the object call `close()`. And it will only be called once.
3. Derived from `ProfileCollector`
All classes which may update the profile counter in `close()` method should extends
the `ProfileCollector`. Such as `GenericReader`, etc. And implement `collect_profile_before_close()`
And `collect_profile_before_close()` will be called in `scanner->mark_to_need_to_close()`.
`ScannerContext` will schedule scanners even after stopped, and confused with `_is_finished` and `_should_stop`.
Only Fix the concurrency bugs when scanner is stopped or finished reported in https://github.com/apache/doris/pull/28384
VScanNode::get_next will check whether the ScanNode has reached limit condition, and send eos to TaskScheduler, and TaskScheduler will try to close ScanNode.
However, ScanNode must wait all running scanners finished, so even if ScanNode has reached limit condition, it can't be closed immediately.
This PR try to interrupt the running readers, and make ScanNode to end as soon as possible.
For load request, there are 2 tuples on scan node, input tuple and output tuple.
The input tuple is for reading file, and it will be converted to output tuple based on user specified column mappings.
And the broker load support different column mapping in different data description to same table(or partition).
So for each scanner, the output tuples are same but the input tuple can be different.
The previous implements save the input tuple in scan node level, causing different scanner using same input tuple,
which is incorrect.
This PR remove the input tuple from scan node and save them in each scanners.
Refactoring the filtering conditions in the current ExecNode from an expression tree to an array can simplify the process of adding runtime filters. It eliminates the need for complex merge operations and removes the requirement for the frontend to combine expressions into a single entity.
By representing the filtering conditions as an array, each condition can be treated individually, making it easier to add runtime filters without the need for complex merging logic. The array can store the individual conditions, and the runtime filter logic can iterate through the array to apply the filters as needed.
This refactoring simplifies the codebase, improves readability, and reduces the complexity associated with handling filtering conditions and adding runtime filters. It separates the conditions into discrete entities, enabling more straightforward manipulation and management within the execution node.
Co-authored-by: yiguolei <yiguolei@gmail.com>
Currently, exec node save exprcontext**, but the object is in object pool, the code is very unclear. we could just use exprcontext*.
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
In the past, only simple predicates (slot=const), and, like, or (only bitmap index) could be pushed down to the storage layer. scan process:
Read part of the column first, and calculate the row ids with a simple push-down predicate.
Use row ids to read the remaining columns and pass them to the scanner, and the scanner filters the remaining predicates.
This pr will also push-down the remaining predicates (functions, nested predicates...) in the scanner to the storage layer for filtering. scan process:
Read part of the column first, and use the push-down simple predicate to calculate the row ids, (same as above)
Use row ids to read the columns needed for the remaining predicates, and use the pushed-down remaining predicates to reduce the number of row ids again.
Use row ids to read the remaining columns and pass them to the scanner.
remove duplicate type definition in function context
remove unused method in function context
not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned.
remove useless slot_size in all tuple or slot descriptor.
remove doris_udf namespace, it is useless.
remove some unused macro definitions.
init v_conjuncts in vscanner, not need write the same code in every scanner.
using unique ptr to manage function context since it could only belong to a single expr context.
Issue Number: close #xxx
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
make rows_read correct so that the scheduler could using this correctly.
use single scanner if has limit clause. Move it from fragment context to scannode.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
1. When mapping column from external datasource, use date/datetimev2 as default type
2. check `is_cancelled` when read data, to avoid endless loop after query is cancelled
This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`.
TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase:
1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode.
2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine.
After the second phase read, Block will contain all the data needed for the query
The origin scan pools are in exec_env.
But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used.
All scan task will be submitted to the scan pool in scanner_scheduler.
BTW, reorganize the scan pool into 3 kinds:
local scan pool
For olap scan node
remote scan pool
For file scan node
limited scan pool
For query which set cpu resource limit or with small limit clause
TODO:
Use bthread to unify all IO task.
Some trivial issues:
fix bug that the memtable flush size printed in log is not right
Add RuntimeProfile param in VScanner
Optimize for key topn query like `SELECT * FROM store_sales ORDER BY ss_sold_date_sk, ss_sold_time_sk LIMIT 100`
(ss_sold_date_sk, ss_sold_time_sk is prefix of table sort key).
Check per scanner limit and set eof true to reduce the data need to be read.
1. remove FE config `enable_array_type`
2. limit the nested depth of array in FE side.
3. Fix bug that when loading array from parquet, the decimal type is treated as bigint
4. Fix loading array from csv(vec-engine), handle null and "null"
5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null.
6. Remove `check_array_format()`, because it's logic is wrong and meaningless
7. Add stream load csv test cases and more parquet broker load tests
1. Refactor the file reader creation in FileFactory, for simplicity.
Previously, FileFactory had too many `create_file_reader` interfaces.
Now unified into two categories: the interface used by the previous BrokerScanNode,
and the interface used by the new FileScanNode.
And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files.
2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode
3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside.
4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.
Refactor of scanners. Support broker load.
This pr is part of the refactor scanner tasks. It provide support for borker load using new VFileScanner.
Work still in progress.
Related pr:
https://github.com/apache/doris/pull/11582https://github.com/apache/doris/pull/12048
Using new file scan node and new scheduling framework to do the load job, replace the old broker scan node.
The load part (Be part) is work in progress. Query part (Fe) has been tested using tpch benchmark.
Please review only the FE code in this pr, BE code has been disabled by enable_new_load_scan_node configuration. Will send another pr soon to fix be side code.
Copy most of profiles from VOlapScanNode and VOlapScanner to NewOlapScanNode and NewOlapScanner.
Fix some blocking bug of new scan framework.
TODO:
Memtracker
Opentelemetry spen
The new framework is still disabled by default, so it will not effect other feature.
There are currently many types of ScanNodes in Doris. And most of the logic of these ScanNodes is the same, including:
Runtime filter
Predicate pushdown
Scanner generation and scheduling
So I intend to unify the common logic of all ScanNodes.
Different data sources only need to implement different Scanners for data access.
So that the future optimization for scan can be applied to the scan of all data sources,
while also reducing the code duplication.
This PR mainly adds 4 new class:
VScanner
All Scanners' parent class. The subclasses can inherit this class to implement specific data access methods.
VScanNode
The unified ScanNode, and is responsible for common logic including RuntimeFilter, predicate pushdown, Scanner generation and scheduling.
ScannerContext
ScannerContext is responsible for recording the execution status
of a group of Scanners corresponding to a ScanNode.
Including how many scanners are being scheduled, and maintaining
a producer-consumer blocks queue between scanners and scan nodes.
ScannerContext is also the scheduling unit of ScannerScheduler.
ScannerScheduler schedules a ScannerContext at a time,
and submits the Scanners to the scanner thread pool for data scanning.
ScannerScheduler
Unified responsible for all Scanner scheduling tasks
Test:
This work is still in progress and default is disabled.
I tested it with jmeter with 50 concurrency, but currently the scanner is just return without data.
The QPS can reach about 9000.
I can't compare it to origin implement because no data is read for now. I will test it when new olap scanner is ready.
Co-authored-by: morningman <morningman@apache.org>