Commit Graph

136 Commits

Author SHA1 Message Date
199d7d3be8 [Refactor]Merged string_value into string_ref (#15925) 2023-01-22 16:39:23 +08:00
8920295534 [refactor](remoe non vec code) remove non vectorized conjunctx from scanner (#16121)
1. remove arrow group filter
2. remove non vectorized conjunctx from scanner

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-21 19:23:17 +08:00
7814d2b651 [Fix](Oracle External Table) fix that oracle external table can not insert batch values (#16117)
Issue Number: close #xxx

This pr fix two bugs:

_jdbc_scanner may be nullptr in vjdbc_connector.cpp, so we use another method to count jdbc statistic. close [Enhencement](jdbc scanner) add profile for jdbc scanner #15914
In the batch insertion scenario, oracle database does not support syntax insert into tables values (...),(...); , what it supports is:
insert all
into table(col1,col2) values(c1v1, c2v1)
into table(col1,col2) values(c1v2, c2v2)
SELECT 1 FROM DUAL;
2023-01-21 07:57:12 +08:00
3ebc98228d [feature wip](multi catalog)Support iceberg schema evolution. (#15836)
Support iceberg schema evolution for parquet file format.
Iceberg use unique id for each column to support schema evolution.
To support this feature in Doris, FE side need to get the current column id for each column and send the ids to be side.
Be read column id from parquet key_value_metadata, set the changed column name in Block to match the name in parquet file before reading data. And set the name back after reading data.
2023-01-20 12:57:36 +08:00
6e090e4daf [Bug](predicate) fix date predicate (#16053) 2023-01-19 14:14:48 +08:00
3894de49d2 [Enhancement](topn) support two phase read for topn query (#15642)
This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`.

TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase:
1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode.
2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine.

After the second phase read, Block will contain all the data needed for the query
2023-01-19 10:01:33 +08:00
7d34512501 [Bug](pipeline) Fix DCHECK failure (#15928) 2023-01-17 12:01:20 +08:00
b1caa68706 [Feature-WIP](inverted index) inverted index reader's implementation, and add mysql_fulltext regression case to test fulltext query (#15823)
Issue Number: Step2 of DSIP-023: Add inverted index for full text search
implementation of inverted index reader

dependency pr: #14211 #15807 #15821
2023-01-17 09:13:56 +08:00
bdec4d5ac2 [enhancement](profile) add read columns to scanner profile (#15902) 2023-01-16 19:32:46 +08:00
97fcad76f8 [enhancement](memtracker) Improve readability (#15716) 2023-01-16 16:30:35 +08:00
5af7bcaa55 [Bug](decimalv3) Fix missing precision and scale in predicates (#15930) 2023-01-15 00:01:48 +08:00
c4475a8dbc [Enhencement](jdbc scanner) add profile for jdbc scanner (#15914) 2023-01-14 10:28:59 +08:00
98d69d1568 [fix](compile) fix vscan node compile error (#15805)
conflict merge of #15604 and #15618
2023-01-11 15:08:46 +08:00
3fec5ff0f5 [refactor](scan-pool) move scan pool from env to scanner scheduler (#15604)
The origin scan pools are in exec_env.
But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used.
All scan task will be submitted to the scan pool in scanner_scheduler.

BTW, reorganize the scan pool into 3 kinds:

local scan pool
For olap scan node

remote scan pool
For file scan node

limited scan pool
For query which set cpu resource limit or with small limit clause

TODO:
Use bthread to unify all IO task.

Some trivial issues:

fix bug that the memtable flush size printed in log is not right
Add RuntimeProfile param in VScanner
2023-01-11 09:38:42 +08:00
90a92f0643 [feature-wip](multi-catalog) add iceberg tvf to read snapshots (#15618)
Support new table value function `iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")`
we can use the sql `select * from iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")` to get snapshots info  of a table. The other iceberg metadata will be supported later when needed.

One of the usage:

Before we use following sql to time travel:
`select * from ice_table FOR TIME AS OF "2022-10-10 11:11:11"`;
`select * from ice_table FOR VERSION AS OF "snapshot_id"`;
we can use the snapshots metadata to get the `committed time` or `snapshot_id`, 
and then, we can use it as the time or version in time travel clause
2023-01-10 22:37:35 +08:00
f17d69e450 [feature](file cache)Import file cache for remote file reader (#15622)
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes

Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.

Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`

Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
2023-01-10 12:23:56 +08:00
9e3a61989b [refactor](es) remove BE generated dsl for es query #15751
remove fe config enable_new_es_dsl and all related code.
Now the DSL for es is always generated on FE side.
2023-01-10 08:40:32 +08:00
Pxl
1514b5ab5c [Feature](Materialized-View) support advanced Materialized-View (#15212) 2023-01-09 09:53:11 +08:00
c57fa7c930 [Pipeline] Fix PipScannerContext::can_finish return wrong status (#15259)
Now in ScannerContext::push_back_scanner_and_reschedule, _num_running_scanners-- is before _num_scheduling_ctx++.
InPipScannerContext::can_finish, we check _num_running_scanners == 0 && _num_scheduling_ctx == 0 without obtaining _transfer_lock.
In follow case, PipScannerContext::can_finish will return wrong result.

_num_running_scanners--
Check _num_running_scanners == 0 && _num_scheduling_ctx == 0` return true.
_num_scheduling_ctx++
So, we can set _num_running_scanners-- in the last of this func.

Describe your changes.

PipScannerContext::get_block_from_queue not block.
Set _num_running_scanners-- in the last of ScannerContext::push_back_scanner_and_reschedule.
2023-01-09 08:46:58 +08:00
707eab9a63 [opt](multi-catalog) cache and reuse position delete rows in iceberg v2 (#15670)
A deleted file may belong to multiple data files. Each data file will read a full amount of deleted files,
so a deleted file may be read repeatedly. The deleted files can be cached, and multiple data files
can reuse the first read content.

The performance is improved by 60% in the case of single thread, and by 30% in the case of multithreading.
2023-01-07 22:29:11 +08:00
9d1f02c580 [Improvement](topn) runtime prune for topn query (#15558) 2023-01-05 20:10:12 +08:00
4075e3aec6 [fix](csv-reader) fix new csv reader's performance issue (#15581) 2023-01-04 18:25:08 +08:00
c42c61dcad [fix](bitmapfilter) fix bitmap filter not pushing down (#15532) 2023-01-04 14:33:53 +08:00
Pxl
85fe9d2496 [Bug](filter) fix not in(null) return true (#15466)
fix not in(null) return true
2023-01-03 21:14:50 +08:00
edecc2e706 [feature-wip](inverted index) API for inverted index reader and syntax for fulltext match (#14211)
* [feature-wip](inverted index)inverted index api: reader

* [feature-wip](inverted index) Fulltext query syntax with MATCH/MATCH_ALL/MATCH_ALL

* [feature-wip](inverted index) Adapt to index meta

* [enhance] add more metrics

* [enhance] add fulltext match query check for column type and index parser

* [feature-wip](inverted index) Support apply inverted index in compound predicate which except leaf node of and node
2022-12-30 21:48:14 +08:00
2c8de30cce [optimize](multi-catalog) use dictionary encode&filter to process delete files (#15441)
**Optimize**
PR #14470 has used `Expr` to filter delete rows to match current data file,
but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files)
to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows.

In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path`
columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column.

After testing, the performance of iceberg v2's MOR is improved by 30%+.

**Fix Bug**
Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr`
and the batch_eof is generated from next batch.
2022-12-30 08:57:55 +08:00
85c7c531f1 [vectorized](jdbc) support array type in jdbc external table (#15303) 2022-12-30 00:29:08 +08:00
987970e8e3 [fix](multi catalog)Set column defualt value for query. (#15415)
Current column default value is used only for load task. But in the case of Iceberg schema change,
query task is also possible to read the default value for columns not exist in old schema.
This pr is to support default value for query task.

Manually tested the broker load and external emr regression cases.
2022-12-29 12:03:17 +08:00
3146fc8189 [bug](jdbc) fix jdbc external table with char type length error (#15386)
Now have test pg and oracle with char(100), if data='abc'
but read string data length is 100, so need trim extral spaces
2022-12-29 11:19:03 +08:00
305dd15fea [improvement](index) Support bitmap index can be applied with compound predicate when enable vectorized engine query (#13035)
Current bitmap index only can apply pushed down predicates which in AND conditions. When predicates in OR conditions and other complex compound conditions, it will not be pushed down to the storage layer, this leads to read more data.

Based on that situation, this pr will do:

1. this pr in order to support bitmap index apply compound predicates, query sql like:
select * from tb where a > 'hello' or b < 100;
select * from tb where a > 'hello' or b < 100 or c > 'ok';
select * from tb where (a > 'hello' or b <100) and (a < 'world' or b > 200);
select * from tb where (not a> 'hello') or b < 100;
...
above sql,column a and b and c has created bitmap_index.

2. this optimization can reduce reading data by index
3. set config enable_index_apply_compound_predicates to use this optimization
2022-12-28 20:08:57 +08:00
0e651365ca [profile](scanner) add per scanner running time profile (#15321)
* [profile](scanner) add per scanner running time profile


Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-26 08:55:07 +08:00
e72404c537 [fix](scan) fix that be may core dump when the predicates are all false (#15332) 2022-12-24 15:27:43 +08:00
e336178ef8 [Fix](multi catalog)Fix VFileScanner file not found status bug. #15226
The if condition to check NOT FOUND status for VFileScanner is incorrect, fix it.
2022-12-23 16:45:54 +08:00
8a810cd554 [fix](bitmapfilter) fix core dump caused by bitmap filter (#15296)
Do not push down the bitmap filter to a non-integer column
2022-12-23 16:42:45 +08:00
b085ff49f0 [refactor](non-vec) delete non-vec data sink (#15283)
* [refactor](non-vec) delete non-vec data sink

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-12-23 14:10:47 +08:00
e331e0420b [improvement](topn)add per scanner limit check for new scanner (#15231)
Optimize for key topn query like `SELECT * FROM store_sales ORDER BY ss_sold_date_sk, ss_sold_time_sk LIMIT 100` 
(ss_sold_date_sk, ss_sold_time_sk is prefix of table sort key). 

Check per scanner limit and set eof true to reduce the data need to be read.
2022-12-22 22:39:31 +08:00
e9a201e0ec [refactor](non-vec) delete some non-vec exec node (#15239)
* [refactor](non-vec) delete some non-vec exec node
2022-12-22 14:05:51 +08:00
8ecf69b09b [pipeline](regression) nested loop join test get error result in pipeline engine and refactor the code for need more input data (#15208) 2022-12-21 19:03:51 +08:00
a447121fc3 [fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan (#15199)
* [fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan

* fix
2022-12-21 15:44:47 +08:00
2445ac9520 [Bug](runtimefilter) Fix BE crash due to init failure (#15228) 2022-12-21 15:36:22 +08:00
7730a88d11 [fix](multi-catalog) add support for orc binary type (#15141)
Fix three bugs:
1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed
2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed
3. fmt::format can't generate right format string, and will be failed
2022-12-19 14:24:12 +08:00
13bc8c2ef8 [Pipeline](runtime filter) Support runtime filters on pipeline engine (#15040) 2022-12-18 21:48:00 +08:00
728a238564 [vectorized](jdbc) fix external table of oracle with condition about … (#15092)
* [vectorized](jdbc) fix external table of oracle with condition about datetime report error

* formatter
2022-12-16 10:48:17 +08:00
0e1e5a802b [config](load) enable new load scan node by default (#14808)
Set FE `enable_new_load_scan_node` to true by default.
So that all load tasks(broker load, stream load, routine load, insert into) will use FileScanNode instead of BrokerScanNode
to read data

1. Support loading parquet file in stream load with new load scan node.
2. Fix bug that new parquet reader can not read column without logical or converted type.
3. Change jsonb parser function to "jsonb_parse_error_to_null"
    So that if the input string is not a valid json string, it will return null for jsonb column in load task.
2022-12-16 09:41:43 +08:00
e0d528980f [fix](multi catalog)Return emtpy block while external table scanner couldn't find the file (#14997)
FE file path cache for external table may out of date. In this case, BE may fail to find the not exist file from FE cache. 
This pr is to handle this case: instead of throw an error message to the user, we return empty result set to the user.
2022-12-16 09:36:35 +08:00
67e4292533 [fix](iceberg-v2) icebergv2 filter data path (#14470)
1. a icebergv2 delete file may cross many data paths, so the path of a file split is required as a predicate to filter rows of delete file
- create delete file structure to save predicate parameters
- create predicate for file path
2. add some log to print row range
3.  fix bug when create file metadata
2022-12-15 10:18:12 +08:00
Pxl
c25a7235f9 [Pipeline](load) support pipeline broker load (#14940)
support pipeline broker load
2022-12-13 00:28:36 +08:00
f3aea7f0f0 [Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744) 2022-12-11 23:33:18 +08:00
wxy
af50461211 [fix](statistics) fix CpuTimeMS in audit log when enable_vectorized_engine=true. (#14853)
Co-authored-by: wangxiangyu@360shuke.com <wangxiangyu@360shuke.com>
2022-12-09 21:13:05 +08:00
fcea89bcf4 [fix](const_expr) fix coredump caused by unsupported cast const expr (#14825) 2022-12-06 10:31:15 +08:00