Commit Graph

157 Commits

Author SHA1 Message Date
09b7c22f6b [Opt](exec) remove unless null key when no split in convert key range (#16624) 2023-02-11 15:44:35 +08:00
aba843bb2b [Improvement](inverted index) inverted index query match bitmap cache (#16578)
Add cache for inverted index query match bitmap to accelerate common query keyword, especially for keyword matching many rows. 

Tests result:
- large result: matching 99% out of 247 million rows shows 8x speed up.
- small result: matching 0.1% out of 247 million rows shows 2x speed up.
2023-02-11 13:38:58 +08:00
37d1519316 [WIP](dynamic-table) support dynamic schema table (#16335)
Issue Number: close #16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
2023-02-11 13:37:50 +08:00
43eca4f209 [Feature-WIP](inverted index) Implementation for alter inverted index. (#16371)
implementation for add/drop inverted index.
2023-02-10 17:56:17 +08:00
379bef598d [fix-core](block) clear block row_same_bit when block reuse (#16172) 2023-02-10 12:21:27 +08:00
646ba2cc88 [bugfix](scannode) 1. make rows_read correct 2. use single scanner if has limit clause (#16473)
make rows_read correct so that the scheduler could using this correctly.
use single scanner if has limit clause. Move it from fragment context to scannode.
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2023-02-09 14:12:18 +08:00
0142ef8b95 [improvement](scanner) Supports bthread scanner (#16031) 2023-02-09 10:24:56 +08:00
737c73dcf0 [Improvement](topn) order by key topn query optimization (#15663) 2023-02-06 15:36:05 +08:00
b1b2697cc7 [fix](iceberg) fix iceberg catalog (#16372)
1. Fix iceberg catalog access s3
2. Fix iceberg catalog partition table query
3. Fix persistence
2023-02-05 13:15:28 +08:00
Pxl
5e4bb98900 [Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290)
enable -Wpedantic and update lowest gcc version to 11.1
2023-02-03 11:28:48 +08:00
7a800bd3c6 [fix](scan) coredump caused by null of _scanner_ctx (#16361) 2023-02-03 09:24:15 +08:00
cb6875b5a4 [improvement](multi-catalog) use date/datetimev2 as default col type for catalog table (#16304)
1. When mapping column from external datasource, use date/datetimev2 as default type
2. check `is_cancelled` when read data, to avoid endless loop after query is cancelled
2023-02-02 17:35:48 +08:00
bb179b77f7 [Feature-WIP](inverted index) support array type for inverted index reader (#16355) 2023-02-02 16:14:14 +08:00
bb0d4ba787 [BugFix](sort) use correct agg function when using 2 phase sort for agg table (#16185) 2023-02-01 20:07:43 +08:00
d224624bbe [improvement](session variable)Add enable_file_cache session variable (#16268)
Add enable_file_cache session variable, so that we can close file cache without restart BE.
2023-02-01 18:15:03 +08:00
Pxl
ca73c60442 [Chore](build) enable ignored-qualifiers check (#16196)
enable ignored-qualifiers check
2023-02-01 15:15:59 +08:00
90b12143a3 [refactor](remove unused code) remove runtime tuple structure and useless utils class (#16237) 2023-01-30 16:45:14 +08:00
4b6a4b3cf7 [refactor](remove unused code) Remove unused mempool declare or function params (#16222)
* Remove unused mempool declare or function params

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-30 13:03:18 +08:00
07d58e531a [improvement](filecache) add profile for file cache (#16223) 2023-01-30 10:46:31 +08:00
6e8eedc521 [refactor](remove unused code) remove storage buffer and orc reader (#16137)
remove olap storage byte buffer
remove orc reader
remove time operator
remove read_write_util
remove aggregate funcs
remove compress.h and cpp
remove bhp_lib

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-24 22:29:32 +08:00
79ad74637d [refactor](remove expr) remove non vectorized Expr and ExprContext related codes (#16136) 2023-01-24 10:45:35 +08:00
199d7d3be8 [Refactor]Merged string_value into string_ref (#15925) 2023-01-22 16:39:23 +08:00
8920295534 [refactor](remoe non vec code) remove non vectorized conjunctx from scanner (#16121)
1. remove arrow group filter
2. remove non vectorized conjunctx from scanner

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-21 19:23:17 +08:00
7814d2b651 [Fix](Oracle External Table) fix that oracle external table can not insert batch values (#16117)
Issue Number: close #xxx

This pr fix two bugs:

_jdbc_scanner may be nullptr in vjdbc_connector.cpp, so we use another method to count jdbc statistic. close [Enhencement](jdbc scanner) add profile for jdbc scanner #15914
In the batch insertion scenario, oracle database does not support syntax insert into tables values (...),(...); , what it supports is:
insert all
into table(col1,col2) values(c1v1, c2v1)
into table(col1,col2) values(c1v2, c2v2)
SELECT 1 FROM DUAL;
2023-01-21 07:57:12 +08:00
3ebc98228d [feature wip](multi catalog)Support iceberg schema evolution. (#15836)
Support iceberg schema evolution for parquet file format.
Iceberg use unique id for each column to support schema evolution.
To support this feature in Doris, FE side need to get the current column id for each column and send the ids to be side.
Be read column id from parquet key_value_metadata, set the changed column name in Block to match the name in parquet file before reading data. And set the name back after reading data.
2023-01-20 12:57:36 +08:00
6e090e4daf [Bug](predicate) fix date predicate (#16053) 2023-01-19 14:14:48 +08:00
3894de49d2 [Enhancement](topn) support two phase read for topn query (#15642)
This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`.

TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase:
1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode.
2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine.

After the second phase read, Block will contain all the data needed for the query
2023-01-19 10:01:33 +08:00
7d34512501 [Bug](pipeline) Fix DCHECK failure (#15928) 2023-01-17 12:01:20 +08:00
b1caa68706 [Feature-WIP](inverted index) inverted index reader's implementation, and add mysql_fulltext regression case to test fulltext query (#15823)
Issue Number: Step2 of DSIP-023: Add inverted index for full text search
implementation of inverted index reader

dependency pr: #14211 #15807 #15821
2023-01-17 09:13:56 +08:00
bdec4d5ac2 [enhancement](profile) add read columns to scanner profile (#15902) 2023-01-16 19:32:46 +08:00
97fcad76f8 [enhancement](memtracker) Improve readability (#15716) 2023-01-16 16:30:35 +08:00
5af7bcaa55 [Bug](decimalv3) Fix missing precision and scale in predicates (#15930) 2023-01-15 00:01:48 +08:00
c4475a8dbc [Enhencement](jdbc scanner) add profile for jdbc scanner (#15914) 2023-01-14 10:28:59 +08:00
98d69d1568 [fix](compile) fix vscan node compile error (#15805)
conflict merge of #15604 and #15618
2023-01-11 15:08:46 +08:00
3fec5ff0f5 [refactor](scan-pool) move scan pool from env to scanner scheduler (#15604)
The origin scan pools are in exec_env.
But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used.
All scan task will be submitted to the scan pool in scanner_scheduler.

BTW, reorganize the scan pool into 3 kinds:

local scan pool
For olap scan node

remote scan pool
For file scan node

limited scan pool
For query which set cpu resource limit or with small limit clause

TODO:
Use bthread to unify all IO task.

Some trivial issues:

fix bug that the memtable flush size printed in log is not right
Add RuntimeProfile param in VScanner
2023-01-11 09:38:42 +08:00
90a92f0643 [feature-wip](multi-catalog) add iceberg tvf to read snapshots (#15618)
Support new table value function `iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")`
we can use the sql `select * from iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")` to get snapshots info  of a table. The other iceberg metadata will be supported later when needed.

One of the usage:

Before we use following sql to time travel:
`select * from ice_table FOR TIME AS OF "2022-10-10 11:11:11"`;
`select * from ice_table FOR VERSION AS OF "snapshot_id"`;
we can use the snapshots metadata to get the `committed time` or `snapshot_id`, 
and then, we can use it as the time or version in time travel clause
2023-01-10 22:37:35 +08:00
f17d69e450 [feature](file cache)Import file cache for remote file reader (#15622)
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes

Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.

Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`

Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
2023-01-10 12:23:56 +08:00
9e3a61989b [refactor](es) remove BE generated dsl for es query #15751
remove fe config enable_new_es_dsl and all related code.
Now the DSL for es is always generated on FE side.
2023-01-10 08:40:32 +08:00
Pxl
1514b5ab5c [Feature](Materialized-View) support advanced Materialized-View (#15212) 2023-01-09 09:53:11 +08:00
c57fa7c930 [Pipeline] Fix PipScannerContext::can_finish return wrong status (#15259)
Now in ScannerContext::push_back_scanner_and_reschedule, _num_running_scanners-- is before _num_scheduling_ctx++.
InPipScannerContext::can_finish, we check _num_running_scanners == 0 && _num_scheduling_ctx == 0 without obtaining _transfer_lock.
In follow case, PipScannerContext::can_finish will return wrong result.

_num_running_scanners--
Check _num_running_scanners == 0 && _num_scheduling_ctx == 0` return true.
_num_scheduling_ctx++
So, we can set _num_running_scanners-- in the last of this func.

Describe your changes.

PipScannerContext::get_block_from_queue not block.
Set _num_running_scanners-- in the last of ScannerContext::push_back_scanner_and_reschedule.
2023-01-09 08:46:58 +08:00
707eab9a63 [opt](multi-catalog) cache and reuse position delete rows in iceberg v2 (#15670)
A deleted file may belong to multiple data files. Each data file will read a full amount of deleted files,
so a deleted file may be read repeatedly. The deleted files can be cached, and multiple data files
can reuse the first read content.

The performance is improved by 60% in the case of single thread, and by 30% in the case of multithreading.
2023-01-07 22:29:11 +08:00
9d1f02c580 [Improvement](topn) runtime prune for topn query (#15558) 2023-01-05 20:10:12 +08:00
4075e3aec6 [fix](csv-reader) fix new csv reader's performance issue (#15581) 2023-01-04 18:25:08 +08:00
c42c61dcad [fix](bitmapfilter) fix bitmap filter not pushing down (#15532) 2023-01-04 14:33:53 +08:00
Pxl
85fe9d2496 [Bug](filter) fix not in(null) return true (#15466)
fix not in(null) return true
2023-01-03 21:14:50 +08:00
edecc2e706 [feature-wip](inverted index) API for inverted index reader and syntax for fulltext match (#14211)
* [feature-wip](inverted index)inverted index api: reader

* [feature-wip](inverted index) Fulltext query syntax with MATCH/MATCH_ALL/MATCH_ALL

* [feature-wip](inverted index) Adapt to index meta

* [enhance] add more metrics

* [enhance] add fulltext match query check for column type and index parser

* [feature-wip](inverted index) Support apply inverted index in compound predicate which except leaf node of and node
2022-12-30 21:48:14 +08:00
2c8de30cce [optimize](multi-catalog) use dictionary encode&filter to process delete files (#15441)
**Optimize**
PR #14470 has used `Expr` to filter delete rows to match current data file,
but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files)
to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows.

In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path`
columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column.

After testing, the performance of iceberg v2's MOR is improved by 30%+.

**Fix Bug**
Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr`
and the batch_eof is generated from next batch.
2022-12-30 08:57:55 +08:00
85c7c531f1 [vectorized](jdbc) support array type in jdbc external table (#15303) 2022-12-30 00:29:08 +08:00
987970e8e3 [fix](multi catalog)Set column defualt value for query. (#15415)
Current column default value is used only for load task. But in the case of Iceberg schema change,
query task is also possible to read the default value for columns not exist in old schema.
This pr is to support default value for query task.

Manually tested the broker load and external emr regression cases.
2022-12-29 12:03:17 +08:00
3146fc8189 [bug](jdbc) fix jdbc external table with char type length error (#15386)
Now have test pg and oracle with char(100), if data='abc'
but read string data length is 100, so need trim extral spaces
2022-12-29 11:19:03 +08:00