Commit Graph

1137 Commits

Author SHA1 Message Date
6e8eedc521 [refactor](remove unused code) remove storage buffer and orc reader (#16137)
remove olap storage byte buffer
remove orc reader
remove time operator
remove read_write_util
remove aggregate funcs
remove compress.h and cpp
remove bhp_lib

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-24 22:29:32 +08:00
79ad74637d [refactor](remove expr) remove non vectorized Expr and ExprContext related codes (#16136) 2023-01-24 10:45:35 +08:00
23edb3de5a [fix](icebergv2) fix bug that delete file reader is not opened (#16133)
This pr #15836 change the way to use parquet reader by first open() then init_reader().
But we forgot to call open() for iceberg delete file, which cause coredump.
2023-01-24 10:19:46 +08:00
a3cd0ddbdc [refactor](remove broker scan node) it is not useful any more (#16128)
remove broker scannode
remove broker table
remove broker scanner
remove json scanner
remove orc scanner
remove hive external table
remove hudi external table
remove broker external table, user could use broker table value function instead
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-23 19:37:38 +08:00
61fccc88d7 [vectorized](analytic) fix analytic node of window function get wrong… (#16074)
[Bug] 基础函数rank()开窗排序结果错误 #15951
2023-01-23 16:09:46 +08:00
199d7d3be8 [Refactor]Merged string_value into string_ref (#15925) 2023-01-22 16:39:23 +08:00
8920295534 [refactor](remoe non vec code) remove non vectorized conjunctx from scanner (#16121)
1. remove arrow group filter
2. remove non vectorized conjunctx from scanner

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-21 19:23:17 +08:00
253445ca46 [vectorzied](jdbc) fix jdbc executor for get result by batch and memo… (#15843)
result set should be get by batch size2.
fix memory leak3.
2023-01-21 08:22:22 +08:00
de12957057 [debug](ParquetReader) print file path if failed to read parquet file (#16118) 2023-01-21 08:05:17 +08:00
7814d2b651 [Fix](Oracle External Table) fix that oracle external table can not insert batch values (#16117)
Issue Number: close #xxx

This pr fix two bugs:

_jdbc_scanner may be nullptr in vjdbc_connector.cpp, so we use another method to count jdbc statistic. close [Enhencement](jdbc scanner) add profile for jdbc scanner #15914
In the batch insertion scenario, oracle database does not support syntax insert into tables values (...),(...); , what it supports is:
insert all
into table(col1,col2) values(c1v1, c2v1)
into table(col1,col2) values(c1v2, c2v2)
SELECT 1 FROM DUAL;
2023-01-21 07:57:12 +08:00
9ffd109b35 [fix](datetimev2) Fix BE datetimev2 type returning wrong result (#15885) 2023-01-20 22:25:20 +08:00
171404228f [improvement](vertical compaction) cache segment in vertical compaction (#16101)
1.In vertical compaction, segments will be loaded for every column group, so
we should cache segment ptr to avoid too many repeated io.
2.fix vertical compaction data size bug
2023-01-20 16:38:23 +08:00
1638936e3f [fix](oracle catalog) oracle catalog support TIMESTAMP dateType of oracle (#16113)
`TIMESTAMP` dateType of Oracle will map to `DateTime` dateType of Doris
2023-01-20 14:47:58 +08:00
116e17428b [Enhancement](point query optimize) improve performace of point query on primary keys (#15491)
1. support row format using codec of jsonb
2. short path optimize for point query
3. support prepared statement for point query
4. support mysql binary format
2023-01-20 13:33:01 +08:00
3ebc98228d [feature wip](multi catalog)Support iceberg schema evolution. (#15836)
Support iceberg schema evolution for parquet file format.
Iceberg use unique id for each column to support schema evolution.
To support this feature in Doris, FE side need to get the current column id for each column and send the ids to be side.
Be read column id from parquet key_value_metadata, set the changed column name in Block to match the name in parquet file before reading data. And set the name back after reading data.
2023-01-20 12:57:36 +08:00
6e090e4daf [Bug](predicate) fix date predicate (#16053) 2023-01-19 14:14:48 +08:00
0b5e71d3b4 [refactor](refactor field) remove unused method (#16068) 2023-01-19 10:16:09 +08:00
3894de49d2 [Enhancement](topn) support two phase read for topn query (#15642)
This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`.

TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase:
1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode.
2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine.

After the second phase read, Block will contain all the data needed for the query
2023-01-19 10:01:33 +08:00
d5a3e8df3a [Exec](opt) Opt the vexplode_split function performance (#15945) 2023-01-17 19:02:57 +08:00
151ae71761 [fix](be)fix bug of VSetOperationNode::release_resource (#15997)
should call "ExecNode::release_resource(state)" if child class override the parent's method
2023-01-17 16:16:25 +08:00
d062ca2944 [refactor](vectorized) remove unnecessary vectorization check (#15984) 2023-01-17 12:21:46 +08:00
7d34512501 [Bug](pipeline) Fix DCHECK failure (#15928) 2023-01-17 12:01:20 +08:00
9f106161a7 [Bug](join) Fix null aware anti join error in fuzzy mode (#15987) 2023-01-17 11:32:16 +08:00
b1caa68706 [Feature-WIP](inverted index) inverted index reader's implementation, and add mysql_fulltext regression case to test fulltext query (#15823)
Issue Number: Step2 of DSIP-023: Add inverted index for full text search
implementation of inverted index reader

dependency pr: #14211 #15807 #15821
2023-01-17 09:13:56 +08:00
0057243f54 [improvement](reader) use union merge when rowset are noneoverlapping (#15749) 2023-01-16 21:53:18 +08:00
bdec4d5ac2 [enhancement](profile) add read columns to scanner profile (#15902) 2023-01-16 19:32:46 +08:00
97fcad76f8 [enhancement](memtracker) Improve readability (#15716) 2023-01-16 16:30:35 +08:00
63d48564ed [fix](datetimev2) fix datetimev2 error with T (#15915)
Signed-off-by: nextdreamblue <zxw520blue1@163.com>
2023-01-16 15:30:48 +08:00
Pxl
81bab55d43 [Bug](function) catch function calculation error on aggregate node to avoid core dump (#15903) 2023-01-16 11:21:28 +08:00
Pxl
b727033906 [Chore](build) enable -Wextra and remove some -Wno (#15760)
enable -Wextra and remove some -Wno
2023-01-15 10:40:35 +08:00
5af7bcaa55 [Bug](decimalv3) Fix missing precision and scale in predicates (#15930) 2023-01-15 00:01:48 +08:00
c4475a8dbc [Enhencement](jdbc scanner) add profile for jdbc scanner (#15914) 2023-01-14 10:28:59 +08:00
049f8ad2f9 [Bug](sort)fix merge sorter might div zero when block bytes less than block rows (#15859)
If block bytes are bigger than the corresponding block's rows, then the avg_size_per_row would be zero. Which would end up diving zero in the following logic.
2023-01-13 18:33:40 +08:00
34bb9cd5d3 [fix](parquet-reader) fix coredump when load datatime data to doris from parquet (#15794)
`date_time_v2` will check scale when constructed datatimev2:
```
LOG(FATAL) << fmt::format("Scale {} is out of bounds", scale);
```

This [PR](https://github.com/apache/doris/pull/15510) has fixed this issue, but parquet does not use constructor to create `TypeDescriptor`, leading the `scale = -1` when reading datetimev2 data.
2023-01-13 11:51:11 +08:00
9468711f9f [Bug](join) fix bug null aware left anti join not correct result (#15841) 2023-01-13 10:18:05 +08:00
688a0bb96a [feature](multi-catalog) support clickhouse jdbc catalog (#15780) 2023-01-13 10:07:22 +08:00
bae29157aa [fix](olap) dictionary cannot be sorted after inserting some null values (#15829) 2023-01-13 09:28:55 +08:00
730571e386 [fix](sort spill) fix bug of failed to create spilled file (#15864)
Also increase buffered block size when it has started to spill.
2023-01-13 09:23:26 +08:00
174e5e601f [refactor](rpc fn) decouple vectorized remote function from row-based one (#15871) 2023-01-13 09:21:33 +08:00
0fbdf8e3e1 [Refactor](table function) Decouple vectorized table functions from non-vectorized ones (#15772) 2023-01-12 15:08:21 +08:00
7441b4dc96 [Feature](function) Support width_bucket function (#14396) 2023-01-12 13:59:21 +08:00
f3ef3f7e15 [fix](sink) fix memory leak in VNodeChannel (#15834) (#15835)
Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>

Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
2023-01-12 09:24:51 +08:00
98d69d1568 [fix](compile) fix vscan node compile error (#15805)
conflict merge of #15604 and #15618
2023-01-11 15:08:46 +08:00
3fec5ff0f5 [refactor](scan-pool) move scan pool from env to scanner scheduler (#15604)
The origin scan pools are in exec_env.
But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used.
All scan task will be submitted to the scan pool in scanner_scheduler.

BTW, reorganize the scan pool into 3 kinds:

local scan pool
For olap scan node

remote scan pool
For file scan node

limited scan pool
For query which set cpu resource limit or with small limit clause

TODO:
Use bthread to unify all IO task.

Some trivial issues:

fix bug that the memtable flush size printed in log is not right
Add RuntimeProfile param in VScanner
2023-01-11 09:38:42 +08:00
d857b4af1b [refactor](remove row batch) remove impala rowbatch structure (#15767)
* [refactor](remove row batch) remove impala rowbatch structure

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-11 09:37:35 +08:00
8f31a36429 [feature] support spill to disk for sort node (#15624) 2023-01-11 08:40:58 +08:00
4bbc93b7ce [refactor](hashtable) simplify template args of partitioned hash table (#15736) 2023-01-11 08:39:13 +08:00
90a92f0643 [feature-wip](multi-catalog) add iceberg tvf to read snapshots (#15618)
Support new table value function `iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")`
we can use the sql `select * from iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")` to get snapshots info  of a table. The other iceberg metadata will be supported later when needed.

One of the usage:

Before we use following sql to time travel:
`select * from ice_table FOR TIME AS OF "2022-10-10 11:11:11"`;
`select * from ice_table FOR VERSION AS OF "snapshot_id"`;
we can use the snapshots metadata to get the `committed time` or `snapshot_id`, 
and then, we can use it as the time or version in time travel clause
2023-01-10 22:37:35 +08:00
c3da5a687a [fix]fixed dangerous usage of namespace std (#15741)
Co-authored-by: zhaochangle <zhaochangle@selectdb.com>
2023-01-10 16:10:49 +08:00
f17d69e450 [feature](file cache)Import file cache for remote file reader (#15622)
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes

Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.

Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`

Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
2023-01-10 12:23:56 +08:00