doris

Author	SHA1	Message	Date
Mingyu Chen	3fec5ff0f5	[refactor](scan-pool) move scan pool from env to scanner scheduler (#15604 ) The origin scan pools are in exec_env. But after enable new_load_scan_node by default, the scan pool in exec_env is no longer used. All scan task will be submitted to the scan pool in scanner_scheduler. BTW, reorganize the scan pool into 3 kinds: local scan pool For olap scan node remote scan pool For file scan node limited scan pool For query which set cpu resource limit or with small limit clause TODO: Use bthread to unify all IO task. Some trivial issues: fix bug that the memtable flush size printed in log is not right Add RuntimeProfile param in VScanner	2023-01-11 09:38:42 +08:00
slothever	90a92f0643	[feature-wip](multi-catalog) add iceberg tvf to read snapshots (#15618 ) Support new table value function `iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")` we can use the sql `select * from iceberg_meta("table" = "ctl.db.tbl", "query_type" = "snapshots")` to get snapshots info of a table. The other iceberg metadata will be supported later when needed. One of the usage: Before we use following sql to time travel: `select * from ice_table FOR TIME AS OF "2022-10-10 11:11:11"`; `select * from ice_table FOR VERSION AS OF "snapshot_id"`; we can use the snapshots metadata to get the `committed time` or `snapshot_id`, and then, we can use it as the time or version in time travel clause	2023-01-10 22:37:35 +08:00
Tiewei Fang	f17d69e450	[feature](file cache)Import `file cache` for remote file reader (#15622 ) The main purpose of this pr is to import `fileCache` for lakehouse reading remote files. Use the local disk as the cache for reading remote file, so the next time this file is read, the data can be obtained directly from the local disk. In addition, this pr includes a few other minor changes Import File Cache: 1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy. 2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`. Other changes: 1. Add a new interface `fs()` for `FileReader`. 2. `IOContext` adds some statistical information to count the situation of `FileCache` Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>	2023-01-10 12:23:56 +08:00
Mingyu Chen	9e3a61989b	[refactor](es) remove BE generated dsl for es query #15751 remove fe config enable_new_es_dsl and all related code. Now the DSL for es is always generated on FE side.	2023-01-10 08:40:32 +08:00
Pxl	1514b5ab5c	[Feature](Materialized-View) support advanced Materialized-View (#15212 )	2023-01-09 09:53:11 +08:00
Lijia Liu	c57fa7c930	[Pipeline] Fix PipScannerContext::can_finish return wrong status (#15259 ) Now in ScannerContext::push_back_scanner_and_reschedule, _num_running_scanners-- is before _num_scheduling_ctx++. InPipScannerContext::can_finish, we check _num_running_scanners == 0 && _num_scheduling_ctx == 0 without obtaining _transfer_lock. In follow case, PipScannerContext::can_finish will return wrong result. _num_running_scanners-- Check _num_running_scanners == 0 && _num_scheduling_ctx == 0` return true. _num_scheduling_ctx++ So, we can set _num_running_scanners-- in the last of this func. Describe your changes. PipScannerContext::get_block_from_queue not block. Set _num_running_scanners-- in the last of ScannerContext::push_back_scanner_and_reschedule.	2023-01-09 08:46:58 +08:00
Ashin Gau	707eab9a63	[opt](multi-catalog) cache and reuse position delete rows in iceberg v2 (#15670 ) A deleted file may belong to multiple data files. Each data file will read a full amount of deleted files, so a deleted file may be read repeatedly. The deleted files can be cached, and multiple data files can reuse the first read content. The performance is improved by 60% in the case of single thread, and by 30% in the case of multithreading.	2023-01-07 22:29:11 +08:00
Kang	9d1f02c580	[Improvement](topn) runtime prune for topn query (#15558 )	2023-01-05 20:10:12 +08:00
Mingyu Chen	4075e3aec6	[fix](csv-reader) fix new csv reader's performance issue (#15581 )	2023-01-04 18:25:08 +08:00
luozenglin	c42c61dcad	[fix](bitmapfilter) fix bitmap filter not pushing down (#15532 )	2023-01-04 14:33:53 +08:00
Pxl	85fe9d2496	[Bug](filter) fix not in(null) return true (#15466 ) fix not in(null) return true	2023-01-03 21:14:50 +08:00
YueW	edecc2e706	[feature-wip](inverted index) API for inverted index reader and syntax for fulltext match (#14211 ) * [feature-wip](inverted index)inverted index api: reader * [feature-wip](inverted index) Fulltext query syntax with MATCH/MATCH_ALL/MATCH_ALL * [feature-wip](inverted index) Adapt to index meta * [enhance] add more metrics * [enhance] add fulltext match query check for column type and index parser * [feature-wip](inverted index) Support apply inverted index in compound predicate which except leaf node of and node	2022-12-30 21:48:14 +08:00
Ashin Gau	2c8de30cce	[optimize](multi-catalog) use dictionary encode&filter to process delete files (#15441 ) Optimize PR #14470 has used `Expr` to filter delete rows to match current data file, but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files) to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows. In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path` columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column. After testing, the performance of iceberg v2's MOR is improved by 30%+. Fix Bug Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr` and the batch_eof is generated from next batch.	2022-12-30 08:57:55 +08:00
zhangstar333	85c7c531f1	[vectorized](jdbc) support array type in jdbc external table (#15303 )	2022-12-30 00:29:08 +08:00
Jibing-Li	987970e8e3	[fix](multi catalog)Set column defualt value for query. (#15415 ) Current column default value is used only for load task. But in the case of Iceberg schema change, query task is also possible to read the default value for columns not exist in old schema. This pr is to support default value for query task. Manually tested the broker load and external emr regression cases.	2022-12-29 12:03:17 +08:00
zhangstar333	3146fc8189	[bug](jdbc) fix jdbc external table with char type length error (#15386 ) Now have test pg and oracle with char(100), if data='abc' but read string data length is 100, so need trim extral spaces	2022-12-29 11:19:03 +08:00
YueW	305dd15fea	[improvement](index) Support bitmap index can be applied with compound predicate when enable vectorized engine query (#13035 ) Current bitmap index only can apply pushed down predicates which in AND conditions. When predicates in OR conditions and other complex compound conditions, it will not be pushed down to the storage layer, this leads to read more data. Based on that situation, this pr will do: 1. this pr in order to support bitmap index apply compound predicates, query sql like: select * from tb where a > 'hello' or b < 100; select * from tb where a > 'hello' or b < 100 or c > 'ok'; select * from tb where (a > 'hello' or b <100) and (a < 'world' or b > 200); select * from tb where (not a> 'hello') or b < 100; ... above sql，column a and b and c has created bitmap_index. 2. this optimization can reduce reading data by index 3. set config enable_index_apply_compound_predicates to use this optimization	2022-12-28 20:08:57 +08:00
yiguolei	0e651365ca	[profile](scanner) add per scanner running time profile (#15321 ) * [profile](scanner) add per scanner running time profile Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-26 08:55:07 +08:00
Xin Liao	e72404c537	[fix](scan) fix that be may core dump when the predicates are all false (#15332 )	2022-12-24 15:27:43 +08:00
Jibing-Li	e336178ef8	[Fix](multi catalog)Fix VFileScanner file not found status bug. #15226 The if condition to check NOT FOUND status for VFileScanner is incorrect, fix it.	2022-12-23 16:45:54 +08:00
luozenglin	8a810cd554	[fix](bitmapfilter) fix core dump caused by bitmap filter (#15296 ) Do not push down the bitmap filter to a non-integer column	2022-12-23 16:42:45 +08:00
Gabriel	b085ff49f0	[refactor](non-vec) delete non-vec data sink (#15283 ) * [refactor](non-vec) delete non-vec data sink Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2022-12-23 14:10:47 +08:00
Kang	e331e0420b	[improvement](topn)add per scanner limit check for new scanner (#15231 ) Optimize for key topn query like `SELECT * FROM store_sales ORDER BY ss_sold_date_sk, ss_sold_time_sk LIMIT 100` (ss_sold_date_sk, ss_sold_time_sk is prefix of table sort key). Check per scanner limit and set eof true to reduce the data need to be read.	2022-12-22 22:39:31 +08:00
Gabriel	e9a201e0ec	[refactor](non-vec) delete some non-vec exec node (#15239 ) * [refactor](non-vec) delete some non-vec exec node	2022-12-22 14:05:51 +08:00
HappenLee	8ecf69b09b	[pipeline](regression) nested loop join test get error result in pipeline engine and refactor the code for need more input data (#15208 )	2022-12-21 19:03:51 +08:00
TengJianPing	a447121fc3	[fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan (#15199 ) * [fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan * fix	2022-12-21 15:44:47 +08:00
Gabriel	2445ac9520	[Bug](runtimefilter) Fix BE crash due to init failure (#15228 )	2022-12-21 15:36:22 +08:00
Ashin Gau	7730a88d11	[fix](multi-catalog) add support for orc binary type (#15141 ) Fix three bugs: 1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed 2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed 3. fmt::format can't generate right format string, and will be failed	2022-12-19 14:24:12 +08:00
Gabriel	13bc8c2ef8	[Pipeline](runtime filter) Support runtime filters on pipeline engine (#15040 )	2022-12-18 21:48:00 +08:00
zhangstar333	728a238564	[vectorized](jdbc) fix external table of oracle with condition about … (#15092 ) * [vectorized](jdbc) fix external table of oracle with condition about datetime report error * formatter	2022-12-16 10:48:17 +08:00
Mingyu Chen	0e1e5a802b	[config](load) enable new load scan node by default (#14808 ) Set FE `enable_new_load_scan_node` to true by default. So that all load tasks(broker load, stream load, routine load, insert into) will use FileScanNode instead of BrokerScanNode to read data 1. Support loading parquet file in stream load with new load scan node. 2. Fix bug that new parquet reader can not read column without logical or converted type. 3. Change jsonb parser function to "jsonb_parse_error_to_null" So that if the input string is not a valid json string, it will return null for jsonb column in load task.	2022-12-16 09:41:43 +08:00
Jibing-Li	e0d528980f	[fix](multi catalog)Return emtpy block while external table scanner couldn't find the file (#14997 ) FE file path cache for external table may out of date. In this case, BE may fail to find the not exist file from FE cache. This pr is to handle this case: instead of throw an error message to the user, we return empty result set to the user.	2022-12-16 09:36:35 +08:00
slothever	67e4292533	[fix](iceberg-v2) icebergv2 filter data path (#14470 ) 1. a icebergv2 delete file may cross many data paths, so the path of a file split is required as a predicate to filter rows of delete file - create delete file structure to save predicate parameters - create predicate for file path 2. add some log to print row range 3. fix bug when create file metadata	2022-12-15 10:18:12 +08:00
Pxl	c25a7235f9	[Pipeline](load) support pipeline broker load (#14940 ) support pipeline broker load	2022-12-13 00:28:36 +08:00
plat1ko	f3aea7f0f0	[Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744 )	2022-12-11 23:33:18 +08:00
wxy	af50461211	[fix](statistics) fix CpuTimeMS in audit log when enable_vectorized_engine=true. (#14853 ) Co-authored-by: wangxiangyu@360shuke.com <wangxiangyu@360shuke.com>	2022-12-09 21:13:05 +08:00
TengJianPing	fcea89bcf4	[fix](const_expr) fix coredump caused by unsupported cast const expr (#14825 )	2022-12-06 10:31:15 +08:00
HappenLee	b30cd86e9e	[Refactor](pipeline) Refactor operator and builder code of pipeline (#14787 )	2022-12-05 18:35:00 +08:00
TengJianPing	8c0e13ab51	[improvement](profile) add detail memory counter for exec nodes (#14806 ) * [improvement](profile) improve accuraccy of memory usage and add detail memory counter * fix	2022-12-05 11:51:52 +08:00
wxy	e141664339	[fix](statistics) fix missing scanBytes and scanRows in query statist… (#14750 ) * [fix](statistics) fix missing scanBytes and scanRows in query statistics when enable_vectorized_engine=true. Co-authored-by: wangxiangyu@360shuke.com <wangxiangyu@360shuke.com>	2022-12-05 09:17:51 +08:00
HappenLee	12304bc0ee	[Pipeline](exec) Support pipeline exec engine (#14736 ) Co-authored-by: Lijia Liu <liutang123@yeah.net> Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: Jerry Hu <mrhhsg@gmail.com> Co-authored-by: Pxl <952130278@qq.com> Co-authored-by: shee <13843187+qzsee@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> ## Problem Summary: ### 1. Design DSIP: https://cwiki.apache.org/confluence/display/DORIS/DSIP-027%3A+Support+Pipeline+Exec+Engine ### 2. How to use: Set the environment variable `set enable_pipeline_engine = true; `	2022-12-02 17:11:34 +08:00
Gabriel	9dd1d989e8	[test](decimalv3) add regression test cases for decimalv3 (#14672 )	2022-12-01 15:18:40 +08:00
Xinyi Zou	176f519fa1	[enhancement](memtracker) Optimize exec node memory tracking (#14711 )	2022-12-01 14:52:21 +08:00
Pxl	bba77fa9dd	[Enhancement](profile) enhance column predicates display on profile (#14664 )	2022-12-01 13:07:12 +08:00
luozenglin	7873bc95a6	[Enhancement](bitmapfilter) Support bitmap filter to apply zone_map index to filter pages (#14635 )	2022-12-01 10:41:09 +08:00
lsy3993	f7a827c06b	[fix](new-scan) fix some bugs about new scan node and readers (#14504 ) json reader DCHECK fail because of missing TYPE_STRING fix bug that if no file is found, the tvf will throw NPE. The predicate conjuncts can not be pushed down to parquet reader if this is a load task. Because the predicate should be applied on column of dest table, not on column of source file. Add a temp property "use_new_load_scan_node" of broker load to make regression test happy. So that we can use new load scan node for a certain job and avoid setting global FE config.	2022-11-29 10:21:41 +08:00
luozenglin	4728e75079	[feature](bitmap) Support in bitmap syntax and bitmap runtime filter (#14340 ) 1.Support in bitmap syntax, like 'where k1 in (select bitmap_column from tbl)'; 2.Support bitmap runtime filter. Generate a bitmap filter using the right table bitmap and push it down to the left table storage layer for filtering.	2022-11-25 15:22:44 +08:00
Jerry Hu	9103ded1dd	[improvement](join)optimize sharing hash table for broadcast join (#14371 ) This PR is to make sharing hash table for broadcast more robust: Add a session variable to enable/disable this function. Do not block the hash join node's close function. Use shared pointer to share hash table and runtime filter in broadcast join nodes. The Hash join node that doesn't need to build the hash table will close the right child without reading any data(the child will close the corresponding sender).	2022-11-24 21:06:44 +08:00
luozenglin	30e1818724	[fix](tracing) fix tracing in the new scan node does not meet expectations (#14155 ) Issue Number: close #14149 - Remove unexpected tracing, like 'vscanner::scan' - Merge span vscannode::get_next	2022-11-22 16:44:02 +08:00
Pxl	bcd641877f	[Enhancement](scan) disable build key range and filters when push down agg work (#14248 ) disable build key range and filters when push down agg work	2022-11-21 12:47:57 +08:00

1 2 3

123 Commits