doris

Author	SHA1	Message	Date
HappenLee	8be43857ef	[feature](executor) Add memory limit for pip_scanner_context (#18238 ) Co-authored-by: wangbo <506340561@qq.com>	2023-03-31 09:36:57 +08:00
Ashin Gau	d6b0fe9072	[feature](jni) jni table scanner framework (#17960 ) A framework that read data from jni scanner, which can support the data source from java ecosystem(java API). ## Java Interface Java scanner should extends `org.apache.doris.jni.JniScanner`, implements the following methods: ``` // Initialize JniScanner public abstract void open() throws IOException; // Close JniScanner and release resources public abstract void close() throws IOException; // Scan data and save as vector table public abstract int getNext() throws IOException; ``` See demo usage in `org.apache.doris.jni.MockJniScanner` ## c++ interface C++ reader should use `doris::JniConnector` to get data from `org.apache.doris.jni.JniScanner`. See demo usage in `doris::MockJniReader`. ## Pushed-down predicates Java scanner can get pushed-down predicates by `org.apache.doris.jni.vec.ScanPredicate`. ## Remaining works: 1. Implement complex nested types. 2. Read hudi MOR table as the end-to-end demo usage.	2023-03-30 23:47:45 +08:00
Mingyu Chen	05db6e9b55	[refactor](file-system)(step-2) remove env, file_utils and filesystem_utils (#18009 ) Follow #17586. This PR mainly changes: Remove env/ Remove FileUtils/FilesystemUtils Some methods are moved to LocalFileSystem Remove olap/file_cache Add s3 client cache for s3 file system In my test, the time of open s3 file can be reduced significantly Fix cold/hot separation bug for s3 fs. This is the last PR of #17764. After this, all IO operation should be in io/fs. Except for tests in #17586, I also tested some case related to fs io: clone concurrency query on local/s3/hdfs load error log create and clean disk metrics	2023-03-29 09:00:52 +08:00
Tiewei Fang	642c378fc7	[feature](table-valued-function) add Backends table-valued-function (#17667 ) This pr implement a new Metadata TVF called backends. And the implement process tutorial is in #17974.	2023-03-27 15:18:31 +08:00
HappenLee	fd5dd9a391	[Opt](Pipeline) opt pipeline code in mult tablet (#17999 )	2023-03-27 10:02:48 +08:00
Mingyu Chen	7c0bcbdca1	[enhance](parquet-reader) cache file meta of parquet to speed up query (#18074 ) Problem: 1. FE will split the parquet file into split. So a file can have several splits. 2. BE will scan each split, read the footer of the parquet file. 3. If 2 splits belongs to a same parquet file, the footer of this file will be read twice. This PR mainly changes: 1. Use kv cache to cache the footer of parquet file. 2. The kv cache is belong to a scan node, so all parquet reader belong to this scan node will share same kv cache. 3. In cache, the key is "meta_file_path", the value is parsed thrift footer. The KV Cache is sharded into mutlti sub cache. So that different file can use different sub cache, avoid blocking each other In my test, a query with 26 splits can reduce the footer parse time from 4s -> 1s	2023-03-25 23:22:57 +08:00
奕冷	855852d582	[enhancement](timeout) fix set timeout failure and simplify timeout logic (#17837 )	2023-03-25 21:56:06 +08:00
Gabriel	e8b9587fe6	[Improvement](dict) compute hash only if needed (#18058 )	2023-03-24 11:45:58 +08:00
Mingyu Chen	cb79e42e5c	[refactor](file-system)(step-1) refactor file sysmte on BE and remove storage_backend (#17586 ) See #17764 for details I have tested: - Unit test for local/s3/hdfs/broker file system: be/test/io/fs/file_system_test.cpp - Outfile to local/s3/hdfs/broker. - Load from local/s3/hdfs/broker. - Query file on local/s3/hdfs/broker file system, with table value function and catalog. - Backup/Restore with local/s3/hdfs/broker file system Not test: - cold & host data separation case.	2023-03-21 21:08:38 +08:00
Gabriel	bd8e3e6405	[refactor](date) unify DateTimeValue and VecDateTimeValue (#17670 )	2023-03-20 16:27:08 +08:00
yiguolei	dd53bc1c8d	[unify type system](remove unused type desc) remove some code (#17921 ) There are many type definitions in BE. Should unify the type system and simplify the development. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-19 14:05:02 +08:00
Qi Chen	d79da2f926	[Fix](parquet-reader) Fix dict filter not enabled. (#17882 )	2023-03-18 22:16:37 +08:00
Tiewei Fang	46d88ede02	[Refactor](Metadata tvf) Reconstruct Metadata table-value function into a more general framework. (#17590 )	2023-03-17 19:54:50 +08:00
Qi Chen	b4b126b817	[Feature](parquet-reader) Implements dict filter functionality parquet reader. (#17594 ) Implements dict filter functionality parquet reader to improve performance.	2023-03-16 20:29:27 +08:00
HappenLee	c29582bd57	[pipeline](split by segment)support segment split by scanner (#17738 ) * support segment split by scanner * change code by cr	2023-03-16 15:25:52 +08:00
lihangyu	9b7596f1c6	[Feature](Dynamic schema table) step1 support schema change expression (#17494 ) 1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns 2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility	2023-03-13 15:12:42 +08:00
HappenLee	39b5682d59	[Pipeline](shared_scan_opt) Support shared scan opt in pipeline exec engine	2023-03-13 10:33:57 +08:00
Xinyi Zou	f9baf9c556	[improvement](scan) Support pushdown execute expr ctx (#15917 ) In the past, only simple predicates (slot=const), and, like, or (only bitmap index) could be pushed down to the storage layer. scan process: Read part of the column first, and calculate the row ids with a simple push-down predicate. Use row ids to read the remaining columns and pass them to the scanner, and the scanner filters the remaining predicates. This pr will also push-down the remaining predicates (functions, nested predicates...) in the scanner to the storage layer for filtering. scan process: Read part of the column first, and use the push-down simple predicate to calculate the row ids, (same as above) Use row ids to read the columns needed for the remaining predicates, and use the pushed-down remaining predicates to reduce the number of row ids again. Use row ids to read the remaining columns and pass them to the scanner.	2023-03-10 08:35:32 +08:00
zhannngchen	2cf90ddfc5	[fix](scanner) remove useless _src_block_mem_reuse to avoid core dump while loading (#17559 ) The _src_block_mem_reuse variable actually not work, since the _src_block is cleared each time when we call get_block. But current code may cause core dump, see issue #17587. Because we insert some result column generated by expr into dest block, and such a column holds a pointer to some column in original schema. When clearing the data of _src_block, some column's data in dest block is also cleared. e.g. coalesce will return a result column which holds a pointer to some original column, see issue #17588	2023-03-09 09:26:32 +08:00
qiye	3a877857ae	[improvement](inverted index)Remove searcher bitmap timer to improve query speed (#17407 ) Timer becomes a bottleneck when the query hit volume is very high.	2023-03-08 14:03:36 +08:00
yiguolei	4692d6764c	[refactor](remove string val) remove string val structure, it is same with string ref (#17461 ) remove stringval, decimalv2val, bigintval	2023-03-08 10:42:20 +08:00
htyoung	69c62b6c6c	[Fix](vectorization) fixed that when a column's _fixed_values exceeds the max_pushdown_conditions_per_column limit, the column will not perform predicate pushdown, but if there are subsequent columns that need to be pushed down, the subsequent column pushdown will be misplaced in _scan_keys and it causes query results to be wrong (#17405 ) the max_pushdown_conditions_per_column limit, the column will not perform predicate pushdown, but if there are subsequent columns that need to be pushed down, the subsequent column pushdown will be misplaced in _scan_keys and it causes query results to be wrong Co-authored-by: tongyang.hty <hantongyang@douyu.tv>	2023-03-08 07:23:56 +08:00
yiguolei	9477c48ef8	[refactor](functioncontext) remove duplicate type definition in function context (#17421 ) remove duplicate type definition in function context remove unused method in function context not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned. remove useless slot_size in all tuple or slot descriptor. remove doris_udf namespace, it is useless. remove some unused macro definitions. init v_conjuncts in vscanner, not need write the same code in every scanner. using unique ptr to manage function context since it could only belong to a single expr context. Issue Number: close #xxx --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-06 16:07:09 +08:00
Mingyu Chen	3d0beec01d	[fix](orc) fix heap-use-after-free and potential memory leak of orc reader (#17431 ) fix heap-use-after-free The OrcReader has a internal FileInputStream, If the file is empty, the memory of FileInputStream will leak. Besides, there is a Statistics instance in FileInputStream. FileInputStream maybe delete if the orc reader is inited failed, but Statistics maybe used when orc reader is closed, causing heap-use-after-free error. Potential memory leak When init file scanner in file scan node, the file scanner prepare failed, the memory of file scanner will leak.	2023-03-06 08:42:35 +08:00
yiguolei	17f4990bd3	[enhancement](functioncontext) function context should use shared ptr and simply function context (#17311 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-02 16:23:54 +08:00
YueW	707f814fc2	[fix](inverted index) fix still execute match query after drop inverted index (#17293 ) background： At the moment, match query must with inverted index, problem description: After drop inverted index which is the only index in table, there still can use match query for this index column. fix it: The index should be updated on BE regardless of whether the indexes_desc from FE is empty.	2023-03-02 11:12:54 +08:00
luozenglin	1771d1e5e7	[fix](value-range) fix the value range of non-nullable column contains null causes query short key index error. (#16943 ) * [fix](value-range) fix the value range of non-nullable column contains null causes query short key index error.	2023-02-28 11:15:32 +08:00
zhannngchen	84413f33b8	[enhancement](merge-on-write) add skip_delete_bitmap session variable for debug purpose (#17127 )	2023-02-27 23:31:28 +08:00
Mingyu Chen	491d269412	[fix](tvf) fix bug that failed to get schema of tvf when file is empty (#16928 ) In previous implementation, when querying tvf, FE will get schema from BE. And BE will try to open the first file to get its schema info, but for orc or parquet format, if the file is empty, it will return error. But even for an empty file, we can still get schema info from file's footer. So we should handle the empty file to get schema info correctly. Also modify the catalog doc to add some FAQ.	2023-02-21 14:14:32 +08:00
Mingyu Chen	c0bb2e33a8	[improvement](scan) separate scanner into local and remote scanner pool (#16891 ) There are 2 kinds for scanner thread pool, local and remote. Local is for local file read, specially for olap scanner. Remote is for other external data source, such as file scanner, jdbc scanner. This PR mainly changes: For olap scanner, use cold or hot rowset to decide whether to use local or remote pool. For other scanner, user remote pool by default. Add a new BE config doris_max_remote_scanner_thread_pool_thread_num, default is 512, indicate the max thread number of the remote scanner thread pool This will alleviate the problem of interaction between olap queries with load job and external queries.	2023-02-21 14:13:09 +08:00
Pxl	ea78184551	[Feature](Materialized-View) support multiple slot on one column in materialized view (#16378 )	2023-02-14 16:10:50 +08:00
yiguolei	1b83829cff	[improvement](block exception safe) make block queue exception safe (#16657 ) * [improvement](block exception safe) make block queue exception safe This is part of exception safe: #16366. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-02-14 10:50:21 +08:00
YueW	f3ab55d27d	[Optimization](index) Optimization for no need to read raw data for index column that only in where clause (#16569 )	2023-02-14 00:12:45 +08:00
yiguolei	be9385d40a	[improvement](lock raii) use raii to lock and unlock (#16652 ) * [improvement](lock raii) use raii to lock and unlock This is part of exception safe: #16366. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-02-13 14:06:36 +08:00
HappenLee	09b7c22f6b	[Opt](exec) remove unless null key when no split in convert key range (#16624 )	2023-02-11 15:44:35 +08:00
Kang	aba843bb2b	[Improvement](inverted index) inverted index query match bitmap cache (#16578 ) Add cache for inverted index query match bitmap to accelerate common query keyword, especially for keyword matching many rows. Tests result: - large result: matching 99% out of 247 million rows shows 8x speed up. - small result: matching 0.1% out of 247 million rows shows 2x speed up.	2023-02-11 13:38:58 +08:00
lihangyu	37d1519316	[WIP](dynamic-table) support dynamic schema table (#16335 ) Issue Number: close #16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.	2023-02-11 13:37:50 +08:00
YueW	43eca4f209	[Feature-WIP](inverted index) Implementation for alter inverted index. (#16371 ) implementation for add/drop inverted index.	2023-02-10 17:56:17 +08:00
xueweizhang	379bef598d	[fix-core](block) clear block row_same_bit when block reuse (#16172 )	2023-02-10 12:21:27 +08:00
yiguolei	646ba2cc88	[bugfix](scannode) 1. make rows_read correct 2. use single scanner if has limit clause (#16473 ) make rows_read correct so that the scheduler could using this correctly. use single scanner if has limit clause. Move it from fragment context to scannode. --------- Co-authored-by: yiguolei <yiguolei@gmail.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2023-02-09 14:12:18 +08:00
Xiaocc	0142ef8b95	[improvement](scanner) Supports bthread scanner (#16031 )	2023-02-09 10:24:56 +08:00
Kang	737c73dcf0	[Improvement](topn) order by key topn query optimization (#15663 )	2023-02-06 15:36:05 +08:00
slothever	b1b2697cc7	[fix](iceberg) fix iceberg catalog (#16372 ) 1. Fix iceberg catalog access s3 2. Fix iceberg catalog partition table query 3. Fix persistence	2023-02-05 13:15:28 +08:00
Pxl	5e4bb98900	[Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290 ) enable -Wpedantic and update lowest gcc version to 11.1	2023-02-03 11:28:48 +08:00
Jerry Hu	7a800bd3c6	[fix](scan) coredump caused by null of _scanner_ctx (#16361 )	2023-02-03 09:24:15 +08:00
Mingyu Chen	cb6875b5a4	[improvement](multi-catalog) use date/datetimev2 as default col type for catalog table (#16304 ) 1. When mapping column from external datasource, use date/datetimev2 as default type 2. check `is_cancelled` when read data, to avoid endless loop after query is cancelled	2023-02-02 17:35:48 +08:00
YueW	bb179b77f7	[Feature-WIP](inverted index) support array type for inverted index reader (#16355 )	2023-02-02 16:14:14 +08:00
AlexYue	bb0d4ba787	[BugFix](sort) use correct agg function when using 2 phase sort for agg table (#16185 )	2023-02-01 20:07:43 +08:00
Jibing-Li	d224624bbe	[improvement](session variable)Add enable_file_cache session variable (#16268 ) Add enable_file_cache session variable, so that we can close file cache without restart BE.	2023-02-01 18:15:03 +08:00
Pxl	ca73c60442	[Chore](build) enable ignored-qualifiers check (#16196 ) enable ignored-qualifiers check	2023-02-01 15:15:59 +08:00

1 2 3 4

191 Commits