doris

Author	SHA1	Message	Date
lihangyu	043f77200f	[Bug](dynamic-table) Fix column alignment logic and support filtering null values when slot is not null (#17842 ) Before this PR when encountering null values with some columns which is specified as `NOT NULL`, null values will not be filtered,thi behavior does not match with the original load behavior. Second column alignment logic has bug : ``` template <typename ColumnInserterFn> void align_variant_by_name_and_type(ColumnObject& dst, const ColumnObject& src, size_t row_cnt, ColumnInserterFn inserter) { CHECK(dst.is_finalized() && src.is_finalized()); // Use rows() here instead of size(), since size() will check_consistency // but we could not check_consistency since num_rows will be upgraded even // if src and dst is empty, we just increase the num_rows of dst and fill // num_rows of default values when meet new data size_t num_rows = dst.rows(); ```	2023-03-17 16:53:30 +08:00
Qi Chen	b4b126b817	[Feature](parquet-reader) Implements dict filter functionality parquet reader. (#17594 ) Implements dict filter functionality parquet reader to improve performance.	2023-03-16 20:29:27 +08:00
HappenLee	c29582bd57	[pipeline](split by segment)support segment split by scanner (#17738 ) * support segment split by scanner * change code by cr	2023-03-16 15:25:52 +08:00
TengJianPing	7d91114304	[fix](join) fix wrong result of null aware left anti join (#17752 )	2023-03-14 09:35:46 +08:00
lihangyu	9b7596f1c6	[Feature](Dynamic schema table) step1 support schema change expression (#17494 ) 1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns 2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility	2023-03-13 15:12:42 +08:00
Pxl	16fc3a0e22	[Chore](compile) remove some unused static on inline function to reduce compile time (#17603 ) remove some unused static on inline function to reduce compile time	2023-03-13 11:11:59 +08:00
HappenLee	39b5682d59	[Pipeline](shared_scan_opt) Support shared scan opt in pipeline exec engine	2023-03-13 10:33:57 +08:00
Jerry Hu	93a865c3e8	[improvement](join) Avoid reading from left child while hash table is empty(right join) (#17655 ) When the right (build) side is empty in a right outer join, there is no need to read data from the left child.	2023-03-13 09:03:17 +08:00
slothever	455c800405	[feature](parquet-reader) add rle bool and delta decoder to read AWS Glue (#17112 ) Support delta encoding and rle(bool) to read Glue data add delta bit pack decoder, add delta length byte array decoder, add delta byte array decoder. add rle bool decoder. We find some data type is read with delta encoding on AWS Glue, so it should be supported. The definition of delta encoding can refer to the delta encoding in parquet.	2023-03-12 20:09:58 +08:00
Xinyi Zou	f9baf9c556	[improvement](scan) Support pushdown execute expr ctx (#15917 ) In the past, only simple predicates (slot=const), and, like, or (only bitmap index) could be pushed down to the storage layer. scan process: Read part of the column first, and calculate the row ids with a simple push-down predicate. Use row ids to read the remaining columns and pass them to the scanner, and the scanner filters the remaining predicates. This pr will also push-down the remaining predicates (functions, nested predicates...) in the scanner to the storage layer for filtering. scan process: Read part of the column first, and use the push-down simple predicate to calculate the row ids, (same as above) Use row ids to read the columns needed for the remaining predicates, and use the pushed-down remaining predicates to reduce the number of row ids again. Use row ids to read the remaining columns and pass them to the scanner.	2023-03-10 08:35:32 +08:00
luozenglin	00727e8c11	[fix](in-bitmap) fix result may be wrong if the left side of the in bitmap predicate is a constant (#17570 )	2023-03-09 10:59:05 +08:00
zhannngchen	2cf90ddfc5	[fix](scanner) remove useless _src_block_mem_reuse to avoid core dump while loading (#17559 ) The _src_block_mem_reuse variable actually not work, since the _src_block is cleared each time when we call get_block. But current code may cause core dump, see issue #17587. Because we insert some result column generated by expr into dest block, and such a column holds a pointer to some column in original schema. When clearing the data of _src_block, some column's data in dest block is also cleared. e.g. coalesce will return a result column which holds a pointer to some original column, see issue #17588	2023-03-09 09:26:32 +08:00
Xin Liao	8001d65811	[fix](insert) fix memory leak for insert transaction (#17530 )	2023-03-08 14:10:59 +08:00
qiye	3a877857ae	[improvement](inverted index)Remove searcher bitmap timer to improve query speed (#17407 ) Timer becomes a bottleneck when the query hit volume is very high.	2023-03-08 14:03:36 +08:00
yiguolei	4692d6764c	[refactor](remove string val) remove string val structure, it is same with string ref (#17461 ) remove stringval, decimalv2val, bigintval	2023-03-08 10:42:20 +08:00
htyoung	69c62b6c6c	[Fix](vectorization) fixed that when a column's _fixed_values exceeds the max_pushdown_conditions_per_column limit, the column will not perform predicate pushdown, but if there are subsequent columns that need to be pushed down, the subsequent column pushdown will be misplaced in _scan_keys and it causes query results to be wrong (#17405 ) the max_pushdown_conditions_per_column limit, the column will not perform predicate pushdown, but if there are subsequent columns that need to be pushed down, the subsequent column pushdown will be misplaced in _scan_keys and it causes query results to be wrong Co-authored-by: tongyang.hty <hantongyang@douyu.tv>	2023-03-08 07:23:56 +08:00
Tiewei Fang	48c2d806d7	[enhencement](jdbc catalog) Use Druid instead of HikariCP in JdbcClient (#17395 ) This pr does three things: 1. Use Druid instead of HikariCP in JdbcClient 2. when download udf jar, add the name of the jar package after the local file name. 3. refactor some jdbcResource code	2023-03-07 08:51:10 +08:00
Ashin Gau	dca16796ad	[fix](ParquetReader) definition level of repeated parent is wrong (#17337 ) Fix three bugs: 1. `repeated_parent_def_level ` should be the definition of its repeated parent. 2. Failed to parse schema like `decimal(p, s)` 3. Fill wrong offsets for array type	2023-03-06 18:15:57 +08:00
yiguolei	9477c48ef8	[refactor](functioncontext) remove duplicate type definition in function context (#17421 ) remove duplicate type definition in function context remove unused method in function context not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned. remove useless slot_size in all tuple or slot descriptor. remove doris_udf namespace, it is useless. remove some unused macro definitions. init v_conjuncts in vscanner, not need write the same code in every scanner. using unique ptr to manage function context since it could only belong to a single expr context. Issue Number: close #xxx --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-06 16:07:09 +08:00
luozenglin	e7cba11680	[fix](array)(parquet) fix be core dump due to load from parquet file containing array types (#17298 )	2023-03-06 15:18:42 +08:00
WenYao	a8f20eb4ac	[Enhencement](schema_scanner) Optimize the performance of reading information schema tables (#17371 ) batch fill block batch call rpc from FE to get table desc For 34w colunms SELECT COUNT( * ) FROM information_schema.columns; time: 10.3s --> 0.4s	2023-03-06 09:53:01 +08:00
Mingyu Chen	3d0beec01d	[fix](orc) fix heap-use-after-free and potential memory leak of orc reader (#17431 ) fix heap-use-after-free The OrcReader has a internal FileInputStream, If the file is empty, the memory of FileInputStream will leak. Besides, there is a Statistics instance in FileInputStream. FileInputStream maybe delete if the orc reader is inited failed, but Statistics maybe used when orc reader is closed, causing heap-use-after-free error. Potential memory leak When init file scanner in file scan node, the file scanner prepare failed, the memory of file scanner will leak.	2023-03-06 08:42:35 +08:00
yiguolei	17f4990bd3	[enhancement](functioncontext) function context should use shared ptr and simply function context (#17311 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-02 16:23:54 +08:00
YueW	707f814fc2	[fix](inverted index) fix still execute match query after drop inverted index (#17293 ) background： At the moment, match query must with inverted index, problem description: After drop inverted index which is the only index in table, there still can use match query for this index column. fix it: The index should be updated on BE regardless of whether the indexes_desc from FE is empty.	2023-03-02 11:12:54 +08:00
HappenLee	1244eed1cd	[Opt](exec) opt the dispose nullable column logic (#17192 )	2023-03-01 23:25:40 +08:00
Tiewei Fang	f1db0d9501	[Enhencement](File Reader) delete old file_reader (#17261 ) * delete old file_reader * fix 1	2023-03-01 20:24:03 +08:00
yiguolei	e22a9ecc3b	[enhancement](execute model) using thread pool to execute report or join task instead of staring too many thread (#17212 ) * [enhancement](execute model) using thread pool to execute report or join task instead of staring too many thread Doris will start report thread and join thread during fragment execution. There are many problems if create and destroy thread very frequently. Jemalloc may not behave very well, it may crashed. jemalloc/jemalloc#1405 It is better to using thread pool to do these tasks. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-01 08:35:27 +08:00
WenYao	68e9a66aa0	[Enchancement](schema scanner) add SchemaScanner profile (#17230 ) Add some profile information to the schema scanner to facilitate performance optimization. Example: SchemaScanner: - FillBlockTime: 9s131ms - GetDbTime: 12.816ms - GetDescribeTime: 1s645ms - GetTableTime: 25.433ms	2023-03-01 08:34:27 +08:00
Gabriel	459874be50	Revert "[Bug](log) add some log to find out bug (#16518 )" (#17178 ) This reverts commit d1c6b8114053e8c754c979d8d3fbf5c880d361d2.	2023-02-28 19:23:12 +08:00
Ashin Gau	bf5037d6d5	[fix](OrcReader) typo in anaylize null values (#17156 ) typographical error in analyzing null values for OrcReader.	2023-02-28 14:29:13 +08:00
slothever	598038e674	[improvement](parquet-reader)support parquet data page v2 (#17054 ) Support parquet data page v2 Now the parquet data on AWS glue use data page v2, but we didn't support before.	2023-02-28 14:23:45 +08:00
luozenglin	1771d1e5e7	[fix](value-range) fix the value range of non-nullable column contains null causes query short key index error. (#16943 ) * [fix](value-range) fix the value range of non-nullable column contains null causes query short key index error.	2023-02-28 11:15:32 +08:00
zhannngchen	84413f33b8	[enhancement](merge-on-write) add skip_delete_bitmap session variable for debug purpose (#17127 )	2023-02-27 23:31:28 +08:00
Pxl	0723e55f76	[Bug](build) fix compile fail on unused value #17165 error: variable 'nullcount' set but not used [-Werror,-Wunused-but-set-variable] int nullcount = 0;	2023-02-27 14:19:44 +08:00
lihangyu	29dc08fc45	[Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. (#17124 ) * [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. `_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode, introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order, should be quite the same between each lines * fix case	2023-02-27 10:39:22 +08:00
zxealous	a0782a1855	[fix](file reader) fix be core in broker file reader (#17039 ) A const reference member variables as class member stores a temporary object, which cannot be got after the temporary object being destroyed, cause be core dump while enable debug level log _broker_addr has been destroyed in BrokerFileReader	2023-02-26 12:35:31 +08:00
Tiewei Fang	f6ce072297	[Enhencement](csv-reader) Optimize csv_reader `_split_value` and fix json_reader case sensitive (#17093 ) 1. Enhencement: For single-charset column separator，csv_reader use another method of `split value`. 2. BugFix Set `json` file format loading to be sensitive.	2023-02-26 09:03:04 +08:00
Ashin Gau	c43e521d29	[feature](multi-catalog) support map&struct type in parquet&orc reader (#17087 ) Support parsing map&struct type in parquet&orc reader. ## Remaining Problems 1. Doris use array type to build the key and value column of a `map`, but doesn't fill the offsets in value column, so the offsets in value column is wasted. 2. Parquet support reading only key or value column in `map`, this PR hasn't supported yet. 3. Parquet support reading partial columns in `struct`, this PR hasn't supported yet.	2023-02-26 08:55:39 +08:00
Ashin Gau	e42465ae59	[fix](OrcReader) handle null values in orc reader for string type (#17135 ) Orc doesn't fill null values in new batch, but the former batch has been release. Other types like int/long/timestamp... are flat types without pointer in them, so other types do not need to be handled separately like string.	2023-02-26 08:10:40 +08:00
Ashin Gau	3ea6478ba8	[feature](multi-catalog) parquet reader support nested array column (#16961 ) Support to decode nested array column in parquet reader: 1. FE should generate the right nested column type. FE doesn't check the nesting depth and legality, like map\<array\<int\>, int\>. 2. `ParquetColumnReader` has removed the filtering of page index to support nested array type. It's too difficult to skip values in nested complex types. Maybe we should support the filtering of page index and lazy read in later PR. 3. `ExternalFileScanNode` has a bug in creating default value expression. 4. Maybe it's slow to read repetition levels in a while loop. I'll optimize this in next PR. 5. Array column has temporary `SchemaElement` in its thrift definition, we have removed them and keep its parent in former implementation. The remaining parent should inherit the repetition and definition level of its child.	2023-02-23 14:54:58 +08:00
Qi Chen	61826e3a77	[Improvement](parquet-reader) Improve performance of parquet reader filter calculation. (#16934 ) Improve performance of parquet reader filter calculation. - Use `filter_data` instead of `(*filter_ptr)` to merge filter to improve performance. - Use mutable column filter func instead of original new column filter func which introduced by #16850. - Avoid column ref-count increasing which caused unnecessary copying by passing column pointer ref.	2023-02-23 14:41:30 +08:00
Xinyi Zou	a1c0054b4c	[fix](memory) fix memory GC details and join probe catch bad_alloc (#16989 ) Fix Redhat 4.x OS /proc/meminfo has no MemAvailable, disable MemAvailable to control memory. vm_rss_str and mem_available_str recorded when gc is triggered, to avoid memory changes during gc and cause inaccurate logs. join probe catch bad_alloc, this may alloc 64G memory at a time, avoid OOM. Modify document doris_be_all_segments_num and doris_be_all_rowsets_num names.	2023-02-23 08:33:30 +08:00
zxealous	29c46d6926	[fix](struct-type) fix be core when load array orc file (#16978 ) * fix be core when load array orc file	2023-02-22 10:15:39 +08:00
Adonis Ling	4cb97b6fb7	[chore](macOS) Fix linkage errors for the release build (#17002 ) Issue Number: close #17003 ## Problem summary The linker couldn't find some symbols because the implementation of a template member function doris::vectorized::Decoder::init_decimal_converter is missing in the header file in which the corresponding declaration is placed.	2023-02-22 10:01:51 +08:00
Mingyu Chen	491d269412	[fix](tvf) fix bug that failed to get schema of tvf when file is empty (#16928 ) In previous implementation, when querying tvf, FE will get schema from BE. And BE will try to open the first file to get its schema info, but for orc or parquet format, if the file is empty, it will return error. But even for an empty file, we can still get schema info from file's footer. So we should handle the empty file to get schema info correctly. Also modify the catalog doc to add some FAQ.	2023-02-21 14:14:32 +08:00
Mingyu Chen	c0bb2e33a8	[improvement](scan) separate scanner into local and remote scanner pool (#16891 ) There are 2 kinds for scanner thread pool, local and remote. Local is for local file read, specially for olap scanner. Remote is for other external data source, such as file scanner, jdbc scanner. This PR mainly changes: For olap scanner, use cold or hot rowset to decide whether to use local or remote pool. For other scanner, user remote pool by default. Add a new BE config doris_max_remote_scanner_thread_pool_thread_num, default is 512, indicate the max thread number of the remote scanner thread pool This will alleviate the problem of interaction between olap queries with load job and external queries.	2023-02-21 14:13:09 +08:00
lihangyu	113023fb86	(Enhancement)[load-json] support simdjson in new json reader (#16903 ) be config: enable_simdjson_reader=true related PR #11665	2023-02-21 11:31:00 +08:00
Qi Chen	a46941c684	[Fix](multi-catalog) Fix switch-case fall-through issue in multi-catalog module. (#16931 ) Fix switch-case fall-through issue in multi-catalog module.	2023-02-20 21:35:41 +08:00
Qi Chen	ef2fdb79bb	[Improvement](parquet-reader) Optimize and refactor parquet reader to improve performance. (#16818 ) Optimize and refactor parquet reader to improve performance. - Improve 2x performance for small dict string by aligned copying. - Refactor code to decrease condition(if) checking. - Don't call skip(0). - Don't read page index if no condition. ssb-flat-100: (single-machine, single-thread) \| Query \| before opt \| after opt \| \| ------------- \|:-------------:\| ---------:\| \| SELECT count(lo_revenue) FROM lineorder_flat \| 9.23 \| 9.12 \| \| SELECT count(lo_linenumber) FROM lineorder_flat \| 4.50 \| 4.36 \| \| SELECT count(c_name) FROM lineorder_flat \| 18.22 \| 17.88\| \| SELECT count(lo_shipmode) FROM lineorder_flat \|10.09 \| 6.15\|	2023-02-20 11:42:29 +08:00
ZhaoChangle	e958b13747	[Exec] Add conjection for union_node. (#16777 )	2023-02-20 10:48:58 +08:00

1 2 3 4 5 ...

627 Commits