doris

Author	SHA1	Message	Date
luozenglin	e17c2416f0	[fix](join) fix be core dump when using right join with other join predicates (#13511 )	2022-10-24 10:35:07 +08:00
Mingyu Chen	3a3def447d	[fix](csv-reader) fix bug that csv reader can not read text format hms table (#13515 ) 1. Missing field and line delimiter 2. When query external table with text(csv) format, we should pass the column position map to BE, otherwise the column order is wrong. TODO: 1. For now, if we query csv file with non-exist column, it will return null. But it should return null or default value of that column. 2. Add regression test after hive docker is ready.	2022-10-22 22:40:03 +08:00
Gabriel	8e19b13f18	[Improvement](runtimefilter) don nott allocate memory if all targets are local (#13557 )	2022-10-21 21:43:38 +08:00
Gabriel	3006b258b0	[Improvement](bloomfilter) allocate memory for BF in open phase (#13494 )	2022-10-21 17:37:26 +08:00
starocean999	5dde13fb7d	[fix](scan)extend_scan_key should not change the range parameter (#13530 ) * [fix](scan)extend_scan_key should not change the range parameter * [fix](scan)new olap scan node has the same issue	2022-10-21 15:17:12 +08:00
Gabriel	d3f65aa746	[Improvement](join) remove unnecessary state for join (#13472 )	2022-10-21 09:59:34 +08:00
Mingyu Chen	32b1456b28	[feature-wip](array) remove array config and check array nested depth (#13428 ) 1. remove FE config `enable_array_type` 2. limit the nested depth of array in FE side. 3. Fix bug that when loading array from parquet, the decimal type is treated as bigint 4. Fix loading array from csv(vec-engine), handle null and "null" 5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null. 6. Remove `check_array_format()`, because it's logic is wrong and meaningless 7. Add stream load csv test cases and more parquet broker load tests	2022-10-20 15:52:31 +08:00
TengJianPing	b5cd167713	[fix](hashjoin) fix coredump of hash join in ubsan build (#13479 ) * [fix](hashjoin) fix coredump of hash join in ubsan build	2022-10-20 10:16:19 +08:00
Ashin Gau	f7c69ade18	[feature-wip](multi-catalog) implement predicate pushdown in native OrcReader (#13453 ) # Proposed changes Implement predicate pushdown in `OrcReader` by converting doris `ColumnValueRange` to orc `SearchArgument`. ## Remaining problems 1. Orc support `not in`, which may have effect on bloom filter. However, doris `ScanNode` has not push down `not in` to file scanner. 2. Orc support `is null`, and row range has `hasNull` identifier. However, `_contain_null` in `ColumnValueRange` is ambiguous. `_contain_null = true` only means that the value can be nullable, not equal to null. 3. `DateTimeV2` has lost microsecond precision in `ColumnValueRange`, which may cause filtering error when a min-max value equals to the predicate value. 4. `DateTimeV1` is not accurate enough, and only saved to seconds. 5. Orc support the predicate pushdown of `float&double` type, but doris has not push down `float&double` type for precision reason.	2022-10-20 10:07:36 +08:00
xy720	f329d33666	[chore](fix) Fix some spell errors in be's comments. #13452	2022-10-20 08:56:01 +08:00
Mingyu Chen	5423de68dd	[refactor](new-scan) remove old file scan node (#13433 ) All these files are not used anymore, can be removed.	2022-10-19 14:25:32 +08:00
Ashin Gau	21f233d7e7	[feature-wip](multi-catalog) use apache orc reader to read orc file (#13404 ) Use apache orc to read orc file, and convert ColumnVectorBatch to doris block.	2022-10-18 13:47:56 +08:00
Adonis Ling	125def5102	[enhancement](macOS M1) Support building from source on macOS (M1) (#13195 ) # Proposed changes This PR fixed lots of issues when building from source on macOS with Apple M1 chip. ## ATTENTION The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime: 1. Some errors with memory tracker occur when BE (RELEASE) starts. 2. Some UT cases fail. ... Temporarily, the following changes are made on macOS to start BE successfully. 1. Disable memory tracker. 2. Use tcmalloc instead of jemalloc. This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues. ## Use case ```shell ./build.sh -j 8 --be --clean cd output/be/bin ulimit -n 60000 ./start_be.sh --daemon ``` ## Something else It takes around _10+_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the development experience on macOS greatly when we finish the adaptation job.	2022-10-18 13:10:13 +08:00
Gabriel	cd3450bd9d	[Improvement](join) optimize join probing phase (#13357 )	2022-10-18 12:37:17 +08:00
Mingyu Chen	dbf71ed3be	[feature-wip](new-scan) Support stream load with csv in new scan framework (#13354 ) 1. Refactor the file reader creation in FileFactory, for simplicity. Previously, FileFactory had too many `create_file_reader` interfaces. Now unified into two categories: the interface used by the previous BrokerScanNode, and the interface used by the new FileScanNode. And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files. 2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode 3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside. 4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.	2022-10-17 23:33:41 +08:00
xy720	c114d87d13	[Enhancement](array-type) Tuple is null predicate support array type (#13307 ) Issue Number: #12689	2022-10-17 18:50:56 +08:00
Pxl	632670a49c	[Enhancement](function) refactor of date function (#13362 ) refactor of date function	2022-10-16 14:31:26 +08:00
zhangstar333	4bc33a54a1	[Fix](agg) fix bitmap agg core dump when phmap pointer assert alignment (#13381 )	2022-10-15 10:39:23 +08:00
Gabriel	8218cfed40	[Bug](function) Fix constant predicate evaluation (#13346 )	2022-10-15 01:05:29 +08:00
Gabriel	baf2689610	[Improvement](join) compute hash values by vectorized way (#13335 )	2022-10-13 16:04:58 +08:00
Gabriel	3e84c04195	[Bug](predicate) fix nullptr in scan node (#13316 )	2022-10-13 12:14:42 +08:00
Gabriel	dfe308f501	[Improvement](join) refine prefetch strategy (#13286 )	2022-10-12 19:02:06 +08:00
slothever	4fc7a048d2	[feature-wip](parquet-reader) fix string test and support decimal64 (#13184 ) 1. Refactor arguments list of parquet min max filter, pass parquet type for min max value parsing 2. Fix the filter of string min max Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-10-12 16:52:28 +08:00
Ashin Gau	bb4414e303	[feature-wip](multi-catalog) optimize parquet profile & add null map timer (#13257 ) Use indentation to make `ParquetReader`'s profile more readable Add `ParquetReader.DecodeNullMapTime` to show the time of parsing `NullMap` for `NullableColumn` ``` VFILE_SCAN_NODE (id=0):(Active: 279.62ms, % non-child: 85.83%) - FileReadBytes: 2.36 MB - FileReadCalls: 20 - FileReadTime: 5.686ms - MaxScannerThreadNum: 1 - NewlyCreateFreeBlocksNum: 125 - NumScanners: 1 - ParquetReader: 0ns - ColumnReadTime: 259.946ms - DecodeDictTime: 0ns - DecodeHeaderTime: 437.707us - DecodeLevelTime: 30.101us - DecodeNullMapTime: 53.295ms - DecodeValueTime: 62.607ms - DecompressCount: 511 - DecompressTime: 1.159ms - FilteredBytes: 0.00 - FilteredGroups: 0 - FilteredRowsByGroup: 0 - FilteredRowsByPage: 0 - ParseMetaTime: 22.517ms - ReadBytes: 2.36 MB - ReadGroups: 20 ```	2022-10-12 16:51:06 +08:00
Tiewei Fang	b7621e1615	[feature-wip](new-scan) support csv reader (#13282 ) Issue Number: close #12574 This pr adds CsvReader which implements GenericReader interface to support read csv format file.	2022-10-12 16:22:13 +08:00
Xinyi Zou	df54c6b63a	[enhancement](memtracker) Add independent and unique scanner mem tracker for each query (#13262 )	2022-10-11 19:47:12 +08:00
Gabriel	1724a91f53	[Bug](predicate) Cover all const predicates in scan node (#13238 ) For an vectorized expression which meets the condition vexpr->is_constant(), a const column is expected to return. But now we still don't cover all predicates for const expression. For example, for query SELECT col FROM tbl WHERE 'PROMOTION' LIKE 'AAA%', predicate like will return a ColumnVector which contains a single value. This PR want to cover all const predicates in scan node whether it returns a constcolumn or not	2022-10-11 15:49:53 +08:00
Mingyu Chen	c1ce48ffe4	[fix](new-scann) scanner may be marked close twice (#13263 )	2022-10-11 15:37:15 +08:00
Pxl	bdcb600f3d	[Bug](load) fix core dump on big block load (#13014 )	2022-10-10 12:38:32 +08:00
Tiewei Fang	935ef5a598	[feature-wip](new-scan) Add new ES scanner and new ES scan node #13027	2022-10-10 09:56:38 +08:00
Ashin Gau	dd089259be	[feature-wip](multi-catalog) Optimize the performance of boolean & dictionary decoding (#13212 ) Generate vector for dictionary data. Decode boolean values in batch.	2022-10-10 08:41:11 +08:00
Pxl	245490d6b7	[Enhancement](runtime filter) optimize for runtime filter (#12856 ) optimize for runtime filter	2022-10-09 14:11:03 +08:00
Ashin Gau	b81a8789c3	[feature-wip](parquet-reader) optimize the performance of column conversion (#13122 ) Convert Parquet column into doris column via batch method. In the previous implementation, only numeric types can be converted in batches, and other types can only be inserted one by one. This process will generate repeated virtual function calls and container expansion.	2022-10-08 18:03:10 +08:00
slothever	5214e898d9	[fix](parquet-reader) skip data/datatime column predicate filter to avoid coredump (#13072 ) Will be fixed later Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-10-08 18:02:35 +08:00
Mingyu Chen	cf2b93532b	[fix](file-scanner) fix some logic about broker load with parquet with new file scanner (#13135 ) Fix some logic about broker load using new file scanner, with parquet format: 1. If columns are specified in load stmt, but none of them are in parquet file, error will be thrown like `err: No columns found in file`. See `parquet_s3_case4` 2. If the first column of table are not in table, the result number of rows is wrong. See `parquet_s3_case8` 3. If column specified in `columns` in load stmt does not exist in file and table, error will be thrown like: `failed to find default value expr for slot: x1`. See `parquet_s3_case2`	2022-10-08 13:08:08 +08:00
Tiewei Fang	b41748efa1	[feature-wip](new-scan)Add new jdbc scanner and new jdbc scan node (#12848 ) Related pr: #11582 This pr is the new jdbc scan node and scanner.	2022-10-07 09:55:17 +08:00
Mingyu Chen	d286aa7bf7	[fix](spark-load) no need to filter row group when doing spark load (#13116 ) 1. Fix issue #13115 2. Modify the method of `get_next_block` or `GenericReader`, to return "read_rows" explicitly. Some columns in block may not be filled in reader, if the first column is not filled, use `block->rows()` can not return real row numbers. 3. Add more checks for broker load test cases.	2022-10-05 23:00:56 +08:00
Lightman	7b75c2df54	[fix](BE) fix the stream load error when upgrade BE from 1.1.2 to master (#13058 )	2022-10-05 12:13:26 +08:00
Ashin Gau	026ffaf10d	[feature-wip](parquet-reader) add detail profile for parquet reader (#13095 ) Add more detail profile for ParquetReader: ParquetColumnReadTime: the total time of reading parquet columns ParquetDecodeDictTime: time to parse dictionary page ParquetDecodeHeaderTime: time to parse page header ParquetDecodeLevelTime: time to parse page's definition/repetition level ParquetDecodeValueTime: time to decode page data into doris column ParquetDecompressCount: counter of decompressing page data ParquetDecompressTime: time to decompress page data ParquetParseMetaTime: time to parse parquet meta data	2022-10-02 15:11:48 +08:00
Gabriel	287ff50a6f	[Bug](datev2) Fix compatible error between datev2 and date (#13024 )	2022-09-29 18:01:55 +08:00
slothever	820ec435ce	[feature-wip](parquet-reader) refactor parquet_predicate (#12896 ) This change serves the following purposes: 1. use ScanPredicate instead of TCondition for external table, it can reuse old code branch. 2. simplify and delete some useless old code 3. use ColumnValueRange to save predicate	2022-09-28 21:27:13 +08:00
Gabriel	1ba9e4b568	[Improvement](sort) Reuse memory in sort node (#12921 )	2022-09-28 09:44:35 +08:00
Mingyu Chen	d80b7b9689	[feature-wip](new-scan) support more load situation (#12953 )	2022-09-27 21:48:32 +08:00
Xiaocc	5790d23624	[fix](transfer_thread) fix the loss of notification. (#12988 )	2022-09-27 08:44:02 +08:00
Pxl	8731eea26e	[Chore](clang) fix some build fail on clang15 (#12882 ) remove unused variables	2022-09-26 23:13:28 +08:00
Tiewei Fang	acd5d67355	[feature-wip](new-scan)Add new odbc scanner and new odbc scan node (#12899 )	2022-09-26 09:24:25 +08:00
Ashin Gau	692176ec07	[feature-wip](parquet-reader) pre read page data in advance to avoid frequent seek (#12898 ) 1. Fix the bug of file position in `HdfsFileReader` 2. Reserve enough buffer for `ColumnColumnReader` to read large continuous memory	2022-09-25 21:21:06 +08:00
Jibing-Li	f1a64ea09f	[fix](new-scan)Fix new scanner load job bugs (#12903 ) Fix bugs: 1. Fe need to send file format (e.g. parquet, orc ...) to be while processing load jobs using new scanner. 2. Try to get parquet file column type from SchemaElement.type before getting from Logical type and Converted type.	2022-09-24 17:21:19 +08:00
yiguolei	7b230e41a8	[bugfix](scanner) olap scanner compute is wrong (#12857 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-09-24 09:59:59 +08:00
Ashin Gau	5bfdfac387	[feature-wip](parquet-reader) add parquet reader profile (#12797 ) Add profile for parquet reader. New counters: - ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics - ParquetReadGroups: The number of row groups to read - ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics - ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics - ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics - ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole ## Result ``` ┌──────────────────────────────────────────────────────┐ │[0: VFILE_SCAN_NODE] │ │(Active: 1s29ms, non-child: 96.42) │ │ - Counters: │ │ - BytesRead: 0.00 │ │ - FileReadCalls: 1.826K (1826) │ │ - FileReadTime: 510.627ms │ │ - FileRemoteReadBytes: 65.23 MB │ │ - FileRemoteReadCalls: 1.146K (1146) │ │ - FileRemoteReadRate: 128.29331970214844 MB/sec │ │ - FileRemoteReadTime: 508.469ms │ │ - NumDiskAccess: 0 │ │ - NumScanners: 1 │ │ - ParquetFilteredBytes: 0.00 │ │ - ParquetFilteredGroups: 0 │ │ - ParquetFilteredRowsByGroup: 0 │ │ - ParquetFilteredRowsByPage: 6.600003M (6600003)│ │ - ParquetReadBytes: 2.13 GB │ │ - ParquetReadGroups: 20 │ │ - PeakMemoryUsage: 0.00 │ │ - PredicateFilteredRows: 3.399797M (3399797) │ │ - PredicateFilteredTime: 133.302ms │ │ - RowsRead: 3.399997M (3399997) │ │ - RowsReturned: 200 │ │ - RowsReturnedRate: 194 │ │ - TotalRawReadTime(*): 726.566ms │ │ - TotalReadThroughput: 0.0 /sec │ │ - WaitScannerTime: 1s27ms │ └──────────────────────────────────────────────────────┘ ```	2022-09-23 18:42:14 +08:00

1 2 3 4 5 ...

317 Commits