doris

Author	SHA1	Message	Date
slothever	67e4292533	[fix](iceberg-v2) icebergv2 filter data path (#14470 ) 1. a icebergv2 delete file may cross many data paths, so the path of a file split is required as a predicate to filter rows of delete file - create delete file structure to save predicate parameters - create predicate for file path 2. add some log to print row range 3. fix bug when create file metadata	2022-12-15 10:18:12 +08:00
Tiewei Fang	b8f93681eb	[feature-wip](file reader) Merge broker reader to the new file reader (#14980 ) Currently, there are two sets of file readers in Doris, this pr rewrites the old broker reader with the new file reader. TODO: 1. rewrite stream load pipe and kafka consumer pipe	2022-12-14 12:48:02 +08:00
Jibing-Li	73ee352705	[fix](multi catalog)Fix convert_to_doris_type missing break for some cases (#14992 )	2022-12-13 13:34:55 +08:00
slothever	e7a84e4a16	[fix](multi-catalog)fix page index thrift deserialize (#15001 ) fix the err when parse page index: Couldn't deserialize thrift msg. use two buffer to store column index and offset index msg, avoid parse them in a buffer	2022-12-13 13:33:19 +08:00
Jibing-Li	8fe0729835	[fix](multi catalog)Check orc file reader is not null before using it. (#14988 ) The external table file path cache may out of date, which will cause orc reader to visit non-exist files. In this case, orc file reader is nullptr. This pr is to check the reader before using it to avoid core dump of visiting nullptr.	2022-12-13 11:27:51 +08:00
plat1ko	f3aea7f0f0	[Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744 )	2022-12-11 23:33:18 +08:00
Tiewei Fang	00f44257e2	[feature-wip](file-reader) Merge hdfs reader to the new file reader (#14875 )	2022-12-09 13:21:59 +08:00
lsy3993	f7a827c06b	[fix](new-scan) fix some bugs about new scan node and readers (#14504 ) json reader DCHECK fail because of missing TYPE_STRING fix bug that if no file is found, the tvf will throw NPE. The predicate conjuncts can not be pushed down to parquet reader if this is a load task. Because the predicate should be applied on column of dest table, not on column of source file. Add a temp property "use_new_load_scan_node" of broker load to make regression test happy. So that we can use new load scan node for a certain job and avoid setting global FE config.	2022-11-29 10:21:41 +08:00
Mingyu Chen	064b8d2aa6	[fix](multi-catalog) fix coredump when querying partitioned hive table with text format (#14604 ) BE will crash when querying partitioned hive table with text format and put partition column at first of select items. 1. FE should use file slots to set the column mapping index of csv file. 2. BE should use `get_by_name` of block to get right column in a block in csv reader.	2022-11-26 11:42:40 +08:00
Ashin Gau	25de068a05	[fix](parquet-reader) the value of null map will overflow when LazyRead merges too many empty batches (#14558 ) The run length of null map is saved as `uint16_t`. Previously, the run length of null map was limited by `batch_size` in the `ParquetReader`, by setting `batch_size = std::min(batch_size, (size_t)USHRT_MAX)`. It works well when the batch size is less than `USHRT_MAX`. However, [Lazy read](https://github.com/apache/doris/pull/13917) will merge empty batches until reading a non-empty batch or reaching the EOF of a row group, so the `batch_size` may be greater than `USHRT_MAX` in non-predicate columns. In addition, even if the `batch_size` does not exceed `USHRT_MAX`, the adjacent batches may also make the run length exceed the `USHRT_MAX` in `ColumnSelectVector::get_next_run`.	2022-11-25 12:22:18 +08:00
Gabriel	1ec7f45fb6	[Bug](avg) Fix `avg` for bigint (#14433 )	2022-11-22 10:29:59 +08:00
Gabriel	2c42f0a905	[refactor](decimalv3) Refine code for DecimalV3 (#14394 )	2022-11-19 16:57:17 +08:00
Mingyu Chen	512b787559	[fix](parquet-reader) fix stack-use-after-return error (#14411 )	2022-11-19 10:52:50 +08:00
slothever	6da2948283	[feature-wip](multi-catalog) support iceberg v2(step 1) (#13867 ) Support position delete(part of).	2022-11-17 17:56:48 +08:00
Ashin Gau	20634ab7e3	[feature-wip](multi-catalog) support partition&missing columns in parquet lazy read (#14264 ) PR https://github.com/apache/doris/pull/13917 has supported lazy read for non-predicate columns in ParquetReader, but can't trigger lazy read when predicate columns are partition or missing columns. This PR support such case, and fill partition and missing columns in `FileReader`.	2022-11-16 08:43:11 +08:00
huangzhaowei	5badd70db2	[fix](csv-reader) Fix core dump when load text into doris with special delimiter (#14196 )	2022-11-15 16:06:59 +08:00
Ashin Gau	fc70179acb	[multi-catalog](fix) the eof of lazy read columns may be not equal to the eof of predicate columns (#14212 ) Fix three bugs: 1. The EOF of lazy read columns may be not equal to the EOF of predicate columns. (for example: If the predicate column has 3 pages, with 400 rows for each, but the last page is filtered by page index. When batch_size=992, the EOF of predicate column is true. However, we should set batch_size=800 for lazy read column, so the EOF of lazy read column may be false.) 2. The array column does not count the number of nulls 3. Generate wrong NullMap for array column	2022-11-14 14:37:21 +08:00
xy720	035657c5a1	[typo](comment) Fix a lot of spell errors in be comments (#14208 ) fix typos.	2022-11-12 16:06:15 +08:00
Ashin Gau	6bd5378f66	[feature-wip](multi-catalog) lazy read for ParquetReader (#13917 ) Read predicate columns firstly, and use VExprContext(push-down predicates) to generate the select vector, which is then applied to read the non-predicate columns. The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced. If a whole page can be skipped, the decompress-time can also be reduced.	2022-11-10 16:56:14 +08:00
Tiewei Fang	43eb946543	[feature](table-valued-function)S3 table valued function supports parquet/orc/json file format #14130 S3 table valued function supports parquet/orc/json file format. For example: parquet format	2022-11-10 10:33:12 +08:00
slothever	c2a01e84b4	[feature-wip](multi-catalog) fix page index filter bug (#14015 ) Fix page index filter not take effect when multiple columns Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-11-08 12:10:12 +08:00
Tiewei Fang	27549564a7	[feature](table-valued-function) Support S3 tvf (#13959 ) This pr does three things： 1. Modified the framework of table-valued-function(tvf). 2. be support `fetch_table_schema` rpc. 3. Implemented `S3(path, AK, SK, format)` table-valued-function.	2022-11-06 11:04:26 +08:00
Mingyu Chen	7b4c2cabb4	[feature](new-scan) support transactional insert in new scan framework (#13858 ) Support running transactional insert operation with new scan framework. eg: admin set frontend config("enable_new_load_scan_node" = "true"); begin; insert into tbl1 values(1,2); insert into tbl1 values(3,4); insert into tbl1 values(5,6); commit; Add some limitation to transactional insert Do not support non-literal value in insert stmt Fix some issue about array type: Forbid cast other non-array type to NESTED array type, it may cause BE crash. Add getStringValueForArray() method for Expr, to get valid string-formatted array type value. Add useLocalSessionState=true in regression-test jdbc url without this config, the jdbc driver will send some init cmd each time it connect to server, such as select @@session.tx_read_only. But when we use transactional insert, after begin command, Doris do not support any other type of stmt except for insert, commit or rollback. So adding this config to let the jdbc NOT send cmd when connecting.	2022-11-03 08:36:07 +08:00
Adonis Ling	ba918b40e2	[chore](macOS) Fix compilation errors caused by the deprecated function (#13890 )	2022-11-02 13:34:51 +08:00
Ashin Gau	e0667b297f	[feature-wip](multi-catalog) reuse hdfsFs and decode parquet values in batch (#13688 ) PR(https://github.com/apache/doris/pull/13404) introduced that ParquetReader will break up batch insertion when encountering null values, which leads to the bad performance compared to OrcReader. So this PR has pushed null map into decode function, reduce the time of virtual function call when encountering null values. Further more, reuse hdfsFS among file readers to reduce the time of building connection to hdfs.	2022-10-28 15:52:52 +08:00
Tiewei Fang	c418bbd2d1	[feature-wip](new-scan) support Json reader (#13546 ) Issue Number: close #12574 This pr adds `NewJsonReader` which implements GenericReader interface to support read json format file. TODO: 1. modify `_scann_eof` later. 2. Rename `NewJsonReader` to `JsonReader` when `JsonReader` is deleted.	2022-10-26 12:52:21 +08:00
Jibing-Li	44c9163b3c	[Fix](multi-catalog)Fix partition external table query bug. (#13535 ) The index for external table columns from path is incorrect in new scanner. This is a fix for it. e.g. In the next query, nation and city columns are from path ``` mysql> select nation, city, count() from parquet_two_part group by nation, city; +--------+------------+----------+ \| nation \| city \| count() \| +--------+------------+----------+ \| cn \| beijing \| 1199969 \| \| cn \| shanghai \| 1199771 \| \| jp \| tokyo \| 599715 \| \| rus \| moscow \| 600659 \| \| us \| chicago \| 1199805 \| \| us \| washington \| 1201296 \| +--------+------------+----------+ 6 rows in set (0.39 sec) ```	2022-10-26 12:47:37 +08:00
Mingyu Chen	3a3def447d	[fix](csv-reader) fix bug that csv reader can not read text format hms table (#13515 ) 1. Missing field and line delimiter 2. When query external table with text(csv) format, we should pass the column position map to BE, otherwise the column order is wrong. TODO: 1. For now, if we query csv file with non-exist column, it will return null. But it should return null or default value of that column. 2. Add regression test after hive docker is ready.	2022-10-22 22:40:03 +08:00
Mingyu Chen	32b1456b28	[feature-wip](array) remove array config and check array nested depth (#13428 ) 1. remove FE config `enable_array_type` 2. limit the nested depth of array in FE side. 3. Fix bug that when loading array from parquet, the decimal type is treated as bigint 4. Fix loading array from csv(vec-engine), handle null and "null" 5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null. 6. Remove `check_array_format()`, because it's logic is wrong and meaningless 7. Add stream load csv test cases and more parquet broker load tests	2022-10-20 15:52:31 +08:00
Ashin Gau	f7c69ade18	[feature-wip](multi-catalog) implement predicate pushdown in native OrcReader (#13453 ) # Proposed changes Implement predicate pushdown in `OrcReader` by converting doris `ColumnValueRange` to orc `SearchArgument`. ## Remaining problems 1. Orc support `not in`, which may have effect on bloom filter. However, doris `ScanNode` has not push down `not in` to file scanner. 2. Orc support `is null`, and row range has `hasNull` identifier. However, `_contain_null` in `ColumnValueRange` is ambiguous. `_contain_null = true` only means that the value can be nullable, not equal to null. 3. `DateTimeV2` has lost microsecond precision in `ColumnValueRange`, which may cause filtering error when a min-max value equals to the predicate value. 4. `DateTimeV1` is not accurate enough, and only saved to seconds. 5. Orc support the predicate pushdown of `float&double` type, but doris has not push down `float&double` type for precision reason.	2022-10-20 10:07:36 +08:00
Ashin Gau	21f233d7e7	[feature-wip](multi-catalog) use apache orc reader to read orc file (#13404 ) Use apache orc to read orc file, and convert ColumnVectorBatch to doris block.	2022-10-18 13:47:56 +08:00
Mingyu Chen	dbf71ed3be	[feature-wip](new-scan) Support stream load with csv in new scan framework (#13354 ) 1. Refactor the file reader creation in FileFactory, for simplicity. Previously, FileFactory had too many `create_file_reader` interfaces. Now unified into two categories: the interface used by the previous BrokerScanNode, and the interface used by the new FileScanNode. And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files. 2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode 3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside. 4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.	2022-10-17 23:33:41 +08:00
Pxl	632670a49c	[Enhancement](function) refactor of date function (#13362 ) refactor of date function	2022-10-16 14:31:26 +08:00
slothever	4fc7a048d2	[feature-wip](parquet-reader) fix string test and support decimal64 (#13184 ) 1. Refactor arguments list of parquet min max filter, pass parquet type for min max value parsing 2. Fix the filter of string min max Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-10-12 16:52:28 +08:00
Ashin Gau	bb4414e303	[feature-wip](multi-catalog) optimize parquet profile & add null map timer (#13257 ) Use indentation to make `ParquetReader`'s profile more readable Add `ParquetReader.DecodeNullMapTime` to show the time of parsing `NullMap` for `NullableColumn` ``` VFILE_SCAN_NODE (id=0):(Active: 279.62ms, % non-child: 85.83%) - FileReadBytes: 2.36 MB - FileReadCalls: 20 - FileReadTime: 5.686ms - MaxScannerThreadNum: 1 - NewlyCreateFreeBlocksNum: 125 - NumScanners: 1 - ParquetReader: 0ns - ColumnReadTime: 259.946ms - DecodeDictTime: 0ns - DecodeHeaderTime: 437.707us - DecodeLevelTime: 30.101us - DecodeNullMapTime: 53.295ms - DecodeValueTime: 62.607ms - DecompressCount: 511 - DecompressTime: 1.159ms - FilteredBytes: 0.00 - FilteredGroups: 0 - FilteredRowsByGroup: 0 - FilteredRowsByPage: 0 - ParseMetaTime: 22.517ms - ReadBytes: 2.36 MB - ReadGroups: 20 ```	2022-10-12 16:51:06 +08:00
Tiewei Fang	b7621e1615	[feature-wip](new-scan) support csv reader (#13282 ) Issue Number: close #12574 This pr adds CsvReader which implements GenericReader interface to support read csv format file.	2022-10-12 16:22:13 +08:00
Ashin Gau	dd089259be	[feature-wip](multi-catalog) Optimize the performance of boolean & dictionary decoding (#13212 ) Generate vector for dictionary data. Decode boolean values in batch.	2022-10-10 08:41:11 +08:00
Ashin Gau	b81a8789c3	[feature-wip](parquet-reader) optimize the performance of column conversion (#13122 ) Convert Parquet column into doris column via batch method. In the previous implementation, only numeric types can be converted in batches, and other types can only be inserted one by one. This process will generate repeated virtual function calls and container expansion.	2022-10-08 18:03:10 +08:00
slothever	5214e898d9	[fix](parquet-reader) skip data/datatime column predicate filter to avoid coredump (#13072 ) Will be fixed later Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-10-08 18:02:35 +08:00
Mingyu Chen	cf2b93532b	[fix](file-scanner) fix some logic about broker load with parquet with new file scanner (#13135 ) Fix some logic about broker load using new file scanner, with parquet format: 1. If columns are specified in load stmt, but none of them are in parquet file, error will be thrown like `err: No columns found in file`. See `parquet_s3_case4` 2. If the first column of table are not in table, the result number of rows is wrong. See `parquet_s3_case8` 3. If column specified in `columns` in load stmt does not exist in file and table, error will be thrown like: `failed to find default value expr for slot: x1`. See `parquet_s3_case2`	2022-10-08 13:08:08 +08:00
Mingyu Chen	d286aa7bf7	[fix](spark-load) no need to filter row group when doing spark load (#13116 ) 1. Fix issue #13115 2. Modify the method of `get_next_block` or `GenericReader`, to return "read_rows" explicitly. Some columns in block may not be filled in reader, if the first column is not filled, use `block->rows()` can not return real row numbers. 3. Add more checks for broker load test cases.	2022-10-05 23:00:56 +08:00
Ashin Gau	026ffaf10d	[feature-wip](parquet-reader) add detail profile for parquet reader (#13095 ) Add more detail profile for ParquetReader: ParquetColumnReadTime: the total time of reading parquet columns ParquetDecodeDictTime: time to parse dictionary page ParquetDecodeHeaderTime: time to parse page header ParquetDecodeLevelTime: time to parse page's definition/repetition level ParquetDecodeValueTime: time to decode page data into doris column ParquetDecompressCount: counter of decompressing page data ParquetDecompressTime: time to decompress page data ParquetParseMetaTime: time to parse parquet meta data	2022-10-02 15:11:48 +08:00
slothever	820ec435ce	[feature-wip](parquet-reader) refactor parquet_predicate (#12896 ) This change serves the following purposes: 1. use ScanPredicate instead of TCondition for external table, it can reuse old code branch. 2. simplify and delete some useless old code 3. use ColumnValueRange to save predicate	2022-09-28 21:27:13 +08:00
Mingyu Chen	d80b7b9689	[feature-wip](new-scan) support more load situation (#12953 )	2022-09-27 21:48:32 +08:00
Ashin Gau	692176ec07	[feature-wip](parquet-reader) pre read page data in advance to avoid frequent seek (#12898 ) 1. Fix the bug of file position in `HdfsFileReader` 2. Reserve enough buffer for `ColumnColumnReader` to read large continuous memory	2022-09-25 21:21:06 +08:00
Jibing-Li	f1a64ea09f	[fix](new-scan)Fix new scanner load job bugs (#12903 ) Fix bugs: 1. Fe need to send file format (e.g. parquet, orc ...) to be while processing load jobs using new scanner. 2. Try to get parquet file column type from SchemaElement.type before getting from Logical type and Converted type.	2022-09-24 17:21:19 +08:00
Ashin Gau	5bfdfac387	[feature-wip](parquet-reader) add parquet reader profile (#12797 ) Add profile for parquet reader. New counters: - ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics - ParquetReadGroups: The number of row groups to read - ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics - ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics - ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics - ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole ## Result ``` ┌──────────────────────────────────────────────────────┐ │[0: VFILE_SCAN_NODE] │ │(Active: 1s29ms, non-child: 96.42) │ │ - Counters: │ │ - BytesRead: 0.00 │ │ - FileReadCalls: 1.826K (1826) │ │ - FileReadTime: 510.627ms │ │ - FileRemoteReadBytes: 65.23 MB │ │ - FileRemoteReadCalls: 1.146K (1146) │ │ - FileRemoteReadRate: 128.29331970214844 MB/sec │ │ - FileRemoteReadTime: 508.469ms │ │ - NumDiskAccess: 0 │ │ - NumScanners: 1 │ │ - ParquetFilteredBytes: 0.00 │ │ - ParquetFilteredGroups: 0 │ │ - ParquetFilteredRowsByGroup: 0 │ │ - ParquetFilteredRowsByPage: 6.600003M (6600003)│ │ - ParquetReadBytes: 2.13 GB │ │ - ParquetReadGroups: 20 │ │ - PeakMemoryUsage: 0.00 │ │ - PredicateFilteredRows: 3.399797M (3399797) │ │ - PredicateFilteredTime: 133.302ms │ │ - RowsRead: 3.399997M (3399997) │ │ - RowsReturned: 200 │ │ - RowsReturnedRate: 194 │ │ - TotalRawReadTime(*): 726.566ms │ │ - TotalReadThroughput: 0.0 /sec │ │ - WaitScannerTime: 1s27ms │ └──────────────────────────────────────────────────────┘ ```	2022-09-23 18:42:14 +08:00
Jibing-Li	4b95b4e41d	[feature-wip](file-scanner)Get column type from parquet schema (#12833 ) Get schema from parquet reader. The new VFileScanner need to get file schema (column name to type map) from parquet file while processing load job, this pr is to set the type information for parquet columns.	2022-09-22 09:35:37 +08:00
slothever	1ca6d559e4	[feature-wip](parquet-reader) refactor some arguments for parquet reader (#12771 ) refactor some arguments for parquet reader 1. Add new parquet context to wrap reader arguments 2. Reduced some arguments for function call Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-09-22 09:34:01 +08:00
Jibing-Li	ec2b3bf220	[feature-wip](new-scan)Refactor VFileScanner, support broker load, remove unused functions in VScanner base class. (#12793 ) Refactor of scanners. Support broker load. This pr is part of the refactor scanner tasks. It provide support for borker load using new VFileScanner. Work still in progress.	2022-09-21 12:49:56 +08:00

1 2

76 Commits