doris

Author	SHA1	Message	Date
Ashin Gau	6bd5378f66	[feature-wip](multi-catalog) lazy read for ParquetReader (#13917 ) Read predicate columns firstly, and use VExprContext(push-down predicates) to generate the select vector, which is then applied to read the non-predicate columns. The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced. If a whole page can be skipped, the decompress-time can also be reduced.	2022-11-10 16:56:14 +08:00
Mingyu Chen	7b4c2cabb4	[feature](new-scan) support transactional insert in new scan framework (#13858 ) Support running transactional insert operation with new scan framework. eg: admin set frontend config("enable_new_load_scan_node" = "true"); begin; insert into tbl1 values(1,2); insert into tbl1 values(3,4); insert into tbl1 values(5,6); commit; Add some limitation to transactional insert Do not support non-literal value in insert stmt Fix some issue about array type: Forbid cast other non-array type to NESTED array type, it may cause BE crash. Add getStringValueForArray() method for Expr, to get valid string-formatted array type value. Add useLocalSessionState=true in regression-test jdbc url without this config, the jdbc driver will send some init cmd each time it connect to server, such as select @@session.tx_read_only. But when we use transactional insert, after begin command, Doris do not support any other type of stmt except for insert, commit or rollback. So adding this config to let the jdbc NOT send cmd when connecting.	2022-11-03 08:36:07 +08:00
Tiewei Fang	c418bbd2d1	[feature-wip](new-scan) support Json reader (#13546 ) Issue Number: close #12574 This pr adds `NewJsonReader` which implements GenericReader interface to support read json format file. TODO: 1. modify `_scann_eof` later. 2. Rename `NewJsonReader` to `JsonReader` when `JsonReader` is deleted.	2022-10-26 12:52:21 +08:00
Jibing-Li	44c9163b3c	[Fix](multi-catalog)Fix partition external table query bug. (#13535 ) The index for external table columns from path is incorrect in new scanner. This is a fix for it. e.g. In the next query, nation and city columns are from path ``` mysql> select nation, city, count() from parquet_two_part group by nation, city; +--------+------------+----------+ \| nation \| city \| count() \| +--------+------------+----------+ \| cn \| beijing \| 1199969 \| \| cn \| shanghai \| 1199771 \| \| jp \| tokyo \| 599715 \| \| rus \| moscow \| 600659 \| \| us \| chicago \| 1199805 \| \| us \| washington \| 1201296 \| +--------+------------+----------+ 6 rows in set (0.39 sec) ```	2022-10-26 12:47:37 +08:00
Mingyu Chen	3a3def447d	[fix](csv-reader) fix bug that csv reader can not read text format hms table (#13515 ) 1. Missing field and line delimiter 2. When query external table with text(csv) format, we should pass the column position map to BE, otherwise the column order is wrong. TODO: 1. For now, if we query csv file with non-exist column, it will return null. But it should return null or default value of that column. 2. Add regression test after hive docker is ready.	2022-10-22 22:40:03 +08:00
Mingyu Chen	32b1456b28	[feature-wip](array) remove array config and check array nested depth (#13428 ) 1. remove FE config `enable_array_type` 2. limit the nested depth of array in FE side. 3. Fix bug that when loading array from parquet, the decimal type is treated as bigint 4. Fix loading array from csv(vec-engine), handle null and "null" 5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null. 6. Remove `check_array_format()`, because it's logic is wrong and meaningless 7. Add stream load csv test cases and more parquet broker load tests	2022-10-20 15:52:31 +08:00
Ashin Gau	21f233d7e7	[feature-wip](multi-catalog) use apache orc reader to read orc file (#13404 ) Use apache orc to read orc file, and convert ColumnVectorBatch to doris block.	2022-10-18 13:47:56 +08:00
Mingyu Chen	dbf71ed3be	[feature-wip](new-scan) Support stream load with csv in new scan framework (#13354 ) 1. Refactor the file reader creation in FileFactory, for simplicity. Previously, FileFactory had too many `create_file_reader` interfaces. Now unified into two categories: the interface used by the previous BrokerScanNode, and the interface used by the new FileScanNode. And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files. 2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode 3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside. 4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.	2022-10-17 23:33:41 +08:00
Tiewei Fang	b7621e1615	[feature-wip](new-scan) support csv reader (#13282 ) Issue Number: close #12574 This pr adds CsvReader which implements GenericReader interface to support read csv format file.	2022-10-12 16:22:13 +08:00
Xinyi Zou	df54c6b63a	[enhancement](memtracker) Add independent and unique scanner mem tracker for each query (#13262 )	2022-10-11 19:47:12 +08:00
Mingyu Chen	cf2b93532b	[fix](file-scanner) fix some logic about broker load with parquet with new file scanner (#13135 ) Fix some logic about broker load using new file scanner, with parquet format: 1. If columns are specified in load stmt, but none of them are in parquet file, error will be thrown like `err: No columns found in file`. See `parquet_s3_case4` 2. If the first column of table are not in table, the result number of rows is wrong. See `parquet_s3_case8` 3. If column specified in `columns` in load stmt does not exist in file and table, error will be thrown like: `failed to find default value expr for slot: x1`. See `parquet_s3_case2`	2022-10-08 13:08:08 +08:00
Mingyu Chen	d286aa7bf7	[fix](spark-load) no need to filter row group when doing spark load (#13116 ) 1. Fix issue #13115 2. Modify the method of `get_next_block` or `GenericReader`, to return "read_rows" explicitly. Some columns in block may not be filled in reader, if the first column is not filled, use `block->rows()` can not return real row numbers. 3. Add more checks for broker load test cases.	2022-10-05 23:00:56 +08:00
Ashin Gau	026ffaf10d	[feature-wip](parquet-reader) add detail profile for parquet reader (#13095 ) Add more detail profile for ParquetReader: ParquetColumnReadTime: the total time of reading parquet columns ParquetDecodeDictTime: time to parse dictionary page ParquetDecodeHeaderTime: time to parse page header ParquetDecodeLevelTime: time to parse page's definition/repetition level ParquetDecodeValueTime: time to decode page data into doris column ParquetDecompressCount: counter of decompressing page data ParquetDecompressTime: time to decompress page data ParquetParseMetaTime: time to parse parquet meta data	2022-10-02 15:11:48 +08:00
slothever	820ec435ce	[feature-wip](parquet-reader) refactor parquet_predicate (#12896 ) This change serves the following purposes: 1. use ScanPredicate instead of TCondition for external table, it can reuse old code branch. 2. simplify and delete some useless old code 3. use ColumnValueRange to save predicate	2022-09-28 21:27:13 +08:00
Mingyu Chen	d80b7b9689	[feature-wip](new-scan) support more load situation (#12953 )	2022-09-27 21:48:32 +08:00
Ashin Gau	5bfdfac387	[feature-wip](parquet-reader) add parquet reader profile (#12797 ) Add profile for parquet reader. New counters: - ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics - ParquetReadGroups: The number of row groups to read - ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics - ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics - ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics - ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole ## Result ``` ┌──────────────────────────────────────────────────────┐ │[0: VFILE_SCAN_NODE] │ │(Active: 1s29ms, non-child: 96.42) │ │ - Counters: │ │ - BytesRead: 0.00 │ │ - FileReadCalls: 1.826K (1826) │ │ - FileReadTime: 510.627ms │ │ - FileRemoteReadBytes: 65.23 MB │ │ - FileRemoteReadCalls: 1.146K (1146) │ │ - FileRemoteReadRate: 128.29331970214844 MB/sec │ │ - FileRemoteReadTime: 508.469ms │ │ - NumDiskAccess: 0 │ │ - NumScanners: 1 │ │ - ParquetFilteredBytes: 0.00 │ │ - ParquetFilteredGroups: 0 │ │ - ParquetFilteredRowsByGroup: 0 │ │ - ParquetFilteredRowsByPage: 6.600003M (6600003)│ │ - ParquetReadBytes: 2.13 GB │ │ - ParquetReadGroups: 20 │ │ - PeakMemoryUsage: 0.00 │ │ - PredicateFilteredRows: 3.399797M (3399797) │ │ - PredicateFilteredTime: 133.302ms │ │ - RowsRead: 3.399997M (3399997) │ │ - RowsReturned: 200 │ │ - RowsReturnedRate: 194 │ │ - TotalRawReadTime(*): 726.566ms │ │ - TotalReadThroughput: 0.0 /sec │ │ - WaitScannerTime: 1s27ms │ └──────────────────────────────────────────────────────┘ ```	2022-09-23 18:42:14 +08:00
slothever	1ca6d559e4	[feature-wip](parquet-reader) refactor some arguments for parquet reader (#12771 ) refactor some arguments for parquet reader 1. Add new parquet context to wrap reader arguments 2. Reduced some arguments for function call Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-09-22 09:34:01 +08:00
Jibing-Li	fbdebe2424	[feature-wip](new-scan)Add load counter for VFileScanner (#12812 ) The new scanner (VFileScanner) need a counter to record two values in load job. 1. The number of rows unselected by pre-filter, and 2. The number of rows filtered by unmatched schema or other error. This pr is to implement the counter.	2022-09-21 20:59:13 +08:00
Jibing-Li	ec2b3bf220	[feature-wip](new-scan)Refactor VFileScanner, support broker load, remove unused functions in VScanner base class. (#12793 ) Refactor of scanners. Support broker load. This pr is part of the refactor scanner tasks. It provide support for borker load using new VFileScanner. Work still in progress.	2022-09-21 12:49:56 +08:00
Jibing-Li	5978fd9647	[refactor](file scanner)Refactor file scanner. (#12602 ) Refactor the scanners for hms external catalog, work in progress. Use VFileScanner, will remove NewFileParquetScanner, NewFileOrcScanner and NewFileTextScanner after fully tested. Query for parquet file has been tested, still need to add readers for orc file, text file and load logic as well.	2022-09-19 15:23:51 +08:00

20 Commits