doris

Author	SHA1	Message	Date
slothever	455c800405	[feature](parquet-reader) add rle bool and delta decoder to read AWS Glue (#17112 ) Support delta encoding and rle(bool) to read Glue data add delta bit pack decoder, add delta length byte array decoder, add delta byte array decoder. add rle bool decoder. We find some data type is read with delta encoding on AWS Glue, so it should be supported. The definition of delta encoding can refer to the delta encoding in parquet.	2023-03-12 20:09:58 +08:00
Xin Liao	8001d65811	[fix](insert) fix memory leak for insert transaction (#17530 )	2023-03-08 14:10:59 +08:00
Ashin Gau	dca16796ad	[fix](ParquetReader) definition level of repeated parent is wrong (#17337 ) Fix three bugs: 1. `repeated_parent_def_level ` should be the definition of its repeated parent. 2. Failed to parse schema like `decimal(p, s)` 3. Fill wrong offsets for array type	2023-03-06 18:15:57 +08:00
yiguolei	9477c48ef8	[refactor](functioncontext) remove duplicate type definition in function context (#17421 ) remove duplicate type definition in function context remove unused method in function context not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned. remove useless slot_size in all tuple or slot descriptor. remove doris_udf namespace, it is useless. remove some unused macro definitions. init v_conjuncts in vscanner, not need write the same code in every scanner. using unique ptr to manage function context since it could only belong to a single expr context. Issue Number: close #xxx --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-06 16:07:09 +08:00
luozenglin	e7cba11680	[fix](array)(parquet) fix be core dump due to load from parquet file containing array types (#17298 )	2023-03-06 15:18:42 +08:00
Mingyu Chen	3d0beec01d	[fix](orc) fix heap-use-after-free and potential memory leak of orc reader (#17431 ) fix heap-use-after-free The OrcReader has a internal FileInputStream, If the file is empty, the memory of FileInputStream will leak. Besides, there is a Statistics instance in FileInputStream. FileInputStream maybe delete if the orc reader is inited failed, but Statistics maybe used when orc reader is closed, causing heap-use-after-free error. Potential memory leak When init file scanner in file scan node, the file scanner prepare failed, the memory of file scanner will leak.	2023-03-06 08:42:35 +08:00
HappenLee	1244eed1cd	[Opt](exec) opt the dispose nullable column logic (#17192 )	2023-03-01 23:25:40 +08:00
Tiewei Fang	f1db0d9501	[Enhencement](File Reader) delete old file_reader (#17261 ) * delete old file_reader * fix 1	2023-03-01 20:24:03 +08:00
Ashin Gau	bf5037d6d5	[fix](OrcReader) typo in anaylize null values (#17156 ) typographical error in analyzing null values for OrcReader.	2023-02-28 14:29:13 +08:00
slothever	598038e674	[improvement](parquet-reader)support parquet data page v2 (#17054 ) Support parquet data page v2 Now the parquet data on AWS glue use data page v2, but we didn't support before.	2023-02-28 14:23:45 +08:00
Pxl	0723e55f76	[Bug](build) fix compile fail on unused value #17165 error: variable 'nullcount' set but not used [-Werror,-Wunused-but-set-variable] int nullcount = 0;	2023-02-27 14:19:44 +08:00
lihangyu	29dc08fc45	[Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. (#17124 ) * [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. `_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode, introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order, should be quite the same between each lines * fix case	2023-02-27 10:39:22 +08:00
zxealous	a0782a1855	[fix](file reader) fix be core in broker file reader (#17039 ) A const reference member variables as class member stores a temporary object, which cannot be got after the temporary object being destroyed, cause be core dump while enable debug level log _broker_addr has been destroyed in BrokerFileReader	2023-02-26 12:35:31 +08:00
Tiewei Fang	f6ce072297	[Enhencement](csv-reader) Optimize csv_reader `_split_value` and fix json_reader case sensitive (#17093 ) 1. Enhencement: For single-charset column separator，csv_reader use another method of `split value`. 2. BugFix Set `json` file format loading to be sensitive.	2023-02-26 09:03:04 +08:00
Ashin Gau	c43e521d29	[feature](multi-catalog) support map&struct type in parquet&orc reader (#17087 ) Support parsing map&struct type in parquet&orc reader. ## Remaining Problems 1. Doris use array type to build the key and value column of a `map`, but doesn't fill the offsets in value column, so the offsets in value column is wasted. 2. Parquet support reading only key or value column in `map`, this PR hasn't supported yet. 3. Parquet support reading partial columns in `struct`, this PR hasn't supported yet.	2023-02-26 08:55:39 +08:00
Ashin Gau	e42465ae59	[fix](OrcReader) handle null values in orc reader for string type (#17135 ) Orc doesn't fill null values in new batch, but the former batch has been release. Other types like int/long/timestamp... are flat types without pointer in them, so other types do not need to be handled separately like string.	2023-02-26 08:10:40 +08:00
Ashin Gau	3ea6478ba8	[feature](multi-catalog) parquet reader support nested array column (#16961 ) Support to decode nested array column in parquet reader: 1. FE should generate the right nested column type. FE doesn't check the nesting depth and legality, like map\<array\<int\>, int\>. 2. `ParquetColumnReader` has removed the filtering of page index to support nested array type. It's too difficult to skip values in nested complex types. Maybe we should support the filtering of page index and lazy read in later PR. 3. `ExternalFileScanNode` has a bug in creating default value expression. 4. Maybe it's slow to read repetition levels in a while loop. I'll optimize this in next PR. 5. Array column has temporary `SchemaElement` in its thrift definition, we have removed them and keep its parent in former implementation. The remaining parent should inherit the repetition and definition level of its child.	2023-02-23 14:54:58 +08:00
Qi Chen	61826e3a77	[Improvement](parquet-reader) Improve performance of parquet reader filter calculation. (#16934 ) Improve performance of parquet reader filter calculation. - Use `filter_data` instead of `(*filter_ptr)` to merge filter to improve performance. - Use mutable column filter func instead of original new column filter func which introduced by #16850. - Avoid column ref-count increasing which caused unnecessary copying by passing column pointer ref.	2023-02-23 14:41:30 +08:00
zxealous	29c46d6926	[fix](struct-type) fix be core when load array orc file (#16978 ) * fix be core when load array orc file	2023-02-22 10:15:39 +08:00
Adonis Ling	4cb97b6fb7	[chore](macOS) Fix linkage errors for the release build (#17002 ) Issue Number: close #17003 ## Problem summary The linker couldn't find some symbols because the implementation of a template member function doris::vectorized::Decoder::init_decimal_converter is missing in the header file in which the corresponding declaration is placed.	2023-02-22 10:01:51 +08:00
Mingyu Chen	491d269412	[fix](tvf) fix bug that failed to get schema of tvf when file is empty (#16928 ) In previous implementation, when querying tvf, FE will get schema from BE. And BE will try to open the first file to get its schema info, but for orc or parquet format, if the file is empty, it will return error. But even for an empty file, we can still get schema info from file's footer. So we should handle the empty file to get schema info correctly. Also modify the catalog doc to add some FAQ.	2023-02-21 14:14:32 +08:00
lihangyu	113023fb86	(Enhancement)[load-json] support simdjson in new json reader (#16903 ) be config: enable_simdjson_reader=true related PR #11665	2023-02-21 11:31:00 +08:00
Qi Chen	a46941c684	[Fix](multi-catalog) Fix switch-case fall-through issue in multi-catalog module. (#16931 ) Fix switch-case fall-through issue in multi-catalog module.	2023-02-20 21:35:41 +08:00
Qi Chen	ef2fdb79bb	[Improvement](parquet-reader) Optimize and refactor parquet reader to improve performance. (#16818 ) Optimize and refactor parquet reader to improve performance. - Improve 2x performance for small dict string by aligned copying. - Refactor code to decrease condition(if) checking. - Don't call skip(0). - Don't read page index if no condition. ssb-flat-100: (single-machine, single-thread) \| Query \| before opt \| after opt \| \| ------------- \|:-------------:\| ---------:\| \| SELECT count(lo_revenue) FROM lineorder_flat \| 9.23 \| 9.12 \| \| SELECT count(lo_linenumber) FROM lineorder_flat \| 4.50 \| 4.36 \| \| SELECT count(c_name) FROM lineorder_flat \| 18.22 \| 17.88\| \| SELECT count(lo_shipmode) FROM lineorder_flat \|10.09 \| 6.15\|	2023-02-20 11:42:29 +08:00
Jibing-Li	292926e5aa	[Fix](multi catalog)Fix partition case bug (#16763 ) Set column names from path to lower case in case-insensitive case. This is for Iceberg columns from path. Iceberg columns are case sensitive, which may cause error for table with partitions.	2023-02-16 15:47:23 +08:00
Jibing-Li	de8d884ec3	[Fix](multi catalog)Fix iceberg parquet file doesn't have iceberg.schema meta problem (#16764 ) To support schema evolution, Iceberg add schema information to Parquet file metadata. But for early iceberg version, it doesn't write any schema information to Parquet file. This PR is to support read parquet without schema information.	2023-02-16 00:08:59 +08:00
Jibing-Li	0d9714b179	[Fix](multi catalog)Support read hive1.x orc file. (#16677 ) Hive 1.x may write orc file with internal column name (_col0, _col1, _col2...). This will cause query result be NULL because column name in orc file doesn't match with column name in Doris table schema. This pr is to support query Hive orc files with internal column names. For now, we haven't see any problem in Parquet file, will send new pr to fix parquet if any problem show up in the future.	2023-02-14 14:32:27 +08:00
lihangyu	37d1519316	[WIP](dynamic-table) support dynamic schema table (#16335 ) Issue Number: close #16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.	2023-02-11 13:37:50 +08:00
Xinyi Zou	c1a1275870	[fix](memory) Fix parquet load stack overflow (#16537 )	2023-02-10 08:48:12 +08:00
Ashin Gau	27216dc7e0	[improvement](multi-catalog) push down all predicates into rowgroup/page filtering for ParquetReader (#16388 ) Tow improvements: 1. Refactor rowgroup&page filtering in `ParquetReader`, and use the operator overloading of Doris native c++ type to process comparison. 2. Support decimal/decimal v3/date/datev2/datetime/datetimev2	2023-02-07 11:32:57 +08:00
slothever	b1b2697cc7	[fix](iceberg) fix iceberg catalog (#16372 ) 1. Fix iceberg catalog access s3 2. Fix iceberg catalog partition table query 3. Fix persistence	2023-02-05 13:15:28 +08:00
luozenglin	d2b5015d3f	[enhancement](profile) add the profile counter RawRowsRead to record the rows read from the parquet file (#16328 )	2023-02-04 22:59:34 +08:00
Pxl	5e4bb98900	[Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290 ) enable -Wpedantic and update lowest gcc version to 11.1	2023-02-03 11:28:48 +08:00
Ashin Gau	9618427020	[improvement](multi-catalog) increase default batch_size to 4064 (#16326 ) The performance of ClickBench Q30 is affected by batch_size: \| batch_size \| 1024 \| 4096 \| 20480 \| \| -- \| -- \| -- \| -- \| \| Q30 query time \| 2.27 \| 1.08 \| 0.62 \| Because aggregation operator will create a new result block for each batch block, and Q30 has 90 columns, which is time-consuming. Larger batch_size will decrease the number of aggregation blocks, so the larger batch_size will improve performance. Doris internal reader will read at least 4064 rows even if batch_size < 4064, so this PR keep the process of reading external table the same as internal table.	2023-02-02 11:51:09 +08:00
Ashin Gau	1c5279d26e	[fix](multi-catalog) remove the eof check among parquet columns (#16302 ) Read parquet file failed: ``` ERROR 1105 (HY000): errCode = 2, detailMessage = [INTERNAL_ERROR]Read parquet file xxx failed, reason = [CORRUPTION]The number of rows are not equal among parquet columns ``` This error may be thrown when reading non-predicate columns in lazy-read, for example: A row group with 1000 rows has tow non-predicate columns. Column A has one page, Column B has two pages with 500 rows for each page. The read range of `ParquetColumnReader` is [0, 400), and the rows between [0, 450) are all filtered by predicate columns. So column A can skip the first page, and reach the EOF, while column B can also skip the first page, but doesn't read the EOF.	2023-02-02 09:22:09 +08:00
huangzhaowei	b878a7e61e	[feature](Load)Suppot skip specific lines number for csv stream load (#16055 ) Support set skip line number for stream load to load csv file. Usage `-H skip_lines:number`: ``` curl --location-trusted -u root: -T test.csv -H skip_lines:5 -XPUT http://127.0.0.1:8030/api/testDb/testTbl/_stream_load ``` Skip line number also can be used in mysql load as below: ```sql LOAD DATA LOCAL INFILE '${mysql_load_skip_lines}' INTO TABLE ${tableName} COLUMNS TERMINATED BY ',' IGNORE 2 LINES PROPERTIES ("auth" = "root:"); ```	2023-02-01 20:42:43 +08:00
Qi Chen	fa14b7ea9c	[Enhancement](icebergv2) Optimize the position delete file filtering mechanism in iceberg v2 parquet reader (#16024 ) close #16023	2023-01-28 00:04:27 +08:00
Jibing-Li	1589d453a3	[fix](multi catalog)Support parquet and orc upper case column name (#16111 ) External hms catalog table column names in doris are all in lower case, while iceberg table or spark-sql created hive table may contain upper case column name, which will cause empty query result. This pr is to fix this bug. 1. For parquet file, transfer all column names to lower case while parse parquet metadata. 2. For orc file, store the origin column names and lower case column names in two vectors, use the suitable names in different cases. 3. FE side, change the column name back to the origin column name in iceberg while doing convertToIcebergExpr.	2023-01-27 23:52:11 +08:00
Mingyu Chen	23edb3de5a	[fix](icebergv2) fix bug that delete file reader is not opened (#16133 ) This pr #15836 change the way to use parquet reader by first open() then init_reader(). But we forgot to call open() for iceberg delete file, which cause coredump.	2023-01-24 10:19:46 +08:00
ZhaoChangle	199d7d3be8	[Refactor]Merged string_value into string_ref (#15925 )	2023-01-22 16:39:23 +08:00
Ashin Gau	de12957057	[debug](ParquetReader) print file path if failed to read parquet file (#16118 )	2023-01-21 08:05:17 +08:00
Jibing-Li	3ebc98228d	[feature wip](multi catalog)Support iceberg schema evolution. (#15836 ) Support iceberg schema evolution for parquet file format. Iceberg use unique id for each column to support schema evolution. To support this feature in Doris, FE side need to get the current column id for each column and send the ids to be side. Be read column id from parquet key_value_metadata, set the changed column name in Block to match the name in parquet file before reading data. And set the name back after reading data.	2023-01-20 12:57:36 +08:00
Pxl	b727033906	[Chore](build) enable -Wextra and remove some -Wno (#15760 ) enable -Wextra and remove some -Wno	2023-01-15 10:40:35 +08:00
Ashin Gau	34bb9cd5d3	[fix](parquet-reader) fix coredump when load datatime data to doris from parquet (#15794 ) `date_time_v2` will check scale when constructed datatimev2: ``` LOG(FATAL) << fmt::format("Scale {} is out of bounds", scale); ``` This [PR](https://github.com/apache/doris/pull/15510) has fixed this issue, but parquet does not use constructor to create `TypeDescriptor`, leading the `scale = -1` when reading datetimev2 data.	2023-01-13 11:51:11 +08:00
Tiewei Fang	f17d69e450	[feature](file cache)Import `file cache` for remote file reader (#15622 ) The main purpose of this pr is to import `fileCache` for lakehouse reading remote files. Use the local disk as the cache for reading remote file, so the next time this file is read, the data can be obtained directly from the local disk. In addition, this pr includes a few other minor changes Import File Cache: 1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy. 2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`. Other changes: 1. Add a new interface `fs()` for `FileReader`. 2. `IOContext` adds some statistical information to count the situation of `FileCache` Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>	2023-01-10 12:23:56 +08:00
Ashin Gau	707eab9a63	[opt](multi-catalog) cache and reuse position delete rows in iceberg v2 (#15670 ) A deleted file may belong to multiple data files. Each data file will read a full amount of deleted files, so a deleted file may be read repeatedly. The deleted files can be cached, and multiple data files can reuse the first read content. The performance is improved by 60% in the case of single thread, and by 30% in the case of multithreading.	2023-01-07 22:29:11 +08:00
Mingyu Chen	4075e3aec6	[fix](csv-reader) fix new csv reader's performance issue (#15581 )	2023-01-04 18:25:08 +08:00
Ashin Gau	50f1931f96	[fix](multi-catalog) get dictionary-encode from parquet metadata (#15525 )	2022-12-31 19:08:10 +08:00
Ashin Gau	2c8de30cce	[optimize](multi-catalog) use dictionary encode&filter to process delete files (#15441 ) Optimize PR #14470 has used `Expr` to filter delete rows to match current data file, but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files) to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows. In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path` columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column. After testing, the performance of iceberg v2's MOR is improved by 30%+. Fix Bug Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr` and the batch_eof is generated from next batch.	2022-12-30 08:57:55 +08:00
luozenglin	f8bb8c7829	[fix](broker) fix be core dump caused by broker load (#15390 ) * [fix](broker) fix be core dump caused by broker load	2022-12-28 10:57:41 +08:00

1 2 3

133 Commits