doris

Author	SHA1	Message	Date
Qi Chen	a44a274563	[Fix](parquet-reader) Fix and optimize parquet min-max filtering. (#39375 ) Backport #38277.	2024-08-15 14:12:54 +08:00
lihangyu	677435cef8	[Pick](Branch-2.1) pick json reader fix and support specify $. as column (#39271 ) #39206 #38213	2024-08-13 17:44:45 +08:00
daidai	3da2d1c9d6	[bug](parquet)Fix the problem that the parquet reader reads the missing sub-columns of the struct and fails. (#38718 ) (#39192 ) bp #38718	2024-08-11 20:37:40 +08:00
daidai	607c0b82a9	[opt](serde)Optimize the filling of fixed values into block columns without repeated deserialization. (#37377 ) (#38245 ) (#38810 ) ## Proposed changes pick pr: #38575 and fix this pr bug : #38245	2024-08-05 09:13:08 +08:00
daidai	5d02c48715	[feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. (#38432 ) (#38809 ) bp #38432 ## Proposed changes Add `hive_parquet_use_column_names` and `hive_orc_use_column_names` session variables to read the table after rename column in `Hive`. These two session variables are referenced from `parquet_use_column_names` and `orc_use_column_names` of `Trino` hive connector. By default, these two session variables are true. When they are set to false, reading orc/parquet will access the columns according to the ordinal position in the Hive table definition. For example: ```mysql in Hive : hive> create table tmp (a int , b string) stored as parquet; hive> insert into table tmp values(1,"2"); hive> alter table tmp change column a new_a int; hive> insert into table tmp values(2,"4"); in Doris : mysql> set hive_parquet_use_column_names=true; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ \| new_a \| b \| +-------+------+ \| NULL \| 2 \| \| 2 \| 4 \| +-------+------+ 2 rows in set (0.02 sec) mysql> set hive_parquet_use_column_names=false; Query OK, 0 rows affected (0.00 sec) mysql> select * from tmp; +-------+------+ \| new_a \| b \| +-------+------+ \| 1 \| 2 \| \| 2 \| 4 \| +-------+------+ 2 rows in set (0.02 sec) ``` You can use `set parquet.column.index.access/orc.force.positional.evolution = true/false` in hive 3 to control the results of reading the table like these two session variables. However, for the rename struct inside column parquet table, the effects of hive and doris are different.	2024-08-05 09:06:49 +08:00
amory	338fa32303	[pick](simdjson) fix simdjson with object array when jsonroot is not empty (#38633 ) ## Proposed changes backport: https://github.com/apache/doris/pull/38490 Issue Number: close #xxx <!--Describe your changes.-->	2024-08-01 11:04:54 +08:00
hui lai	17d351af80	[fix](csv reader) fix csv parser incorrect if enclosing line_delimiter (#38347 ) (#38445 ) Csv reader parse data incorrect when data enclosing line_delimiter, for example, line_delimiter is \n and enclose is ', data as follows: ``` 'aaaaaaaaaaaa bbbb' ``` it will be parsed as two columns: `'aaaaaaaaaaaa` and `bbbb',` rather than one column ``` 'aaaaaaaaaaaa bbbb' ``` The reason why this happened is csv reader will not reset result when not match enclose in this `output_buf_read`, causing incorrect truncation was made. Co-authored-by: Xin Liao <liaoxinbit@126.com>	2024-07-29 14:55:45 +08:00
Qi Chen	a751372e76	[Feature](multi-catalog) Add memory tracker for orc reader/writer and arrow parquet writer。 (#37257 ) ## Proposed changes backport #37234	2024-07-25 13:51:59 +08:00
Mingyu Chen	3ea26a8c95	[fix](external) record not found file number (#38253 ) (#38285 ) bp #38253	2024-07-25 11:03:19 +08:00
Qi Chen	ef00dad680	[Fix](multi-catalog) Fix some undefined behaviors. (#38274 ) ## Proposed changes backport #37845	2024-07-24 16:14:34 +08:00
daidai	193be20c86	[feature](csv)Supports reading CSV data using LF and CRLF as line separators. (#37687 ) (#38099 ) bp #37687	2024-07-22 22:53:04 +08:00
lihangyu	d9fd419e47	[Fix](JsonReader) fix json with duplicate key entry may result out of bound exception (#38147 ) #38146	2024-07-19 22:53:02 +08:00
camby	de2272ce48	[fix](round) fix round decimal128 overflow (#37733 ) (#37963 ) cherry-pick #37733 to branch-2.1	2024-07-18 23:50:23 +08:00
Mingyu Chen	3d5043817a	Revert "[opt](serde)Optimize the filling of fixed values into block columns without repeated deserialization. (#37377 )" (#38007 ) Reverts apache/doris#37530 Need more test, revert it temporarily	2024-07-17 21:44:25 +08:00
daidai	6932eef65e	[opt](serde)Optimize the filling of fixed values into block columns without repeated deserialization. (#37377 ) (#37530 ) bp #37377	2024-07-16 10:56:13 +08:00
Tiewei Fang	bd24a8bdd9	[Fix](csv_reader) Add a session variable to control whether empty rows in CSV files are read as NULL values (#37153 ) bp: #36668	2024-07-02 22:12:17 +08:00
Mingyu Chen	e25717458e	[opt](catalog) add some profile for parquet reader and change meta cache config (#37040 ) (#37146 ) bp #37040	2024-07-02 20:58:43 +08:00
Ashin Gau	d0eea3886d	[fix](multi-catalog) Revert #36575 and check nullptr of data column (#37086 ) Revert #36575, because `VScanner::get_block` will check `DCHECK(block->rows() == 0)`, so block should be cleared when `eof = true`.	2024-07-02 15:32:52 +08:00
Ashin Gau	e4b6dac0c1	[fix](ubsan) reinterpret_cast fix length types to int8 is not safe (#36725 ) ## Proposed changes Fix type check of ubsan. ``` /root/doris/be/src/vec/exec/format/parquet/fix_length_plain_decoder.h:75:78: runtime error: member call on address 0x5582f35db5c0 which does not point to an object of type 'doris::vectorized::ColumnVector<signed char>' 0x5582f35db5c0: note: object is of type 'doris::vectorized::ColumnVector<int>' 83 55 00 00 78 c0 b0 5a 82 55 00 00 02 00 00 00 00 00 00 00 10 a0 00 d7 83 55 00 00 10 a0 00 d7 ^~~~~~~~~~~~~~~~~~~~~~~ vptr for 'doris::vectorized::ColumnVector<int>' doris::Status doris::vectorized::FixLengthPlainDecoder::_decode_values<false>(COW<doris::vectorized::IColumn>::mutable_ptr<doris::vectorized::IColumn>&, std::shared_ptr<doris::vectorized::IDataType const>&, doris::vectorized::ColumnSelectVector&, bool) at fix_length_plain_decoder.h:75:78 ```	2024-06-24 14:03:41 +08:00
Qi Chen	17cf34b244	[Fix](multi-catalog) Fix core in orc and parquet reader sometimes after low mem exception. (#36575 ) ## Proposed changes Backport #36574.	2024-06-22 11:28:21 +08:00
Qi Chen	f7f7b2b738	[Enhancement](multi-catalog) Add more error msgs for wrong data types in orc and parquet reader. (#36580 ) Backport #36417	2024-06-20 18:10:25 +08:00
Ashin Gau	56ccb9a657	[fix](parquet) fix parquet reader missing column and filter missing column (#36182 ) bp #36189	2024-06-13 21:30:05 +08:00
wuwenchi	9e972cb0b9	[bugfix](iceberg)Fix the datafile path error issue for 2.1 (#36066 ) bp: #35957	2024-06-08 21:51:46 +08:00
daidai	bc062a2595	[fix](orc)fix orc reader missing column. (#35735 ) ## Proposed changes bp #35583 Issue Number: close #xxx <!--Describe your changes.-->	2024-05-31 22:51:44 +08:00
Qi Chen	b91d2caab8	[Feature](iceberg-writer) Implements iceberg sink basic functionality for inserting into table. (#35587 ) backport #34929	2024-05-29 16:40:54 +08:00
Qi Chen	68eda58a8c	[Fix](multi-catalog) Fix string dict filtering when use null related function in parquet and orc reader. (#35335 ) The following sql and when the dictionary column contains functions related to null, the results will be incorrect. ``` select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null'; ``` ``` select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null' ``` ``` select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'; ```	2024-05-27 15:25:29 +08:00
Qi Chen	7284b6959f	[Configurations](multi-catalog)Fix enable_orc_filter_by_min_max functionality, the mistake for #35012 . (#35320 ) fix bug introduced from #35012	2024-05-27 15:25:07 +08:00
Ashin Gau	eb49cd839b	[refactor](datalake) return the error status instead of static_cast<void> (#34873 ) Followup #34797 `static_cast<void>` has ignored the wrong status, some of them should make the query finished with error status, so replace `static_cast<void>` with `RETURN_IF_ERROR`. The following three scenarios need to be handled separately and cannot be simply replaced: 1. The outer function returns void; 2. Call status function inner constructors or destructors; 3. Call status function with best effort, and should ignore the wrong status.	2024-05-23 19:06:21 +08:00
Mingyu Chen	adc364a6fd	[feature](Paimon) support deletion vector for Paimon naive reader (#34743 ) (#35241 ) bp #34743 Co-authored-by: 苏小刚 <suxiaogang223@icloud.com>	2024-05-23 00:01:30 +08:00
Qi Chen	291cf57c54	[Configurations](multi-catalog) Add `enable_parquet_filter_by_min_max` and `enable_orc_filter_by_min_max` Session variables. (#35012 ) (#35164 ) backport #35012	2024-05-22 19:06:12 +08:00
Qi Chen	74d66e9650	[Fix](parquet-reader) Fix Timestamp Int96 min-max statistics is incorrect when was written by some old parquet writers by disable it. (#35041 ) Parquet INT96 timestamp values were compared incorrectly for the purposes of producing statistics by older parquet writers, so PARQUET-1065 deprecated them. The result is that any writer that produced stats was producing unusable incorrect values, except the special case where min == max and an incorrect ordering would not be material to the result. PARQUET-1026 made binary stats available and valid in that special case.	2024-05-21 13:00:22 +08:00
Tiewei Fang	c0fd98abe5	[Fix](tvf) Fix that tvf reading empty files in compressed formats. (#34926 ) 1. Fix the issue with tvf reading empty compressed files. 2. move two test cases (`test_local_tvf_compression` and `test_s3_tvf_compression`) from p2 to p0	2024-05-21 12:59:31 +08:00
huanghaibin	6b1c441258	[fix](group_commit) Wal reader should check block length to avoid reading empty block (#34792 )	2024-05-18 18:17:56 +08:00
huanghaibin	6c515e0c76	[fix](group commit) Make compatibility issues on serializing and deserializing wal file more clear (#34793 )	2024-05-18 18:12:43 +08:00
Ashin Gau	1f0c45204b	[fix](iceberg) read the primary key columns if hasing equality delete (#34884 ) backport: #34835	2024-05-15 11:37:25 +08:00
daidai	02084fd91f	[fix](iceberg_orc)Fixed the bug that the iceberg reader did not perform position delete when reading the orc file without a predicate. (#34814 ) (#34882 ) bp #34814	2024-05-15 11:31:29 +08:00
Ashin Gau	9491b7d422	[fix](iceberg) prevent coredump if read position delete file failed (#34802 )	2024-05-14 14:03:33 +08:00
yiguolei	4be589951b	Revert "Revert "[fix](csv-reader) fix column split error when there is escape character (#34364 )"" This reverts commit d127d67ebe989484bbdf340a4de5b79ded56eecc.	2024-05-07 18:03:56 +08:00
yiguolei	d127d67ebe	Revert "[fix](csv-reader) fix column split error when there is escape character (#34364 )" This reverts commit 971e10a9db782c9986b20e1209468e4d7aeedf71.	2024-05-07 13:36:11 +08:00
camby	9d0d7293f0	[fix](json) fix be crash while load json data (#34283 )	2024-05-07 07:42:53 +08:00
Xin Liao	971e10a9db	[fix](csv-reader) fix column split error when there is escape character (#34364 )	2024-05-07 07:38:35 +08:00
Mingyu Chen	35f8563a75	[feature](iceberg) support iceberg equality delete (#34223 ) (#34327 ) bp #34223 Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>	2024-04-30 11:51:29 +08:00
daidai	1bfe0f0393	[feature](iceberg)support read iceberg complex type，iceberg.orc format and position delete. (#33935 ) (#34256 ) master #33935	2024-04-29 14:40:12 +08:00
Qi Chen	99af54f779	[Fix](orc-reader) Fix the issue when string col has mixed plain and dict encoding in different stripes. (#34146 ) (#34248 ) backport #34146	2024-04-28 19:43:57 +08:00
苏小刚	0f0c0a266b	[opt](parquet)Skip page with offset index (#33082 ) Make skip_page() in ColumnChunkReader more efficient. No more reading page headers if there are pagelocations in chunk.	2024-04-26 15:06:16 +08:00
Pxl	5a5063be20	[bug](fix) heap use after free when json parse failed (#33955 )	2024-04-22 22:33:24 +08:00
Ashin Gau	c631f4f8a8	[fix](schema change) resolve the use count check of source logical column (#33932 ) Fix error like: ``` 8# google::LogMessageFatal::~LogMessageFatal() in /mnt/hdd01/ci/master-deploy/be/lib/doris_be 9# doris::vectorized::Block::clear_column_data(int) in /mnt/hdd01/ci/master-deploy/be/lib/doris_be 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block, unsigned long, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:514 11# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState, doris::vectorized::Block, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vfile_scanner.cpp:333 12# doris::vectorized::VScanner::get_block(doris::RuntimeState, doris::vectorized::Block, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:132 13# doris::vectorized::VScanner::get_block_after_projects(doris::RuntimeState, doris::vectorized::Block, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:99 ``` Because source logical column is the destination logical column if logical converter is consistent. Previously, the reference of column was reset after the conversion was completed, but if an EOF occurred, it was returned in advance, but EOF is not a true error. ``` if (_logical_converter->is_consistent()) { // If logical converter is consistent, _src_logical_column is the final destination column, // other components will check the use count _src_logical_column.reset(); } ```	2024-04-22 12:31:46 +08:00
Tiewei Fang	36a70ba1e7	[Fix](Csv-Reader)Fix the issue of BE core dump caused by improper configuration of column_seperator and line_delimiter. (#33693 )	2024-04-20 20:06:48 +08:00
Mingyu Chen	0e3ad5cd9d	[fix](parquet) fix time zone error(isAdjustedToUTC=true) in parquet reader (#33675 ) (#33924 ) bp (#33675) Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>	2024-04-20 19:06:54 +08:00
zclllyybb	25358564ca	[Fix](compile) Fix gcc compile on master (#33864 ) This is imported by #33511. wrongly used ColumnStr<T> (); which violate C++20 standard(see https://wg21.cmeerw.net/cwg/issue2237) but still supported by clang up until now(see llvm/llvm-project#58112)	2024-04-19 23:41:37 +08:00

1 2 3 4 5 ...

413 Commits