doris

Author	SHA1	Message	Date
Qi Chen	291cf57c54	[Configurations](multi-catalog) Add `enable_parquet_filter_by_min_max` and `enable_orc_filter_by_min_max` Session variables. (#35012 ) (#35164 ) backport #35012	2024-05-22 19:06:12 +08:00
Qi Chen	74d66e9650	[Fix](parquet-reader) Fix Timestamp Int96 min-max statistics is incorrect when was written by some old parquet writers by disable it. (#35041 ) Parquet INT96 timestamp values were compared incorrectly for the purposes of producing statistics by older parquet writers, so PARQUET-1065 deprecated them. The result is that any writer that produced stats was producing unusable incorrect values, except the special case where min == max and an incorrect ordering would not be material to the result. PARQUET-1026 made binary stats available and valid in that special case.	2024-05-21 13:00:22 +08:00
Tiewei Fang	c0fd98abe5	[Fix](tvf) Fix that tvf reading empty files in compressed formats. (#34926 ) 1. Fix the issue with tvf reading empty compressed files. 2. move two test cases (`test_local_tvf_compression` and `test_s3_tvf_compression`) from p2 to p0	2024-05-21 12:59:31 +08:00
huanghaibin	6b1c441258	[fix](group_commit) Wal reader should check block length to avoid reading empty block (#34792 )	2024-05-18 18:17:56 +08:00
huanghaibin	6c515e0c76	[fix](group commit) Make compatibility issues on serializing and deserializing wal file more clear (#34793 )	2024-05-18 18:12:43 +08:00
Ashin Gau	1f0c45204b	[fix](iceberg) read the primary key columns if hasing equality delete (#34884 ) backport: #34835	2024-05-15 11:37:25 +08:00
daidai	02084fd91f	[fix](iceberg_orc)Fixed the bug that the iceberg reader did not perform position delete when reading the orc file without a predicate. (#34814 ) (#34882 ) bp #34814	2024-05-15 11:31:29 +08:00
Ashin Gau	9491b7d422	[fix](iceberg) prevent coredump if read position delete file failed (#34802 )	2024-05-14 14:03:33 +08:00
yiguolei	4be589951b	Revert "Revert "[fix](csv-reader) fix column split error when there is escape character (#34364 )"" This reverts commit d127d67ebe989484bbdf340a4de5b79ded56eecc.	2024-05-07 18:03:56 +08:00
yiguolei	d127d67ebe	Revert "[fix](csv-reader) fix column split error when there is escape character (#34364 )" This reverts commit 971e10a9db782c9986b20e1209468e4d7aeedf71.	2024-05-07 13:36:11 +08:00
camby	9d0d7293f0	[fix](json) fix be crash while load json data (#34283 )	2024-05-07 07:42:53 +08:00
Xin Liao	971e10a9db	[fix](csv-reader) fix column split error when there is escape character (#34364 )	2024-05-07 07:38:35 +08:00
Mingyu Chen	35f8563a75	[feature](iceberg) support iceberg equality delete (#34223 ) (#34327 ) bp #34223 Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>	2024-04-30 11:51:29 +08:00
daidai	1bfe0f0393	[feature](iceberg)support read iceberg complex type，iceberg.orc format and position delete. (#33935 ) (#34256 ) master #33935	2024-04-29 14:40:12 +08:00
Qi Chen	99af54f779	[Fix](orc-reader) Fix the issue when string col has mixed plain and dict encoding in different stripes. (#34146 ) (#34248 ) backport #34146	2024-04-28 19:43:57 +08:00
苏小刚	0f0c0a266b	[opt](parquet)Skip page with offset index (#33082 ) Make skip_page() in ColumnChunkReader more efficient. No more reading page headers if there are pagelocations in chunk.	2024-04-26 15:06:16 +08:00
Pxl	5a5063be20	[bug](fix) heap use after free when json parse failed (#33955 )	2024-04-22 22:33:24 +08:00
Ashin Gau	c631f4f8a8	[fix](schema change) resolve the use count check of source logical column (#33932 ) Fix error like: ``` 8# google::LogMessageFatal::~LogMessageFatal() in /mnt/hdd01/ci/master-deploy/be/lib/doris_be 9# doris::vectorized::Block::clear_column_data(int) in /mnt/hdd01/ci/master-deploy/be/lib/doris_be 10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block, unsigned long, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:514 11# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState, doris::vectorized::Block, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vfile_scanner.cpp:333 12# doris::vectorized::VScanner::get_block(doris::RuntimeState, doris::vectorized::Block, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:132 13# doris::vectorized::VScanner::get_block_after_projects(doris::RuntimeState, doris::vectorized::Block, bool) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:99 ``` Because source logical column is the destination logical column if logical converter is consistent. Previously, the reference of column was reset after the conversion was completed, but if an EOF occurred, it was returned in advance, but EOF is not a true error. ``` if (_logical_converter->is_consistent()) { // If logical converter is consistent, _src_logical_column is the final destination column, // other components will check the use count _src_logical_column.reset(); } ```	2024-04-22 12:31:46 +08:00
Tiewei Fang	36a70ba1e7	[Fix](Csv-Reader)Fix the issue of BE core dump caused by improper configuration of column_seperator and line_delimiter. (#33693 )	2024-04-20 20:06:48 +08:00
Mingyu Chen	0e3ad5cd9d	[fix](parquet) fix time zone error(isAdjustedToUTC=true) in parquet reader (#33675 ) (#33924 ) bp (#33675) Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>	2024-04-20 19:06:54 +08:00
zclllyybb	25358564ca	[Fix](compile) Fix gcc compile on master (#33864 ) This is imported by #33511. wrongly used ColumnStr<T> (); which violate C++20 standard(see https://wg21.cmeerw.net/cwg/issue2237) but still supported by clang up until now(see llvm/llvm-project#58112)	2024-04-19 23:41:37 +08:00
HappenLee	1300317723	[Exec](join) Support column string64 to avoid join failed in string size overflow the uint32 (#33511 ) (#33850 )	2024-04-18 19:43:08 +08:00
Ashin Gau	ae68cca07d	[fix](schema change) CastStringConverter is compiled failed in g++ (#33546 ) follow #32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:	2024-04-17 23:42:00 +08:00
Ashin Gau	9b7af4c0cf	[feature](schema change) unified schema change for parquet and orc reader (#32873 ) Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well. Unified schema change interface for all format readers: - First, read the data according to the column type of the file into source column; - Second, convert source column to the destination column with type planned by FE.	2024-04-12 15:09:25 +08:00
超威老仲	b0b5f84e40	[feature](load) support compressed JSON format data for broker load (#30809 )	2024-04-10 14:20:53 +08:00
Ashin Gau	29556f758e	[fix](parquet) fix time zone error in parquet reader (#33217 ) `isAdjustedToUTC` is exactly the opposite in parquet reader(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), resulting the time with `isAdjustedToUTC=true` has increased by eight hours(UTC8). The parquet with `isAdjustedToUTC=true` can be produced by spark-sql with the following configuration: ``` --conf spark.sql.session.timeZone=UTC --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS ``` However, using the following configuration, there's no logical and convert type in parquet meta data, so the time read by doris will also increase by eight hours(UTC8). Users need to set their own UTC time zone in doris(https://doris.apache.org/docs/dev/advanced/time-zone/) ``` --conf spark.sql.session.timeZone=UTC --conf spark.sql.parquet.outputTimestampType=INT96 ```	2024-04-07 23:24:22 +08:00
苏小刚	2f2d488668	[opt](parquet) Support hive struct schema change (#32438 ) Followup: #31128 This optimization allows doris to correctly read struct type data after changing the schema from hive. ## Changing struct schema in hive: ```sql hive> create table struct_test(id int,sf struct<f1: int, f2: string>) stored as parquet; hive> insert into struct_test values > (1, named_struct('f1', 1, 'f2', 's1')), > (2, named_struct('f1', 2, 'f2', 's2')), > (3, named_struct('f1', 3, 'f2', 's3')); hive> alter table struct_test change sf sf struct<f1:int, f3:string>; hive> select * from struct_test; OK 1 {"f1":1,"f3":null} 2 {"f1":2,"f3":null} 3 {"f1":3,"f3":null} Time taken: 5.298 seconds, Fetched: 3 row(s) ``` The previous result of doris was: ```sql mysql> select * from struct_test; +------+-----------------------+ \| id \| sf \| +------+-----------------------+ \| 1 \| {"f1": 1, "f3": "s1"} \| \| 2 \| {"f1": 2, "f3": "s2"} \| \| 3 \| {"f1": 3, "f3": "s3"} \| +------+-----------------------+ ``` Now the result is same as hive: ```sql mysql> select * from struct_test; +------+-----------------------+ \| id \| sf \| +------+-----------------------+ \| 1 \| {"f1": 1, "f3": null} \| \| 2 \| {"f1": 2, "f3": null} \| \| 3 \| {"f1": 3, "f3": null} \| +------+-----------------------+ ```	2024-03-22 16:35:47 +08:00
huanghaibin	2196c534e8	[fix](group commit) Fix compatibility issues on serializing and deserializing wal file (#32299 )	2024-03-21 14:07:24 +08:00
Tiewei Fang	f99db38998	[fix](ParquetReader) Fix Parquet Reader to read `int96` parquet type problem (#32394 ) `hi - JULIAN_EPOCH_OFFSET_DAYS` could be negative, so we can't all use unsigned int.	2024-03-21 14:07:24 +08:00
Mingyu Chen	2e564036ef	[fix](profile) avoid update profile in deconstructor (#32131 ) In previous, the counter in `profile` may be updated when close the file reader. And the file reader may be closed when the object being deconstruted. But at that time, the `profile` object may already be deleted, causing NPE and BE will crash. This PR try to fix this issue: 1. Remove the "profile counter update" logic from all `close()` method. 2. Add a new interface `ProfileCollector` It has 2 methods: - `collect_profile_at_runtime()` It can be called at runtime, eg, in every `get_next_block()` method. So that the counter in profile can be updated at runtime. - `collect_profile_before_close()` Should be called before the object call `close()`. And it will only be called once. 3. Derived from `ProfileCollector` All classes which may update the profile counter in `close()` method should extends the `ProfileCollector`. Such as `GenericReader`, etc. And implement `collect_profile_before_close()` And `collect_profile_before_close()` will be called in `scanner->mark_to_need_to_close()`.	2024-03-21 14:07:22 +08:00
HappenLee	9c1888e7ec	[RuntimeFilter](exec) support min max runtime filter and do refactor (#32210 )	2024-03-15 18:06:20 +08:00
Xinyi Zou	7b74b199a5	[fix](memory) Fix LRU cache deleter and memory tracking (#32080 ) In order to add common code to the value deleter of LRU cache, let all lru cache values inherit from LRUCacheValueBase class and tracking memory in destructor.	2024-03-15 17:57:58 +08:00
slothever	0b5b7175d6	[fix](multi-catalog) add max compute custom odps and tunnel url (#31390 ) add max compute custom odps and tunnel url	2024-02-29 16:44:40 +08:00
Gavin Chou	4e5147c6a4	[fix](parquet) Fix possible memory leak if ParquetReader::parse_thrift_footer failed (#31375 )	2024-02-25 18:08:19 +08:00
huanghaibin	b66583551c	[fix](group_commit)Fix bound checking problem when reading wal block (#31112 )	2024-02-22 13:01:48 +08:00
lihangyu	278b232e76	[Bug](json reader) object should stop processing when encounter error (#31159 ) If DATA_QUALITY_ERROR encountered we should stop processing this document any more.Otherwise there will be UB in simdjson.	2024-02-21 13:53:32 +08:00
Ashin Gau	7ca3be6d51	[fix](parquet) return error if schema changed in complex types (#31128 ) Check the column type of complex type to prevent core dump in BE. ColumnReader will throw segmentation fault in the following case: Change complex types in hive: hive> create table struct_test( id int, sf struct<f1: int, f2: map<string, string>>) stored as parquet; hive> insert into struct_test values (1, named_struct('f1', 1, 'f2', str_to_map('1:s2,2:s2'))), (2, named_struct('f1', 2, 'f2', str_to_map('k1:s3,k2:s4'))), (3, named_struct('f1', 3, 'f2', str_to_map('k1:s5,k2:s6'))); hive> alter table struct_test change sf sf struct<f1:int, f2: string>;	2024-02-20 09:12:38 +08:00
Vallish Pai	0d4b8386a2	[bugfix][be][cppcheck] Possible NULL pointer access (#31025 ) (#31026 )	2024-02-16 10:16:40 +08:00
Tiewei Fang	f65844fae4	[Enhencement](Outfile/Export) Export data to csv file format with BOM (#30533 ) The UTF8 format of the Windows system has BOM. We add a new user property to `Outfile/Export`。Therefore, when exporting Doris data, users can choose whether to bring BOM on the beginning of the CSV file. Usage: ```sql -- outfile: select * from demo.student into outfile "file:///xxx/export/exp_" format as csv properties( "column_separator" = ",", "with_bom" = "true" ); -- Export: EXPORT TABLE student TO "file:///xx/tmpdata/export/exp_" PROPERTIES( "format" = "csv", "with_bom" = "true" ); ```	2024-02-16 10:16:40 +08:00
huanghaibin	7571ecc42f	[fix](group_commit)Add bounds checking when reading wal file on group commit (#30940 )	2024-02-16 10:12:24 +08:00
HowardQin	0d32aeeaf6	[improvement](load) Enable lzo & Remove dependency on Markus F.X.J. Oberhumer's lzo library (#30573 ) Issue Number: close #29406 1. increase lzop version to 0x1040, I set to 0x1040 only for decompressing lzo files compressed by higher version of lzop, no change of decompressing logic, actully, 0x1040 should have "F_H_FILTER" feature, but it mainly for audio and image data, so we do not support it. 2. use orc::lzoDecompress() instead of lzo1x_decompress_safe() to decompress lzo data 3. use crc32c::Extend() instead of lzo_crc32() 4. use olap_adler32() instead of lzo_adler32() 5. thus, remove dependency of Markus F.X.J. Oberhumer's lzo library 6. remove DORIS_WITH_LZO, so lzo file are supported by stream and broker load by default 7. add some regression test	2024-02-05 22:00:24 +08:00
py023	4b42156fc0	[chore](clang-tidy): add bugprone linters (#29521 ) This PR introduces 4 bugprone linter rules to .clang-tidy, these linters found some bugs in #28965. This PR also add some comments to mute false positive reports.	2024-02-05 21:58:08 +08:00
Qi Chen	92cad69fc4	[Fix](parquet-reader) Fix reading fixed length byte array decimal in parquet reader. (#30535 )	2024-01-31 23:53:40 +08:00
Mingyu Chen	73371d44f8	[fix][refactor] refactor schema init of externa table and some parquet issue (#30325 ) 1. Skip parquet file which has only 4 bytes length: PAR1 2. Refactor the schema init method of iceberg/hudi/hive table in hms catalog 1. Remove some redundant methods of `getIcebergTable` 2. Fix issue described in #23771 3. Support HoodieParquetInputFormatBase, treat it as normal hive table format 4. When listing file, skip all hidden dirs and files	2024-01-31 23:53:40 +08:00
wuwenchi	7d037c12bf	[bugfix](paimon)fix paimon testcases (#30514 ) 1. set default timezone 2. not supported `char` type to pushdown	2024-01-31 23:53:39 +08:00
wuwenchi	8308bc96b9	[fix](paimon)set timestamp's scale for parquet which has no logical type (#30119 )	2024-01-23 13:22:14 +08:00
HHoflittlefish777	32c5153999	[fix](routine-load) pause job when json path is invalid #30197 If jsonpaths is set wrong, routine load job will report error but running all time.For example: CREATE ROUTINE LOAD jobName ON tableName PROPERTIES ( "format" = "json", "max_batch_interval" = "5", "max_batch_rows" = "300000", "max_batch_size" = "209715200", "jsonpaths" = "[\'t\',\'a\']" ) FROM KAFKA ( "kafka_broker_list" = "$IP:PORT", "kafka_topic" = "XXX", "property.kafka_default_offsets" = "OFFSET_BEGINNING" ); Jsonpaths ['t','a'] is invalid, but job will running all time.	2024-01-23 10:12:37 +08:00
nanfeng	be893d792c	[fix](jni) fix jni_reader function name get_nex_block to get_next_block (#29943 )	2024-01-16 18:39:00 +08:00
Qi Chen	d494674ff4	[opt](parquet-reader) Opt parquet decimal type reading. (#29825 )	2024-01-12 13:58:19 +08:00
Qi Chen	7287c0ca15	[Opt](exec)(multi-catalog) Opt date type reading. (#29571 )	2024-01-12 11:48:39 +08:00

1 2 3 4 5 ...

384 Commits