Commit Graph

402 Commits

Author SHA1 Message Date
d9fd419e47 [Fix](JsonReader) fix json with duplicate key entry may result out of bound exception (#38147)
#38146
2024-07-19 22:53:02 +08:00
de2272ce48 [fix](round) fix round decimal128 overflow (#37733) (#37963)
cherry-pick #37733 to branch-2.1
2024-07-18 23:50:23 +08:00
3d5043817a Revert "[opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377)" (#38007)
Reverts apache/doris#37530
Need more test, revert it temporarily
2024-07-17 21:44:25 +08:00
6932eef65e [opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377) (#37530)
bp #37377
2024-07-16 10:56:13 +08:00
bd24a8bdd9 [Fix](csv_reader) Add a session variable to control whether empty rows in CSV files are read as NULL values (#37153)
bp: #36668
2024-07-02 22:12:17 +08:00
e25717458e [opt](catalog) add some profile for parquet reader and change meta cache config (#37040) (#37146)
bp #37040
2024-07-02 20:58:43 +08:00
d0eea3886d [fix](multi-catalog) Revert #36575 and check nullptr of data column (#37086)
Revert #36575, because `VScanner::get_block` will check
`DCHECK(block->rows() == 0)`, so block should be cleared when `eof =
true`.
2024-07-02 15:32:52 +08:00
e4b6dac0c1 [fix](ubsan) reinterpret_cast fix length types to int8 is not safe (#36725)
## Proposed changes

Fix type check of ubsan. 
```
/root/doris/be/src/vec/exec/format/parquet/fix_length_plain_decoder.h:75:78: runtime error: member call on address 0x5582f35db5c0 which does not point to an object of type 'doris::vectorized::ColumnVector<signed char>'
0x5582f35db5c0: note: object is of type 'doris::vectorized::ColumnVector<int>'
 83 55 00 00  78 c0 b0 5a 82 55 00 00  02 00 00 00 00 00 00 00  10 a0 00 d7 83 55 00 00  10 a0 00 d7
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'doris::vectorized::ColumnVector<int>'
doris::Status doris::vectorized::FixLengthPlainDecoder::_decode_values<false>(COW<doris::vectorized::IColumn>::mutable_ptr<doris::vectorized::IColumn>&, std::shared_ptr<doris::vectorized::IDataType const>&, doris::vectorized::ColumnSelectVector&, bool) at fix_length_plain_decoder.h:75:78
```
2024-06-24 14:03:41 +08:00
17cf34b244 [Fix](multi-catalog) Fix core in orc and parquet reader sometimes after low mem exception. (#36575)
## Proposed changes

Backport #36574.
2024-06-22 11:28:21 +08:00
f7f7b2b738 [Enhancement](multi-catalog) Add more error msgs for wrong data types in orc and parquet reader. (#36580)
Backport #36417
2024-06-20 18:10:25 +08:00
56ccb9a657 [fix](parquet) fix parquet reader missing column and filter missing column (#36182)
bp #36189
2024-06-13 21:30:05 +08:00
9e972cb0b9 [bugfix](iceberg)Fix the datafile path error issue for 2.1 (#36066)
bp: #35957
2024-06-08 21:51:46 +08:00
bc062a2595 [fix](orc)fix orc reader missing column. (#35735)
## Proposed changes
bp #35583 
Issue Number: close #xxx

<!--Describe your changes.-->
2024-05-31 22:51:44 +08:00
b91d2caab8 [Feature](iceberg-writer) Implements iceberg sink basic functionality for inserting into table. (#35587)
backport #34929
2024-05-29 16:40:54 +08:00
68eda58a8c [Fix](multi-catalog) Fix string dict filtering when use null related function in parquet and orc reader. (#35335)
The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
2024-05-27 15:25:29 +08:00
7284b6959f [Configurations](multi-catalog)Fix enable_orc_filter_by_min_max functionality, the mistake for #35012. (#35320)
fix bug introduced from  #35012
2024-05-27 15:25:07 +08:00
eb49cd839b [refactor](datalake) return the error status instead of static_cast<void> (#34873)
Followup #34797
`static_cast<void>` has ignored the wrong status, some of them should make the query finished with error status, so replace `static_cast<void>`  with `RETURN_IF_ERROR`.

The following three scenarios need to be handled separately and cannot be simply replaced:
1. The outer function returns void;
2. Call status function inner constructors or destructors;
3. Call status function with best effort, and should ignore the wrong status.
2024-05-23 19:06:21 +08:00
adc364a6fd [feature](Paimon) support deletion vector for Paimon naive reader (#34743) (#35241)
bp #34743
Co-authored-by: 苏小刚 <suxiaogang223@icloud.com>
2024-05-23 00:01:30 +08:00
291cf57c54 [Configurations](multi-catalog) Add enable_parquet_filter_by_min_max and enable_orc_filter_by_min_max Session variables. (#35012) (#35164)
backport #35012
2024-05-22 19:06:12 +08:00
74d66e9650 [Fix](parquet-reader) Fix Timestamp Int96 min-max statistics is incorrect when was written by some old parquet writers by disable it. (#35041)
Parquet INT96 timestamp values were compared incorrectly for the purposes of producing statistics
by older parquet writers, so PARQUET-1065 deprecated them. The result is that any writer that produced
stats was producing unusable incorrect values, except the special case where min == max and an incorrect
ordering would not be material to the result. PARQUET-1026 made binary stats available and valid in that special case.
2024-05-21 13:00:22 +08:00
c0fd98abe5 [Fix](tvf) Fix that tvf reading empty files in compressed formats. (#34926)
1. Fix the issue with tvf reading empty compressed files.
2. move two test cases (`test_local_tvf_compression` and `test_s3_tvf_compression`) from p2 to p0
2024-05-21 12:59:31 +08:00
6b1c441258 [fix](group_commit) Wal reader should check block length to avoid reading empty block (#34792) 2024-05-18 18:17:56 +08:00
6c515e0c76 [fix](group commit) Make compatibility issues on serializing and deserializing wal file more clear (#34793) 2024-05-18 18:12:43 +08:00
1f0c45204b [fix](iceberg) read the primary key columns if hasing equality delete (#34884)
backport: #34835
2024-05-15 11:37:25 +08:00
02084fd91f [fix](iceberg_orc)Fixed the bug that the iceberg reader did not perform position delete when reading the orc file without a predicate. (#34814) (#34882)
bp #34814
2024-05-15 11:31:29 +08:00
9491b7d422 [fix](iceberg) prevent coredump if read position delete file failed (#34802) 2024-05-14 14:03:33 +08:00
4be589951b Revert "Revert "[fix](csv-reader) fix column split error when there is escape character (#34364)""
This reverts commit d127d67ebe989484bbdf340a4de5b79ded56eecc.
2024-05-07 18:03:56 +08:00
d127d67ebe Revert "[fix](csv-reader) fix column split error when there is escape character (#34364)"
This reverts commit 971e10a9db782c9986b20e1209468e4d7aeedf71.
2024-05-07 13:36:11 +08:00
9d0d7293f0 [fix](json) fix be crash while load json data (#34283) 2024-05-07 07:42:53 +08:00
971e10a9db [fix](csv-reader) fix column split error when there is escape character (#34364) 2024-05-07 07:38:35 +08:00
35f8563a75 [feature](iceberg) support iceberg equality delete (#34223) (#34327)
bp #34223

Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>
2024-04-30 11:51:29 +08:00
1bfe0f0393 [feature](iceberg)support read iceberg complex type,iceberg.orc format and position delete. (#33935) (#34256)
master #33935
2024-04-29 14:40:12 +08:00
99af54f779 [Fix](orc-reader) Fix the issue when string col has mixed plain and dict encoding in different stripes. (#34146) (#34248)
backport #34146
2024-04-28 19:43:57 +08:00
0f0c0a266b [opt](parquet)Skip page with offset index (#33082)
Make skip_page() in ColumnChunkReader more efficient. No more reading page headers if there are pagelocations in chunk.
2024-04-26 15:06:16 +08:00
Pxl
5a5063be20 [bug](fix) heap use after free when json parse failed (#33955) 2024-04-22 22:33:24 +08:00
c631f4f8a8 [fix](schema change) resolve the use count check of source logical column (#33932)
Fix error like:
```
8# google::LogMessageFatal::~LogMessageFatal() in /mnt/hdd01/ci/master-deploy/be/lib/doris_be
 9# doris::vectorized::Block::clear_column_data(int) in /mnt/hdd01/ci/master-deploy/be/lib/doris_be
10# doris::vectorized::ParquetReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:514
11# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vfile_scanner.cpp:333
12# doris::vectorized::VScanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:132
13# doris::vectorized::VScanner::get_block_after_projects(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/doris_master/doris/be/src/vec/exec/scan/vscanner.cpp:99
```

Because source logical column is the destination logical column if logical converter is consistent. Previously, the reference of column was reset after the conversion was completed, but if an EOF occurred, it was returned in advance, but EOF is not a true error.
```
if (_logical_converter->is_consistent()) {
            // If logical converter is consistent, _src_logical_column is the final destination column,
            // other components will check the use count
            _src_logical_column.reset();
}
```
2024-04-22 12:31:46 +08:00
36a70ba1e7 [Fix](Csv-Reader)Fix the issue of BE core dump caused by improper configuration of column_seperator and line_delimiter. (#33693) 2024-04-20 20:06:48 +08:00
0e3ad5cd9d [fix](parquet) fix time zone error(isAdjustedToUTC=true) in parquet reader (#33675) (#33924)
bp (#33675)

Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>
2024-04-20 19:06:54 +08:00
25358564ca [Fix](compile) Fix gcc compile on master (#33864)
This is imported by #33511. wrongly used

ColumnStr<T> ();

which violate C++20 standard(see https://wg21.cmeerw.net/cwg/issue2237) but still supported by clang up until now(see llvm/llvm-project#58112)
2024-04-19 23:41:37 +08:00
1300317723 [Exec](join) Support column string64 to avoid join failed in string size overflow the uint32 (#33511) (#33850) 2024-04-18 19:43:08 +08:00
ae68cca07d [fix](schema change) CastStringConverter is compiled failed in g++ (#33546)
follow #32873, CastStringConverter is compiled failed in g++ for uninitialized value, which is ok in clang:
2024-04-17 23:42:00 +08:00
9b7af4c0cf [feature](schema change) unified schema change for parquet and orc reader (#32873)
Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Unified schema change interface for all format readers:
- First, read the data according to the column type of the file into source column;
- Second, convert source column to the destination column with type planned by FE.
2024-04-12 15:09:25 +08:00
b0b5f84e40 [feature](load) support compressed JSON format data for broker load (#30809) 2024-04-10 14:20:53 +08:00
29556f758e [fix](parquet) fix time zone error in parquet reader (#33217)
`isAdjustedToUTC` is exactly the opposite in parquet reader(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), resulting the time with `isAdjustedToUTC=true` has increased by eight hours(UTC8).

The parquet with `isAdjustedToUTC=true` can be produced by spark-sql with the following configuration:
```
--conf spark.sql.session.timeZone=UTC
--conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS
```

However, using the following configuration, there's no logical and convert type in parquet meta data, so the time read by doris will also increase by eight hours(UTC8). Users need to set their own UTC time zone in doris(https://doris.apache.org/docs/dev/advanced/time-zone/)
```
--conf spark.sql.session.timeZone=UTC
--conf spark.sql.parquet.outputTimestampType=INT96
```
2024-04-07 23:24:22 +08:00
2f2d488668 [opt](parquet) Support hive struct schema change (#32438)
Followup: #31128
This optimization allows doris to correctly read struct type data after changing the schema from hive.

## Changing  struct schema  in hive:
```sql
hive> create table struct_test(id int,sf struct<f1: int, f2: string>) stored as parquet;

hive> insert into struct_test values
    >           (1, named_struct('f1', 1, 'f2', 's1')),
    >           (2, named_struct('f1', 2, 'f2', 's2')),
    >           (3, named_struct('f1', 3, 'f2', 's3'));

hive> alter table struct_test change sf sf struct<f1:int, f3:string>;

hive> select * from struct_test;
OK
1	{"f1":1,"f3":null}
2	{"f1":2,"f3":null}
3	{"f1":3,"f3":null}
Time taken: 5.298 seconds, Fetched: 3 row(s)
```

The previous result of doris was:
```sql
mysql> select * from struct_test;
+------+-----------------------+
| id   | sf                    |
+------+-----------------------+
|    1 | {"f1": 1, "f3": "s1"} |
|    2 | {"f1": 2, "f3": "s2"} |
|    3 | {"f1": 3, "f3": "s3"} |
+------+-----------------------+
```

Now the result is same as hive:

```sql
mysql> select * from struct_test;
+------+-----------------------+
| id   | sf                    |
+------+-----------------------+
|    1 | {"f1": 1, "f3": null} |
|    2 | {"f1": 2, "f3": null} |
|    3 | {"f1": 3, "f3": null} |
+------+-----------------------+
```
2024-03-22 16:35:47 +08:00
2196c534e8 [fix](group commit) Fix compatibility issues on serializing and deserializing wal file (#32299) 2024-03-21 14:07:24 +08:00
f99db38998 [fix](ParquetReader) Fix Parquet Reader to read int96 parquet type problem (#32394)
`hi - JULIAN_EPOCH_OFFSET_DAYS` could be negative, so we can't all use unsigned int.
2024-03-21 14:07:24 +08:00
2e564036ef [fix](profile) avoid update profile in deconstructor (#32131)
In previous, the counter in `profile` may be updated when close the file reader.
And the file reader may be closed when the object being deconstruted.
But at that time, the `profile` object may already be deleted, causing NPE and BE will crash.

This PR try to fix this issue:

1. Remove the "profile counter update" logic from all `close()` method.

2. Add a new interface `ProfileCollector`

	It has 2 methods:
	
	- `collect_profile_at_runtime()`

		It can be called at runtime, eg, in every `get_next_block()` method.
		So that the counter in profile can be updated at runtime.
		
	- `collect_profile_before_close()`

		Should be called before the object call `close()`. And it will only be called once.
		
3. Derived from `ProfileCollector`

	All classes which may update the profile counter in `close()` method should extends
	the `ProfileCollector`. Such as `GenericReader`, etc. And implement `collect_profile_before_close()`
	
	And `collect_profile_before_close()` will be called in `scanner->mark_to_need_to_close()`.
2024-03-21 14:07:22 +08:00
9c1888e7ec [RuntimeFilter](exec) support min max runtime filter and do refactor (#32210) 2024-03-15 18:06:20 +08:00
7b74b199a5 [fix](memory) Fix LRU cache deleter and memory tracking (#32080)
In order to add common code to the value deleter of LRU cache, let all lru cache values inherit from LRUCacheValueBase class and tracking memory in destructor.
2024-03-15 17:57:58 +08:00