Commit Graph

1479 Commits

Author SHA1 Message Date
3103bb08dc [pick](Variant) casting to decimal type may lost precision (#39843)
#39650
2024-08-23 22:47:32 +08:00
6ceb574aa0 [branch-2.1]Pick IO limit/workload group usage table (#39839) 2024-08-23 18:51:47 +08:00
8ce8887b75 [branch-2.1](memory) Refactor refresh workload groups weighted memory ratio and record refresh interval memory growth (#39760)
pick #38168
overwrites changes in #37221 on workload_group_manager.cpp. If need to
pick 37221, ignore it.
2024-08-22 17:33:11 +08:00
a3fd13fee6 [fix](catalog) set timeout for split fetch (#39346) (#39624)
bp #39346
2024-08-20 21:59:55 +08:00
12ed2951c4 [fix] (inverted index) remove tmp columns in block (#39369) (#39533) 2024-08-20 20:53:23 +08:00
5fcd6e6270 [Fix](load) Fix the incorrect src value printed in the error log when strict mode is true #39447 (#39587)
cherry pick from #39447
2024-08-20 12:02:13 +08:00
a44a274563 [Fix](parquet-reader) Fix and optimize parquet min-max filtering. (#39375)
Backport #38277.
2024-08-15 14:12:54 +08:00
677435cef8 [Pick](Branch-2.1) pick json reader fix and support specify $. as column (#39271)
#39206
#38213
2024-08-13 17:44:45 +08:00
7e7729c4b0 [cherry-pick](branch-21) fix partition-topn calculate partition input rows have error (#39100) (#39281)
## Proposed changes

cherry-pick from master: #39100 

<!--Describe your changes.-->
2024-08-13 17:42:29 +08:00
3da2d1c9d6 [bug](parquet)Fix the problem that the parquet reader reads the missing sub-columns of the struct and fails. (#38718) (#39192)
bp #38718
2024-08-11 20:37:40 +08:00
e15b6cfc68 [fix](be) return correct canceled status from scanner (#36392) (#39111)
## Proposed changes

pick #36392
2024-08-09 04:02:42 +08:00
44cb7978a9 [opt](index) add more inverted index profile metrics #36696 (#38858) 2024-08-08 14:16:55 +08:00
bc644cb253 [opt](catalog) merge scan range to avoid too many splits (#38311) (#38964)
bp #38311
2024-08-06 21:57:02 +08:00
607c0b82a9 [opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377) (#38245) (#38810)
## Proposed changes
pick pr: #38575  and fix this pr bug :  #38245
2024-08-05 09:13:08 +08:00
5d02c48715 [feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. (#38432) (#38809)
bp #38432 

## Proposed changes
Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
2024-08-05 09:06:49 +08:00
7bdc508ac7 [Bug](fix) fix coredump case in (not null, null) execpt (not null, not null) case (#38756)
## Proposed changes

Issue Number: close #38612

<!--Describe your changes.-->
2024-08-04 10:44:10 +08:00
9d23ccf1f2 [Improvement](schema scan) Use async scanner for schema scanners (#38… (#38666)
…403)
2024-08-01 16:05:24 +08:00
338fa32303 [pick](simdjson) fix simdjson with object array when jsonroot is not empty (#38633)
## Proposed changes
backport: https://github.com/apache/doris/pull/38490
Issue Number: close #xxx

<!--Describe your changes.-->
2024-08-01 11:04:54 +08:00
41fa7bc9fd [bugfix](paimon)Fixed the reading of timestamp with time zone type data for 2.1 (#37716) (#38592)
bp: #37716
2024-08-01 10:23:06 +08:00
17d351af80 [fix](csv reader) fix csv parser incorrect if enclosing line_delimiter (#38347) (#38445)
Csv reader parse data incorrect when data enclosing line_delimiter, for
example, line_delimiter is \n and enclose is ', data as follows:
```
'aaaaaaaaaaaa
bbbb'
```
it will be parsed as two columns: `'aaaaaaaaaaaa` and `bbbb',` rather
than one column
```
'aaaaaaaaaaaa
bbbb'
```

The reason why this happened is csv reader will not reset result when
not match enclose in this `output_buf_read`, causing incorrect
truncation was made.

Co-authored-by: Xin Liao <liaoxinbit@126.com>
2024-07-29 14:55:45 +08:00
e2bb86e7f8 [fix](inverted index) fixed in_list condition not indexed on pipelinex (#38178)
## Proposed changes

https://github.com/apache/doris/pull/36565
https://github.com/apache/doris/pull/37842
https://github.com/apache/doris/pull/37921
https://github.com/apache/doris/pull/37386

<!--Describe your changes.-->
2024-07-25 14:42:34 +08:00
a751372e76 [Feature](multi-catalog) Add memory tracker for orc reader/writer and arrow parquet writer。 (#37257)
## Proposed changes

backport #37234
2024-07-25 13:51:59 +08:00
57864e8554 [cherry-pick](branch-21) fix collect_set function core dump without arena pool (#38234) (#38307)
## Proposed changes

cherry-pick from master #38234

<!--Describe your changes.-->
2024-07-25 12:05:52 +08:00
3ea26a8c95 [fix](external) record not found file number (#38253) (#38285)
bp #38253
2024-07-25 11:03:19 +08:00
ef00dad680 [Fix](multi-catalog) Fix some undefined behaviors. (#38274)
## Proposed changes

backport #37845
2024-07-24 16:14:34 +08:00
193be20c86 [feature](csv)Supports reading CSV data using LF and CRLF as line separators. (#37687) (#38099)
bp #37687
2024-07-22 22:53:04 +08:00
d9fd419e47 [Fix](JsonReader) fix json with duplicate key entry may result out of bound exception (#38147)
#38146
2024-07-19 22:53:02 +08:00
7b141ffde7 [pick]add min scan thread num for workload group's scan thread (#38123)
## Proposed changes

pick #38096
2024-07-19 18:43:05 +08:00
de2272ce48 [fix](round) fix round decimal128 overflow (#37733) (#37963)
cherry-pick #37733 to branch-2.1
2024-07-18 23:50:23 +08:00
88d771d360 [pipeline](fix) Avoid to use a freed dependency when cancelled (#34584) (#38046)
## Proposed changes

pick #34584
<!--Describe your changes.-->
2024-07-18 15:27:10 +08:00
3d5043817a Revert "[opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377)" (#38007)
Reverts apache/doris#37530
Need more test, revert it temporarily
2024-07-17 21:44:25 +08:00
6932eef65e [opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377) (#37530)
bp #37377
2024-07-16 10:56:13 +08:00
e7a001c420 [enhance](mtmv)support partition tvf (#37795)
pick from: https://github.com/apache/doris/pull/36479 and
https://github.com/apache/doris/pull/37201
2024-07-16 09:27:44 +08:00
1d49d386aa [cherry-pick](branch-21) remove the useless code in column vector (#34432) (#37827)
cherry-pick from master https://github.com/apache/doris/pull/34432

Co-authored-by: HappenLee <happenlee@hotmail.com>
2024-07-15 22:10:58 +08:00
967173d7d0 [cherry-pick-2.1](table-function) pick some table functions exec performance (#34090) (#37778)
## Proposed changes

pick from master:
https://github.com/apache/doris/pull/33904
https://github.com/apache/doris/pull/34090

Co-authored-by: HappenLee <happenlee@hotmail.com>
2024-07-15 17:15:56 +08:00
a4d37d96ca [opt](file-scanner) add not found file number in profile (#37042) (#37764)
bp #37042
2024-07-15 17:11:06 +08:00
20758576b2 [fix](split) remove retry when fetch split batch failed (#37637)
bp: #37636
2024-07-12 22:46:03 +08:00
87912de93f [fix](scan) catch exceptions thrown in scanner (#36101) (#37408)
## Proposed changes

pick #36101

The uncaught exceptions thrown in the scanner will cause the BE to
crash.
2024-07-12 08:49:39 +08:00
217eac790b [pick](Variant) pick some refactor and fix #34925 #36317 #36201 #36793 (#37526) 2024-07-11 21:25:34 +08:00
b272247a57 [pick]log thread num (#37258)
## Proposed changes

pick #37159
2024-07-04 15:27:52 +08:00
bd24a8bdd9 [Fix](csv_reader) Add a session variable to control whether empty rows in CSV files are read as NULL values (#37153)
bp: #36668
2024-07-02 22:12:17 +08:00
e25717458e [opt](catalog) add some profile for parquet reader and change meta cache config (#37040) (#37146)
bp #37040
2024-07-02 20:58:43 +08:00
d0eea3886d [fix](multi-catalog) Revert #36575 and check nullptr of data column (#37086)
Revert #36575, because `VScanner::get_block` will check
`DCHECK(block->rows() == 0)`, so block should be cleared when `eof =
true`.
2024-07-02 15:32:52 +08:00
e686e85f27 [opt](split) add max wait time of getting splits (#36842)
bp: #36843
2024-07-01 22:05:25 +08:00
25fb30c723 [fix](intersect) fix coredump caused by intersect of nullable and not nullable children #36401 (#36441)
## Proposed changes

Pick #36765
2024-06-26 17:45:21 +08:00
695d58f354 [cherry-pick](scan)scanner could eos early when reached limit (#36535) (#36736)
## Proposed changes
cherry-pick from master #36535
2024-06-25 17:22:43 +08:00
e4b6dac0c1 [fix](ubsan) reinterpret_cast fix length types to int8 is not safe (#36725)
## Proposed changes

Fix type check of ubsan. 
```
/root/doris/be/src/vec/exec/format/parquet/fix_length_plain_decoder.h:75:78: runtime error: member call on address 0x5582f35db5c0 which does not point to an object of type 'doris::vectorized::ColumnVector<signed char>'
0x5582f35db5c0: note: object is of type 'doris::vectorized::ColumnVector<int>'
 83 55 00 00  78 c0 b0 5a 82 55 00 00  02 00 00 00 00 00 00 00  10 a0 00 d7 83 55 00 00  10 a0 00 d7
              ^~~~~~~~~~~~~~~~~~~~~~~
              vptr for 'doris::vectorized::ColumnVector<int>'
doris::Status doris::vectorized::FixLengthPlainDecoder::_decode_values<false>(COW<doris::vectorized::IColumn>::mutable_ptr<doris::vectorized::IColumn>&, std::shared_ptr<doris::vectorized::IDataType const>&, doris::vectorized::ColumnSelectVector&, bool) at fix_length_plain_decoder.h:75:78
```
2024-06-24 14:03:41 +08:00
17cf34b244 [Fix](multi-catalog) Fix core in orc and parquet reader sometimes after low mem exception. (#36575)
## Proposed changes

Backport #36574.
2024-06-22 11:28:21 +08:00
f7f7b2b738 [Enhancement](multi-catalog) Add more error msgs for wrong data types in orc and parquet reader. (#36580)
Backport #36417
2024-06-20 18:10:25 +08:00
f59dc4fb37 [opt](split) generate and get split batch concurrently (#36044)
bp #36045, and turn on batch split, which is turn off in #36109
Generate and get split batch concurrently.
`SplitSource.getNextBatch` remove the synchronization, and make each get their splits concurrently, and `SplitAssignment` generates splits asynchronously.
2024-06-19 16:16:02 +08:00