Commit Graph

541 Commits

Author SHA1 Message Date
e15b6cfc68 [fix](be) return correct canceled status from scanner (#36392) (#39111)
## Proposed changes

pick #36392
2024-08-09 04:02:42 +08:00
44cb7978a9 [opt](index) add more inverted index profile metrics #36696 (#38858) 2024-08-08 14:16:55 +08:00
bc644cb253 [opt](catalog) merge scan range to avoid too many splits (#38311) (#38964)
bp #38311
2024-08-06 21:57:02 +08:00
607c0b82a9 [opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377) (#38245) (#38810)
## Proposed changes
pick pr: #38575  and fix this pr bug :  #38245
2024-08-05 09:13:08 +08:00
5d02c48715 [feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. (#38432) (#38809)
bp #38432 

## Proposed changes
Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
2024-08-05 09:06:49 +08:00
41fa7bc9fd [bugfix](paimon)Fixed the reading of timestamp with time zone type data for 2.1 (#37716) (#38592)
bp: #37716
2024-08-01 10:23:06 +08:00
e2bb86e7f8 [fix](inverted index) fixed in_list condition not indexed on pipelinex (#38178)
## Proposed changes

https://github.com/apache/doris/pull/36565
https://github.com/apache/doris/pull/37842
https://github.com/apache/doris/pull/37921
https://github.com/apache/doris/pull/37386

<!--Describe your changes.-->
2024-07-25 14:42:34 +08:00
3ea26a8c95 [fix](external) record not found file number (#38253) (#38285)
bp #38253
2024-07-25 11:03:19 +08:00
7b141ffde7 [pick]add min scan thread num for workload group's scan thread (#38123)
## Proposed changes

pick #38096
2024-07-19 18:43:05 +08:00
3d5043817a Revert "[opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377)" (#38007)
Reverts apache/doris#37530
Need more test, revert it temporarily
2024-07-17 21:44:25 +08:00
6932eef65e [opt](serde)Optimize the filling of fixed values ​​into block columns without repeated deserialization. (#37377) (#37530)
bp #37377
2024-07-16 10:56:13 +08:00
e7a001c420 [enhance](mtmv)support partition tvf (#37795)
pick from: https://github.com/apache/doris/pull/36479 and
https://github.com/apache/doris/pull/37201
2024-07-16 09:27:44 +08:00
a4d37d96ca [opt](file-scanner) add not found file number in profile (#37042) (#37764)
bp #37042
2024-07-15 17:11:06 +08:00
20758576b2 [fix](split) remove retry when fetch split batch failed (#37637)
bp: #37636
2024-07-12 22:46:03 +08:00
87912de93f [fix](scan) catch exceptions thrown in scanner (#36101) (#37408)
## Proposed changes

pick #36101

The uncaught exceptions thrown in the scanner will cause the BE to
crash.
2024-07-12 08:49:39 +08:00
217eac790b [pick](Variant) pick some refactor and fix #34925 #36317 #36201 #36793 (#37526) 2024-07-11 21:25:34 +08:00
b272247a57 [pick]log thread num (#37258)
## Proposed changes

pick #37159
2024-07-04 15:27:52 +08:00
e686e85f27 [opt](split) add max wait time of getting splits (#36842)
bp: #36843
2024-07-01 22:05:25 +08:00
695d58f354 [cherry-pick](scan)scanner could eos early when reached limit (#36535) (#36736)
## Proposed changes
cherry-pick from master #36535
2024-06-25 17:22:43 +08:00
f59dc4fb37 [opt](split) generate and get split batch concurrently (#36044)
bp #36045, and turn on batch split, which is turn off in #36109
Generate and get split batch concurrently.
`SplitSource.getNextBatch` remove the synchronization, and make each get their splits concurrently, and `SplitAssignment` generates splits asynchronously.
2024-06-19 16:16:02 +08:00
4a277affdc [fix](scan) In-predicate should not be pushed down for non-key column(#35913) (#35968)
pick #35913
2024-06-11 11:13:34 +08:00
fe1a4c4136 [Feature](IP) support ipv4/ipv6 with inverted index and conjuncts for query (#35734)
support data type ipv4/ipv6 with inverted index 
and then we can query like "> or < or >= or <= or in/not in " this
conjuncts expr for ip with inverted index speeding up
2024-06-03 23:24:03 +08:00
680be6d19f [fix](ub) fix uninitialized accesses in BE (#35370)
ubsan hints:
```c++
/root/doris/be/src/olap/hll.h:93:29: runtime error: load of value 3078029312, which is not a valid value for type 'HllDataType'
/root/doris/be/src/olap/hll.h:94:23: runtime error: load of value 3078029312, which is not a valid value for type 'HllDataType'
/root/doris/be/src/runtime/descriptors.h:439:38: runtime error: load of value 118, which is not a valid value for type 'bool'
/root/doris/be/src/vec/exec/vjdbc_connector.cpp:61:50: runtime error: load of value 35, which is not a valid value for type 'bool' 
```
2024-05-29 20:31:07 +08:00
86c7092f21 [opt](external) ignore not find files (#35319)
The file list is got from external meta cache, and the file may already
be removed from storage.
We should ignore not found files and that query continue.
2024-05-28 18:51:56 +08:00
d97788dec8 [Refactor](Status) Refactor the scanner scheduler code make return error msg means (#35286)
## Proposed changes

Before error msg:
```
Failed to submit scanner to scanner pool
```

After error msg:
```
Failed to submit scanner to scanner pool reason:Scan thread pool had shutdown|type 1

```
2024-05-28 18:49:55 +08:00
c44affb43f Add downgrade scan thread num by column num (#35351) 2024-05-27 15:27:12 +08:00
eb49cd839b [refactor](datalake) return the error status instead of static_cast<void> (#34873)
Followup #34797
`static_cast<void>` has ignored the wrong status, some of them should make the query finished with error status, so replace `static_cast<void>`  with `RETURN_IF_ERROR`.

The following three scenarios need to be handled separately and cannot be simply replaced:
1. The outer function returns void;
2. Call status function inner constructors or destructors;
3. Call status function with best effort, and should ignore the wrong status.
2024-05-23 19:06:21 +08:00
adc364a6fd [feature](Paimon) support deletion vector for Paimon naive reader (#34743) (#35241)
bp #34743
Co-authored-by: 苏小刚 <suxiaogang223@icloud.com>
2024-05-23 00:01:30 +08:00
903ff32021 [opt](fe) exit FE when transfer to (non)master failed (#34809) (#35158)
bp #34809
2024-05-21 22:31:47 +08:00
98f8eb5c43 [opt](split) get file splits in batch mode (#34032) (#35107)
bp  #34032
2024-05-21 22:27:07 +08:00
1f0c45204b [fix](iceberg) read the primary key columns if hasing equality delete (#34884)
backport: #34835
2024-05-15 11:37:25 +08:00
cc00666be6 [opt](inverted index) add inlist condition handling to compound (#34134)
1. Previously, the compound did not support the inlist condition, which could impact performance if an inverted index was created.
2024-05-10 14:35:47 +08:00
e085f75a43 [opt](file-scanner) print current path when encountering error (#34365) (#34523)
bp #34365
2024-05-08 14:49:03 +08:00
1bfe0f0393 [feature](iceberg)support read iceberg complex type,iceberg.orc format and position delete. (#33935) (#34256)
master #33935
2024-04-29 14:40:12 +08:00
9f0a5690a6 [profile](scan) add projection time in scaner #34120 2024-04-26 07:43:40 +08:00
47b54d4bd5 Fix remote scan pool (#33976) 2024-04-25 15:04:43 +08:00
4d7ac82305 [profile](scanner) Fix wrong metrics (#33965) 2024-04-22 22:33:24 +08:00
59de97be5e [improvement](mow) Add profile for delete_bitmap get_agg function (#33576) 2024-04-17 23:42:13 +08:00
2cd4012541 [opt](scan) read scan ranges in the order of partitions (#33515) (#33657)
backport: #33515
2024-04-17 23:42:12 +08:00
8ee8de7857 [Fix](executor)reset remote scan thread num #33579 2024-04-17 23:42:11 +08:00
249a9c9875 [Feature](Variant) support aggregation model for Variant type (#33493)
refactor use `insert_from` to replace `replace_column_data` for variable lengths columns
2024-04-17 23:42:00 +08:00
6bcf24b1f6 [bug](not in) if not in (null) could eos early (#33482)
* [bug](not in) if not in (null) could eos early
2024-04-17 23:41:59 +08:00
Pxl
5f30463bb3 [Chore](descriptors) remove unused codes for descriptors (#33408)
remove unused codes for descriptors
2024-04-12 15:09:25 +08:00
f7d52b5b1c [feature](expr) add type check when expr prepare (#33330) 2024-04-11 09:31:50 +08:00
Pxl
8fd6d4c41b [Chore](build) add -Wconversion and remove some unused code (#33127)
add -Wconversion and remove some unused code
2024-04-10 15:26:08 +08:00
8e19cdd745 [featrue](expr) support common subexpression elimination be part (#32673) 2024-04-10 11:56:21 +08:00
cf7595d423 [opt](memory) Optimize mem tracker accuracy (#32039) (#33140) 2024-04-10 11:42:19 +08:00
3c4ccb3981 Revert "[opt](scan) read scan ranges in the order of partitions (#31630)"
This reverts commit 5d99dffe6f1a3fcb107ce56181aeff96ef222def.
2024-04-09 12:37:31 +08:00
0234976ab7 [refactor](meta scan) Remove RPC from execute threads (#33378) 2024-04-08 20:28:02 +08:00
a8232c67f9 [pipelineX](runtime filter) Fix task timeout caused by runtime filter (#33332) (#33369) 2024-04-08 16:30:32 +08:00