Commit Graph

5103 Commits

Author SHA1 Message Date
f498beed07 [improvement](jdbc)Support for automatically obtaining the precision of the trino/presto timestamp type (#21386) 2023-07-04 18:59:42 +08:00
aec5bac498 [improvement](jdbc)Support for automatically obtaining the precision of the hana timestamp type (#21380) 2023-07-04 18:59:21 +08:00
b27fa70558 [fix](jdbc) fix presto jdbc catalog pushDown and nameFormat (#21447) 2023-07-04 18:58:33 +08:00
9d997b9349 [revert](nereids) Revert data size agg (#21216)
To make stats derivation more precise
2023-07-04 18:02:15 +08:00
1b86e658fd [fix](Nereids): decrease the memo GroupExpression of limits (#21354) 2023-07-04 17:15:41 +08:00
c2b483529c [fix](heartbeat) need to set backend status base on edit log (#21410)
For non-master FE, must set Backend's status based on the content of edit log.
There is a bug that if we set fe config: `max_backend_heartbeat_failure_tolerance_count` larger that one,
the non-master FE will not set Backend as dead until it receive enough number of heartbeat edit log,
which is wrong.
This will causing the Backend is dead on Master FE, but is alive on non-master FE
2023-07-04 17:12:53 +08:00
9adbca685a [opt](hudi) use spark bundle to read hudi data (#21260)
Use spark-bundle to read hudi data instead of using hive-bundle to read hudi data.

**Advantage** for using spark-bundle to read hudi data:
1. The performance of spark-bundle is more than twice that of hive-bundle
2. spark-bundle using `UnsafeRow` can reduce data copying and GC time of the jvm
3. spark-bundle support `Time Travel`, `Incremental Read`, and `Schema Change`, these functions can be quickly ported to Doris

**Disadvantage** for using spark-bundle to read hudi data:
1. More dependencies make hudi-dependency.jar very cumbersome(from 138M -> 300M)
2. spark-bundle only provides `RDD` interface and cannot be used directly
2023-07-04 17:04:49 +08:00
90dd8716ed [refactor](multicast) change the way multicast do filter, project and shuffle (#21412)
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>

1. Filtering is done at the sending end rather than the receiving end
2. Projection is done at the sending end rather than the receiving end
3. Each sender can use different shuffle policies to send data
2023-07-04 16:51:07 +08:00
9e8501f191 [Performance](Nereids): speedup analyze by removing sort()/addAll() in OptimizeGroupExpressionJob to (#21452)
sort() and allAll() all rules will cost much time and it's useless action, remove them to speed up.

explain tpcds q72: 1.72s -> 1.46s
2023-07-04 16:01:54 +08:00
Pxl
65cb91e60e [Chore](agg-state) add sessionvariable enable_agg_state (#21373)
add sessionvariable enable_agg_state
2023-07-04 14:25:21 +08:00
e4c0a0ac24 [improve](dependency)Upgrade dependency version (#21431)
exclude old netty version
upgrade spring-boot version to 2.7.13
used ojdbc8 replace ojdbc6
upgrade jackson version to 2.15.2
upgrade fabric8 version to 6.7.2
2023-07-04 11:29:21 +08:00
8cbc1d58e1 [fix](MTMV) Disable partition specification temporarily (#20793)
The syntax for supporting partition updates in the future has not been investigated yet and there are issues with partition syntax. Therefore, the partition syntax has been temporarily removed in the current version and will be added after future research.
2023-07-04 11:09:04 +08:00
d5f39a6e54 [Performance](Nereids) refactor code speedup analyze (#21458)
refactor those code which cost much time.
2023-07-04 10:59:07 +08:00
599ba4529c [fix](nereids) need run ConvertInnerOrCrossJoin rule again after EliminateNotNull (#21346)
after running EliminateNotNull rule, the join conjuncts may be removed from inner join node.
So need run ConvertInnerOrCrossJoin rule to convert inner join with no join conjuncts to cross join node.
2023-07-04 10:52:36 +08:00
11e18f4c98 [Fix](multi-catalog) fix NPE for FileCacheValue. (#21441)
FileCacheValue.files may be null if there is not any files exists for some partitions.
2023-07-03 23:38:58 +08:00
63b170251e [fix](nereids)cast filter and join conjunct's return type to boolean (#21434) 2023-07-03 17:22:46 +08:00
f80df20b6f [Fix](multi-catalog) Fix read error in mixed partition locations. (#21399)
Issue Number: close #20948

Fix read error in mixed partition locations(for example, some partitions locations are on s3, other are on hdfs) by `getLocationType` of file split level instead of the table level.
2023-07-03 15:14:17 +08:00
9fa2dac352 [fix](Nereids): DefaultPlanRewriter visit plan children. (#21395) 2023-07-03 13:20:01 +08:00
17af099dc3 [fix](nereids)miss group id in explain plan #21402
after we introduce "PushdownFilterThroughProject" post processor, some plan node missed their groupExpression (withChildren function will remove groupExpression).
this is not good for debug, since it takes more time to find the owner group of a plan node
This pr record the missing owner group id in plan node mutableState.
2023-07-03 13:16:33 +08:00
2827bc1a39 [Fix](nereids) fix a bug in ColumnStatistics.numNulls update #21220
no impact on tpch
has impact on tpcds 95,
before 1.63 sec, after 1.30 sec
2023-07-03 10:51:23 +08:00
Pxl
59c1bbd163 [Feature](materialized view) support query match mv with agg_state on nereids planner (#21067)
* support create mv contain aggstate column

* update

* update

* update

* support query match mv with agg_state on nereids planner

update

* update

* update
2023-07-03 10:19:31 +08:00
124516c1ea [Fix](orc-reader) Fix Wrong data type for column error when column order in hive table is not same in orc file schema. (#21306)
`Wrong data type for column` error when column order in hive table is not same in orc file schema.

The root cause is in order to handle the following case:

The table in orc format of Hive 1.x may encounter system column names such as `_col0`, `_col1`, `_col2`... in the underlying orc file schema, which need to use the column names in the hive table for mapping.

### Solution
Currently fix this issue by handling the following case by specifying hive version to 1.x.x in the hive catalog configuration.

```sql
CREATE CATALOG hive PROPERTIES (
    'hive.version' = '1.x.x'
);
```
2023-07-03 09:32:55 +08:00
f5af735fa6 [fix](multi-catalog)fix obj file cache and dlf iceberg catalog (#21238)
1. fix storage prefix for obj file cache: oss/cos/obs don't need convert to s3 prefix , just convert when create split
2. dlf iceberg catalog: support dlf iceberg table, use s3 file io.
2023-07-02 21:08:41 +08:00
f74e635aa5 [bug](proc) fix NumberFormatException in show proc '/current_queries' (#21400)
If the current query is running for a very long time, the ExecTime of this query may larger than the MAX_INT value, then a NumberFormatException will be thrown when execute "show proc '/current_queries'."
The query's ExecTime is long type, we should not use 'Integer.parseInt' to parse it.
2023-07-01 17:42:46 +08:00
887d33c789 [fix](cup) add keywords KW_PERCENT (#21404)
Or it may cause some edit log replay error, like parsing create routine load stmt, which contains this keyword as
a column name
2023-07-01 16:53:54 +08:00
603f4ab20f [fix](truncate) it will directly return and avoid throwing IllegalStateException caused by bufferSize equals zero when table has no partition (#21378)
if table currently has no partition, the truncate SQL will be a empty command, it should directly return and avoid throwing IllegalStateException caused by bufferSize equals zero

Issue Number: close #21316
Co-authored-by: tongyang.han <tongyang.han@jiduauto.com>
2023-07-01 08:39:38 +08:00
0e17cd4d92 [fix](hudi) use hudi api to split the COW table (#21385)
Fix tow bugs:

COW & Read Optimized table will use hive splitter to split files, but it can't recognize some specific files.
ERROR 1105 (HY000): errCode = 2, detailMessage =
(172.21.0.101)[CORRUPTION]Invalid magic number in parquet file, bytes read: 3035, file size: 3035,
path: /usr/hive/warehouse/hudi.db/test/.hoodie/metadata/.hoodie/00000000000000.deltacommit.inflight, read magic:
The read optimized table created by spark will add empty partition even if the table has no partition, so we have to filter these empty partition keys in hive client.
| test_ro | CREATE TABLE `test_ro`(
  `_hoodie_commit_time` string COMMENT '',
  ...
  `ts` bigint COMMENT '')
PARTITIONED BY (
 `` string)
ROW FORMAT SERDE
2023-07-01 08:35:33 +08:00
96aa0e5876 [fix](tvf) To fix the bug that requires adding backticks on "frontends()" in order to query the frontends TVF. (#21338) 2023-06-30 22:37:21 +08:00
ed2cd4974e [fix](nereids) to_date should return type datev2 for datetimev2 (#21375)
To_date function in nereids return type should be DATEV2 if the arg type is DATETIMEV2.
Before the return type was DATE which would cause BE get wrong query result.
2023-06-30 21:42:59 +08:00
18b7d84436 [fix](Nereids): reject infer distinct when children exist NLJ (#21391) 2023-06-30 20:29:48 +08:00
4117f0b93b [improve](nereids) Support outer rf into inner left outer join (#21368)
Support rf into left outer join from outside allowed type join.
Before this pr, some join type, such as full outer join, are all not allowed to do rf pushing.
For example, (a left join b on a.id = b.id) inner join c on a.id2 = c.id2, will lost the rf pushing from c.id2 to inner table a.
This pr will open this limitation for supporting rf into left outer join from outside allowed type join.
2023-06-30 19:07:39 +08:00
164448dac3 [fix](nereids) fix rf info missing for set op (#21367)
During physical set operation translation, we forget to inherit rf related info from set op children, which will lead the merge filter error and get a long waittime.
2023-06-30 18:50:29 +08:00
Pxl
88cbea2b56 [Bug](agg-state) fix core dump on not nullable argument for aggstate's nested argument (#21331)
fix core dump on not nullable argument for aggstate's nested argument
2023-06-30 18:20:25 +08:00
de39632f1b [feature](binlog) Add AddPartitionRecord && DROP_PARTITION (#21344)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2023-06-30 16:57:11 +08:00
2c3183f5eb [Feature](Job)Provide unified internal Job scheduling (#21113)
We use the time wheel algorithm to complete the scheduling and triggering of periodic tasks. The implementation of the time wheel algorithm refers to netty's HashedWheelTimer.
We will periodically (10 minutes by default) put the events that need to be triggered in the future cycle into the time wheel for periodic scheduling. In order to ensure the efficient triggering of tasks and avoid task blocking and subsequent task scheduling delays, we use Disruptor to implement the production and consumption model.
When the task expires and needs to be triggered, the task will be put into the RingBuffer of the Disruptor, and then the consumer thread will consume the task.
Consumers need to register for events, and event registration needs to provide event executors. Event executors are a functional interface with only one method for executing events.
If it is a single event, the event definition will be deleted after the scheduling is completed; if it is a periodic event, it will be put back into the time wheel according to the periodic scheduling after the scheduling is completed.
2023-06-30 16:43:20 +08:00
8809cca74a [fix](nereids) physical sort node's equals method should compare sort phase (#21301) 2023-06-30 14:04:22 +08:00
8f4b7c8f3d [Fix](multi-catalog) optimize hashcode for PartitionKey. (#21307) 2023-06-30 13:48:08 +08:00
df23ab3f29 [Enhancement](tvf) Add authentication for workload group tvf (#21323) 2023-06-30 12:56:23 +08:00
9f44c2d80d [fix](nereids) nest loop join stats estimation (#21275)
1. fix bug in nest loop join estimation
2. update column=column stats estimation
2023-06-30 10:00:30 +08:00
9756ff1e25 [feature](Nereids): infer distinct from SetOperator (#21235)
Infer distinct from Distinct SetOperator, and put distinct above children to reduce data.

tpcds_sf100 q14:

before
100 rows in set (7.60 sec)

after
100 rows in set (6.80 sec)
2023-06-29 22:04:41 +08:00
c7286c620b [fix](unique key) agg_function is NONE when properties is null (#21337) 2023-06-29 20:47:13 +08:00
6259a91d12 [opt](profile) add whether use Nereids info in Profile (#21342)
add whether use Nereids or pipeline engine in profile, for example:

Summary:
  -  Profile  ID:  460e710601674438-9df2d685bdfc20f8
  -  Task  Type:  QUERY
  ...
  -  Is  Nereids:  Yes
  -  Is  Pipeline:  Yes
  -  Is  Cached:  No
2023-06-29 20:36:15 +08:00
f3fc606312 [minor](Nereids) change Nereids parse failed log level to debug (#21335) 2023-06-29 19:52:48 +08:00
5bb79be932 [opt](Nereids) forbid gather agg and gather set operation (#21332)
gather agg and gather set operation usually not good
we cannot compute cost on them nicely, so just
forbid them until we could choose realy best plan
2023-06-29 19:52:15 +08:00
419f51ca2c [feature](nereids)set nereids cbo weights by session var #21293
good for tune cost model
2023-06-29 18:54:04 +08:00
59198ed59e [improvement](nereids) Support rf into cte (#21114)
Support runtime filter pushing down into cte internal.
2023-06-29 16:58:31 +08:00
64e9eab0dd [fix](nereids)update Agg stats estimation #21300
Agg stats estimation should use the biggest groupby key's NDV as base, and multiply expansion factor, which is calculated by other groupby key' ndv.
Before, we use the smallest ndv as base
2023-06-29 16:37:05 +08:00
Pxl
a518ea5063 [Bug](pipeline) do not call cancelPlanFragmentAsync when instance finished (#21193)
do not call cancelPlanFragmentAsync when instance finished
2023-06-29 15:35:23 +08:00
16c218fde5 [feature](nereids) support bind external relation out of Doris fe environment (#21123)
support bind external relation out of Doris fe environment, for example, analyze sql in other java application.
see BindRelationTest.bindExternalRelation.
2023-06-29 14:29:29 +08:00
3a12b67517 [Improvement](statistics, multi catalog)Implement hive table statistic connector (#21053)
This pr is to add the collecting hive statistic function. While the CBO fetching hive table statistics, statistic cache will 
first load from internal stats olap table. If not found, then using this pr's function to fetch from remote Hive metastore.
2023-06-29 10:50:54 +08:00