Commit Graph

5135 Commits

Author SHA1 Message Date
dc44345ee4 [Fix](Planner) change non boolean return type to boolean (#20599)
Problem: When using no boolean type as return type in where or having clause, the analyzer will check the return type and throw an error. But in some other databases, this usage is enable.

Solved: Cast return type to boolean in where clause and having clause. select *** from *** where case when *** then 1 else 0 end;
2023-07-07 17:12:41 +08:00
0b7b5dc991 [fix](catalog) wrong required slot info causing BE crash (#21598)
For file scan node, this is a special field `requiredSlot`, this field is set depends on the `isMaterialized` info of slot.
But `isMaterialized` info can be changed during the plan process, so we must update the `requiredSlot`
in `finalize` phase of scan node, otherwise, it may causing BE crash due to mismatching slot info.
2023-07-07 17:10:50 +08:00
02149ff329 [fix](nereids) Agg on unknown-stats column (#21428) 2023-07-07 17:03:04 +08:00
f908ea5573 [fix](Nereids) union distinct should not prune any column (#21610) 2023-07-07 14:38:28 +08:00
b5f247f73f [Improve](mysql)ensure constant time for computing hash value (#21569) 2023-07-07 14:04:11 +08:00
b70fb4ca8e [fix](test) build internal table for TPCHTest to fix testRank (#21566) 2023-07-07 12:46:07 +08:00
53c10a2389 (chore) Disable ssl connection to FE by default for compatibility reason (#20230)
Older MySQL client (< 5.7.28) will try to connect to server with tls1.1,
which is insecure and is not supported by Doris FE. The connection will
fail.

We disable ssl connection support on Doris FE to keep the users' application
unaffected. To enable ssl support explicitly, just put
the following to fe.conf
```
enable_ssl = true
```
2023-07-07 12:24:55 +08:00
bb985cd9a1 [refactor](udf) refactor java-udf execute method by using for loop (#21388) 2023-07-07 11:43:11 +08:00
64d0e28ed0 [improvement](multi catalog)Use getPartitionsByNames to retrieve hive partitions (#21562)
Before, we get hive partition using HMS getPartition api. In this case, each partition need to call the api once. The performance is very poor when partition number is large. This pr use getPartitionsByNames to get multiple partitions in one api call.
To get 90000 partitions, the time costing is reduced to 14s from 108s.
2023-07-07 10:37:33 +08:00
9bcf79178e [Improvement](statistics, multi catalog)Support iceberg table stats collection (#21481)
Fetch iceberg table stats automatically while querying a table.
Collect accurate statistics for Iceberg table by running analyze sql in Doris (remove collect by meta option).
2023-07-07 09:18:37 +08:00
79221a54ca [refactor](Nereids): remove withLogicalProperties & check children size (#21563) 2023-07-06 20:37:17 +08:00
fba3ae96b9 Revert "[Fix](planner) Set inline view output as non constant after analyze (#21212)" (#21581)
This reverts commit 0c3acfdb7c744decb7b60e372007707a55d14e00.
2023-07-06 20:30:27 +08:00
2e651bbc9a [fix](nereids) fix some planner bugs (#21533)
1. allow cast boolean as date like type in nereids, the result is null
2. PruneOlapScanTablet rule can prune tablet even if a mv index is selected.
3. constant conjunct should not be pushed through agg node in old planner
2023-07-06 16:13:37 +08:00
0c3acfdb7c [Fix](planner) Set inline view output as non constant after analyze (#21212)
Problem:
Select list should be non const when from list have tables or multiple tuples. Or upper query will regard wrong of isConstant
And make wrong constant folding
For example: when using nullif funtion with subquery which result in two alternative constant, planner would treat it as constant expr. So analyzer would report an error of order by clause can not be constant

Solusion:
Change inline view output to non constant, because (select 1 a from table) as view , a in output is no constant when we see
view.a outside
2023-07-06 15:37:43 +08:00
068fe44493 [feature](profile) Add important time of legacy planner to profile (#20602)
Add important time in planning process. Add time points of:
// Join reorder end time
queryJoinReorderFinishTime means time after analyze and before join reorder
// Create single node plan end time
queryCreateSingleNodeFinishTime means time after join reorder and before finish create single node plan
// Create distribute plan end time
queryDistributedFinishTime means time after create single node plan and before finish create distributed node plan
2023-07-06 15:36:25 +08:00
bb3b6770b5 [Enhancement](multi-catalog) Make meta cache batch loading concurrently. (#21471)
I will enhance performance about querying meta cache of hms tables by 2 steps:
**Step1** : use concurrent batch loading for meta cache
**Step2** : execute some other tasks concurrently as soon as possible

**This pr mainly for step1 and it mainly do the following things:**
- Create a `CacheBulkLoader` for batch loading
- Remove the executor of the previous async cache loader and change the loader's type to `CacheBulkLoader` (We do not set any refresh strategies for LoadingCache, so the previous executor is not useful)
- Use a `FixedCacheThreadPool` to replace the `CacheThreadPool` (The previous `CacheThreadPool` just log warn infos and will not throw any exceptions when the pool is full).
- Remove parallel streams and use the `CacheBulkLoader` to do batch loadings
- Change the value of `max_external_cache_loader_thread_pool_size` to 64, and set the pool size of hms client pool to `max_external_cache_loader_thread_pool_size`
- Fix the spelling mistake for `max_hive_table_catch_num`
2023-07-06 15:18:30 +08:00
8839518bfb [Performance](Nereids): add withGroupExprLogicalPropChildren to reduce new Plan (#21477) 2023-07-06 14:10:31 +08:00
013bfc6a06 [Bug](row store) Fix column aggregate info lost when table is unique model (#21506) 2023-07-06 12:06:22 +08:00
b1be59c799 [enhancement](query) enable strong consistency by syncing max journal id from master (#21205)
Add a session var & config enable_strong_consistency_read to solve the problem that loading result may be shortly invisible to follwers, to meet users requirements in strong consistency read scenario.

Will sync max journal id from master and wait for replaying.
2023-07-06 10:25:38 +08:00
c1e82ce817 [fix](backup) fix show snapshot cauing mysql connection lost (#21520)
If this is no `info file` in repository, the mysql connection may lost when user executing `show snapshot on repo`,
```
2023-07-05 09:22:48,689 WARN (mysql-nio-pool-0|199) [ReadListener.lambda$handleEvent$0():60] Exception happened in one session(org.apache.doris.qe.ConnectContext@730797c1).
java.io.IOException: Error happened when receiving packet.
    at org.apache.doris.qe.ConnectProcessor.processOnce(ConnectProcessor.java:691) ~[doris-fe.jar:1.2-SNAPSHOT]
    at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_322]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_322]
    at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_322]
```

This is because there are some field missing in returned result set.
2023-07-05 22:44:57 +08:00
b6a5afa87d [Feature](multi-catalog) support query hive-view for nereids planner. (#21419)
Relevant pr #18815, support query hive views for nereids planner.
2023-07-05 21:58:03 +08:00
b3db904847 [fix](Nereids): when child is Aggregate, don't infer Distinct for it (#21519) 2023-07-05 19:39:41 +08:00
f868aa9d4a [Enhancement](multi-catalog) Add some checks for ShowPartitionsStmt. (#21446)
1.  Add some validations for ShowPartitionsStmt with hive tables
2. Make the behavior consistently with hive
2023-07-05 16:28:05 +08:00
0da1bc7acd [Fix](multi-catalog) Fallback to refresh catalog when hms events are missing (#21333)
Fix #20227, the implementation has some problems and can not catch event-missing-exception.
2023-07-05 16:27:01 +08:00
37a52789bd [improvement](statistics, multi catalog)Estimate hive table row count based on file size. (#21207)
Support estimate table row count based on file size.

With sample size=3000 (total partition number is 87491), load cache time is 45s.
With sample size=100000 (more than total partition number 87505), load cache time is 388s.
2023-07-05 16:07:12 +08:00
1121e7d0c3 [feature](Nereids): pushdown distinct through join. (#21437) 2023-07-05 15:55:21 +08:00
4d414c649a [fix](Nereids) set operation physical properties derive is wrong (#21496) 2023-07-05 15:44:40 +08:00
f9bc433917 [fix](nereids) fix runtime filter expr order (#21480)
Current runtime filter pushing down to cte internal, we construct the runtime filter expr_order with incremental number, which is not correct. For cte internal rf pushing down, the join node will be always different, the expr_order should be fixed as 0 without incrementation, otherwise, it will lead the checking for expr_order and probe_expr_size illegal or wrong query result.

This pr will revert 2827bc1 temporarily, it will break the cte rf pushing down plan pattern.
2023-07-05 14:27:35 +08:00
0084b9fd9a [fix](hudi) scala can't call Properties.putAll in jdk11 (#21494) 2023-07-05 10:53:09 +08:00
de5cfe34bf [fix](feut)should not create a DeriveStatsJob in fe ut (#21498) 2023-07-05 10:38:09 +08:00
15ec191a77 [Fix](CCR) Use tableId as the credential for CCR syncer instead of tableName (#21466) 2023-07-05 10:16:09 +08:00
93795442a4 [Fix](CCR) Binlog config is missed when create replica task (#21397) 2023-07-05 10:15:13 +08:00
f498beed07 [improvement](jdbc)Support for automatically obtaining the precision of the trino/presto timestamp type (#21386) 2023-07-04 18:59:42 +08:00
aec5bac498 [improvement](jdbc)Support for automatically obtaining the precision of the hana timestamp type (#21380) 2023-07-04 18:59:21 +08:00
b27fa70558 [fix](jdbc) fix presto jdbc catalog pushDown and nameFormat (#21447) 2023-07-04 18:58:33 +08:00
9d997b9349 [revert](nereids) Revert data size agg (#21216)
To make stats derivation more precise
2023-07-04 18:02:15 +08:00
1b86e658fd [fix](Nereids): decrease the memo GroupExpression of limits (#21354) 2023-07-04 17:15:41 +08:00
c2b483529c [fix](heartbeat) need to set backend status base on edit log (#21410)
For non-master FE, must set Backend's status based on the content of edit log.
There is a bug that if we set fe config: `max_backend_heartbeat_failure_tolerance_count` larger that one,
the non-master FE will not set Backend as dead until it receive enough number of heartbeat edit log,
which is wrong.
This will causing the Backend is dead on Master FE, but is alive on non-master FE
2023-07-04 17:12:53 +08:00
9adbca685a [opt](hudi) use spark bundle to read hudi data (#21260)
Use spark-bundle to read hudi data instead of using hive-bundle to read hudi data.

**Advantage** for using spark-bundle to read hudi data:
1. The performance of spark-bundle is more than twice that of hive-bundle
2. spark-bundle using `UnsafeRow` can reduce data copying and GC time of the jvm
3. spark-bundle support `Time Travel`, `Incremental Read`, and `Schema Change`, these functions can be quickly ported to Doris

**Disadvantage** for using spark-bundle to read hudi data:
1. More dependencies make hudi-dependency.jar very cumbersome(from 138M -> 300M)
2. spark-bundle only provides `RDD` interface and cannot be used directly
2023-07-04 17:04:49 +08:00
90dd8716ed [refactor](multicast) change the way multicast do filter, project and shuffle (#21412)
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>

1. Filtering is done at the sending end rather than the receiving end
2. Projection is done at the sending end rather than the receiving end
3. Each sender can use different shuffle policies to send data
2023-07-04 16:51:07 +08:00
9e8501f191 [Performance](Nereids): speedup analyze by removing sort()/addAll() in OptimizeGroupExpressionJob to (#21452)
sort() and allAll() all rules will cost much time and it's useless action, remove them to speed up.

explain tpcds q72: 1.72s -> 1.46s
2023-07-04 16:01:54 +08:00
Pxl
65cb91e60e [Chore](agg-state) add sessionvariable enable_agg_state (#21373)
add sessionvariable enable_agg_state
2023-07-04 14:25:21 +08:00
e4c0a0ac24 [improve](dependency)Upgrade dependency version (#21431)
exclude old netty version
upgrade spring-boot version to 2.7.13
used ojdbc8 replace ojdbc6
upgrade jackson version to 2.15.2
upgrade fabric8 version to 6.7.2
2023-07-04 11:29:21 +08:00
8cbc1d58e1 [fix](MTMV) Disable partition specification temporarily (#20793)
The syntax for supporting partition updates in the future has not been investigated yet and there are issues with partition syntax. Therefore, the partition syntax has been temporarily removed in the current version and will be added after future research.
2023-07-04 11:09:04 +08:00
d5f39a6e54 [Performance](Nereids) refactor code speedup analyze (#21458)
refactor those code which cost much time.
2023-07-04 10:59:07 +08:00
599ba4529c [fix](nereids) need run ConvertInnerOrCrossJoin rule again after EliminateNotNull (#21346)
after running EliminateNotNull rule, the join conjuncts may be removed from inner join node.
So need run ConvertInnerOrCrossJoin rule to convert inner join with no join conjuncts to cross join node.
2023-07-04 10:52:36 +08:00
11e18f4c98 [Fix](multi-catalog) fix NPE for FileCacheValue. (#21441)
FileCacheValue.files may be null if there is not any files exists for some partitions.
2023-07-03 23:38:58 +08:00
63b170251e [fix](nereids)cast filter and join conjunct's return type to boolean (#21434) 2023-07-03 17:22:46 +08:00
f80df20b6f [Fix](multi-catalog) Fix read error in mixed partition locations. (#21399)
Issue Number: close #20948

Fix read error in mixed partition locations(for example, some partitions locations are on s3, other are on hdfs) by `getLocationType` of file split level instead of the table level.
2023-07-03 15:14:17 +08:00
9fa2dac352 [fix](Nereids): DefaultPlanRewriter visit plan children. (#21395) 2023-07-03 13:20:01 +08:00