Commit Graph

11844 Commits

Author SHA1 Message Date
83ce4379ff [regression] add order by in test case for stable output (#21815) 2023-07-14 18:01:43 +08:00
7a6ae12ebb [imporve](bloomfilter) refactor runtime_filter_mgr with bloomfilter and fix bug in change_to_bloom_filter (#21783) 2023-07-14 17:47:32 +08:00
c9a99ce171 [Feature](Nereids) support udf for Nereids (#18257)
Support alias function, Java UDF, Java UDAF for Nereids.
Implementation:
UDFs(alias function, Java UD(A)F) are saved in database object, we get it by FunctionDesc, which requires function name and arg types. So firstly we bind expressions of its children so that we can get the return type of args. Then we get the best selection.

Secondly:
For alias function:
The original function of the alias function is represented as original planner-style function, it's too hard to translate it to nereids-style expression hence we transfer it to the corresponding sql and parse it. Now we get the nereids-style function, and try to bind the function.
the bound function will also change the type by add cast node of its children to its expecting input types, so that if we travel a bound function more than one times, the cast node will be different. To solve the problem, we add a flag isAnalyzedFunction. it's set false by default and will be set true when return from the visitor function. If the flag is true, it will return immediately in visitor function.

Now we can ensure that the bound functions in children will be the same though we travel it more than one time. we can replace the alias function to its original function and bind the unbound functions.

For JavaUDF and JavaUDAF
JavaUDF and JavaUDAF can be recognized as a catalog function and hard to be entirely translated to Nereids-style function, we create a nereids expression object JavaUdf and JavaUdaf to wrap it.

All in all, now Nereids support UDFs and nesting them.
2023-07-14 17:02:01 +08:00
d57bb84842 [Enhancement] (binlog) TBinlog and BinlogManager V2 (#21674) 2023-07-14 16:59:32 +08:00
f95d728d3e [shape](nereids) TPCDS check all query shape, except ds64 (#21742)
there is a known bug on ds64 analyze. add ds 64 shape check latter
2023-07-14 16:56:46 +08:00
Pxl
4d44cea784 [Bug](materialized-view) check group expr at create mv (#21798)
check group expr at create mv
2023-07-14 15:39:38 +08:00
62214cd1f4 [feature](nereids) adjust min/max of column stats for cast function (#21772)
cast(A as date), where A is a string column. the min/max of result column stats should be calc like this:
convert A.minExpr to a date dateA, and then get double value from dateA.

add "explain memo plan select ..." to print memo from mysql client

dump column stats for FileScanNode, used in datalake.
2023-07-14 12:54:04 +08:00
b013f8006d [enhancement](multi-table) enable mullti table routine load on pipeline engine (#21729) 2023-07-14 12:16:32 +08:00
2c897b82ad [enhance](Nereids) Pushdown Project Through OuterJoin. (#21730)
PushdownJoinOtherCondition will pushdown expression in condition into project, it will block JoinReorder, so we need to pushdown project to help JoinReorder
2023-07-14 11:46:29 +08:00
b2778d0724 [fix](Nereids) use groupExpr's children to make logicalPlan (#21794)
After mergeGroup, the children of the plan are different from GroupExpr. To avoid optimizing out-dated group, we should construct new plan with groupExpr's children rather than plan's children
2023-07-14 11:41:38 +08:00
c07e2ada43 [imporve](udaf) refactor java-udaf executor by using for loop (#21713)
refactor java-udaf executor by using for loop
2023-07-14 11:37:19 +08:00
ea73dd5851 [improve](nereids)inner join estimation: assume children output at least one tuple #21792
this assumption is good to eliminate error propagation, when the filter estimation is too low, less than one.
2023-07-14 11:30:25 +08:00
ebe771d240 [refactor](executor) remove unused variable 2023-07-14 10:35:59 +08:00
ca6e33ec0c [feature](table-value-functions)add catalogs table-value-function (#21790)
mysql> select * from catalogs() order by CatalogId;
2023-07-14 10:25:16 +08:00
352a0c2e17 [Improvement](multi catalog)Cache file system to improve list remote files performance (#21700)
Use file system type and Conf as key to cache remote file system.
This could avoid get a new file system for each external table partition's location.
The time cost for fetching 100000 partitions with 1 file for each partition is reduced to 22s from about 15 minutes.
2023-07-14 09:59:46 +08:00
cbddff0694 [FIX](map) fix map key-column nullable for arrow serde #21762
arrow is not support key column has null element , but doris default map key column is nullable , so need to deal with if doris map row if key column has null element , we put null to arrow
2023-07-14 00:30:07 +08:00
254f76f61d [Agg](exec) support aggregation_node limit short circuit (#21767) 2023-07-14 00:29:19 +08:00
6fd8f5cd2f [Fix](parquet-reader) Fix parquet string column min max statistics issue which caused query result incorrectly. (#21675)
In parquet, min and max statistics may not be able to handle UTF8 correctly.
Current processing method is using min_value and max_value statistics introduced by PARQUET-1025 if they are used.
If not, current processing method is temporarily ignored. A better way is try to read min and max statistics if it contains 
only ASCII characters. I will improve it in the future PR.
2023-07-14 00:09:41 +08:00
4158253799 [feature](hudi) support hudi time travel in external table (#21739)
Support hudi time travel in external table:
```
select * from hudi_table for time as of '20230712221248';
```
PR(https://github.com/apache/doris/pull/15418) supports to take timestamp or version as the snapshot ID in iceberg, but hudi only has timestamp as the snapshot ID. Therefore, when querying hudi table with `for version as of`, error will be thrown like:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = Hudi table only supports timestamp as snapshot ID
```
The supported formats of timestamp in hudi are: 'yyyy-MM-dd HH:mm:ss[.SSS]' or 'yyyy-MM-dd' or 'yyyyMMddHHmmss[SSS]', which is consistent with the [time-travel-query.](https://hudi.apache.org/docs/quick-start-guide#time-travel-query)

## Partitioning Strategies
Before this PR, hudi's partitions need to be synchronized to hive through [hive-sync-tool](https://hudi.apache.org/docs/syncing_metastore/#hive-sync-tool), or by setting very complex synchronization parameters in [spark conf](https://hudi.apache.org/docs/syncing_metastore/#sync-template). These processes are exceptionally complex and unnecessary, unless you want to query hudi data through hive.

In addition, partitions are changed in time travel. We cannot guarantee the correctness of time travel through partition synchronization.

So this PR directly obtain partitions by reading hudi meta information. Caching and updating table partition information through hudi instant timestamp, and reusing Doris' partition pruning.
2023-07-13 22:30:07 +08:00
23272abf48 [chore](docs)Removed documentation related to dynamic tables (#21803)
since the feature was reworked
2023-07-13 22:20:20 +08:00
37e247536a [tpcds](nereids) add tpchds 1T shape check #21753
add regression case to simulate tpcds 1T.
shape check will be added later after they are stable.
2023-07-13 21:44:10 +08:00
fd6553b218 [Fix](MoW) Fix bug about caculating all committed rowsets delete bitmaps when do comapction (#21760) 2023-07-13 21:10:15 +08:00
2c83e5a538 [fix](merge-on-write) fix be core and delete unused pending publish info for async publish when tablet dropped (#21793) 2023-07-13 21:09:51 +08:00
35fa9496e7 [fix](merge-on-write) fix wrong result when query with prefix key predicate (#21770) 2023-07-13 19:56:00 +08:00
c5dbd53e6f [fix](multi-catalog)support oss-hdfs service (#21504)
1. support oss-hdfs if it is enabled when use dlf or hms catalog
2. add docs for aliyun dlf and mc.
2023-07-13 18:02:15 +08:00
c78349a4c6 [Docs](statistics)Add external table statistic docs (#21567) 2023-07-13 17:54:34 +08:00
22b59038d5 [pipeline](ckb) Update auto_trigger_teamcity.yml (#21769) 2023-07-13 17:44:25 +08:00
abc21f5d77 [bugfix](ngram bf index) process differently for normal bloom filter index and ngram bf index (#21310)
* process differently for normal bloom filter index and ngram bf index

* fix review comments for readbility

* add test case

* add testcase for delete condition
2023-07-13 17:31:45 +08:00
d4bdd6768c [Feature](Nereids) support select into outfile (#21197) 2023-07-13 17:01:47 +08:00
b72e0d9172 [github](labeler) remove scope labeler (#21789)
Scope labeler is useless now, I think we can remove it.
2023-07-13 16:13:58 +08:00
8a42ba5742 [typo](docs) modify bitmap function document (#21721) 2023-07-13 14:02:10 +08:00
06d129c364 [docs](stats) Update statistics related content #21766
1. Update grammar of `ANALYZE`
2. Add command description about how to delete a analyze job
2023-07-13 13:51:26 +08:00
e167394dc1 [Fix](pipeline) close sink when fragment context destructs (#21668)
Co-authored-by: airborne12 <airborne12@gmail.com>
2023-07-13 11:52:24 +08:00
14253b6a30 [fix](ccr) Add tableName in DropInfo && BatchDropInfo (#21736)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2023-07-13 11:47:49 +08:00
9cad929e96 [Fix](rowset) When a rowset is cooled down, it is directly deleted. This can result in data query misses in the second phase of a two-phase query. (#21741)
* [Fix](rowset) When a rowset is cooled down, it is directly deleted. This can result in data query misses in the second phase of a two-phase query.

related pr #20732

There are two reasons for moving the logic of delayed deletion from the Tablet to the StorageEngine. The first reason is to consolidate the logic and unify the delayed operations. The second reason is that delayed garbage collection during queries can cause rowsets to remain in the "stale rowsets" state, preventing the timely deletion of rowset metadata, It may cause rowset metadata too large.

* not use unused rowsets
2023-07-13 11:46:12 +08:00
f863c653e2 [Fix](Planner) fix limit execute before sort in show export job (#21663)
Problem:
When doing show export jobs, limit would execute before sort before changed. So the result would not be expected because limit always cut results first and we can not get what we want.

Example:
we having export job1 and job2 with JobId1 > JobId2. We want to get job with JobId1
show export from db order by JobId desc limit 1;
We do limit 1 first, so we would probably get Job2 because JobId assigned from small to large

Solve:
We can not cut results first if we have order by clause. And cut result set after sorting
2023-07-13 11:17:28 +08:00
cf016f210d Revert "[imporve](bloomfilter) refactor runtime_filter_mgr with bloomfilter (#21715)" (#21763)
This reverts commit 925da90480f60afc0e5333a536d41e004234874e.
2023-07-13 10:44:20 +08:00
2d2beb637a [enhancement](RoutineLoad)Mutile table support pipeline load (#21678) 2023-07-13 10:26:46 +08:00
e18465eac7 [feature](TVF) support path partition keys for external file TVF (#21648) 2023-07-13 10:15:55 +08:00
105a162f94 [Enhancement](multi-catalog) Merge hms events every round to speed up events processing. (#21589)
Currently we find that MetastoreEventsProcessor can not catch up the event producing rate in our cluster, so we need to merge some hms events every round.
2023-07-12 23:41:07 +08:00
2e3d15b552 [Feature](doris compose) A tool for setup and manage doris docker cluster scaling easily (#21649) 2023-07-12 22:13:38 +08:00
00c48f7d46 [opt](regression case) add more index change case (#21734) 2023-07-12 21:52:48 +08:00
7f133b7514 [fix](partial-update) transient rowset writer should not trigger segcompaction when build rowset (#21751) 2023-07-12 21:47:07 +08:00
be55cb8dfc [Improve](jsonb_extract) support jsonb_extract multi parse path (#21555)
support jsonb_extract multi parse path
2023-07-12 21:37:36 +08:00
da67d08bca [fix](compile) fix be compile error (#21765)
* [fix](compile) fix be compile error

* remove warning
2023-07-12 21:14:04 +08:00
3163841a3a [FIX](serde)Fix decimal for arrow serde (#21716) 2023-07-12 19:15:48 +08:00
f0d08da97c [enhancement](merge-on-write) split delete bitmap from tablet meta (#21456) 2023-07-12 19:13:36 +08:00
9d96e18614 [fix](multi-table-load) fix memory leak when processing multi-table routine load (#21611)
* use naked ptr to prevent loop ref

* add comments
2023-07-12 17:32:56 +08:00
0243c403f1 [refactor](nereids)set session var for bushy join (#21744)
add session var: MAX_JOIN_NUMBER_BUSHY_TREE, default is 5
if table number is less than MAX_JOIN_NUMBER_BUSHY_TREE in a join cluster, nereids try bushy tree, o.w. zigzag tree
2023-07-12 16:40:48 +08:00
3b76428de9 [fix](stats) when some stat is NULL, causing an exception during display stats (#21588)
During manual statistics injection, some statistics may beNULL, causing an exception during display.
2023-07-12 14:57:06 +08:00