Commit Graph

9480 Commits

Author SHA1 Message Date
7bdd854fdc [fix](nereids) bucket shuffle and colocate join is not correctly recognized (#17807)
1. close (https://github.com/apache/doris/issues/16458) for nereids
2. varchar and string type should be treated as same type in bucket shuffle join scenario.
```
create table shuffle_join_t1 ( a varchar(10) not null )
create table shuffle_join_t2 ( a varchar(5) not null, b string not null, c char(3) not null )
```
the bellow 2 sqls can use bucket shuffle join
```
select * from shuffle_join_t1 t1 left join shuffle_join_t2 t2 on t1.a = t2.a;
select * from shuffle_join_t1 t1 left join shuffle_join_t2 t2 on t1.a = t2.b;
```
3. PushdownExpressionsInHashCondition should consider both hash and other conjuncts
4. visitPhysicalProject should handle MarkJoinSlotReference
2023-03-24 19:21:41 +08:00
562f572311 [enhancement](UDF) The user defined functions support global ('show functions'/'show create') operation (#16973) (#17964)
1. add the global keyword.

SHOW [GLOBAL] [FULL] [BUILTIN] FUNCTIONS [IN|FROM db] [LIKE 'function_pattern']

SHOW CREATE GLOBAL FUNCTION function_name(arg_type [, ...]);

2. show the details of the global udf.
2023-03-24 19:07:38 +08:00
354d109130 [feat](Nereids): check Memo Plan for Unit Test. (#18082) 2023-03-24 18:31:33 +08:00
eb7b59c1c6 [docs](plugins) Fix the information in auditlog plugin documentation #18073
The information in the document is incomplete, user may be get error message like:

mysql> INSTALL PLUGIN FROM "http://127.0.0.1:8039/auditloader.zip";
ERROR 1105 (HY000): errCode = 2, detailMessage = http://127.0.0.1:8039/auditloader.zip.md5. you should set md5sum in plugin properties or provide a md5 URI to check plugin file
2023-03-24 18:16:16 +08:00
cd28e9f3b5 [fix](function) fix encrypt/decrypt function bug select list expression not produced by aggregation output #18078
Fix function analysis repeat add child.

select list expression not produced by aggregation output (missing from GROUP BY clause?): if(length(`r_2_3`.`name`) % 32 = 0, aes_decrypt(unhex(`r_2_3`.`name`), '***'), `r_2_3`.`name`)
2023-03-24 18:03:18 +08:00
ca0e4844e8 [typo](comment) code comment fix (#17870)
Co-authored-by: wangqingtao6 <wangqingtao6@jd.com>
2023-03-24 17:47:30 +08:00
b244c41371 [Bug](regression-test) Fix grace stop be coredump in pipeline (#18076) 2023-03-24 17:44:06 +08:00
1a3c6b7ed9 [bugfix](testcase) use different table name in map testcases to avoid confilt (#18077) 2023-03-24 17:43:18 +08:00
Pxl
8249441335 [Bug](planner) add conjunct slotref id to table function node to avoid result incorrect (#18063)
add conjunct slotref id to table function node to avoid result incorrect
2023-03-24 14:48:03 +08:00
e8b9587fe6 [Improvement](dict) compute hash only if needed (#18058) 2023-03-24 11:45:58 +08:00
aa3ea4beed [fix](planner) failed to create view when use window function (#17815)
fix failed to create view when use window function because the view string contains slot id and which cannot be parsed.
2023-03-24 10:58:52 +08:00
22fce33fb2 [fix](nereids) fix bitmap function nullable trait and dphyper bugs (#18041)
1. some bitmap functions like bitmap_or, bitmap_and_count, bitmap_or_count etc shouldn't follow constant fold rule for PropagateNullable functions. So remove PropagateNullable property and these functions would use their own constant fold logic correctly
2. dphyper's PlanReceiver class shouldn't change hyperGraph's complex project info. So make PlanReceiver use its own copy of complex project info now.
2023-03-24 10:53:45 +08:00
f9f87545d6 [improve](Nereids): check slot from children in validator. (#17951) 2023-03-24 10:52:12 +08:00
a65616a5cd [enhancement](MTMV) Add a timeout for regression tests (#18048)
MTMV regression tests may loop forever due to some potential bugs. Therefore, we add a timeout to avoid endless loop. The value of the timeout is hard coded 30 minutes now.
2023-03-24 10:39:42 +08:00
1999cccde9 [feature](array-type) Unique table support array value (#17024)
Unique table support array value

---------

Co-authored-by: huangqixiang.871 <huangqixiang.871@bytedance.com>
2023-03-24 10:18:59 +08:00
1f8ba4948d [Fix](multi-catalog) add handler for hms INSERT EVENT. (#17933)
When we use a hive client to submit a `INSERT INTO TBL SELECT * FROM ...` or `INSERT INTO TBL VALUES ...`
sql and the table is non-partitioned table, the hms will generate an insert event. The insert stmt may changed the
hdfs file distribution of this table, but currently we do not handle this, so the file cache of this table may be inaccurate.
2023-03-24 10:17:47 +08:00
2a35adbba8 [vectorized](udaf) fix java-udaf case of P0 is unstable (#18054)
the udaf case is unstable reason:
when enable_pipeline_engine=true, the case of agg function only 1 instance,
so not merge the default value, but if instance>1, will merge the default value
2023-03-24 09:10:58 +08:00
321bb3e9ee [refactor](Nereids) Refactor and optimize partition pruning (#18003)
the legacy PartitionPruner only support some simple cases, some useful cases not support:
1. can not support evaluate some builtin functions, like `cast(part_column as bigint) = 1`
2. can not prune multi level range partition, for partition `[[('1', 'a'), ('2', 'b'))`, it has some constraints:
    - first_part_column between '1' and '2'
    - if first_part_column = '1' then second_part_column >= 'a'
    - if first_part_column = '2' then second_part_column < 'a'

This pr refactor it and support:
1. use visitor to evaluate function and fold constant
2. if the partition is discrete like int, date, we can expand it and evaluate, e.g `[1, 5)` will be expand to `[1, 2, 3, 4]`
3. support prune multi level range partition, as previously described
4. support evaluate capabilities for a range slot, e.g. datetime range partition `[('2023-03-21 00:00:00'), ('2023-03-21 23:59:59'))`,  if the filter is `date(col1) = '2023-03-22'`, this partition will be pruned, we can do this prune because we know that the date always is `2023-03-21`. you can implement the visit method in FoldConstantRuleOnFE and OneRangePartitionEvaluator to support this functions.

### How can we do it so finely ?
Generally, the range partition can separate to three parts: `const`, `range`, `other`.
for example,  the partition `[(1, 'a', 'D'), ('1', 'c', 'D'))` exist
1. first partition column is `const`: always equals to '1'
2. second partition column is `range`: `slot >= 'a' and <= 'c'`. If not later slot, it must be `slot >= 'a' and < 'c'`
3. third partition column is `other`: regardless of whether the upper and lower bounds are the same, it must exist multi values, e.g. `('1', 'a', 'D')`, `('1', 'a', 'F')`, `('1', 'b', 'A')`, `('1', 'c', 'A')` 

In a partition, there is one and only one `range` slot can exist; maybe zero or one or many `const`/`other` slots.
Normally, a partition look like [const*, range, other*], these are the possible shapes:
1. [range], e.g `[('1'), ('10'))`
2. [const, range], e.g. `[('1', 'a'), ('1', 'd'))`
3. [range, other, other], e.g. `[('1', '1', '1'), ('2', '1', '1'))`
4. [const, const, ..., range, other, other, ...], e.g. `[('1', '1', '2', '3', '4'), ('1', '1', '3', '3', '4'))`

The properties of `const`: 
1. we can replace slot to literal to evaluate expression tree.

The properties of `range`:
1. if the slot date type is discrete type, like int, and date, we can expand it to literal and evaluate expression tree
2. if not discrete type, like datetime, or the discrete values too much, like [1, 1000000), we can keep the slot in the expression tree, and assign a range for it, when evaluate expression tree, we also compute the range and check whether range is empty set, if so we can simplify to BooleanLiteral.FALSE to skip this partition.
5. if the range slot satisfied some conditions , we can fold the slot with some function too, see the datetime example above

The properties of `other`:
1. only when the previous slot is literal and equals to the lower bound or upper bound of partition, we can shrink the range of the `other` slot

According this properties, we can do it finely.


at the runtime, the `range` and `other` slot maybe shrink the range of values,
e.g.
1. the partition `[('a'), ('b'))` with predicate `part_col = 'a'` will shrink range `['a', 'b')` to `['a']`, like a `range` slot change/downgrading to `const` slot;
2. the partition `[('a', '1'), ('b', '10'))` with predicate `part_col1 = 'a'` will shrink the range of `other` slot from unknown(all range) to `['1', +∞)`, like a `other` slot change/downgrading to `range` slot.

But to simplify, I haven't change the type at the runtime, just shrink the ColumnRange.
2023-03-24 09:06:52 +08:00
d3e7f12ada [refactor](Nereids) refactor column pruning (#17579)
This pr refactor the column pruning by the visitor, the good sides
1. easy to provide ability of column pruning for new plan by implement the interface `OutputPrunable` if the plan contains output field or do nothing if not contains output field, don't need to add new rule like `PruneXxxChildColumns`, few scenarios need to override the visit function to write special logic, like prune the LogicalSetOperation and Aggregate
2. support shrink output field in some plans, this can skip some useless operations so improvement

example:
```sql
select id 
from (
  select id, sum(age)
  from student
  group by id
)a
```

we should prune the useless `sum (age)` in the aggregate.
before refactor:
```
LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
   +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0, sum(age#2) AS `sum(age)`#4], hasRepeat=false )
      +--LogicalProject ( distinct=false, projects=[id#0, age#2], excepts=[], canEliminate=true )
         +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )
```

after refactor:
```
LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
   +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0], hasRepeat=false )
      +--LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
         +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )
```
2023-03-24 09:00:48 +08:00
678314d657 [fix](regression)fix glue regression (#17952) 2023-03-24 00:10:20 +08:00
c1bd5b26a8 [refactor](Nereids) expression translate no long rely on legacy planner code (#17671) 2023-03-23 23:05:15 +08:00
47bd3e77e8 [fix](Nereids) cannot select random olap table (#18044) 2023-03-23 22:11:36 +08:00
3bb3c36b9b [bugfix](txn) return when txn state is null when doing abort txn (#18045) 2023-03-23 20:51:21 +08:00
5445a86570 [Bug](array_product) Fix array_product for ARRAY<DECIMAL> (#18014) 2023-03-23 20:29:50 +08:00
b0948ea4cd [Fix](SAP Hana External Table) fix that SAP Hana external table can not insert batch values (#17957)
In the batch insertion scenario, sap hana database does not support syntax insert into tables values (...),(...);
what it supports is:
```sql
INSERT INTO table(col1,col2)
SELECT c1v1, c2v1 FROM dummy
UNION ALL
SELECT c1v2, c2v2 FROM dummy;
```
2023-03-23 18:49:50 +08:00
bdff9a7a7b [regression-test](merge-on-write) Optimize merge-on-write case (#18038) 2023-03-23 17:59:49 +08:00
4c5ba4bb01 [Improve](point query) optimize sendFields since writeField is heav… (#18000)
save about 20% FE cpu cost for point query with prepared statement which table contains 100 columns
2023-03-23 17:45:56 +08:00
8b617afe43 [Improve](point query) improve column match performance when doing computeColumnFilter to prune partition (#17982)
Only use key columns when `computeColumnFilter` otherwise for wide tables the match process could be very slow

500 columns table QPS:
6186 -> 13208
2023-03-23 17:45:34 +08:00
Pxl
f43d2ded0a [Chore](case) add order by to testIncorrectMVRewriteInSubquery (#18017)
add order by to testIncorrectMVRewriteInSubquery
2023-03-23 16:39:46 +08:00
20d26397aa [fix](planner) forbid inline view but not the subquery resolve from parent tuples (#18032)
in PR #17813 , we want to forbid bind slot on brother's column
howerver the fix is not in correct way.
the correct way to do that is forbid subquery register itself in parent's analyzer.

This reverts commit b91a3b5a72520105638dad1079b71a05f02c10a0.
2023-03-23 16:11:04 +08:00
34dc7e57c1 [ehancement](stats) Tune for stats framework (#18035)
1. Estimate timearithmeticexpr instead of setting Double.MAX Double.MIN directly
2. Enable histogram to derive stats
3. Loose the condition for histogram usage
4. Improve the accuracy for agg on TPC-H 1G greatly
5. Fix avg qerror calculation
2023-03-23 16:03:58 +08:00
e9ff3d185b [Opt](pipeline) disable coloagg when the para instance num >= tablet_num * 2 (#18030) 2023-03-23 15:53:13 +08:00
574365b6d4 [Feature](Nereids) support new mv (#17853)
The metadata storage format of the materialized view has changed, and the new optimizer adapts to the new storage method.

The column storage format of the metadata for the materialized view is changed to start with mv_ or start with mva_
This pr allows the new optimizer to recognize the new materialized view columns and select the correct materialized view.

TODO: support advance mv
2023-03-23 15:25:42 +08:00
6684d65075 [Improvement](TVF)Support file split for TableValueFunction (#17958)
Current getSplits for TVF is to create one split for each file. In this case, large file scan performance maybe bad.
This pr is to implement the getSplits function in TVFSplitter to support split file to multiple blocks which
may improve the performance for large files.
2023-03-23 15:05:44 +08:00
cedd36c786 [improvement](compaction)Support segcompaction for inverted index (#17874)
Since Doris supports segcompaction #12866 during loading, inverted index support is also needed.
2023-03-23 14:41:30 +08:00
11936d85f9 [fix](inverted index) fix erroneous judgement for inverted index not read raw data (#17992)
when apply inverted index will use predicate_params() from ColumnPredicate, if comparison predicate be cloned, but the clone one not copy the predicate_params() together, that resulting when applying inverted index make the wrong choice.
2023-03-23 14:40:08 +08:00
Pxl
4b626d260a [Build] fix build fail when WITH_MYSQL=OFF (#18021) 2023-03-23 14:01:21 +08:00
2d4f5886ab [Enhancement](Nereids) add single sql fall back to original planner hint (#17994)
now we can use /*+ SET_VAR(enable_nereids_planner="false") */ to disable nereids in a single sql.
2023-03-23 13:38:40 +08:00
e415754130 [enhancement](nereids) adjust distribution cost in cost model v1 (#17990)
1. adjust in cost model
the cost of broadcast should lower than the cost of shuffle when data size is small.
In broadcast, we do not known the number of receiver BEs, so we use the number of BEs in the system.

2. debug message adjust
a. in explain, print row count after filter
b. if join is not marked join, do not print marked join info
2023-03-23 13:32:36 +08:00
fadf3b906d [enhancement](planner) delete support between predicate (#17892) 2023-03-23 13:24:32 +08:00
0bb04c08aa [improvement](coverage) build be with coverage enabled, which can get coverage data with llvm-cov-15 (#17995) 2023-03-23 12:07:19 +08:00
3870689cbb [Fix](parquet-reader) Fix iceberg_schema_evolution regression test caused by slot col name different with parquet col name. (#17988) 2023-03-23 11:23:08 +08:00
abeec4848a [Fix](Nereids)fix be fold constant incorrectly on from_unixtime. (#18016) 2023-03-23 11:17:08 +08:00
089a91ecd5 [vectorized](function) support array_exists lambda function (#17931)
Co-authored-by: zhangyu209 <zhangyu209@meituan.com>
2023-03-23 11:11:39 +08:00
994a2e967b [chore](git) add git ignore to avoid commit error (#18011) 2023-03-23 10:53:45 +08:00
cfa0a8b136 [Improvement](DECIMALV3) multiply/plus DECIMAL32 and DECIMAL64 safely and not check overflow (#18031) 2023-03-23 10:10:03 +08:00
5a7d99e2f0 [Improvement](statistics) Support for collecting statistics at the granularity of partitions. (#17966)
* Support for collecting statistics at the granularity of partitions

* Add ut and fix some bug
2023-03-23 09:05:42 +08:00
58b00858ab [Refactor](pipeline) Remove unless fe session variable enable_rpc_opt_for_pipeline (#18019) 2023-03-23 07:27:58 +08:00
d9059ef070 [Docs](multi-catalog) add yarn.resourcemanager.principal for hive catalog with kerberos enabled. (#17930)
Co-authored-by: wangxiangyu@360shuke.com <wangxiangyu@360shuke.com>
2023-03-22 23:34:33 +08:00
7ed15ee8c9 [Fix](multi-catalog) invalidates the file cache when table is non-partitioned. (#17932)
Reference to `org.apache.doris.planner.external.HiveSplitter`, the file cache of `HiveMetaStoreCache`
may be created even the table is a non-partitioned table,
so the `RefreshTableStmt` should consider this scene and handle it.
2023-03-22 23:34:18 +08:00