Commit Graph

2931 Commits

Author SHA1 Message Date
68eda58a8c [Fix](multi-catalog) Fix string dict filtering when use null related function in parquet and orc reader. (#35335)
The following sql and when the dictionary column contains functions related to null, the results will be incorrect.
```
select * from ( select IF(o_orderpriority IS NULL, 'null', o_orderpriority) AS o_orderpriority from test_string_dict_filter_orc ) as A where o_orderpriority = 'null';
```
```
select * from ( select IFNULL(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null'
```
```
select * from ( select COALESCE(o_orderpriority, 'null') AS o_orderpriority from test_string_dict_filter_parquet ) as A where o_orderpriority = 'null';
```
2024-05-27 15:25:29 +08:00
f98ed4e4c5 [bugfix](hive)Misspelling of class names (#34981) 2024-05-27 15:24:38 +08:00
b1795d44ec [bugfix](hive)fix testcase for test_hive_write_different_path (#35209)
Hive's test environment uses docker, so when using 127.0.0.1,
BE will write the file to the docker of its own machine.
But if FE and are not on the same machine,
FE cannot read this file because it can only read docker on its own machine. 
Therefore, the address 127.0.0.1 cannot be used in the test environment.
2024-05-27 15:24:30 +08:00
2422439e45 [Update](regression) add case for inverted index (#35305)
Co-authored-by: Kang <kxiao.tiger@gmail.com>
2024-05-27 15:24:09 +08:00
af986c370b [feat](Nereids): Put the Child with Least Row Count in the First Position of Intersect (#34290) (#35339)
In this pull request, we optimize the ordering of children in the Intersect operator to improve query performance. The proposed change is to place the child with the least row count in the first position of the Intersect operator.

The rationale behind this optimization is that the Intersect operator works by first evaluating the leftmost child and then iterating through the results of the other children to find matching rows. By placing the child with the least row count first, we can minimize the number of iterations required to find the matching rows, thereby reducing the overall execution time of the query.
2024-05-27 11:52:35 +08:00
62998719df [opt](mtmv) Add threshold for relation mapping num when query rewrite (#34694) (#35378)
if query and mv def is as following:

    def mv1_1 = """
        select  t1.L_LINENUMBER,t2.l_extendedprice, t2.L_ORDERKEY
        from lineitem t1
        inner join lineitem t2 on t1.L_ORDERKEY = t2.L_ORDERKEY;
    """
    def query1_1 = """
        select  t1.L_LINENUMBER, t2.L_ORDERKEY
        from lineitem t1
        inner join lineitem t2 on t1.L_ORDERKEY = t2.L_ORDERKEY;
    """

this will generate relation mapping  by Cartesian, if the num of self join is too much, this will cause the performance problem
so we add `materialized_view_relation_mapping_max_count` session varaible, default 8. if actual num is greater than the value, the excess relation mapping is discarded.
2024-05-24 20:36:29 +08:00
639c7ee7fb [fix](decimalv2) fix scale of decimalv2 to string (#35222) (#35359)
* [fix](decimalv2) fix scale of decimalv2 to string
2024-05-24 17:20:43 +08:00
1e07971a98 [Feat](nereids)when dealing insert into stmt with empty table source, fe returns directly (#35333)
* [Feat](nereids) when dealing insert into stmt with empty table source, fe returns directly (#34418)

When a LogicalOlapScan has no partitions, transform it to a LogicalEmptyRelation.
When dealing insert into stmt with empty table source, fe returns directly.

* [Fix](nereids) fix when insert into select empty table

---------

Co-authored-by: feiniaofeiafei <moailing@selectdb.com>
2024-05-24 16:25:00 +08:00
f6beeb1ddd [Enhencement](tvf) select tvf supports using resource (#35139)
Create an S3/HDFS resource that TVF can use it directly to access the data source.
2024-05-24 16:23:58 +08:00
d6e8fb7d77 [feature](mtmv) Support agg state roll up and optimize the roll up code (#35026)
agg_state is agg  intermediate state, detail see 
state combinator: https://doris.apache.org/zh-CN/docs/dev/sql-manual/sql-functions/combinators/state

this support agg function roll up as following
 
+---------------------+---------------------------------------------+---------------------+
| query               | materialized view                           | roll up             |
| ------------------- | ------------------------------------------- | ------------------- |
| agg_funtion()       | agg_funtion_unoin()  or agg_funtion_state() | agg_funtion_merge() |
| agg_funtion_unoin() | agg_funtion_unoin() or agg_funtion_state()  | agg_funtion_union() |
| agg_funtion_merge() | agg_funtion_unoin() or agg_funtion_state()  | agg_funtion_merge() |
+---------------------+---------------------------------------------+---------------------+

for example which can be rewritten by mv sucessfully as following

MV defination is

```
            select
            o_orderstatus,
            l_partkey,
            l_suppkey,
            sum_union(sum_state(o_shippriority)),
            group_concat_union(group_concat_state(l_shipinstruct)),
            avg_union(avg_state(l_linenumber)),
            max_by_union(max_by_state(l_shipmode, l_suppkey)),
            count_union(count_state(l_orderkey)),
            multi_distinct_count_union(multi_distinct_count_state(l_shipmode))
            from lineitem
            left join orders
            on lineitem.l_orderkey = o_orderkey and l_shipdate = o_orderdate
            group by
            o_orderstatus,
            l_partkey,
            l_suppkey;
```

Query is

```
            select
            o_orderstatus,
            l_suppkey,
            sum(o_shippriority),
            group_concat(l_shipinstruct),
            avg(l_linenumber),
            max_by(l_shipmode,l_suppkey),
            count(l_orderkey),
            multi_distinct_count(l_shipmode)
            from lineitem
            left join orders 
            on l_orderkey = o_orderkey and l_shipdate = o_orderdate
            group by
            o_orderstatus,
            l_suppkey;
```
2024-05-24 16:23:58 +08:00
dd567fa774 [fix](function) support return JsonType for If function (#35199)
add a FunctionSignature for If to support return Type is JsonType.
2024-05-24 16:23:58 +08:00
98b2bda660 [opt](Nereids) remove restrict for count(*) in window (#35220)
support count(*) used for window function

CREATE TABLE `t1` (
  `id` INT NULL,
  `dt` TEXT NULL
)
DISTRIBUTED BY HASH(`id`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);

select *, count(*) over() from t1;
2024-05-24 16:23:58 +08:00
b3f6668464 fix case: test_create_table_without_distribution 2024-05-23 19:03:30 +08:00
4075408b84 [feature](mtmv)Support single table mv rewrite (#34185) (#35242)
Support Single table  query rewrite with out group by
this is useful for complex filter or expresission

the mv def and query is as following
which can be query rewritten

mv def:
```
          select *
            from lineitem where l_comment like '%xx%'
```

query:
```
            select l_linenumber, l_receiptdate
            from lineitem where l_comment like '%xx%'
```

Co-authored-by: zfr9527 <qhu15zhang3294197@163.com>
2024-05-23 19:00:36 +08:00
adc364a6fd [feature](Paimon) support deletion vector for Paimon naive reader (#34743) (#35241)
bp #34743
Co-authored-by: 苏小刚 <suxiaogang223@icloud.com>
2024-05-23 00:01:30 +08:00
24990383ff [refactor](jdbc catalog) split clickhouse jdbc executor (#34794) (#35174)
pick master #34794
2024-05-22 19:09:05 +08:00
291cf57c54 [Configurations](multi-catalog) Add enable_parquet_filter_by_min_max and enable_orc_filter_by_min_max Session variables. (#35012) (#35164)
backport #35012
2024-05-22 19:06:12 +08:00
15f70c8183 [Feat](planner)create table stmt offer default distribution attribute :random distribution and auto bucket (#35189)
Co-authored-by: feiniaofeiafei <moailing@selectdb.com>
2024-05-22 15:18:29 +08:00
c23384ff07 [fix](decimal) Fix long string casting to decimalv2 (#35121) 2024-05-22 14:32:29 +08:00
Pxl
84f7bfffe2 [Bug](bitmap-filter) fix empty bitmap when rf do merge (#34182)
fix empty bitmap when rf do merge
2024-05-22 14:29:50 +08:00
f0b2f5ba36 [Fix](bug) agg limit contains null values may cause error result (#35180) 2024-05-22 10:57:57 +08:00
af7b16f213 [optimize](desc) display the correct data type of aggStateType (#34968)
If a table column is AGG_STATE type, we can't get the clear defined data type if we use `desc tbl` statement.

create table a_table(
    k1 int null,
    k2 agg_state<max_by(int not null,int)> generic,
    k3 agg_state<group_concat(string)> generic
)
aggregate key (k1)
distributed BY hash(k1) buckets 3
properties("replication_num" = "1");

before optimize:

mysql> desc a_table;
+-------+------------------------------------------------+------+-------+---------+---------+
| Field | Type                                           | Null | Key   | Default | Extra   |
+-------+------------------------------------------------+------+-------+---------+---------+
| k1    | INT                                            | Yes  | true  | NULL    |         |
| k2    | org.apache.doris.catalog.AggStateType@239f771c | No   | false | NULL    | GENERIC |
| k3    | org.apache.doris.catalog.AggStateType@2e535f50 | No   | false | NULL    | GENERIC |
+-------+------------------------------------------------+------+-------+---------+---------+
3 rows in set (0.00 sec)


after optimize:

mysql> desc a_table;
+-------+------------------------------------+------+-------+---------+---------+
| Field | Type                               | Null | Key   | Default | Extra   |
+-------+------------------------------------+------+-------+---------+---------+
| k1    | INT                                | Yes  | true  | NULL    |         |
| k2    | AGG_STATE<max_by(INT, INT NULL)>   | No   | false | NULL    | GENERIC |
| k3    | AGG_STATE<group_concat(TEXT NULL)> | No   | false | NULL    | GENERIC |
+-------+------------------------------------+------+-------+---------+---------+


Co-authored-by: duanxujian <duanxujian@jd.com>
2024-05-22 10:03:31 +08:00
b96148c9cd [Fix](function) fix days/weeks_diff result wrong on BE #35104
select days_diff('2024-01-01 00:00:00', '2023-12-31 23:59:59');
should be 0 but got 1 on BE.
2024-05-22 10:00:26 +08:00
fb28d0b185 [BUG] fix scan range boundary handling is incorrect (#34832)
fix scan range boundary handling is incorrect
Co-authored-by: shizhiqiang03 <shizhiqiang03@meituan.com>
2024-05-21 13:00:50 +08:00
c0fd98abe5 [Fix](tvf) Fix that tvf reading empty files in compressed formats. (#34926)
1. Fix the issue with tvf reading empty compressed files.
2. move two test cases (`test_local_tvf_compression` and `test_s3_tvf_compression`) from p2 to p0
2024-05-21 12:59:31 +08:00
5872173901 [improve](function) add limit check for lpad/rpad function input big value of length (#34810) 2024-05-21 12:54:25 +08:00
aba00d7146 [Fix](executor)Fix workload reg test #35082 2024-05-20 20:36:29 +08:00
42425808a1 [Cherry-Pick](branch-2.1) Pick "Fix multiple replica partial update auto inc data inconsistency problem #34788" (#35056)
* [Fix](auto inc) Fix multiple replica partial update auto inc data inconsistency problem (#34788)

* **Problem:** For tables with auto-increment columns, updating partial columns can cause data inconsistency among replicas.

**Cause:** Previously, the implementation for updating partial columns in tables with auto-increment columns was done independently on each BE (Backend), leading to potential inconsistencies in the auto-increment column values generated by each BE.

**Solution:** Before distributing blocks, determine if the update involves partial columns of a table with an auto-increment column. If so, add the auto-increment column to the last column of the block. After distributing to each BE, each BE will check if the data key for the partial column update exists. If it exists, the previous auto-increment column value is used; if not, the auto-increment column value from the last column of the block is used. This ensures that the auto-increment column values are consistent across different BEs.

* 2

* [Fix](regression-test) Fix auto inc partial update unstable regression test (#34940)
2024-05-20 15:43:46 +08:00
be50139eb1 [Fix](Nereids) fix leading with cte and same subqueryalias name (#34838) (#35047)
fix leading with cte and same subqueryalias name
Example:
with tbl1 as select t1.c1 from t1
select tbl2.c2 from (select / * + leading(t2 tbl1) * / tbl1.c1, t2.c2 from tbl1 join t2) as tbl2 join t3;
Reason:
in this case, before getting analyzed preprocess would change subquery tbl2 to cte plan, and this cte plan should be in upper level cte plan, but not in logical result sink plan
2024-05-20 10:44:22 +08:00
5ac4ea2cd9 [Fix](Nereids) fix leading hint with update of alias name (#34434) (#35046)
Problem:
when using leading like leading(tbl1 tbl2) in
"select * from (select tbl1.c1 from t1 as tbl1 join t2 as tbl2) join t3 as tbl2 on tbl2.c3 != 101;",
in which tbl2.c3 means t3.c3 but not t2.c3
Causes and solved:
when finding columns in condition, leading hint would find tbl2.c3's RelationId, and when we collect RelationId and aliasName
we should update it if aliasName is repeat
2024-05-20 10:40:10 +08:00
7c29a964e5 [Fix](Nereids) fix leading with multi level of brace pairs (#34169) (#35043)
fix leading with multi level of brace pairs
example:
leading(t1 {{t2 t3} {t4 t5}} t6) can be reduced to leading(t1 {t2 t3 {t4 t5}} t6)
also update cases which remove project node from explain shape plan
2024-05-20 10:28:22 +08:00
a6a398d7a4 [Fix](function) remove datev2 signature of microsecond #35017 2024-05-19 19:58:02 +08:00
22f85be712 [fix](hive-ctas) support create hive table with full quolified name (#34984)
Before, when executing `create table hive.db.table as select` to create table in hive catalog,
if current catalog is not hive catalog, the default engine name will be filled with `olap`, which is wrong.

This PR will fill the default engine name base on specified catalog.
2024-05-18 18:42:43 +08:00
a59f9c3fa1 [fix](planner) fix unrequired slot bug when join node introduced by #25204 (#34923)
before fix, join node will retain some slots, which are not materialized and unrequired.
join node need remove these slots and not make them be output slots.

Signed-off-by: nextdreamblue <zxw520blue1@163.com>
2024-05-18 18:40:56 +08:00
e3e5f18f26 [Fix](Json type) correct cast result for json type (#34764) 2024-05-18 18:40:17 +08:00
81bcb9d490 [opt](planner)(Nereids) support auto aggregation for random distributed table (#33630)
support auto aggregation for querying detail data of random distributed table:
the same key column will return only one row.
2024-05-18 18:40:16 +08:00
9b5028785d [fix](prepare) fix datetimev2 return err when binary_row_format (#34662)
fix datetimev2 return err when binary_row_format. before pr, Backend return datetimev2 alwary by to_string.
fix datatimev2 return metadata loss scale.
2024-05-18 18:37:41 +08:00
eb7eaee386 [fix](function) money format (#34680) 2024-05-18 18:35:29 +08:00
30a036e7a4 [feature](mtmv) create mtmv support partitions rollup (#31812)
if create MTMV `date_trunc(`xxx`,'month')`
when related table is `range` partition,and have 3 partitions:
```
20200101-20200102
20200102-20200103
20200201-20200202
```
then MTMV will have 2 partitions:
```
20200101-20200201
20200201-20200301
```

when related table is `list` partition,and have 3 partitions:
```
(20200101,20200102)
(20200103)
(20200201)
```
then MTMV will have 2 partitions:
```
(20200101,20200102,20200103)
(20200201)
```
2024-05-18 18:14:48 +08:00
1545d96617 [WIP](test) remove enable_nereids_planner in regression cases (part 4) (#34642)
before PR are
#34417
#34490
#34558
2024-05-18 18:07:39 +08:00
385739564d [test](executor) Add workload group upgrade test #35007 2024-05-17 17:34:08 +08:00
e74b17c761 [Fix](Row store) support decimal256 type (#34887) 2024-05-15 19:01:18 +08:00
baf9a45e57 [fix](mtmv) check groupby in agg-bottom-plan when rewrite agg query by mv (#34274)
check groupby in agg-bottom-plan when rewrite and rollup agg query by mv
2024-05-15 12:38:40 +08:00
1f0c45204b [fix](iceberg) read the primary key columns if hasing equality delete (#34884)
backport: #34835
2024-05-15 11:37:25 +08:00
d5ab2787ba [Fix](function) fix pad functions behaviour of empty pad string (#34796)
fix pad functions behaviour of empty pad string
2024-05-15 10:28:09 +08:00
0b4d814598 [fix](decimal) Fix wrong result produced by decimal128 multiply (#34825)
* [fix](decimal) Fix wrong result produced by decimal128 multiply

* update
2024-05-14 23:34:11 +08:00
a0a025f763 [fix](regression test)fix test_hive_parquet_alter_column p2 case. (#34727) (#34859)
fix test_hive_parquet_alter_column p2 case.
Since this is a p2 case. The data is stored on emr, not in docker. So there is no need to consider hive2 and hive3.
2024-05-14 23:30:06 +08:00
4dd5379951 [bugfix](hive)fix error for writing to hive for 2.1 (#34518)
mirror #34520
2024-05-14 23:27:29 +08:00
0ae1b9c70a [chore](remove code) Remove dragonbox related (#34528)
* Revert "[refactor](mysql result format) use new serde framework to tuple convert (#25006)"

This reverts commit e5ef0aa6d439c3f9b1f1fe5bc89c9ea6a71d4019.

* run buildall

* MORE

* FIX
2024-05-13 22:16:57 +08:00
db15c811f8 [opt](Nereids) enhance properties regulator checking (#34603)
Enhance properties regulator checking:
(1) right bucket shuffle restriction takes effective only when either side has NATUAL shuffle type.
(2) enhance bothSideShuffleKeysAreSameOrder checking if taking EquivalenceExprIds into consideration.


Co-authored-by: zhongjian.xzj <zhongjian.xzj@zhongjianxzjdeMacBook-Pro.local>
2024-05-13 22:15:16 +08:00