Commit Graph

1011 Commits

Author SHA1 Message Date
5ceb5441f4 [feature](nereids) let set operation syntax campatible with lagecy planner (#15664)
Though this syntax doesn't get suppoted in many other systems since the order by clause here almost redandunt and useless but we have to keep consistent with the legacy doris syntax

Here is a example:
SELECT * FROM (SELECT k1, k3 FROM tbl1 ORDER BY k3 UNION ALL SELECT k1, k5 FROM tbl2) t;
2023-01-09 15:31:29 +08:00
2c9c7c48ac [improvement](decimalv3) Java UDF and array type support DECIMALV3 (#15674) 2023-01-09 15:13:16 +08:00
211cc66d02 [fix](multi-catalog) fix image loading failture when create catalog with resource (#15692)
Bug fix
fix image loading failture when create catalog with resource
When creating jdbc catalog with resource, the metadata image will failed to be loaded.
Because when loading jdbc catalog image, it will try to get resource from ResourceMgr,
but ResourceMgr has not been loaded, so NPE will be thrown.

This PR fix this bug, and refactor some logic about catalog and resource.

When loading jdbc catalog image, it will not get resource from ResourceMgr.
And now user can create catalog with resource and properties, like:

create catalog jdbc_catalog with resource jdbc_resource
properites("user" = "user1");
The properties in "properties" clause will overwrite the properties in "jdbc_resource".

force adding tinyInt1isBit=false to jdbc url
The default value of tinyInt1isBit is true, and it will cause tinyint in mysql to be bit type.
force adding tinyInt1isBit=false to jdbc url so that the tinyint in mysql will be tinyint in Doris.

Avoid calculate checksum of jdbc driver jar multiple times
Refactor
Refactor the notification logic when updating properties in resource.
When updating properties in resource, it will notify the corresponding catalog to update its own properties.
This PR change this logic. After updating properties in resource, it will only uninitialize the catalog's internal
objects such "jdbc client" or "hms client". And this objects will be re-initialized lazily.

And all properties will be got from Resource at runtime, so that it will always get the latest properties

Regression test cases
Because we add tinyInt1isBit=false to jdbc url, some of cases need to be changed.
2023-01-09 09:56:26 +08:00
Pxl
1514b5ab5c [Feature](Materialized-View) support advanced Materialized-View (#15212) 2023-01-09 09:53:11 +08:00
5dfdacd278 [enhancement](histogram) add histogram syntax and perstist histogram statistics (#15490)
Histogram statistics are more expensive to collect and we collect and persist them separately.

This PR does the following work:
1. Add histogram syntax and add keyword `TABLE`
2. Add the task of collecting histogram statistics
3. Persistent histogram statistics
4. Replace fastjson with gson
5. Add unit tests...

Relevant syntax examples:
> Refer to some databases such as mysql and add the keyword `TABLE`.

```SQL
-- collect column statistics
ANALYZE TABLE statistics_test;

-- collect histogram statistics
ANALYZE TABLE statistics_test UPDATE HISTOGRAM ON col1,col2;
```

base on #15317
2023-01-07 00:55:42 +08:00
7f84db310a [fix](nereids) Convert to datetime when binary expr's left is date and right is int type (#15615)
In the below case, expression ` date > 20200101` should implicit cast date both side to datetime instead of bigint

```sql
        CREATE TABLE `part_by_date`
        (
            `date`                  date   NOT NULL COMMENT '',
            `id`                      int(11) NOT NULL COMMENT ''
        ) ENGINE=OLAP
        UNIQUE KEY(`date`, `id`)
        PARTITION BY RANGE(`date`) 
        (PARTITION p201912 VALUES [('0000-01-01'), ('2020-01-01')),
        PARTITION p202001 VALUES [('2020-01-01'), ('2020-02-01')))
        DISTRIBUTED BY HASH(`id`) BUCKETS 3
        PROPERTIES (
        "replication_allocation" = "tag.location.default: 1"
        );

        INSERT INTO  part_by_date VALUES('0001-02-01', 1),('2020-01-15', 2);

        SELECT
            id
        FROM
           part_by_date
        WHERE date > 20200101;
```
2023-01-06 14:08:05 +08:00
b57500d0c3 [Bug](decimalv3) fix wrong result for MOD operation (#15644) 2023-01-06 10:38:53 +08:00
77ffafb766 [vulnerability](CVE-2022-1292) fix CVE-2022-1292 (#15639) 2023-01-05 21:57:16 +08:00
d36b93708c [feature](Nereids): add ExtractFilterFromJoin rule to support more (#14896) 2023-01-05 19:09:43 +08:00
5460c873e8 [Feature] (Nereids) support un equals conjuncts in un scalar sub query (#15591)
support un equals conjuncts in un scalar sub query.
[fix] in correlated subquery wrong result
2023-01-05 16:56:14 +08:00
0dfa143140 [enhancement](Nereids) generate colocate join when property is different with require property (#15479)
1. When checking HashProperty which's type is nature, we only need to check whether the required properties contain all shuffle column
2. In ChildrenPropertiesRegulator.java, when colocate/buckte join is not allowed, we will enforce the required property.
2023-01-05 11:41:18 +08:00
61d538c713 [improvement](storage-policy) Add check validity when create storage policy. (#14405) 2023-01-04 22:24:49 +08:00
wxy
e0c56bcd20 [Feature](export) Support cancel export statement (#15128)
Co-authored-by: wangxiangyu@360shuke.com <wangxiangyu@360shuke.com>
2023-01-04 14:08:25 +08:00
7728794b4a [fix](Nereids) SimplifyArithmeticRule generate wrong expression after process (#15580)
in the case of 'a / b', if a is constant, after apple SimplifyArithmeticRule, expression will be convert to 'b * a' by mistake.
2023-01-04 11:10:15 +08:00
18bc354c06 [fix](Nereids) use correct column unique id when read data from non-base index (#15534)
When light schema change is enabled by default, a column in OLAP scan is retrieved by column unique id instead of the column name. Columns with the same name would use different unique IDs among materialized indexes.
This PR ensures that the column in the OLAP scan node could use the correct column unique id.
2023-01-04 01:41:25 +08:00
8d0c06c897 [fix](nereids) binding priority in agg-sort, having, group_by_key (#15240)
This PR defines order_key and having_key binding priority.

1. order key priority
 ```
                select
                        col1 * -1 as col1    # inner_col1 * -1 as alias_col1
                from
                        t
                order by col1;     # order by order_col1
```
to bind `order_col1`, `alias_col1` has higher priority than `inner_col1`

2. having key priority
```
       select (a-1) as a  # inner_a - 1 as alias_a
       from bind_priority_tbl 
       group by a 
       having a=1;
```
to bind having key, `inner_a` has higher priority than `alias_a`

3. group by key binding priority
```
SELECT date_format(b.k10,
         '%Y%m%d') AS k10
FROM test a
LEFT JOIN 
    (SELECT k10
    FROM baseall) b
    ON a.k10 = b.k10
GROUP BY  k10;
```
group_by_key (k10) binding priority:

- agg.child.output
- agg.output
if binding with agg.child.output failed(the slot not found, or more than one candidate slot found in agg.child.output), nereids try to bind group_by_key with agg.output.
In above example, nereids found 2 candidate slots (a.k10, b.k10) in agg.child.output for group_by_key (k10), binding with agg.child.output failed. Then nereids try to bind group_by_key with agg.output, that is `date_format(b.k10, '%Y%m%d') AS k10`. and finally, group_by_key is bound with `alias k10`
2023-01-03 22:09:28 +08:00
55dc541c90 [Fix](Nereids) aggregate function except COUNT should nullable without group by expr (#15547)
Co-authored-by: mch_ucchi
2023-01-03 21:28:07 +08:00
a365486a25 [fix](Nereids) get datatype for binary arithmetic (#15548)
it is just a temporary fix for binary arithmetic. Next we will refactor the TypeCoercion rule to make the behavior exactly same with Lagecy planner.
2023-01-03 19:09:48 +08:00
02d035466b [refactor] remove partition pruner v1 (#15552)
partition pruner v1 is no longer used.
Also remove session variable partition_prune_algorithm_version
2023-01-03 11:35:30 +08:00
238ae54620 [fix](merge-on-write) unique key mow tables should require distribution columns be key column (#15535)
* [fix](merge-on-write) unique key mow tables should require distribution columns be key column

* fix code style
2023-01-01 15:53:21 +08:00
e89adc6e1d [fix](create-table) wrong judgement about partition column type (#15542)
The following stmt should be success, but return error: `complex type cannt be partition column:ARRAY<VARCHAR(64)>`

```
create table test_array( 
task_insert_time BIGINT NOT NULL DEFAULT "0" COMMENT "" , 
task_project ARRAY<VARCHAR(64)>  DEFAULT NULL COMMENT "" ,
route_key DATEV2 NOT NULL COMMENT "range分区键"
) 
DUPLICATE KEY(`task_insert_time`)  
 COMMENT ""
PARTITION BY RANGE(route_key) 
(PARTITION `p202209` VALUES LESS THAN ("2022-10-01"),
PARTITION `p202210` VALUES LESS THAN ("2022-11-01"),
PARTITION `p202211` VALUES LESS THAN ("2022-12-01")) 
DISTRIBUTED BY HASH(`task_insert_time` ) BUCKETS 32 
PROPERTIES
(
    "replication_num" = "1",    
    "light_schema_change" = "true"    
);
```

This PR fix this
2022-12-31 13:10:39 +08:00
c47bdf6606 [vectorized](jdbc) fix external table of oracle have keyworld column (#15487)
if column name is keyword of oracle, the query will report error
2022-12-31 12:48:26 +08:00
100834df8b [fix](nereids) fix some arrgregate bugs in Nereids (#15326)
1. the agg function without distinct keyword should be a "merge" funcion in threePhaseAggregateWithDistinct
2. use aggregateParam.aggMode.consumeAggregateBuffer instead of aggregateParam.aggPhase.isGlobal() to indicate if a agg function is a "merge" function
3. add an AvgDistinctToSumDivCount rule to support avg(distinct xxx) in some case
4. AggregateExpression's nullable method should call inner function's nullable method.
5. add a bind slot rule to bind pattern "logicalSort(logicalHaving(logicalProject()))"
6. don't remove project node in PhysicalPlanTranslator
7. add a cast to bigint expr when count( distinct datelike type )
8. fallback to old optimizer if bitmap runtime filter is enabled.
9. fix exchange node mem leak
2022-12-30 23:07:37 +08:00
93a25e1af5 [fix](nereids) the project node is lost when creating PhysicalStorageLayerAggregate node (#15467) 2022-12-30 16:33:24 +08:00
6c847daba0 [Feature](Nereids) Support grouping set for materialized index. (#15383)
This PR adds support for materialized index selecting when the query has grouping sets.
2022-12-29 23:17:02 +08:00
dda505487c [fix](nereids) SimplifyArithmeticRuleTest ut failed (#15486)
this PR remove typeCoercion on expected expr in ExpressionRewriteTestHelper. Because we should not rewrite expected expr at all. It will change the expected expr unexpectedly.
2022-12-29 22:53:27 +08:00
79113b0cd1 [Fix](storage) Fix bug that cooldown time is error (#15444)
Cooldown time is wrong for data in SSD, because cooldown time for all `table/partitionis`
is only calculated once when class `DataProperty` loaded and that cannot be updated later.
This patch is to ensure that cooldown time for each table/partition can be calculated in real time
when table/partition is created.
Co-authored-by: weizuo <weizuo@xiaomi.com>
2022-12-29 21:01:36 +08:00
25b257e37c [enhancement](session var) varariable to control whether to rewrite OR to IN or not (#15437) 2022-12-29 14:50:32 +08:00
5b09d27d54 [feature-wip](nereids) Made decimal in nereids more complete (#15087)
1. Add IntegralDivide operator to support `DIV` semantics
2. Add more operator rewriter to keep expression type consistent between operators
3. Support the convertion between float type and decimal type.

After this PR, below cases could be executed normaly like the legacy optimizer:
  use test_query_db;
  select k1, k5,100000*k5 from test order by k1, k2, k3, k4;
  select avg(k9) as a from test group by k1 having a < 100.0 order by a;
2022-12-29 13:01:47 +08:00
0e154feeb9 [feature](multi catalog nereids)Add file scan node to nereids. (#15201)
Add file scan node to nereids, so that the new planner could support external hms table.
2022-12-29 10:31:11 +08:00
1b1083eb52 [fix](metric) fix prometheus metric format error for doris_fe_query_latency_ms (#15447)
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2022-12-29 08:51:15 +08:00
4336aaa01a [bug](datetimev2) fix wrong info when show create table (#15422)
* [bug](datetimev2) fix wrong info when show create table

* update
2022-12-28 19:55:43 +08:00
8ce62600dc [Bug] #14876 && #15225 have some bugs in rewrite or to in, revert them (#15420) 2022-12-28 13:30:09 +08:00
2af831de33 [Fix](Nereids)fix group by binding error, resulting in incorrect results (#15328)
Original: group by is bound to the outputExpression of the current node.

Problem: When the name of the new reference of outputExpression is the same as the child's output column, the child's output column should be used for group by, but at this time, the new reference of the node's outputExpression will be used for group by, resulting in an error

Now: Give priority to the child's output for group by binding. If the child does not have a corresponding column, use the outputExpression of this node for binding
2022-12-28 10:42:21 +08:00
28bb13a026 [feature](light-schema-change) enable light schema change by default (#15344) 2022-12-28 09:29:26 +08:00
5ac7b09765 [feature](Nereids) Support SchemaScan (#15411)
such as: select * from information_schema.backends;
2022-12-28 00:33:48 +08:00
0550dfaeb2 [enhancement](rewrite) add OrToIn rule and fix ExtractCommonFactorsRule apply problems (#12872)
Co-authored-by: wuhangze <wuhangze@jd.com>
2022-12-27 18:39:53 +08:00
a07ca41f8e [Fix](Nereids) fix repeat node nullable error bugs (#15251) 2022-12-27 17:01:33 +08:00
69068f9835 [fix](planner) fix hll_union plan: Invalid Aggregate Operator: hll_union (#14931)
When using hll_union aggregate function, PREAGGREGATION is always OFF and Rollup cannot be hit.
2022-12-27 11:20:41 +08:00
325d247b92 [Feature](Nereids) Support hll and count for materialized index. (#15275) 2022-12-27 00:38:04 +08:00
650136c32e [Enhancement](fe): replace assertTrue(X.equals(X)) with assertEquals (#15356) 2022-12-27 00:37:24 +08:00
ae87415174 [Feature](Nereids) add simplify arithmetic rule (#15242)
support simplify arithmetic rule

for example :
a + 1 > 1
=> a > 0
2022-12-26 16:57:59 +08:00
1400a89065 [Bug](Compile) fix compile error by using correct method name (#15355)
fix compile error by using correct method name
2022-12-26 14:58:01 +08:00
8b6e4e74e7 [improvement](jdbc) add default jdbc driver's dir (#15346)
Add a new config "jdbc_drivers_dir" for both FE and BE.
User can put jdbc drivers' jar file in this dir, and only specify file name in "driver_url" properties
when creating jdbc resource.
And Doris will find jar files in this dir.

Also modify the logic so that when the jdbc resource is modified, the corresponding jdbc table
will get the latest properties.
2022-12-26 11:51:12 +08:00
7b5739e9a9 [Fix](Nerids) fix dup key for pull predicate from project children (#15292)
In InferPredicates, we need pull predicates from project children then use sid replace id1.
In our code, use alias name as key, use expression as value to build map. Obviously, sid has two alias name(id1,id2) so throw Duplicate key exception.
2022-12-26 10:57:14 +08:00
a807978882 [refactor](non-vec) Remove rowbatch code from delta writer and some rowbatch related code (#15349)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-26 08:54:51 +08:00
a291cb17be [fix](information-schema) fix bug that query tables in information_schema db will return error #15336 2022-12-25 10:09:40 +08:00
27d64964e6 [enhancement](Nereids) cast expression to the type with parameters (#14657) 2022-12-23 18:29:50 +08:00
4b7f279cf9 [Enhancement](Nereids) change expression to conjuncts in filter (#14807) 2022-12-23 15:31:40 +08:00
754fceafaf [feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910)
**Histogram statistics**

Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics.

For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan.

The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred.

The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency.

---

Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows:

**Syntax**

```SQL
histogram(expr)
```

> The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query.

**example**

```
MySQL [test]> select histogram(login_time) from dev_table;
+------------------------------------------------------------------------------------------------------------------------------+
| histogram(`login_time`)                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------+
| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}|
+------------------------------------------------------------------------------------------------------------------------------+
```
**description**

```JSON
{
    "bucket_size": 5, 
    "buckets": [
        {
            "lower": "2022-09-21 17:30:29", 
            "upper": "2022-09-21 22:30:29", 
            "count": 9, 
            "pre_sum": 0, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-22 17:30:29", 
            "upper": "2022-09-22 22:30:29", 
            "count": 10, 
            "pre_sum": 9, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-23 17:30:29", 
            "upper": "2022-09-23 22:30:29", 
            "count": 9, 
            "pre_sum": 19, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-24 17:30:29", 
            "upper": "2022-09-24 22:30:29", 
            "count": 9, 
            "pre_sum": 28, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-25 17:30:29", 
            "upper": "2022-09-25 22:30:29", 
            "count": 9, 
            "pre_sum": 37, 
            "ndv": 1
        }
    ]
}
```

TODO:
- histogram func supports parameter and sample statistics (It's got another pr)
- use histogram statistics
- add  p0 regression
2022-12-22 16:42:17 +08:00