Commit Graph

1351 Commits

Author SHA1 Message Date
Pxl
87e64115ae [Chore](materialized-view) add case about insert data imidiately after create mv(#21281)
add case about insert data imidiately after create mv
2023-06-29 11:17:38 +08:00
3a12b67517 [Improvement](statistics, multi catalog)Implement hive table statistic connector (#21053)
This pr is to add the collecting hive statistic function. While the CBO fetching hive table statistics, statistic cache will 
first load from internal stats olap table. If not found, then using this pr's function to fetch from remote Hive metastore.
2023-06-29 10:50:54 +08:00
Pxl
45f1909bc3 [Bug](lateral-view) make lateral view function's nullable mode work (#21242)
make lateral view function's nullable mode work
2023-06-29 10:50:07 +08:00
449c8d4568 [fix](jdbc) Handling Zero DateTime Values in Non-nullable Columns for JDBC Catalog Reading MySQL (#21296) 2023-06-28 22:51:17 +08:00
7588abe76b [refactor](Nereids) refactor physical properties and plan translator (#21168)
this PR
1. refactor physical properties, property deriver and property regular 
to ensure Nereids could generate plan with sufficent PhysicalDistribute.
2. refactor PhyscialPlanTranslator to ensure all ExchangeNode generated
by PhysicalDistribute, except CTEConsumer. We will refactor all cte
related node later. 

the detail changes of this PR:
1. update DistributionSpec of physical properties:
- Any: random distribution, used in output and require
- StorageAny: random distribution but constrained by where the data is stored, used in output
- ExecutionAny: random distribution to present random shuffle, used in output
- Gather: gather distribution, used in output and require
- StorageGather: gather distribution but constrained by where the data is stored, used in output
- Replicated: broadcast distribution
- Hash: bucket distribution

2. update shuffle type of DistributionSpecHash
- REQUIRE: used in require
- NATURAL: distribution as storage engine hash algorithm, constrained by where the data is stored
- STORAGE_BUCKETED: distribution as storage engine hash algorithm
- EXECUTION_BUCKETED: distribution as execution engine hash algorithm

3. update HideOneRowRelationUnderSetOperation to MergeOneRowRelationIntoSetOperation

4. update property deriver of SetOperation to ensure suitable PhysicalDistribute be added
at top and below of SetOperation

5. refactor PhysicalPlanTranslator to ensure no unplanned exchange node will be added
2023-06-28 15:15:11 +08:00
b1e973b721 [Improve](func)support array to window-func first-last-value arg type (#21201)
* support array to windown-func first-last-value arg type

* add regress test for first-last-value of array type

* update

* format be:
2023-06-28 10:02:00 +08:00
d871df64ca [improvement](oracle jdbc)Support for automatically obtaining the precision of the oracle timestamp type (#21252) 2023-06-28 00:19:01 +08:00
5506faa7b4 [datetimev2](minor) Add scale parameter for datetimev2 (#21176) 2023-06-27 19:55:35 +08:00
64a1eb77f0 [opt](planner) support delete with a subquery in predicate by construct an insert. (#20983)
complex predicate in delete stmt like: 
```sql
delete from t1 where t1.id in (select id from t2);
```

will be replaced to an insert stmt.
```sql
insert into t1(id, __DORIS_DELETE_SIGN__) select id, 1 from t1 where id in (select id from t2);
```
2023-06-27 17:51:13 +08:00
84554ec0fd [fix](planner) the resultExprs should be substituted using table function node's outputSmap (#21182) 2023-06-27 17:19:49 +08:00
Pxl
70ddf64126 [Chore](agg-state) add documentation about agg_state, add group_concat agg_state test case (#21147)
add documentation about agg_state, add group_concat agg_state test case
2023-06-27 11:28:19 +08:00
e0b20f0437 [feature](function) add ip function ipv4numtostring (alias inet_ntoa) (#20936) 2023-06-27 10:17:40 +08:00
c9306e9c48 [improvement](ms jdbc)Support for automatically obtaining the precision of the sqlserver datetime type (#21145) 2023-06-26 23:10:46 +08:00
50c1d55769 [Improve](dynamic schema) support filtering invalid data (#21160)
* [Improve](dynamic schema) support filtering invalid data

1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
2023-06-26 19:32:43 +08:00
1ac8cdec7e [Fix](inverted index) fix inverted query cache for chinese tokenizer (#21106)
1. query cache for chinese tokenizer is confusing when just converting w_char to char.
2. seperate query_type from inverted_index_reader to clean code.
2023-06-25 22:04:02 +08:00
2d1163c4d8 [refactor](nereids) update Agg stats derive method #21036
This pr has no effect on tpch queries.
Some tpcds queries are impacted.
They are 4/11/23/24/47/51/57/65/74, in which 4 and 51 are improved
2023-06-25 21:47:32 +08:00
638aa41988 [fix](planner) fix push filter through agg #21080
In the previous implementation, the check for groupby exprs was ignored. Add this necessary check to make sure it would work

You could reproduce it by runnning belowing sql:

CREATE TABLE t_push_filter_through_agg (col1 varchar(11451) not null, col2 int not null, col3 int not null)
UNIQUE KEY(col1)
DISTRIBUTED BY HASH(col1)
BUCKETS 3
PROPERTIES(
    "replication_num"="1"
);

CREATE VIEW `view_i` AS 
SELECT 
    `b`.`col1` AS `col1`, 
    `b`.`col2` AS `col2`
FROM 
(
    SELECT 
        `col1` AS `col1`, 
        sum(`cost`) AS `col2`
    FROM 
    (
        SELECT 
            `col1` AS `col1`, 
            sum(CAST(`col3` AS INT)) AS `cost` 
        FROM 
            `t_push_filter_through_agg` 
        GROUP BY 
            `col1`
    ) a 
    GROUP BY 
        `col1`
) b;

SELECT SUM(`total_cost`) FROM view_a WHERE `dt` BETWEEN '2023-06-12' AND '2023-06-18' LIMIT 1;
2023-06-25 19:14:20 +08:00
6896776034 [test](regression) update some case in p2 (#21094)
update some case in p2
2023-06-25 11:16:56 +08:00
8b561cfb03 [fix](nereids)create datev2 and datetimev2 literal if enable_date_conversion is true (#21065) 2023-06-21 20:29:36 +08:00
6ac0bfeceb [Feature](inverted index) add unicode parser for inverted index (#21035) 2023-06-21 20:14:06 +08:00
cc53391c9a Revert "[feature](merge-on-write) enable merge on write by default (#… (#21041) 2023-06-21 18:36:46 +08:00
2beed11256 [Bug](streamload) fix inconsistent load result of be and fe (#20950) 2023-06-21 18:12:51 +08:00
8bcd42d3f6 [test](regression) update some case in brown_p2 #21037 2023-06-21 16:25:07 +08:00
4d84cd8ca1 Revert "Revert "[Test](regression) CCR syncer thrift interface regression test (#20935)" (#20990)" (#21022)
This reverts commit 2a294801f1324a999570158eea3224239eefbb29.
2023-06-21 15:20:21 +08:00
bad22dd4e2 [Fix](orc-reader) Fix orc dict filter null value issue in _convert_dict_cols_to_string_cols which caused incorrect result. (#21047)
Query results should not have empty values.
```
use regresssion.multi_catalog;
select commit_id from github_events_orc WHERE (event_type = 'CommitCommentEvent') AND commit_id != "" limit 10;
```
```
+------------------------------------------+
| commit_id                                |
+------------------------------------------+
| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 |
| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 |
| 4e3ab2ff2d2474f5d51334b9b0fdf17e9845a166 |
|                                          |
|                                          |
|                                          |
|                                          |
|                                          |
|                                          |
| 7191c20cb49da07a7fc16aa32dc0de4faff528b2 |
+------------------------------------------+
10 rows in set (0.54 sec) 
```
2023-06-21 14:54:01 +08:00
Pxl
5f0bb49d46 [Feature](materialized-view) support create mv contain aggstate column (#20812)
support create mv contain aggstate column
2023-06-21 13:06:52 +08:00
18beb822a3 [FIX](array-type) fix array string output with fe const expr (#21042)
fe foldconstRule make array() function expr with const literal , and would not pass this array literal to be . but we should make fe array string output format is same with be array string output
2023-06-21 11:52:02 +08:00
0cf9de8cef [fix](decimalv3) fix result error when cast a round decimalv3 to double (#20678) 2023-06-21 00:02:48 +08:00
2c11ce0a02 [bugfix](topn) fix key topn merge block conflict with index predicate result columns (#20820) 2023-06-20 21:23:00 +08:00
f10258577b [Fix](Planner) Fix group concat with multi distinct and segs (#20912)
Problem:
when use select group_concat(distinct a, 'seg1'), group_concat(distinct b, 'seg2') ... Error would rised
Reason:
Group_concat function regard 'seg' as arguments also, so multi distinct column error would rised
Solved:
let Multi Distinct group_concat function only get first argument as real argument
2023-06-20 21:00:18 +08:00
7e01f074e2 [improvement](jdbc mysql) support auto calculate the precision of timestamp/datetime (#20788) 2023-06-20 10:39:34 +08:00
824bc02603 [Function] Support date function: microsecond() (#20044) 2023-06-20 10:32:54 +08:00
d02ecef406 [fix](Nereids): revert push down alias into union (#20991)
revert #20543 to tmp avoid problem
2023-06-20 09:32:26 +08:00
5a28b6f9fc [fix](datetime) Fix the error in date calculation that includes constants (#20863)
before

```
mysql> select hours_add('2023-03-30 22:23:45.23452',8);
+-------------------------------------+
| hours_add('2023-03-30 22:23:45', 8) |
+-------------------------------------+
| 2023-03-31 06:23:45                 |
+-------------------------------------+

mysql> select date_add('2023-03-30 22:23:45.23452',8);
+------------------------------------+
| date_add('2023-03-30 22:23:45', 8) |
+------------------------------------+
| 2023-04-07 22:23:45                |
+------------------------------------+

mysql [test]>select hours_add('2023-03-30 22:23:45.23452',8);
+-------------------------------------------+
| hours_add('2023-03-30 22:23:45.23452', 8) |
+-------------------------------------------+
| 2023-03-31 06:23:45.000234                |
+-------------------------------------------+
```

after

```
mysql [test]>select hours_add('2023-03-30 22:23:45.23452',8);
+-------------------------------------------+
| hours_add('2023-03-30 22:23:45.23452', 8) |
+-------------------------------------------+
| 2023-03-31 06:23:45.23452                 |
+-------------------------------------------+
1 row in set (0.01 sec)

mysql [test]>select date_add('2023-03-30 22:23:45.23452',8);
+------------------------------------------+
| date_add('2023-03-30 22:23:45.23452', 8) |
+------------------------------------------+
| 2023-04-07 22:23:45.23452                |
+------------------------------------------+
1 row in set (0.00 sec)

mysql [test]>set enable_nereids_planner=true;
Query OK, 0 rows affected (0.00 sec)

mysql [test]>set enable_fallback_to_original_planner=false;
Query OK, 0 rows affected (0.00 sec)

mysql [test]>select hours_add('2023-03-30 22:23:45.23452',8);
+-------------------------------------------+
| hours_add('2023-03-30 22:23:45.23452', 8) |
+-------------------------------------------+
| 2023-03-31 06:23:45.23452                 |
+-------------------------------------------+
1 row in set (0.03 sec)

mysql [test]>select date_add('2023-03-30 22:23:45.23452',8);
+------------------------------------------+
| days_add('2023-03-30 22:23:45.23452', 8) |
+------------------------------------------+
| 2023-04-07 22:23:45.23452                |
+------------------------------------------+
1 row in set (0.00 sec)
```
2023-06-19 23:44:30 +08:00
e6f50c04f1 [fix](nereids)SubqueryToApply rule lost is null condition (#20971)
* [fix](nereids)SubqueryToApply rule lost is null condition
2023-06-19 23:43:40 +08:00
f20ef165fe [opt](Nereids) update join stats derive (#20895)
in hash join condition, some equals are trustable, some are not.
an equal is trustable if one side is almost unique, like primary key. for such equal condition we could estimate more accurate.
the problem is in rewriten q20, the are 2 equal condition, one is trustable, another is not. But we treat both of them as trustable.

Test result:
on tpch100, from 2.2 sec to 0.44 sec
no impact on tpch other queries
no performance impact on tpcds queries
2023-06-19 23:40:44 +08:00
2a294801f1 Revert "[Test](regression) CCR syncer thrift interface regression test (#20935)" (#20990)
This reverts commit dd482b74c849b022862e7cfb1f1d0b933a84e3d2.
2023-06-19 21:38:03 +08:00
dd5ecea36a [fix](compress) snappy does not work right (#20934) 2023-06-19 14:11:10 +08:00
fb9fcf460a [fix](leftjoin) fix bug of left and full join with other conjuncts (#20946)
Fix bug of left and full outer join with other conjuncts. When equal matched row count of a probe row exceed batch_size, some times the _join_node->_is_any_probe_match_row_output flag is not set correcty, which result in outputing extra rows for the probe row.
2023-06-19 12:27:06 +08:00
Pxl
85c5d7c6a9 [Chore](materialized-view) add ssb_flat mv test case (#20869)
add ssb_flat mv test case
2023-06-19 10:51:50 +08:00
1efd345963 [Enhancement](table) adding information_schema.parameters table (#20259)
this is a virtual table for compatibility information_schema parameters table
2023-06-19 09:05:46 +08:00
8366ce7a81 [enhancement](insert-stmt) Make insert into tbl values(); compatible with mysql (#20694) 2023-06-18 19:56:07 +08:00
ac3290021d [fix](Nereids): MergeSetOperations can merge SetOperation ALL. (#20902) 2023-06-18 17:49:03 +08:00
5ae14549d1 [Feature](Nereids) support delete using syntax to delete data from unique key table (#20452) 2023-06-18 16:22:21 +08:00
dd482b74c8 [Test](regression) CCR syncer thrift interface regression test (#20935) 2023-06-18 00:13:09 +08:00
fe18cfa2fb [improvement](pg jdbc)Support for automatically obtaining the precision of the postgresql timestamp type (#20909) 2023-06-16 23:41:09 +08:00
367f64e7bd [improvement](jdbc) support insert autoinc and default value column to mysql (#20765)
In JdbcMysqlClient, I've added methods to retrieve auto-increment and default value columns from MySQL. These columns are then mapped into Doris metadata to make them visible to users.

When handling the InsertStmt into an execution plan, Doris used to automatically fill in NULL or default values for columns not specified in the InsertStmt. However, in the JDBC catalog, we don't need Doris to handle these unspecified columns, so I've made changes to skip them directly.

For the insert prepared statement required for writing, our previous behavior was to obtain all columns for placeholders. So, the change I made is to pass in the columns processed by the execution plan during the sink task generation stage for dynamic generation.
2023-06-16 23:38:11 +08:00
e834637a5b [improvement](ck jdbc) Support for automatically getting the precision of clickhouse's datetime64 type (#20887) 2023-06-16 23:37:30 +08:00
bf197ee8d2 [opt](nereids) adjust cost model for BroadCastJoin and PartitionJoin (#20713)
we add penalty for broadcast join (bc for brief in the following).
the intuition of penalty is as follow:
1. if the build side is very small (< 1M), we prefer bc, and set `penalty=1`, which means no penalty
2. if build side is more than 1M, we consider the ratio of the probe row count to the build row count. the less the ratio is, the higher penalty is.

this pr has positive impact on tpch queries. Only q3 is changed. in out test (tpch 1T, 3BE) q3 improved from 5.1sec to 2.5 sec.
this pr has positive impact on tpcds queries. test on tpcds sf100 (3BE), cold run improve from 163 sec to 156 sec, hot run improves from 155 sec to 149 sec
2023-06-16 22:49:04 +08:00
5dc0f90c7f [opt](Nereids) revert convert IN with 2 options to OR expression rule (#20894)
revert this rule because it has negative effect on predicate push-down-to-storage-layer
2023-06-16 19:11:37 +08:00