Commit Graph

1368 Commits

Author SHA1 Message Date
90dd8716ed [refactor](multicast) change the way multicast do filter, project and shuffle (#21412)
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>

1. Filtering is done at the sending end rather than the receiving end
2. Projection is done at the sending end rather than the receiving end
3. Each sender can use different shuffle policies to send data
2023-07-04 16:51:07 +08:00
599ba4529c [fix](nereids) need run ConvertInnerOrCrossJoin rule again after EliminateNotNull (#21346)
after running EliminateNotNull rule, the join conjuncts may be removed from inner join node.
So need run ConvertInnerOrCrossJoin rule to convert inner join with no join conjuncts to cross join node.
2023-07-04 10:52:36 +08:00
f80df20b6f [Fix](multi-catalog) Fix read error in mixed partition locations. (#21399)
Issue Number: close #20948

Fix read error in mixed partition locations(for example, some partitions locations are on s3, other are on hdfs) by `getLocationType` of file split level instead of the table level.
2023-07-03 15:14:17 +08:00
8e8a8da2e7 [Improve](regresstest) update collect distinct regress test for array hash (#21417)
this regress sql can make sense of array hashing function is working fine
2023-07-03 12:16:11 +08:00
2827bc1a39 [Fix](nereids) fix a bug in ColumnStatistics.numNulls update #21220
no impact on tpch
has impact on tpcds 95,
before 1.63 sec, after 1.30 sec
2023-07-03 10:51:23 +08:00
Pxl
59c1bbd163 [Feature](materialized view) support query match mv with agg_state on nereids planner (#21067)
* support create mv contain aggstate column

* update

* update

* update

* support query match mv with agg_state on nereids planner

update

* update

* update
2023-07-03 10:19:31 +08:00
124516c1ea [Fix](orc-reader) Fix Wrong data type for column error when column order in hive table is not same in orc file schema. (#21306)
`Wrong data type for column` error when column order in hive table is not same in orc file schema.

The root cause is in order to handle the following case:

The table in orc format of Hive 1.x may encounter system column names such as `_col0`, `_col1`, `_col2`... in the underlying orc file schema, which need to use the column names in the hive table for mapping.

### Solution
Currently fix this issue by handling the following case by specifying hive version to 1.x.x in the hive catalog configuration.

```sql
CREATE CATALOG hive PROPERTIES (
    'hive.version' = '1.x.x'
);
```
2023-07-03 09:32:55 +08:00
4ad3a7a8de [fix](exec) run exec_plan_fragment in pthread to avoid BE crash (#21343)
If there is only one fragment of a query plan, FE will call `exec_plan_fragment` rpc to BE.
And on BE side, the `exec_plan_fragment()` will be executed directly in bthread, but it may call
some JNI method like `AttachCurrentThread()`, which will return error in bthread.

So I modify the `exec_plan_fragment` to make sure it will be executed in pthread pool.
2023-07-01 12:29:22 +08:00
ed2cd4974e [fix](nereids) to_date should return type datev2 for datetimev2 (#21375)
To_date function in nereids return type should be DATEV2 if the arg type is DATETIMEV2.
Before the return type was DATE which would cause BE get wrong query result.
2023-06-30 21:42:59 +08:00
Pxl
88cbea2b56 [Bug](agg-state) fix core dump on not nullable argument for aggstate's nested argument (#21331)
fix core dump on not nullable argument for aggstate's nested argument
2023-06-30 18:20:25 +08:00
d76fa427a3 [improve](jsonb)Invalid json path prompts an error instead of null (#19646)
1. Invalid json path prompts an error instead of null:
before:
```sql
mysql> SELECT jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[a]');
+-------------------------------------------------------------+
| jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[a]') |
+-------------------------------------------------------------+
| NULL                                                        |
+-------------------------------------------------------------+
1 row in set (0.01 sec)
```
now
```sql
mysql> SELECT jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[a]');
ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Json path error: Invalid Json Path for value: $[a]
```
2. fix some problem: https://github.com/apache/doris/pull/19185
   a. support negative numbers
```sql
mysql> SELECT jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[-2]');
+--------------------------------------------------------------+
| jsonb_extract('[{"k1":"v41","k2":400},1,"a",3.14]', '$[-2]') |
+--------------------------------------------------------------+
| "a"                                                          |
+--------------------------------------------------------------+
1 row in set (0.02 sec)
```
  b. Avoid using unnecessary memory
3. Supplementary regression test
2023-06-30 14:29:21 +08:00
8809cca74a [fix](nereids) physical sort node's equals method should compare sort phase (#21301) 2023-06-30 14:04:22 +08:00
33fa5dd1e9 [fix](cast) fix coredump of cast string of invalid datetime (#21350)
For sql like select cast("627492340" as datetime); the string is an invalid datetime, function DateV2Value<T>::from_date_str cast it as datetime 2062-74-92 23:40:00, with an out-of-range month and day value, which cause memory violation in function DateV2Value<T>::format_datetime when trying to access s_days_in_month.

==256444==ERROR: AddressSanitizer: global-buffer-overflow on address 0x55a7c1a5cff8 at pc 0x55a7e5aa3d2a bp 0x7f3b805f0370 sp 0x7f3b805f0368
READ of size 4 at 0x55a7c1a5cff8 thread T390 (FragmentMgrThre)
    #0 0x55a7e5aa3d29 in doris::vectorized::DateV2Value<doris::vectorized::DateTimeV2ValueType>::format_datetime(unsigned int*, bool*) const /home/zcp/repo_center/doris_master/doris/be/src/vec/runtime/vdatetime_value.cpp:1821:31
    #1 0x55a7e5aa3052 in doris::vectorized::DateV2Value<doris::vectorized::DateTimeV2ValueType>::from_date_str(char const*, int, int) /home/zcp/repo_center/doris_master/doris/be/src/vec/runtime/vdatetime_value.cpp:1968:5
    #2 0x55a7d48f0c49 in bool doris::vectorized::read_datetime_v2_text_impl<unsigned long>(unsigned long&, doris::vectorized::ReadBuffer&, unsigned int) /home/zcp/repo_center/doris_master/doris/be/src/vec/io/io_helper.h:309:19
    #3 0x55a7ddb21642 in bool doris::vectorized::try_read_datetime_v2_text<unsigned long>(unsigned long&, doris::vectorized::ReadBuffer&, unsigned int) /home/zcp/repo_center/doris_master/doris/be/src/vec/io/io_helper.h:409:12
    #4 0x55a7ddb215ec in bool doris::vectorized::try_parse_impl<doris::vectorized::DataTypeDateTimeV2, unsigned int, void*>(doris::vectorized::DataTypeDateTimeV2::FieldType&, doris::vectorized::ReadBuffer&, DateLUTImpl const*, unsigned int) /home/zcp/repo_center/doris_master/doris/be/src/vec/functions/function_cast.h:839:16
    #5 0x55a7ddb21c84 in auto doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*)::'lambda'(void*, auto)::operator()<std::integral_constant<bool, false>, std::integral_constant<bool, true>>(void*, auto) const /home/zcp/repo_center/doris_master/doris/be/src/vec/functions/function_cast.h:1340:38
    #6 0x55a7ddb216f7 in void* std::__invoke_impl<doris::Status, doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*)::'lambda'(void*, auto), std::integral_constant<bool, false>, std::integral_constant<bool, true>>(std::__invoke_other, auto&&, std::integral_constant<bool, false>&&, std::integral_constant<bool, true>&&) /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61:14
    #7 0x55a7ddb2167f in std::__invoke_result<void*, std::integral_constant<bool, false>, std::integral_constant<bool, true>>::type std::__invoke<doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*)::'lambda'(void*, auto), std::integral_constant<bool, false>, std::integral_constant<bool, true>>(void*&&, std::integral_constant<bool, false>&&, std::integral_constant<bool, true>&&) /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96:14
    #8 0x55a7ddb20d14 in std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<std::__detail::__variant::__deduce_visit_result<doris::Status> (*)(doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*)::'lambda'(void*, auto)&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&)>, std::integer_sequence<unsigned long, 0ul, 1ul>>::__visit_invoke(doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*)::'lambda'(void*, auto)&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&) /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/variant:1013:11
    #9 0x55a7ddb20c15 in decltype(auto) std::__do_visit<std::__detail::__variant::__deduce_visit_result<doris::Status>, doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*)::'lambda'(void*, auto), std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>>(auto&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&) /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/variant:1714:14
    #10 0x55a7ddb20b6a in decltype(auto) std::visit<doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*)::'lambda'(void*, auto), std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>>(void*&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&, std::variant<std::integral_constant<bool, false>, std::integral_constant<bool, true>>&&) /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/variant:1769:9
    #11 0x55a7ddb205ff in doris::Status doris::vectorized::ConvertThroughParsing<doris::vectorized::DataTypeString, doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute<void*>(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long, bool, void*) /home/zcp/repo_center/doris_master/doris/be/src/vec/functions/function_cast.h:1321:23
    #12 0x55a7ddb1f2c7 in doris::vectorized::FunctionConvertFromString<doris::vectorized::DataTypeDateTimeV2, doris::vectorized::NameCast>::execute_impl(doris::FunctionContext*, doris::vectorized::Block&, std::vector<unsigned long, std::allocator<unsigned long>> const&, unsigned long, unsigned long) /home/zcp/repo_center/doris_master/doris/be/src/vec/functions/function_cast.h:1417:20
2023-06-30 10:12:31 +08:00
9f44c2d80d [fix](nereids) nest loop join stats estimation (#21275)
1. fix bug in nest loop join estimation
2. update column=column stats estimation
2023-06-30 10:00:30 +08:00
9756ff1e25 [feature](Nereids): infer distinct from SetOperator (#21235)
Infer distinct from Distinct SetOperator, and put distinct above children to reduce data.

tpcds_sf100 q14:

before
100 rows in set (7.60 sec)

after
100 rows in set (6.80 sec)
2023-06-29 22:04:41 +08:00
5bb79be932 [opt](Nereids) forbid gather agg and gather set operation (#21332)
gather agg and gather set operation usually not good
we cannot compute cost on them nicely, so just
forbid them until we could choose realy best plan
2023-06-29 19:52:15 +08:00
64e9eab0dd [fix](nereids)update Agg stats estimation #21300
Agg stats estimation should use the biggest groupby key's NDV as base, and multiply expansion factor, which is calculated by other groupby key' ndv.
Before, we use the smallest ndv as base
2023-06-29 16:37:05 +08:00
Pxl
87e64115ae [Chore](materialized-view) add case about insert data imidiately after create mv(#21281)
add case about insert data imidiately after create mv
2023-06-29 11:17:38 +08:00
3a12b67517 [Improvement](statistics, multi catalog)Implement hive table statistic connector (#21053)
This pr is to add the collecting hive statistic function. While the CBO fetching hive table statistics, statistic cache will 
first load from internal stats olap table. If not found, then using this pr's function to fetch from remote Hive metastore.
2023-06-29 10:50:54 +08:00
Pxl
45f1909bc3 [Bug](lateral-view) make lateral view function's nullable mode work (#21242)
make lateral view function's nullable mode work
2023-06-29 10:50:07 +08:00
449c8d4568 [fix](jdbc) Handling Zero DateTime Values in Non-nullable Columns for JDBC Catalog Reading MySQL (#21296) 2023-06-28 22:51:17 +08:00
7588abe76b [refactor](Nereids) refactor physical properties and plan translator (#21168)
this PR
1. refactor physical properties, property deriver and property regular 
to ensure Nereids could generate plan with sufficent PhysicalDistribute.
2. refactor PhyscialPlanTranslator to ensure all ExchangeNode generated
by PhysicalDistribute, except CTEConsumer. We will refactor all cte
related node later. 

the detail changes of this PR:
1. update DistributionSpec of physical properties:
- Any: random distribution, used in output and require
- StorageAny: random distribution but constrained by where the data is stored, used in output
- ExecutionAny: random distribution to present random shuffle, used in output
- Gather: gather distribution, used in output and require
- StorageGather: gather distribution but constrained by where the data is stored, used in output
- Replicated: broadcast distribution
- Hash: bucket distribution

2. update shuffle type of DistributionSpecHash
- REQUIRE: used in require
- NATURAL: distribution as storage engine hash algorithm, constrained by where the data is stored
- STORAGE_BUCKETED: distribution as storage engine hash algorithm
- EXECUTION_BUCKETED: distribution as execution engine hash algorithm

3. update HideOneRowRelationUnderSetOperation to MergeOneRowRelationIntoSetOperation

4. update property deriver of SetOperation to ensure suitable PhysicalDistribute be added
at top and below of SetOperation

5. refactor PhysicalPlanTranslator to ensure no unplanned exchange node will be added
2023-06-28 15:15:11 +08:00
b1e973b721 [Improve](func)support array to window-func first-last-value arg type (#21201)
* support array to windown-func first-last-value arg type

* add regress test for first-last-value of array type

* update

* format be:
2023-06-28 10:02:00 +08:00
d871df64ca [improvement](oracle jdbc)Support for automatically obtaining the precision of the oracle timestamp type (#21252) 2023-06-28 00:19:01 +08:00
5506faa7b4 [datetimev2](minor) Add scale parameter for datetimev2 (#21176) 2023-06-27 19:55:35 +08:00
64a1eb77f0 [opt](planner) support delete with a subquery in predicate by construct an insert. (#20983)
complex predicate in delete stmt like: 
```sql
delete from t1 where t1.id in (select id from t2);
```

will be replaced to an insert stmt.
```sql
insert into t1(id, __DORIS_DELETE_SIGN__) select id, 1 from t1 where id in (select id from t2);
```
2023-06-27 17:51:13 +08:00
84554ec0fd [fix](planner) the resultExprs should be substituted using table function node's outputSmap (#21182) 2023-06-27 17:19:49 +08:00
Pxl
70ddf64126 [Chore](agg-state) add documentation about agg_state, add group_concat agg_state test case (#21147)
add documentation about agg_state, add group_concat agg_state test case
2023-06-27 11:28:19 +08:00
e0b20f0437 [feature](function) add ip function ipv4numtostring (alias inet_ntoa) (#20936) 2023-06-27 10:17:40 +08:00
c9306e9c48 [improvement](ms jdbc)Support for automatically obtaining the precision of the sqlserver datetime type (#21145) 2023-06-26 23:10:46 +08:00
50c1d55769 [Improve](dynamic schema) support filtering invalid data (#21160)
* [Improve](dynamic schema) support filtering invalid data

1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
2023-06-26 19:32:43 +08:00
1ac8cdec7e [Fix](inverted index) fix inverted query cache for chinese tokenizer (#21106)
1. query cache for chinese tokenizer is confusing when just converting w_char to char.
2. seperate query_type from inverted_index_reader to clean code.
2023-06-25 22:04:02 +08:00
2d1163c4d8 [refactor](nereids) update Agg stats derive method #21036
This pr has no effect on tpch queries.
Some tpcds queries are impacted.
They are 4/11/23/24/47/51/57/65/74, in which 4 and 51 are improved
2023-06-25 21:47:32 +08:00
638aa41988 [fix](planner) fix push filter through agg #21080
In the previous implementation, the check for groupby exprs was ignored. Add this necessary check to make sure it would work

You could reproduce it by runnning belowing sql:

CREATE TABLE t_push_filter_through_agg (col1 varchar(11451) not null, col2 int not null, col3 int not null)
UNIQUE KEY(col1)
DISTRIBUTED BY HASH(col1)
BUCKETS 3
PROPERTIES(
    "replication_num"="1"
);

CREATE VIEW `view_i` AS 
SELECT 
    `b`.`col1` AS `col1`, 
    `b`.`col2` AS `col2`
FROM 
(
    SELECT 
        `col1` AS `col1`, 
        sum(`cost`) AS `col2`
    FROM 
    (
        SELECT 
            `col1` AS `col1`, 
            sum(CAST(`col3` AS INT)) AS `cost` 
        FROM 
            `t_push_filter_through_agg` 
        GROUP BY 
            `col1`
    ) a 
    GROUP BY 
        `col1`
) b;

SELECT SUM(`total_cost`) FROM view_a WHERE `dt` BETWEEN '2023-06-12' AND '2023-06-18' LIMIT 1;
2023-06-25 19:14:20 +08:00
6896776034 [test](regression) update some case in p2 (#21094)
update some case in p2
2023-06-25 11:16:56 +08:00
8b561cfb03 [fix](nereids)create datev2 and datetimev2 literal if enable_date_conversion is true (#21065) 2023-06-21 20:29:36 +08:00
6ac0bfeceb [Feature](inverted index) add unicode parser for inverted index (#21035) 2023-06-21 20:14:06 +08:00
cc53391c9a Revert "[feature](merge-on-write) enable merge on write by default (#… (#21041) 2023-06-21 18:36:46 +08:00
2beed11256 [Bug](streamload) fix inconsistent load result of be and fe (#20950) 2023-06-21 18:12:51 +08:00
8bcd42d3f6 [test](regression) update some case in brown_p2 #21037 2023-06-21 16:25:07 +08:00
4d84cd8ca1 Revert "Revert "[Test](regression) CCR syncer thrift interface regression test (#20935)" (#20990)" (#21022)
This reverts commit 2a294801f1324a999570158eea3224239eefbb29.
2023-06-21 15:20:21 +08:00
bad22dd4e2 [Fix](orc-reader) Fix orc dict filter null value issue in _convert_dict_cols_to_string_cols which caused incorrect result. (#21047)
Query results should not have empty values.
```
use regresssion.multi_catalog;
select commit_id from github_events_orc WHERE (event_type = 'CommitCommentEvent') AND commit_id != "" limit 10;
```
```
+------------------------------------------+
| commit_id                                |
+------------------------------------------+
| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 |
| 685c1fd8dbbdc10c042932f9a9f88be00ff96c75 |
| 4e3ab2ff2d2474f5d51334b9b0fdf17e9845a166 |
|                                          |
|                                          |
|                                          |
|                                          |
|                                          |
|                                          |
| 7191c20cb49da07a7fc16aa32dc0de4faff528b2 |
+------------------------------------------+
10 rows in set (0.54 sec) 
```
2023-06-21 14:54:01 +08:00
Pxl
5f0bb49d46 [Feature](materialized-view) support create mv contain aggstate column (#20812)
support create mv contain aggstate column
2023-06-21 13:06:52 +08:00
18beb822a3 [FIX](array-type) fix array string output with fe const expr (#21042)
fe foldconstRule make array() function expr with const literal , and would not pass this array literal to be . but we should make fe array string output format is same with be array string output
2023-06-21 11:52:02 +08:00
0cf9de8cef [fix](decimalv3) fix result error when cast a round decimalv3 to double (#20678) 2023-06-21 00:02:48 +08:00
2c11ce0a02 [bugfix](topn) fix key topn merge block conflict with index predicate result columns (#20820) 2023-06-20 21:23:00 +08:00
f10258577b [Fix](Planner) Fix group concat with multi distinct and segs (#20912)
Problem:
when use select group_concat(distinct a, 'seg1'), group_concat(distinct b, 'seg2') ... Error would rised
Reason:
Group_concat function regard 'seg' as arguments also, so multi distinct column error would rised
Solved:
let Multi Distinct group_concat function only get first argument as real argument
2023-06-20 21:00:18 +08:00
7e01f074e2 [improvement](jdbc mysql) support auto calculate the precision of timestamp/datetime (#20788) 2023-06-20 10:39:34 +08:00
824bc02603 [Function] Support date function: microsecond() (#20044) 2023-06-20 10:32:54 +08:00
d02ecef406 [fix](Nereids): revert push down alias into union (#20991)
revert #20543 to tmp avoid problem
2023-06-20 09:32:26 +08:00