Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
1. Filtering is done at the sending end rather than the receiving end
2. Projection is done at the sending end rather than the receiving end
3. Each sender can use different shuffle policies to send data
after running EliminateNotNull rule, the join conjuncts may be removed from inner join node.
So need run ConvertInnerOrCrossJoin rule to convert inner join with no join conjuncts to cross join node.
Issue Number: close#20948
Fix read error in mixed partition locations(for example, some partitions locations are on s3, other are on hdfs) by `getLocationType` of file split level instead of the table level.
`Wrong data type for column` error when column order in hive table is not same in orc file schema.
The root cause is in order to handle the following case:
The table in orc format of Hive 1.x may encounter system column names such as `_col0`, `_col1`, `_col2`... in the underlying orc file schema, which need to use the column names in the hive table for mapping.
### Solution
Currently fix this issue by handling the following case by specifying hive version to 1.x.x in the hive catalog configuration.
```sql
CREATE CATALOG hive PROPERTIES (
'hive.version' = '1.x.x'
);
```
If there is only one fragment of a query plan, FE will call `exec_plan_fragment` rpc to BE.
And on BE side, the `exec_plan_fragment()` will be executed directly in bthread, but it may call
some JNI method like `AttachCurrentThread()`, which will return error in bthread.
So I modify the `exec_plan_fragment` to make sure it will be executed in pthread pool.
To_date function in nereids return type should be DATEV2 if the arg type is DATETIMEV2.
Before the return type was DATE which would cause BE get wrong query result.
Infer distinct from Distinct SetOperator, and put distinct above children to reduce data.
tpcds_sf100 q14:
before
100 rows in set (7.60 sec)
after
100 rows in set (6.80 sec)
Agg stats estimation should use the biggest groupby key's NDV as base, and multiply expansion factor, which is calculated by other groupby key' ndv.
Before, we use the smallest ndv as base
This pr is to add the collecting hive statistic function. While the CBO fetching hive table statistics, statistic cache will
first load from internal stats olap table. If not found, then using this pr's function to fetch from remote Hive metastore.
this PR
1. refactor physical properties, property deriver and property regular
to ensure Nereids could generate plan with sufficent PhysicalDistribute.
2. refactor PhyscialPlanTranslator to ensure all ExchangeNode generated
by PhysicalDistribute, except CTEConsumer. We will refactor all cte
related node later.
the detail changes of this PR:
1. update DistributionSpec of physical properties:
- Any: random distribution, used in output and require
- StorageAny: random distribution but constrained by where the data is stored, used in output
- ExecutionAny: random distribution to present random shuffle, used in output
- Gather: gather distribution, used in output and require
- StorageGather: gather distribution but constrained by where the data is stored, used in output
- Replicated: broadcast distribution
- Hash: bucket distribution
2. update shuffle type of DistributionSpecHash
- REQUIRE: used in require
- NATURAL: distribution as storage engine hash algorithm, constrained by where the data is stored
- STORAGE_BUCKETED: distribution as storage engine hash algorithm
- EXECUTION_BUCKETED: distribution as execution engine hash algorithm
3. update HideOneRowRelationUnderSetOperation to MergeOneRowRelationIntoSetOperation
4. update property deriver of SetOperation to ensure suitable PhysicalDistribute be added
at top and below of SetOperation
5. refactor PhysicalPlanTranslator to ensure no unplanned exchange node will be added
complex predicate in delete stmt like:
```sql
delete from t1 where t1.id in (select id from t2);
```
will be replaced to an insert stmt.
```sql
insert into t1(id, __DORIS_DELETE_SIGN__) select id, 1 from t1 where id in (select id from t2);
```
* [Improve](dynamic schema) support filtering invalid data
1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
1. query cache for chinese tokenizer is confusing when just converting w_char to char.
2. seperate query_type from inverted_index_reader to clean code.
In the previous implementation, the check for groupby exprs was ignored. Add this necessary check to make sure it would work
You could reproduce it by runnning belowing sql:
CREATE TABLE t_push_filter_through_agg (col1 varchar(11451) not null, col2 int not null, col3 int not null)
UNIQUE KEY(col1)
DISTRIBUTED BY HASH(col1)
BUCKETS 3
PROPERTIES(
"replication_num"="1"
);
CREATE VIEW `view_i` AS
SELECT
`b`.`col1` AS `col1`,
`b`.`col2` AS `col2`
FROM
(
SELECT
`col1` AS `col1`,
sum(`cost`) AS `col2`
FROM
(
SELECT
`col1` AS `col1`,
sum(CAST(`col3` AS INT)) AS `cost`
FROM
`t_push_filter_through_agg`
GROUP BY
`col1`
) a
GROUP BY
`col1`
) b;
SELECT SUM(`total_cost`) FROM view_a WHERE `dt` BETWEEN '2023-06-12' AND '2023-06-18' LIMIT 1;
fe foldconstRule make array() function expr with const literal , and would not pass this array literal to be . but we should make fe array string output format is same with be array string output
Problem:
when use select group_concat(distinct a, 'seg1'), group_concat(distinct b, 'seg2') ... Error would rised
Reason:
Group_concat function regard 'seg' as arguments also, so multi distinct column error would rised
Solved:
let Multi Distinct group_concat function only get first argument as real argument