For file scan node, this is a special field `requiredSlot`, this field is set depends on the `isMaterialized` info of slot.
But `isMaterialized` info can be changed during the plan process, so we must update the `requiredSlot`
in `finalize` phase of scan node, otherwise, it may causing BE crash due to mismatching slot info.
1. allow cast boolean as date like type in nereids, the result is null
2. PruneOlapScanTablet rule can prune tablet even if a mv index is selected.
3. constant conjunct should not be pushed through agg node in old planner
Problem:
Select list should be non const when from list have tables or multiple tuples. Or upper query will regard wrong of isConstant
And make wrong constant folding
For example: when using nullif funtion with subquery which result in two alternative constant, planner would treat it as constant expr. So analyzer would report an error of order by clause can not be constant
Solusion:
Change inline view output to non constant, because (select 1 a from table) as view , a in output is no constant when we see
view.a outside
Current runtime filter pushing down to cte internal, we construct the runtime filter expr_order with incremental number, which is not correct. For cte internal rf pushing down, the join node will be always different, the expr_order should be fixed as 0 without incrementation, otherwise, it will lead the checking for expr_order and probe_expr_size illegal or wrong query result.
This pr will revert 2827bc1 temporarily, it will break the cte rf pushing down plan pattern.
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>
1. Filtering is done at the sending end rather than the receiving end
2. Projection is done at the sending end rather than the receiving end
3. Each sender can use different shuffle policies to send data
after running EliminateNotNull rule, the join conjuncts may be removed from inner join node.
So need run ConvertInnerOrCrossJoin rule to convert inner join with no join conjuncts to cross join node.
Issue Number: close#20948
Fix read error in mixed partition locations(for example, some partitions locations are on s3, other are on hdfs) by `getLocationType` of file split level instead of the table level.
`Wrong data type for column` error when column order in hive table is not same in orc file schema.
The root cause is in order to handle the following case:
The table in orc format of Hive 1.x may encounter system column names such as `_col0`, `_col1`, `_col2`... in the underlying orc file schema, which need to use the column names in the hive table for mapping.
### Solution
Currently fix this issue by handling the following case by specifying hive version to 1.x.x in the hive catalog configuration.
```sql
CREATE CATALOG hive PROPERTIES (
'hive.version' = '1.x.x'
);
```
If there is only one fragment of a query plan, FE will call `exec_plan_fragment` rpc to BE.
And on BE side, the `exec_plan_fragment()` will be executed directly in bthread, but it may call
some JNI method like `AttachCurrentThread()`, which will return error in bthread.
So I modify the `exec_plan_fragment` to make sure it will be executed in pthread pool.
To_date function in nereids return type should be DATEV2 if the arg type is DATETIMEV2.
Before the return type was DATE which would cause BE get wrong query result.
Infer distinct from Distinct SetOperator, and put distinct above children to reduce data.
tpcds_sf100 q14:
before
100 rows in set (7.60 sec)
after
100 rows in set (6.80 sec)
Agg stats estimation should use the biggest groupby key's NDV as base, and multiply expansion factor, which is calculated by other groupby key' ndv.
Before, we use the smallest ndv as base
This pr is to add the collecting hive statistic function. While the CBO fetching hive table statistics, statistic cache will
first load from internal stats olap table. If not found, then using this pr's function to fetch from remote Hive metastore.
this PR
1. refactor physical properties, property deriver and property regular
to ensure Nereids could generate plan with sufficent PhysicalDistribute.
2. refactor PhyscialPlanTranslator to ensure all ExchangeNode generated
by PhysicalDistribute, except CTEConsumer. We will refactor all cte
related node later.
the detail changes of this PR:
1. update DistributionSpec of physical properties:
- Any: random distribution, used in output and require
- StorageAny: random distribution but constrained by where the data is stored, used in output
- ExecutionAny: random distribution to present random shuffle, used in output
- Gather: gather distribution, used in output and require
- StorageGather: gather distribution but constrained by where the data is stored, used in output
- Replicated: broadcast distribution
- Hash: bucket distribution
2. update shuffle type of DistributionSpecHash
- REQUIRE: used in require
- NATURAL: distribution as storage engine hash algorithm, constrained by where the data is stored
- STORAGE_BUCKETED: distribution as storage engine hash algorithm
- EXECUTION_BUCKETED: distribution as execution engine hash algorithm
3. update HideOneRowRelationUnderSetOperation to MergeOneRowRelationIntoSetOperation
4. update property deriver of SetOperation to ensure suitable PhysicalDistribute be added
at top and below of SetOperation
5. refactor PhysicalPlanTranslator to ensure no unplanned exchange node will be added
complex predicate in delete stmt like:
```sql
delete from t1 where t1.id in (select id from t2);
```
will be replaced to an insert stmt.
```sql
insert into t1(id, __DORIS_DELETE_SIGN__) select id, 1 from t1 where id in (select id from t2);
```
* [Improve](dynamic schema) support filtering invalid data
1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
1. query cache for chinese tokenizer is confusing when just converting w_char to char.
2. seperate query_type from inverted_index_reader to clean code.
In the previous implementation, the check for groupby exprs was ignored. Add this necessary check to make sure it would work
You could reproduce it by runnning belowing sql:
CREATE TABLE t_push_filter_through_agg (col1 varchar(11451) not null, col2 int not null, col3 int not null)
UNIQUE KEY(col1)
DISTRIBUTED BY HASH(col1)
BUCKETS 3
PROPERTIES(
"replication_num"="1"
);
CREATE VIEW `view_i` AS
SELECT
`b`.`col1` AS `col1`,
`b`.`col2` AS `col2`
FROM
(
SELECT
`col1` AS `col1`,
sum(`cost`) AS `col2`
FROM
(
SELECT
`col1` AS `col1`,
sum(CAST(`col3` AS INT)) AS `cost`
FROM
`t_push_filter_through_agg`
GROUP BY
`col1`
) a
GROUP BY
`col1`
) b;
SELECT SUM(`total_cost`) FROM view_a WHERE `dt` BETWEEN '2023-06-12' AND '2023-06-18' LIMIT 1;