* [hotfix](dev-1.0.1) fix colocate join bug in vec engine after introducing output tuple (#10651)
to support vectorized outer join, we introduced a out tuple for hash join node,
but it breaks the checking for colocate join.
To solve this problem, we need map the output slot id to the children's slot id of hash join node,
and the colocate join can be checked correctly.
* fix colocate join bug
* fix non vec colocate join issue
Co-authored-by: lichi <lichi@rateup.com.cn>
* add test cases
Co-authored-by: lichi <lichi@rateup.com.cn>
when show databases/tables/table status where xxx, it will change a selectStmt to select result from
information_schema, it need catalog info to scan schema table, otherwise may get many
database or table info from multi catalog.
for example
mysql> show databases where schema_name='test';
+----------+
| Database |
+----------+
| test |
| test |
+----------+
MySQL [internal.test]> show tables from test where table_name='test_dc';
+----------------+
| Tables_in_test |
+----------------+
| test_dc |
| test_dc |
+----------------+
Fix three bugs:
1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed
2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed
3. fmt::format can't generate right format string, and will be failed
Add one metric to detect the publish txn num per db. User can get the relative speed of the txns processing per db using this metric and doris_fe_txn_num.
# Proposed changes
## refactor
- add AggregateExpression to shield the difference of AggregateFunction before disassemble and after
- request `GATHER` physicalProperties for query, because query always collect result to the coordinator, use `GATHER` maybe select a better plan
- refactor `NormalizeAggregate`
- remove some physical fields for the `LogicalAggregate`, like `AggPhase`, `isDisassemble`
- remove `AggregateDisassemble` and `DistinctAggregateDisassemble`, and use `AggregateStrategies` to generate various of PhysicalHashAggregate, like `two phases aggregate`, `three phases aggregate`, and cascades can auto select the lowest cost alternative.
- move `PushAggregateToOlapScan` to `AggregateStrategies`
- separate the traverse and visit method in FoldConstantRuleOnFE
- if some expression not implement the visit method, the traverse method can handle and rewrite the children by default
- if some expression implement the visit, the user defined traverse(invoke accept/visit method) will quickly return because the default visit method will not forward to the children, and the pre-process in traverse method will not be skipped.
## new feature
- support `disable_nereids_rules` to skip some rules.
example:
1. create 1 bucket table `n`
```sql
CREATE TABLE `n` (
`id` bigint(20) NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
```
2. insert some rows into `n`
```sql
insert into n select * from numbers('number'='20000000')
```
3. query table `n`
```sql
SET enable_nereids_planner=true;
SET enable_vectorized_engine=true;
SET enable_fallback_to_original_planner=false;
explain plan select id from n group by id;
```
the result show that we use the one stage aggregate
```
| PhysicalHashAggregate ( aggPhase=LOCAL, aggMode=INPUT_TO_RESULT, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional.empty, requestProperties=[GATHER], stats=(rows=1, width=1, penalty=2.0E7) ) |
| +--PhysicalProject ( projects=[id#0], stats=(rows=20000000, width=1, penalty=0.0) ) |
| +--PhysicalOlapScan ( qualified=default_cluster:test.n, output=[id#0, name#1], stats=(rows=20000000, width=1, penalty=0.0) ) |
```
4. disable one stage aggregate
```sql
explain plan select
/*+SET_VAR(disable_nereids_rules=DISASSEMBLE_ONE_PHASE_AGGREGATE_WITHOUT_DISTINCT)*/
id
from n
group by id
```
the result is two stage aggregate
```
| PhysicalHashAggregate ( aggPhase=GLOBAL, aggMode=BUFFER_TO_RESULT, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional[[id#0]], requestProperties=[GATHER], stats=(rows=1, width=1, penalty=2.0E7) ) |
| +--PhysicalHashAggregate ( aggPhase=LOCAL, aggMode=INPUT_TO_BUFFER, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional[[id#0]], requestProperties=[ANY], stats=(rows=1, width=1, penalty=2.0E7) ) |
| +--PhysicalProject ( projects=[id#0], stats=(rows=20000000, width=1, penalty=0.0) ) |
| +--PhysicalOlapScan ( qualified=default_cluster:test.n, output=[id#0, name#1], stats=(rows=20000000, width=1, penalty=0.0) ) |
```
Set FE `enable_new_load_scan_node` to true by default.
So that all load tasks(broker load, stream load, routine load, insert into) will use FileScanNode instead of BrokerScanNode
to read data
1. Support loading parquet file in stream load with new load scan node.
2. Fix bug that new parquet reader can not read column without logical or converted type.
3. Change jsonb parser function to "jsonb_parse_error_to_null"
So that if the input string is not a valid json string, it will return null for jsonb column in load task.
Doris does not support disjunctive predicates very well, which causes some problems in partition prune.
For example, sqls like the followings will trigger a full table scan without partition pruning
select * from test.t1
where (dt between 20211121 and 20211122) or (dt between 20211125 and 20211126)
1. a icebergv2 delete file may cross many data paths, so the path of a file split is required as a predicate to filter rows of delete file
- create delete file structure to save predicate parameters
- create predicate for file path
2. add some log to print row range
3. fix bug when create file metadata
add DateV2 and DateTimeV2 for Literal.uncheckCastTo()
move nereids tpch cases into suite nereids_tpch_p1
move nereids datav2 cases into suite nereids_datav2_p1
The PhysicalFilter can't be assigned to ExchangeNode, SortNode and UnionNode. The nereids would create a standalone SelectNode to do the filter work properly.