1. For query with 1656 union, the plan thrift size will be reduced from 400MB+ to 2MB.
This optimization is introduced from #4904, but lost after #9720
2. Disable ExprSubstitutionMap.verify when debug is disable.
So that the plan time of query with 1656 union will be reduced from 20s to 2s
The original statistic derive calculate algorithm rely on NDV and other column statistics. But we cannot get these stats in product environment.
This PR change these operator's stats calc algorithm to use a DEFAULT RATIO variable instead of column statistics.
We should change these algorithm when we could get column stats in product environment
Implement uncheckedCast on VarcharLiteral for a temp way to let TimestampArithmetic work.
We should remove these code and do implicit cast in TypeCoercion rule in future.
Just as legacy planner, Nereids parse all fractional literal to decimal.
In the future, we will add more syntax for user to control the fractional literal type.
Execution plan display when using orthogonal_bitmap_union_count function:
PREAGGREGATION: OFF
Reason: Invalid Aggregate Operator: orthogonal_bitmap_union_count
The correct plan is: PREAGGREGATION: ON
Co-authored-by: lihuigang <lihuigang@meituan.com>
Support OneRowRelation and EmptyRelation.
OneRowRelation: `select 100, 'abc', substring('abc', 1, 2)`
EmptyRelation: `select * from tbl limit 0`
Note:
PhysicalOneRowRelation will translate to UnionNode(constExpr) for BE execution
In earlier PR #11976 , we changed DistributionSpecHash#equalsSatisfy, and forgot to check whether the length of both side are same. When required's shuffle slot size longer than current one, exception will be thrown.
## Fix five bugs:
1. Parquet dictionary data may be compressed, but `ColumnChunkReader` try to parse dictionary data before creating compression codec, causing unexpected data errors.
2. `FE` doesn't resolve array type
3. `ParquetFileHdfsScanner` doesn't fill partition values when the table is partitioned
4. `ParquetFileHdfsScanner` set `_scanner_eof = true` when a scan range is empty, causing the end of the scanner, and resulting in data loss
5. typographical error in `PageReader`
Simplify the code of getting input/output slots from `Expression` or `Plan`.
**new interfaces add**
`Expression`:
`getInputSlots`: Get all the input slots of the expression.
`Plan`:
- `getOutputSet`: Get the output slot set of the plan.
- `getInputSlots`: Get the input slot set of the plan.
**changed interface**
`TreeNode`:
- `collect`: return `set` as result instead of `list`.
In the earlier PR #11812 , we split join condition into two parts: hash join conjuncts and other condition. But we forgot to translate other condition into other conjuncts in HashJoinNode of legacy planner. So we get wrong result if query has other condition on join node. Such as:
SELECT * FROM lineorder INNER JOIN part ON lo_partkey = p_partkey WHERE lo_orderkey > p_size;
In the current spark load implementation, the types of source data, that BE reads from the Broker, are all set to varchar.
However, the two types of varchar and bitmap are not compatible anymore after version 1.1.0, which will cause spark load failure.
An example of spark load error message:
detailMessage = type not match, originType=VARCHAR(*), targeType=BITMAP
Describe your changes.
Set the src type of the bitmap columns from varchar to bitmapwhen fe pushtasks.
Implement the having clause for Nereids Planner.
NOTE:
This PR aims at making Nereids Planner generate the correct logical plan and physical plan only. The runtime correctness is not the goal in this PR due to GROUP BY is not ready in Nereids Planner.
This PR
1. add support below join algorithm already supported by legacy to Nereids
- colocate join
- bucket shuffle join
- shuffle join
- broadcast join
2. update all cost enforce derive utils
- ChildOutputPropertyDeriver
- EnforceMissingPropertiesHelper
- RequestPropertyDeriver
3. add a local quick sort plan used in enforce
4. set PhysicalProperties to PhysicalPlan when choose best plan from memo
5. rename Job#pushTask to Job#pushJob
After applying NormalizeAggregate rule, owner groups of all aggregate children are removed.
The root cause is the new aggregate node is regarded as the old aggregate node, because LogicalAggregate.equals() does not take some attributes ("normalized", "disassembled") into account.
In earlier PR #11842, we add the ability of projection on each ExecNode.
But, we cannot get the projection expr list in explain. This is inconvenience to debug.
This PR add them into explain string if they exist.
Add a new property called 'reserve_replica', which means you can
get a table with same partitions with the same replication num
as before the backup.
Co-authored-by: Stalary <stalary@163.com>
Co-authored-by: camby <104178625@qq.com>
Fix some bugs when add REWRITE rule to Cascades Optimizer
- all rule should set as not rewrite rule when use them in Cascades Optimizer
- IMPLEMENT rule promise should large than others since we should do exploration first.
In old planner, Predicate set its type in analyzeImpl(). However, function analyzeImpl() is in old planner path, but not in nereids path. And hence the type is invalid.
Because all predicate has type bool, we set its type in constructor.
Currently, nereids doesn't support aggregate function with no slot reference in query, since all the column would be pruned, e.g.
SELECT COUNT(1) FROM t;
This PR reserve the column with the smallest amount of data when doing column prune under this situation.
To be noticed, this PR ONLY handle aggregate functions. So projection with no slot reference need to be handled in future.
#11392 made _input_block in each BetaRowsetReaders sharable. However, for some types (e.g. nested array with more than 1 depth), the _column_vector_batches in RowBlockV2 can be nested which means that there is a ColumnVectorBatch inside another ColumnVectorBatch. In this case, the data of inner ColumnVectorBatch
may be corrupted because the data of _input_block is copied shallowly to the _output_block.
Currently, explain string print all expression as slot id, e.g. `<slot 1>`.
This PR, print its name with slot id instead, e.g. `column_a[#1]`. For details:
- print qualified table name for OlapScanNode
- print NamedExpression name with SlotId instead of just SlotId
- OlapScanNode's node name use "OlapScanNode" instead of table name
Currently, there are still lots of bugs related to ARRAY<NOT_NULL(T)>.
We decide that we don't support ARRAY<NOT_NULL(T)> types at the first version and all elements in ARRAY are nullable.
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>