Failed when reading parquet file with many columns(>1600).
mysql> select int_col from types_sf100_r100w limit 5;
ERROR 1105 (HY000): errCode = 2, detailMessage = Couldn't deserialize thrift msg:
TProtocolException: Invalid data
parse_thrift_footer uses fixed length buffer(=64k) to read parquet footer, but the meta data of a parquet file with 1600 columns can exceed 5MB.
Therefore, the buffer size needs to be applied according to the actual length.
1. For query with 1656 union, the plan thrift size will be reduced from 400MB+ to 2MB.
This optimization is introduced from #4904, but lost after #9720
2. Disable ExprSubstitutionMap.verify when debug is disable.
So that the plan time of query with 1656 union will be reduced from 20s to 2s
The original statistic derive calculate algorithm rely on NDV and other column statistics. But we cannot get these stats in product environment.
This PR change these operator's stats calc algorithm to use a DEFAULT RATIO variable instead of column statistics.
We should change these algorithm when we could get column stats in product environment
Implement uncheckedCast on VarcharLiteral for a temp way to let TimestampArithmetic work.
We should remove these code and do implicit cast in TypeCoercion rule in future.
Just as legacy planner, Nereids parse all fractional literal to decimal.
In the future, we will add more syntax for user to control the fractional literal type.
Execution plan display when using orthogonal_bitmap_union_count function:
PREAGGREGATION: OFF
Reason: Invalid Aggregate Operator: orthogonal_bitmap_union_count
The correct plan is: PREAGGREGATION: ON
Co-authored-by: lihuigang <lihuigang@meituan.com>
* [fix](threadpool) threadpool schedules does not work right on concurrent token
Assuming there is a concurrent thread token whose concurrency is 2, and the 1st
submit on the token is submitted to threadpool while the 2nd is not submitted due
to busy. The token's active_threads is 1, then thread pool does not schedule the
token.
The patch fixes the problem.
We already separate Array Offset64 and String Offset(32bit) in PR: #12341
Now we limit: Offset inside IColumn, Offset64 only inside ColumnArray, to avoid abuse of them.
If we use the wrong one, it will compile failed.
* using pooled connection and enlarge brpc connection timeout and retry times
When a connection failure happen, doris fails queries using the connection.
We should lower the impact of a connection failure by using pooled connection
and enlaring connection timeout and retry times.
* clang format
Support OneRowRelation and EmptyRelation.
OneRowRelation: `select 100, 'abc', substring('abc', 1, 2)`
EmptyRelation: `select * from tbl limit 0`
Note:
PhysicalOneRowRelation will translate to UnionNode(constExpr) for BE execution
In earlier PR #11976 , we changed DistributionSpecHash#equalsSatisfy, and forgot to check whether the length of both side are same. When required's shuffle slot size longer than current one, exception will be thrown.
## Fix five bugs:
1. Parquet dictionary data may be compressed, but `ColumnChunkReader` try to parse dictionary data before creating compression codec, causing unexpected data errors.
2. `FE` doesn't resolve array type
3. `ParquetFileHdfsScanner` doesn't fill partition values when the table is partitioned
4. `ParquetFileHdfsScanner` set `_scanner_eof = true` when a scan range is empty, causing the end of the scanner, and resulting in data loss
5. typographical error in `PageReader`
When the load channel is canceled, the memtracker does not subtract the memory released by the load channel. This will cause the memory usage counted by the memtracker of the load channel mgr to be larger than the actual memory usage.
Each NodeChannel has its own queue, with size up to 1/20 exec_mem_limit.
User will crash into OOM if set exec_mem_limit high. This commit uses
fixed number to control the total max memory used by NodeChannels.
Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
Simplify the code of getting input/output slots from `Expression` or `Plan`.
**new interfaces add**
`Expression`:
`getInputSlots`: Get all the input slots of the expression.
`Plan`:
- `getOutputSet`: Get the output slot set of the plan.
- `getInputSlots`: Get the input slot set of the plan.
**changed interface**
`TreeNode`:
- `collect`: return `set` as result instead of `list`.
In the earlier PR #11812 , we split join condition into two parts: hash join conjuncts and other condition. But we forgot to translate other condition into other conjuncts in HashJoinNode of legacy planner. So we get wrong result if query has other condition on join node. Such as:
SELECT * FROM lineorder INNER JOIN part ON lo_partkey = p_partkey WHERE lo_orderkey > p_size;