Create partitions use :
```
PARTITION BY RANGE(event_day)(
FROM ("2000-11-14") TO ("2021-11-14") INTERVAL 1 YEAR,
FROM ("2021-11-14") TO ("2022-11-14") INTERVAL 1 MONTH,
FROM ("2022-11-14") TO ("2023-01-03") INTERVAL 1 WEEK,
FROM ("2023-01-03") TO ("2023-01-14") INTERVAL 1 DAY,
PARTITION p_20230114 VALUES [('2023-01-14'), ('2023-01-15'))
)
PARTITION BY RANGE(event_time)(
FROM ("2023-01-03 12") TO ("2023-01-14 22") INTERVAL 1 HOUR
)
```
can create a year/month/week/day/hour's date partitions in a batch,
also it is compatible with the single partitioning method.
## Problem summary
This pr support
1. `numbers` TableValuedFunction for nereids test, like `select * from numbers(number = 10, backend_num = 1)`
2. bitmap/hll aggregate function
3. support find variable length function in function registry, like `coalesce`
4. fix a bug that print nerieds trace will throw exception because use RewriteRule in ApplyRuleJob, e.g: `AggregateDisassemble`, introduced by #13957
ORC NextStripeReader now only support read columns by indices, but it is hard to get column indices for complex types.
We patch ORC adapter to support read columns by column names.
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
For runtime filter, signal will be called by a thread which is different from the await thread. So there will be a potential race for variable is_ready
To support query like that:
SELECT c1 + 1 as a, sum(c2) FROM t GROUP BY c1 + 1 ORDER BY c1 + 1
After rewrite, plan will equal to
SELECT c1 + 1 as a, sum(c2) FROM t GROUP BY c1 + 1 ORDER BY a
1. add a post processor: runtime filter pruner
Doris generates RFs (runtime filter) on Join node to reduce the probe table at scan stage. But some RFs have no effect, because its selectivity is 100%. This pr will remove them.
A RF is effective if
a. the build column value range covers part of that of probe column, OR
b. the build column ndv is less than that of probe column, OR
c. the build column's ColumnStats.selectivity < 1, OR
d. the build column is reduced by another RF, which satisfies above criterions.
2. explain graph
a. add RF info in Join and Scan node
b. add predicate count in Scan node
3. Rename session variable
rename `enable_remove_no_conjuncts_runtime_filter_policy` to `enable_runtime_filter_prune`
4. fix min/max column stats derive bug
`select max(A) as X from T group by B`
X.min is A.min, not A.max
Currently, it takes too much time to build BE from source in workflow environments (P0/P1) which affects the efficiency of daily development.
We can measure the time by executing the following command.
time EXTRA_CXX_FLAGS='-O3' BUILD_TYPE=ASAN ./build.sh --be --fe --clean -j "$(nproc)"
This PR optimizes the compilation time by exploiting the following methods.
Reduce the codegen by removing some useless std::visit.
Disable the optimization for some template functions which are instantiated by std::visit conditionally (except for the RELEASE build).
When upgrade from 1.1 to master, and then rollback to 1.1, and upgrade to master again, BE will coredump because some rowsets has schema and some rowsets has no schema. In the first time upgrade from 1.1, BE will flush schema in all rowsets and after rollback to 1.1, BE do compaction, and create some new rowset without schema. And the second time upgrade from 1.1, BE coredump because some conditions depend on having all or none of the rowsets.
1. Supports for persisting collected statistics to a pre-built OLAP table named `column_statistics`.
2. Use a much simpler mechanism to collect statistics: all the gauges are collected in single one SQL for each partition and then the whole column, which defined in class `AnalysisJob`
3. Implement a cache to manage the statistics records in FE
TODO:
1. Use opentelemetry to monitor the execution time of each job
2. Format the internal analysis SQL
3. split SQL to promise the in expr's child count not exceeds the FE limits of generated SQL for deleting expired records
4. Implements show statements
Read predicate columns firstly, and use VExprContext(push-down predicates)
to generate the select vector, which is then applied to read the non-predicate columns.
The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced.
If a whole page can be skipped, the decompress-time can also be reduced.