Improve the accuracy of sample stats collection. For non distribution columns, use
`n*d / (n - f1 + f1*n/N)`
where `f1` is the number of distinct values that occurred exactly once in our sample of n rows (from a total of N),
and `d` is the total number of distinct values in the sample.
For distribution columns, use `ndv(n) * fraction of tablets sampled` for NDV.
For very large tablet to sample, use limit to control the total lines to scan (for non key column only, because key column is sorted and will be inaccurate using limit).
This commit overhauls the JDBC connector logic within our project, transitioning from the previous mechanism of fetching data through JNI calls for individual ResultSet items to a more efficient and unified approach using the VectorTable data structure.
sql select * from t1 a join t1 b on b.id in (select 1) and a.id = b.id; will report an error.
This pr support uncorrelated subquery in join condition to fix it
When doing LoadPendingTask or LoadLoadingTask, there may be some Error thrown,
such as `NoClassDefFoundError`, but previously, we only catch java's `Exception`, so
other kind of error can not be shown clearly.
Now the update time of hms table is generated by every FE node (Use `System.currentTimestamp()` separately), so the update time of a hms table may be different between FE nodes, always the same query can not hit the sql-cache if we submit it more than one times through different FE nodes. This pr mainly do following changes to avoid this problem.
- Use the `transient_lastDdlTime` instead of `System.currentTimestamp` as the `schemaUpdateTime` of hms tables
- Use the `eventTime` in hms event instead of `System.currentTimestamp` as the update time when processing hms events
The local meta/info files generated during backup are not distinguished
by repo names. If two backup jobs with the same name are submitted to
different repos at the same time, meta/info may be overwritten by another
backup job.
disable is null predicate constant fold rule for inline view
consider sql
select c.*
from (
select a.*, b.x
from test_insert a left join
(select 'some_const_str' x from test_insert) b on true
) c
where c.x is null;
when push “c.x is null” into c, after folding constant rule, it will get empty result. Because x is 'some_const_str' and "x is null" will be evaluated to false. This is wrong.
forbid one phase agg for pattern: agg-unionAll
one phase agg plan: agg-union-hashDistribute-children
two phase agg plan: agg(global) - hashDistribute-agg(local)-union-randomDistribute
the key point is the cost of randomDistribute is much lower than the hashDistribute, and hence two-phase agg wins.