estimate broadcast cost by an experience formula: beNumber^0.5 * rowCount
1. sender number and receiver number is not available at RBO stage now, so we use beNumber
2. senders and receivers work in parallel, that why we use square of beNumber
1. Evict the dropped stats from cache
2. Remove codes for the partition level stats collection
3. Disable analyze whole database directly
4. Fix the potential death loop in the stats cleaner
5. Sleep thread in each loop when scanning stats table to avoid excessive IO usage by this task.
`commons-lang`(1and2) is no longer maintained since 2011, and the official recommendation is `commons-lang3`, which can be smoothly upgraded to be compatible with `commons-lang`.
We use both dependencies in `fe`, which can be completely unified.
`PatternGenerator#generateTypePattern` has many meaningless loops, and IntegerRange is introduced for,
which is unnecessary. So I refactored it.
Tpc-h q10 and q5 benefit from this optimization.
For a given hash join condition, A=B, sometimes both A and B are reduced by filters. In this pr, both reductions are counted in join estimation.
PR(#17960) has introduced vector table which can map java table to c++ block.
In some cases(java udf & jdbc exector), we should map c++ block to java table. This PR implements this function.
The memory structure of java vector table and c++ block is consistent,
so the implementation doesn't copy the block, just passes the memory address.
`hudi-common` depends on `parque-avro`, but the dependency scope is `provide`.
When we use `hudi-catalog`, `HoodieAvroWriteSupport` will be called. This method depends on `parque-avro`, so it will generate ClassNotFound
Describe your changes.
For each release of Doris, there are some experimental features.
These feature may not stable or qualified enough, and user need to use it by setting config or session variables,
eg, set enable_mtmv = true, otherwise, these feature is disable by default.
We should explicitly tell user which features are experimental, so that user will notice that and decide whether to
use it.
Changes
In this PR, I support the experimental_ prefix for FE config and session variables.
Session Variable
Given enable_nereids_planner as an example.
The Nereids planner is an experimental feature in Doris, so there is an EXPERIMENTAL annotation for it:
@VariableMgr.VarAttr(..., expType = ExperimentalType.EXPERIMENTAL)
private boolean enableNereidsPlanner = false;
And for compatibility, user can set it by:
set enable_nereids_planner = true;
set experimental_enable_nereids_planner = true;
And for show variables, it will only show experimental_enable_nereids_planner entry.
And you can also see all experimental session variables by:
show variables like "%experimental%"
Config
Same as session variable, give enable_mtmv as an example.
@ConfField(..., expType = ExperimentalType.EXPERIMENTAL)
public static boolean enable_mtmv = false;
User can set it in fe.conf or ADMIN SET FRONTEND CONFIG stmt with both names:
enable_mtmv
experimental_enable_mtmv
And user can see all experimental FE configs by:
ADMIN SHOW FRONTEND CONFIG LIKE "%experimental%";
TODO
Support this feature for BE config
Only add experimental for:
enable_pipeline_engine
enable_nereids_planner
enable_single_replica_insert
and FE config:
enable_mtmv
enabel_ssl
enable_fqdn_mode
Should modify other config and session vars
select cast(k1 as INT) as id from tbl1 order by id limit 2;
is not valid for topN optimization, because 'id' is
a cast expr not a table column from scan node.
This pr address this issue.
Support to delete expired stats periodically and manually.
default cleaner running interval is 2 days
Manually clean syntax is
```sql
DROP EXPIRED STATS
```
TODO:
1. process external catalog's stats
2. run drop at the appointed time
3. sleep a short time after drop one batch
Add session variable forbid_unknown_col_stats. When this var is true, nereids rejects to use unknown column stats.
the main purpose of this pr is to save debug effort.
`select count(*) from T group by A, B`
suppose `ndv(A) > ndv(B)`
the estimated row count of aggregate is between ndv(A) and ndv(A) * ndv(B)
in previous version, we choose upper bound, that is ndv(A) * ndv(B). The drawback of this choice is the estimated row is often bigger that row count of T.
In this version, we choose the lower bound.
When query tables in information_schema databases, it may timeout due to:
There are external catalog with too many tables.
The external catalog is unreachable
So I add a new FE config infodb_support_ext_catalog.
The default is false, which means that when select from tables in information_schema database,
the result will not contain the information of the table in external catalog.
Describe your changes.
if we have expr like below
```
date(c1) -- c1's type is date or datev2
```
the expr's result is exactly same with c1, and we should
remove date function. This expr optimization will simplify
expr, speed up execution and increase the opportunity of
push filters to storage layer.
1. fix bind ambiguous slots exception because select same slots
2. fix bind SetOperation multiple times because CTE
3. fix case when clause not coercion to same type
4. fix an exception when set_var hint exists in subquery or CTE