* [fix](Inbitmap) fix in bitmap result error when left expr is constant
1. When left expr of the in predicate is a constant, instead of generating a bitmap filter, rewrite sql to use `bitmap_contains`.
For example,"select k1, k2 from (select 2 k1, 11 k2) t where k1 in (select bitmap_col from bitmap_tbl)"
=> "select k1, k2 from (select 2 k1, 11 k2) t left semi join bitmap_tbl b on bitmap_contains(b.bitmap_col, t.k1)"
* add regression test
**Histogram statistics**
Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics.
For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan.
The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred.
The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency.
---
Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows:
**Syntax**
```SQL
histogram(expr)
```
> The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query.
**example**
```
MySQL [test]> select histogram(login_time) from dev_table;
+------------------------------------------------------------------------------------------------------------------------------+
| histogram(`login_time`) |
+------------------------------------------------------------------------------------------------------------------------------+
| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}|
+------------------------------------------------------------------------------------------------------------------------------+
```
**description**
```JSON
{
"bucket_size": 5,
"buckets": [
{
"lower": "2022-09-21 17:30:29",
"upper": "2022-09-21 22:30:29",
"count": 9,
"pre_sum": 0,
"ndv": 1
},
{
"lower": "2022-09-22 17:30:29",
"upper": "2022-09-22 22:30:29",
"count": 10,
"pre_sum": 9,
"ndv": 1
},
{
"lower": "2022-09-23 17:30:29",
"upper": "2022-09-23 22:30:29",
"count": 9,
"pre_sum": 19,
"ndv": 1
},
{
"lower": "2022-09-24 17:30:29",
"upper": "2022-09-24 22:30:29",
"count": 9,
"pre_sum": 28,
"ndv": 1
},
{
"lower": "2022-09-25 17:30:29",
"upper": "2022-09-25 22:30:29",
"count": 9,
"pre_sum": 37,
"ndv": 1
}
]
}
```
TODO:
- histogram func supports parameter and sample statistics (It's got another pr)
- use histogram statistics
- add p0 regression
This PR adds the rewriting and matching logic for the bitmap_union column in materialized index.
If a materialized index has bitmap_union column, we try to rewrite count distinct or bitmap_union_count to the bitmap_union column in materialized index.
when we process not in subquery. if the subquery return column is nullable, we need a NULL AWARE ANTI JOIN instead of ANTI JOIN.
Doris already support NULL AWARE ANTI JOIN in PR #13871
Nereids need to do that so.
if switch external catalog, and use a database that has same name with one database of internal catalog,
query 'show data', will get data info from internal catalog.
in previous pr(#14876) we compact equals like "a=1 or a=2 or a = 3 " in to "a in (1, 2, 3)"
this pr set a lower bound for the number of equals COMPACT_EQUAL_TO_IN_PREDICATE_THRESHOLD (default is 2)
for performance reason, we create a hashSet to collect literals, like {1,2,3}. and hence, the literals in "in-predicates" are in random order.
for regression test, if we need stable output of explain string, set COMPACT_EQUAL_TO_IN_PREDICATE_THRESHOLD to a large number to avoid compact rule.
when create table like table of jdbc, it will get error like
'errCode = 2, detailMessage = Failed to execute CREATE TABLE LIKE baseall_mysql.
Reason: errCode = 2, detailMessage = property table_type must be set'
this pr fix it.
The pr implements the SetOperation.
- Adapt to the EliminateUnnecessaryProject rule to ensure that the project under SetOperation is not deleted.
- Add predicate pushdown of SetOperation
- Optimization: Merge multiple SetOperations with the same type and the same qualifier
- Optimization: merge oneRowRelation and union
* [hotfix](dev-1.0.1) fix colocate join bug in vec engine after introducing output tuple (#10651)
to support vectorized outer join, we introduced a out tuple for hash join node,
but it breaks the checking for colocate join.
To solve this problem, we need map the output slot id to the children's slot id of hash join node,
and the colocate join can be checked correctly.
* fix colocate join bug
* fix non vec colocate join issue
Co-authored-by: lichi <lichi@rateup.com.cn>
* add test cases
Co-authored-by: lichi <lichi@rateup.com.cn>
when show databases/tables/table status where xxx, it will change a selectStmt to select result from
information_schema, it need catalog info to scan schema table, otherwise may get many
database or table info from multi catalog.
for example
mysql> show databases where schema_name='test';
+----------+
| Database |
+----------+
| test |
| test |
+----------+
MySQL [internal.test]> show tables from test where table_name='test_dc';
+----------------+
| Tables_in_test |
+----------------+
| test_dc |
| test_dc |
+----------------+
Fix three bugs:
1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed
2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed
3. fmt::format can't generate right format string, and will be failed
Add one metric to detect the publish txn num per db. User can get the relative speed of the txns processing per db using this metric and doris_fe_txn_num.
# Proposed changes
## refactor
- add AggregateExpression to shield the difference of AggregateFunction before disassemble and after
- request `GATHER` physicalProperties for query, because query always collect result to the coordinator, use `GATHER` maybe select a better plan
- refactor `NormalizeAggregate`
- remove some physical fields for the `LogicalAggregate`, like `AggPhase`, `isDisassemble`
- remove `AggregateDisassemble` and `DistinctAggregateDisassemble`, and use `AggregateStrategies` to generate various of PhysicalHashAggregate, like `two phases aggregate`, `three phases aggregate`, and cascades can auto select the lowest cost alternative.
- move `PushAggregateToOlapScan` to `AggregateStrategies`
- separate the traverse and visit method in FoldConstantRuleOnFE
- if some expression not implement the visit method, the traverse method can handle and rewrite the children by default
- if some expression implement the visit, the user defined traverse(invoke accept/visit method) will quickly return because the default visit method will not forward to the children, and the pre-process in traverse method will not be skipped.
## new feature
- support `disable_nereids_rules` to skip some rules.
example:
1. create 1 bucket table `n`
```sql
CREATE TABLE `n` (
`id` bigint(20) NOT NULL
) ENGINE=OLAP
DUPLICATE KEY(`id`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`id`) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
```
2. insert some rows into `n`
```sql
insert into n select * from numbers('number'='20000000')
```
3. query table `n`
```sql
SET enable_nereids_planner=true;
SET enable_vectorized_engine=true;
SET enable_fallback_to_original_planner=false;
explain plan select id from n group by id;
```
the result show that we use the one stage aggregate
```
| PhysicalHashAggregate ( aggPhase=LOCAL, aggMode=INPUT_TO_RESULT, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional.empty, requestProperties=[GATHER], stats=(rows=1, width=1, penalty=2.0E7) ) |
| +--PhysicalProject ( projects=[id#0], stats=(rows=20000000, width=1, penalty=0.0) ) |
| +--PhysicalOlapScan ( qualified=default_cluster:test.n, output=[id#0, name#1], stats=(rows=20000000, width=1, penalty=0.0) ) |
```
4. disable one stage aggregate
```sql
explain plan select
/*+SET_VAR(disable_nereids_rules=DISASSEMBLE_ONE_PHASE_AGGREGATE_WITHOUT_DISTINCT)*/
id
from n
group by id
```
the result is two stage aggregate
```
| PhysicalHashAggregate ( aggPhase=GLOBAL, aggMode=BUFFER_TO_RESULT, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional[[id#0]], requestProperties=[GATHER], stats=(rows=1, width=1, penalty=2.0E7) ) |
| +--PhysicalHashAggregate ( aggPhase=LOCAL, aggMode=INPUT_TO_BUFFER, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional[[id#0]], requestProperties=[ANY], stats=(rows=1, width=1, penalty=2.0E7) ) |
| +--PhysicalProject ( projects=[id#0], stats=(rows=20000000, width=1, penalty=0.0) ) |
| +--PhysicalOlapScan ( qualified=default_cluster:test.n, output=[id#0, name#1], stats=(rows=20000000, width=1, penalty=0.0) ) |
```