Commit Graph

25 Commits

Author SHA1 Message Date
bfba058ecf [Feature](join) Support null aware left anti join (#13871) 2022-11-03 12:11:25 +08:00
3006b258b0 [Improvement](bloomfilter) allocate memory for BF in open phase (#13494) 2022-10-21 17:37:26 +08:00
d3f65aa746 [Improvement](join) remove unnecessary state for join (#13472) 2022-10-21 09:59:34 +08:00
b5cd167713 [fix](hashjoin) fix coredump of hash join in ubsan build (#13479)
* [fix](hashjoin) fix coredump of hash join in ubsan build
2022-10-20 10:16:19 +08:00
cd3450bd9d [Improvement](join) optimize join probing phase (#13357) 2022-10-18 12:37:17 +08:00
baf2689610 [Improvement](join) compute hash values by vectorized way (#13335) 2022-10-13 16:04:58 +08:00
3a49156e30 [performance] (vectorization)optimize In Expr (#11826)
Co-authored-by: Wang Bo <wangbo36@meituan.com>
2022-08-17 10:46:37 +08:00
fdb4193e1b [Vectorized][Refactor] Refactor the function of tuple_is_null, only do work in hash join node (#11109) 2022-07-23 11:50:07 +08:00
7e3fc0d321 [enhancement](vec) Support outer join for vectorized exec engine (#11068)
Hash join node adds three new attributes.
The following will take an SQL as an example to illustrate the meaning of these three attributes

```
select t1. a from t1 left join t2 on t1. a=t2. b;
```
1. vOutputTupleDesc:Tuple2(a'')

2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>)

2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')>

The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap.

This code mainly merges the contents of two PRs:
1.  [fix](vectorized) Support outer join for vectorized exec engine (https://github.com/apache/doris/pull/10323)
2. [Fix](Join) Fix the bug of outer join function under vectorization #9954

The following is the specific description of the first PR
In a vectorized scenario, the query plan will generate a new tuple for the join node.
This tuple mainly describes the output schema of the join node.
Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema.
For example:
1. The case where the null side column caused by outer join is converted to nullable.
2. The projection of the outer tuple.

The following is the specific description of the second PR
This pr mainly fixes the following problems:
1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function.
2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage.

For example:
```
select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1
```
At this time, the nullable property of column k1 in the `tmp` inline view should be true.

In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the **correctness** of the column nullable property of this tableRef is very important.
In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable.

In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness.
That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result.

The vectorized nullable attribute requirements are very strict.
Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer.
Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer.
So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem.
(At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.)

Co-authored-by: EmmyMiao87
Co-authored-by: HappenLee
Co-authored-by: morrySnow

Co-authored-by: EmmyMiao87 <522274284@qq.com>
2022-07-21 23:39:25 +08:00
4960043f5e [enhancement] Refactor to improve the usability of MemTracker (step2) (#10823) 2022-07-21 17:11:28 +08:00
3e87960202 [bugfix] fix bug of vhash join build (#10614)
* [bugfix] fix bug of vhash join build

* format code
2022-07-05 19:14:42 +08:00
Pxl
6a566ccb74 [Enhancement][Vectorized] add constexpr_loop_match (#10283) 2022-06-29 14:58:50 +08:00
eebfbd0c91 Revert "[fix](vectorized) Support outer join for vectorized exec engine (#10323)" (#10424)
This reverts commit 2cc670dba697a330358ae7d485d856e4b457c679.
2022-06-25 22:18:08 +08:00
2cc670dba6 [fix](vectorized) Support outer join for vectorized exec engine (#10323)
In a vectorized scenario, the query plan will generate a new tuple for the join node.
This tuple mainly describes the output schema of the join node.
Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema.
For example:
1. The case where the null side column caused by outer join is converted to nullable.
2. The projection of the outer tuple.
2022-06-24 08:59:30 +08:00
0683181fef [API changed](parser) Remove merge join syntax (#9795)
Remove merge join sql and merge join node
2022-05-30 09:04:21 +08:00
90e8cda5f2 [Enhancement](Vectorized)build hash table with new thread, as non-vec… (#9290)
* [Enhancement][Vectorized]build hash table with new thread, as non-vectorized past do

edit after comments

* format code with clang format

Co-authored-by: lidongyang <dongyang.li@rateup.com.cn>
Co-authored-by: stephen <hello-stephen@qq.com>
2022-05-24 10:23:15 +08:00
c9961c9bb9 [style] clang-format all c++ code (#9305)
- sh build-support/clang-format.sh  to  clang-format all c++ code
2022-04-29 16:14:22 +08:00
71ac86b183 [improvement](join) Support join project in query engine (#8722) 2022-03-31 23:00:07 +08:00
eeae516e37 [Feature](Memory) Hook TCMalloc new/delete automatically counts to MemTracker (#8476)
Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G

Implement a new way of memory statistics based on TCMalloc New/Delete Hook,
MemTracker and TLS, and it is expected that all memory new/delete/malloc/free
of the BE process can be counted.
2022-03-20 23:06:54 +08:00
705989d239 [improvement](VHashJoin) add probe timer (#8233) 2022-03-13 20:54:44 +08:00
6f4cf57b17 [Bug][Vectorized] Fix bug of decimal cast to double loss precision (#8221)
Co-authored-by: lihaopeng <lihaopeng@baidu.com>
2022-02-25 11:04:11 +08:00
b1e7343532 [Vectorized] [HashJoin] Opt HashJoin Performance (#8119)
Co-authored-by: lihaopeng <happenlee@hotmail.com>
2022-02-23 10:28:16 +08:00
50864aca7d [refactor] fix warings when compile with clang (#8069) 2022-02-19 11:29:02 +08:00
bdd78f20c8 [Vectorized][HashJoin] Eliminate hashjoin branch prediction (#8051)
Co-authored-by: jewisliu <jewisliu@tencent.com>
2022-02-17 19:00:26 +08:00
e1d7233e9c [feature](vectorization) Support Vectorized Exec Engine In Doris (#7785)
# Proposed changes

Issue Number: close #6238

    Co-authored-by: HappenLee <happenlee@hotmail.com>
    Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com>
    Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com>
    Co-authored-by: wangbo <506340561@qq.com>
    Co-authored-by: emmymiao87 <522274284@qq.com>
    Co-authored-by: Pxl <952130278@qq.com>
    Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com>
    Co-authored-by: thinker <zchw100@qq.com>
    Co-authored-by: Zeno Yang <1521564989@qq.com>
    Co-authored-by: Wang Shuo <wangshuo128@gmail.com>
    Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com>
    Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>
    Co-authored-by: xinghuayu007 <1450306854@qq.com>
    Co-authored-by: weizuo93 <weizuo@apache.org>
    Co-authored-by: yiguolei <guoleiyi@tencent.com>
    Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com>
    Co-authored-by: awakeljw <993007281@qq.com>
    Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com>
    Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com>


## Problem Summary:

### 1. Some code from clickhouse

**ClickHouse is an excellent implementation of the vectorized execution engine database,
so here we have referenced and learned a lot from its excellent implementation in terms of
data structure and function implementation.
We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers.**

The following comment has been added to the code from Clickhouse, eg:
// This file is copied from
// https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h
// and modified by Doris

### 2. Support exec node and query:
* vaggregation_node
* vanalytic_eval_node
* vassert_num_rows_node
* vblocking_join_node
* vcross_join_node
* vempty_set_node
* ves_http_scan_node
* vexcept_node
* vexchange_node
* vintersect_node
* vmysql_scan_node
* vodbc_scan_node
* volap_scan_node
* vrepeat_node
* vschema_scan_node
* vselect_node
* vset_operation_node
* vsort_node
* vunion_node
* vhash_join_node

You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set.

### 3. Data Model

Vec Exec Engine Support **Dup/Agg/Unq** table, Support Block Reader Vectorized.
Segment Vec is working in process.

### 4. How to use

1. Set the environment variable `set enable_vectorized_engine = true; `(required)
2. Set the environment variable `set batch_size = 4096; ` (recommended)

### 5. Some diff from origin exec engine

https://github.com/doris-vectorized/doris-vectorized/issues/294

## Checklist(Required)

1. Does it affect the original behavior: (No)
2. Has unit tests been added: (Yes)
3. Has document been added or modified: (No)
4. Does it need to update dependencies: (No)
5. Are there any changes that cannot be rolled back: (Yes)
2022-01-18 10:07:15 +08:00