Go to file

Mingyu Chen 7e3fc0d321 [enhancement](vec) Support outer join for vectorized exec engine (#11068 )

Hash join node adds three new attributes.
The following will take an SQL as an example to illustrate the meaning of these three attributes

```
select t1. a from t1 left join t2 on t1. a=t2. b;
```
1. vOutputTupleDesc：Tuple2(a'')

2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>)

2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')>

The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap.

This code mainly merges the contents of two PRs:
1.  [fix](vectorized) Support outer join for vectorized exec engine (https://github.com/apache/doris/pull/10323)
2. [Fix](Join) Fix the bug of outer join function under vectorization #9954

The following is the specific description of the first PR
In a vectorized scenario, the query plan will generate a new tuple for the join node.
This tuple mainly describes the output schema of the join node.
Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema.
For example:
1. The case where the null side column caused by outer join is converted to nullable.
2. The projection of the outer tuple.

The following is the specific description of the second PR
This pr mainly fixes the following problems:
1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function.
2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage.

For example：
```
select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1
```
At this time, the nullable property of column k1 in the `tmp` inline view should be true.

In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the **correctness** of the column nullable property of this tableRef is very important.
In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable.

In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness.
That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result.

The vectorized nullable attribute requirements are very strict.
Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer.
Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer.
So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem.
(At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.)

Co-authored-by: EmmyMiao87
Co-authored-by: HappenLee
Co-authored-by: morrySnow

Co-authored-by: EmmyMiao87 <522274284@qq.com>

2022-07-21 23:39:25 +08:00

.github

[action](Nereids): add label auto for nereids UT. (#10665 )

2022-07-07 18:21:04 +08:00

[enhancement](vec) Support outer join for vectorized exec engine (#11068 )

2022-07-21 23:39:25 +08:00

bin

[Enhancement] check vm.max_map_count before starting (#11052 )

2022-07-21 21:16:48 +08:00

build-support

[enhancement](vec) Support outer join for vectorized exec engine (#11068 )

2022-07-21 23:39:25 +08:00

conf

[enhancement](be) be asan add asan_suppr.conf to ignore known leak. (#10768 )

2022-07-12 19:51:34 +08:00

contrib/udf

[UDF] support RPC udaf part 1: support create RPC udaf in fe (#8510 )

2022-04-21 17:38:58 +08:00

dist

[TLP](step-1) Remove incubator prefix (#10230 )

2022-06-19 19:34:52 +08:00

docker

[TLP](step-1) Remove incubator prefix (#10230 )

2022-06-19 19:34:52 +08:00

docs

[enhancement] Refactor to improve the usability of MemTracker (step2) (#10823 )

2022-07-21 17:11:28 +08:00

extension

[compile]Update init-env.sh (#10451 )

2022-06-30 11:28:06 +08:00

[enhancement](vec) Support outer join for vectorized exec engine (#11068 )

2022-07-21 23:39:25 +08:00

fe_plugins

[TLP](step-1) Remove incubator prefix (#10230 )

2022-06-19 19:34:52 +08:00

fs_brokers/apache_hdfs_broker

[TLP](step-1) Remove incubator prefix (#10230 )

2022-06-19 19:34:52 +08:00

gensrc

[enhancement](vec) Support outer join for vectorized exec engine (#11068 )

2022-07-21 23:39:25 +08:00

regression-test

[enhancement] Refactor to improve the usability of MemTracker (step2) (#10823 )

2022-07-21 17:11:28 +08:00

samples

[feature](udf) Vectorization support remote udaf #10683 (#10685 )

2022-07-18 17:15:34 +08:00

thirdparty

[dependency](arrow) Add GetRawORCReader function for arrow orc reader (#11069 )

2022-07-21 22:23:05 +08:00

tools

[tools] add clickbench tools (#11009 )

2022-07-20 17:59:04 +08:00

[improvement]Remove the page button on the System page (#10900 )

2022-07-16 06:00:08 +08:00

webroot

[License] Organize and modify the license of the code (#4371 )

2020-08-24 21:51:55 +08:00

.asf.yaml

[TLP](step-2) update .asf.yaml (#10273 )

2022-06-21 09:43:56 +08:00

.clang-format

[refactor][style] Use clang-format to sort includes (#9483 )

2022-05-10 21:25:35 +08:00

.clang-format-ignore

[chore](clang-format)(license-eye) Add Clang Format/Skywalking eyes github action (#7132 )

2021-11-24 10:41:02 +08:00

.clang-tidy

[refactor] add some clang-tidy checks && some code style fix (#8752 )

2022-03-31 13:53:41 +08:00

.clangd

[enhancement][diagnostics] Add a diagnostic: detect unused includes (#9117 )

2022-06-08 11:52:48 +08:00

.editorconfig

[enhancement](thirdparty) Support building thirdparty on macOS (#10677 )

2022-07-18 10:50:30 +08:00

.gitattributes

Enable auto convert when check in (#1926 )

2019-10-09 22:31:27 +08:00

.gitignore

[feature-wip](multi-catalog) Support s3 storage for file scan node (#10977 )

2022-07-21 17:38:53 +08:00

.gitmodules

[chore]replace checkstyle action with mvn checkstyle:check (#10474 )

2022-06-30 11:20:50 +08:00

.licenserc.yaml

[website](doc)add package-lock.json to resolve docs build failure (#10558 )

2022-07-02 17:20:11 +08:00

.rat-excludes

[improvement] Improve sig handler (#8545 )

2022-03-22 10:40:31 +08:00

build_plugin.sh

[Fix Bug] Fix ehco command not found (#9021 )

2022-04-15 13:43:47 +08:00

build.sh

[Enhancement] [Memory] add strict memory usage compile option STRICT_MEMORY_USE (#10936 )

2022-07-18 16:16:43 +08:00

CODE_OF_CONDUCT.md

[Doc] Create CODE_OF_CONDUCT.md (#4070 )

2020-07-14 22:28:38 +08:00

CONTRIBUTING_CN.md

[TLP](step-1) Remove incubator prefix (#10230 )

2022-06-19 19:34:52 +08:00

CONTRIBUTING.md

[TLP](step-1) Remove incubator prefix (#10230 )

2022-06-19 19:34:52 +08:00

env.sh

[refactor] fix warings when compile with clang (#8069 )

2022-02-19 11:29:02 +08:00

LICENSE.txt

[website] add website external resource (#10416 )

2022-06-26 01:22:14 +08:00

NOTICE.txt

[TLP](step-1) Remove incubator prefix (#10230 )

2022-06-19 19:34:52 +08:00

README.md

[Bug][docs] Fix wrong links in README.md (#10394 )

2022-07-04 14:44:23 +08:00

run-be-ut.sh

[Enhancement] [Memory] add strict memory usage compile option STRICT_MEMORY_USE (#10936 )

2022-07-18 16:16:43 +08:00

run-fe-ut.sh

[Fix Bug] Fix ehco command not found (#9021 )

2022-04-15 13:43:47 +08:00

run-regression-test.sh

[fix](regression-test) fix run-regression-test Xmx2048m param (#10234 )

2022-06-17 23:30:44 +08:00

tsan_suppressions

[TSAN] Fix tsan bugs (part 1) (#5162 )

2021-01-15 09:45:11 +08:00

README.md

Apache Doris

Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. Its original name was Palo, developed in Baidu. After donated to Apache Software Foundation, it was renamed Doris.

Doris provides high concurrent low latency point query performance, as well as high throughput queries of ad-hoc analysis.
Doris provides batch data loading and real-time mini-batch data loading.
Doris provides high availability, reliability, fault tolerance, and scalability.

The main advantages of Doris are the simplicity (of developing, deploying and using) and meeting many data serving requirements in a single system. For details, refer to Overview.

Official website: https://doris.apache.org/

License

Apache License, Version 2.0

Note

Some licenses of the third-party dependencies are not compatible with Apache 2.0 License. So you need to disable some Doris features to be complied with Apache 2.0 License. For details, refer to the thirdparty/LICENSE.txt

Technology

Doris mainly integrates the technology of Google Mesa and Apache Impala, and it is based on a column-oriented storage engine and can communicate by MySQL client.