doris

Author	SHA1	Message	Date
Jerry Hu	08adf914f9	[improvement](vec) avoid creating a new column while filtering mutable columns (#16850 ) Currently, when filtering a column, a new column will be created to store the filtering result, which will cause some performance loss。 ssb-flat without pushdown expr from 19s to 15s.	2023-02-21 09:47:21 +08:00
lihangyu	37d1519316	[WIP](dynamic-table) support dynamic schema table (#16335 ) Issue Number: close #16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.	2023-02-11 13:37:50 +08:00
HappenLee	40141a9c9c	[opt](vectorized) opt the null map _has_null logic (#15181 ) opt the null map _has_null logic	2022-12-20 10:01:54 +08:00
Jerry Hu	daeabcf053	[improvement](vec) optimize the logic for _has_null in ColumnNullable (#14633 )	2022-11-29 08:53:30 +08:00
HappenLee	70a424d6e3	[Bug](regression) Fail regression test in test_grouping_sets in fuzzy mode (#14601 )	2022-11-26 12:17:31 +08:00
abmdocrt	70ea07bc4b	[fix](nullable) Fix nullable cache to avoid function returning wrong value (#14463 )	2022-11-24 09:35:08 +08:00
starocean999	277025b046	[fix](join)ColumnNullable need handle const column with nullable const value (#13866 )	2022-11-02 08:52:49 +08:00
Gabriel	cd3450bd9d	[Improvement](join) optimize join probing phase (#13357 )	2022-10-18 12:37:17 +08:00
Jerry Hu	9b590ac4cb	[improvement](olap) cache value of has_null in ColumnNullable (#13289 )	2022-10-13 09:12:02 +08:00
HappenLee	36bf8ad3eb	[Opt](Vec) Support const column check nullable and remove nullable (#13020 )	2022-09-29 08:39:19 +08:00
Shane	35076431ab	[fix](column)fix get_shrinked_column misspell (#12961 ) Fix misspell	2022-09-26 17:32:03 +08:00
HappenLee	e413a2b8e9	[Opt](vectorized) Use new way to do hash shffle to speed up query (#12586 )	2022-09-15 11:08:04 +08:00
camby	56b2fc43d4	[enhancement](array-type) shrink column suffix zero for type ARRAY<CHAR> (#12443 ) In compute level, CHAR type will shrink suffix zeros. To keep the logic the same as CHAR type, we also shrink for ARRAY or ARRAY<ARRAY> types. Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-09-13 23:24:48 +08:00
HappenLee	d913ca5731	[Opt](vectorized) Speed up bucket shuffle join hash compute (#12407 ) * [Opt](vectorized) Speed up bucket shuffle join hash compute	2022-09-13 20:19:22 +08:00
Gabriel	66491ec137	[Improvement](sort) improve partial sort algorithm (#12349 ) * [Improvement](sort) improve partial sort algorithm	2022-09-09 15:44:18 +08:00
HappenLee	54d1630c42	[Opt](vectorized) speed up hash function compute in hash partition (#12334 ) After do the opt of hash function, the compute of siphash in HASH_PARTITION in vdata_stream_sender Before: 1s800ms After: 800ms	2022-09-07 10:11:40 +08:00
HappenLee	fdb4193e1b	[Vectorized][Refactor] Refactor the function of `tuple_is_null`, only do work in hash join node (#11109 )	2022-07-23 11:50:07 +08:00
Jerry Hu	b7c9007776	[improvement][agg]Process aggregated results in the vectorized way (#11084 )	2022-07-22 22:04:43 +08:00
Mingyu Chen	7e3fc0d321	[enhancement](vec) Support outer join for vectorized exec engine (#11068 ) Hash join node adds three new attributes. The following will take an SQL as an example to illustrate the meaning of these three attributes ``` select t1. a from t1 left join t2 on t1. a=t2. b; ``` 1. vOutputTupleDesc：Tuple2(a'') 2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>) 2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')> The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap. This code mainly merges the contents of two PRs: 1. [fix](vectorized) Support outer join for vectorized exec engine (https://github.com/apache/doris/pull/10323) 2. [Fix](Join) Fix the bug of outer join function under vectorization #9954 The following is the specific description of the first PR In a vectorized scenario, the query plan will generate a new tuple for the join node. This tuple mainly describes the output schema of the join node. Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema. For example: 1. The case where the null side column caused by outer join is converted to nullable. 2. The projection of the outer tuple. The following is the specific description of the second PR This pr mainly fixes the following problems: 1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function. 2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage. For example： ``` select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1 ``` At this time, the nullable property of column k1 in the `tmp` inline view should be true. In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the correctness of the column nullable property of this tableRef is very important. In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable. In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness. That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result. The vectorized nullable attribute requirements are very strict. Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer. Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer. So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem. (At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.) Co-authored-by: EmmyMiao87 Co-authored-by: HappenLee Co-authored-by: morrySnow Co-authored-by: EmmyMiao87 <522274284@qq.com>	2022-07-21 23:39:25 +08:00
Jerry Hu	e293fbd277	[improvement]pre-serialize aggregation keys (#10700 )	2022-07-09 06:21:56 +08:00
Gabriel	eebfbd0c91	Revert "[fix](vectorized) Support outer join for vectorized exec engine (#10323 )" (#10424 ) This reverts commit 2cc670dba697a330358ae7d485d856e4b457c679.	2022-06-25 22:18:08 +08:00
HappenLee	2cc670dba6	[fix](vectorized) Support outer join for vectorized exec engine (#10323 ) In a vectorized scenario, the query plan will generate a new tuple for the join node. This tuple mainly describes the output schema of the join node. Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema. For example: 1. The case where the null side column caused by outer join is converted to nullable. 2. The projection of the outer tuple.	2022-06-24 08:59:30 +08:00
camby	2725127421	[fix] group by with two NULL rows after left join (#9688 ) Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-05-25 16:43:55 +08:00
chenlinzhong	c9961c9bb9	[style] clang-format all c++ code (#9305 ) - sh build-support/clang-format.sh to clang-format all c++ code	2022-04-29 16:14:22 +08:00
HappenLee	2c63fc1d6c	[improvement](vectorized) Support BetweenPredicate enable fold const expr (#8450 )	2022-03-13 09:36:24 +08:00
HappenLee	68dd799796	[improvement](vectorized) Support function tuple is null (#8442 )	2022-03-11 16:54:37 +08:00
HappenLee	68b24d608f	[fix] (vectorization)Fix nullable column compute the hash value error (#8105 ) Co-authored-by: lihaopeng <lihaopeng@baidu.com>	2022-02-18 11:20:47 +08:00
wangbo	b9f0b5565c	[refactor](storage) refactor some interfaces of storage layer column (#8064 ) 1 format binary plain 2 remove batch_set_null_bitmap 3 fix segiter return value 4 set insert_many_binary_data args	2022-02-18 10:54:51 +08:00
HappenLee	358bd79fb1	[improvement](vec)(Join) Mem reuse to speed up join operator (#7905 ) 1. Reuse the mem of output block in vec join node 2. Add the function `replicate` in column	2022-01-31 22:14:12 +08:00
HappenLee	e1d7233e9c	[feature](vectorization) Support Vectorized Exec Engine In Doris (#7785 ) # Proposed changes Issue Number: close #6238 Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com> Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com> Co-authored-by: wangbo <506340561@qq.com> Co-authored-by: emmymiao87 <522274284@qq.com> Co-authored-by: Pxl <952130278@qq.com> Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> Co-authored-by: thinker <zchw100@qq.com> Co-authored-by: Zeno Yang <1521564989@qq.com> Co-authored-by: Wang Shuo <wangshuo128@gmail.com> Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> Co-authored-by: xinghuayu007 <1450306854@qq.com> Co-authored-by: weizuo93 <weizuo@apache.org> Co-authored-by: yiguolei <guoleiyi@tencent.com> Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com> Co-authored-by: awakeljw <993007281@qq.com> Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com> Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com> ## Problem Summary: ### 1. Some code from clickhouse ClickHouse is an excellent implementation of the vectorized execution engine database, so here we have referenced and learned a lot from its excellent implementation in terms of data structure and function implementation. We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers. The following comment has been added to the code from Clickhouse, eg: // This file is copied from // https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h // and modified by Doris ### 2. Support exec node and query: * vaggregation_node * vanalytic_eval_node * vassert_num_rows_node * vblocking_join_node * vcross_join_node * vempty_set_node * ves_http_scan_node * vexcept_node * vexchange_node * vintersect_node * vmysql_scan_node * vodbc_scan_node * volap_scan_node * vrepeat_node * vschema_scan_node * vselect_node * vset_operation_node * vsort_node * vunion_node * vhash_join_node You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set. ### 3. Data Model Vec Exec Engine Support Dup/Agg/Unq table, Support Block Reader Vectorized. Segment Vec is working in process. ### 4. How to use 1. Set the environment variable `set enable_vectorized_engine = true; `(required) 2. Set the environment variable `set batch_size = 4096; ` (recommended) ### 5. Some diff from origin exec engine https://github.com/doris-vectorized/doris-vectorized/issues/294 ## Checklist(Required) 1. Does it affect the original behavior: (No) 2. Has unit tests been added: (Yes) 3. Has document been added or modified: (No) 4. Does it need to update dependencies: (No) 5. Are there any changes that cannot be rolled back: (Yes)	2022-01-18 10:07:15 +08:00

30 Commits