doris

Author	SHA1	Message	Date
HappenLee	1244eed1cd	[Opt](exec) opt the dispose nullable column logic (#17192 )	2023-03-01 23:25:40 +08:00
Jerry Hu	08adf914f9	[improvement](vec) avoid creating a new column while filtering mutable columns (#16850 ) Currently, when filtering a column, a new column will be created to store the filtering result, which will cause some performance loss。 ssb-flat without pushdown expr from 19s to 15s.	2023-02-21 09:47:21 +08:00
lihangyu	37d1519316	[WIP](dynamic-table) support dynamic schema table (#16335 ) Issue Number: close #16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.	2023-02-11 13:37:50 +08:00
Kang	737c73dcf0	[Improvement](topn) order by key topn query optimization (#15663 )	2023-02-06 15:36:05 +08:00
yiguolei	90b12143a3	[refactor](remove unused code) remove runtime tuple structure and useless utils class (#16237 )	2023-01-30 16:45:14 +08:00
yiguolei	adb758dcac	[refactor](remove non vec code) remove json functions string functions match functions and some code (#16141 ) remove json functions code remove string functions code remove math functions code move MatchPredicate to olap since it is only used in storage predicate process remove some code in tuple, Tuple structure should be removed in the future. remove many code in collection value structure, they are useless	2023-01-26 16:21:12 +08:00
yiguolei	79ad74637d	[refactor](remove expr) remove non vectorized Expr and ExprContext related codes (#16136 )	2023-01-24 10:45:35 +08:00
ZhaoChangle	199d7d3be8	[Refactor]Merged string_value into string_ref (#15925 )	2023-01-22 16:39:23 +08:00
lihangyu	3894de49d2	[Enhancement](topn) support two phase read for topn query (#15642 ) This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`. TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase: 1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode. 2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine. After the second phase read, Block will contain all the data needed for the query	2023-01-19 10:01:33 +08:00
yiguolei	d857b4af1b	[refactor](remove row batch) remove impala rowbatch structure (#15767 ) * [refactor](remove row batch) remove impala rowbatch structure Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-11 09:37:35 +08:00
xueweizhang	40c53931e5	[fix](vec) VMergeIterator add key same label for agg table (#14722 )	2023-01-02 22:54:21 +08:00
yiguolei	a807978882	[refactor](non-vec) Remove rowbatch code from delta writer and some rowbatch related code (#15349 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2022-12-26 08:54:51 +08:00
Gabriel	732417258c	[Bug](pipeline) Fix bugs to pass TPCDS cases (#15194 )	2022-12-20 22:29:55 +08:00
Mingyu Chen	0b945e4ee3	[fix](csv-reader) fix be crash when reading invalid value (#14951 )	2022-12-10 18:45:47 +08:00
HappenLee	9d36931038	[Refactor](NLJ) refactor the nested loop join node (#14911 ) * [Refactor](NLJ) refactor the nested loop join node * change the logic of alloc/release resource	2022-12-09 14:10:26 +08:00
Gabriel	2c42f0a905	[refactor](decimalv3) Refine code for DecimalV3 (#14394 )	2022-11-19 16:57:17 +08:00
Ashin Gau	6bd5378f66	[feature-wip](multi-catalog) lazy read for ParquetReader (#13917 ) Read predicate columns firstly, and use VExprContext(push-down predicates) to generate the select vector, which is then applied to read the non-predicate columns. The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced. If a whole page can be skipped, the decompress-time can also be reduced.	2022-11-10 16:56:14 +08:00
Pxl	2fab0c45c7	[Feature](runtime-filter) add runtime filter breaking change adapt (#13246 ) add runtime filter breaking change adapt	2022-10-28 10:59:28 +08:00
Adonis Ling	125def5102	[enhancement](macOS M1) Support building from source on macOS (M1) (#13195 ) # Proposed changes This PR fixed lots of issues when building from source on macOS with Apple M1 chip. ## ATTENTION The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime: 1. Some errors with memory tracker occur when BE (RELEASE) starts. 2. Some UT cases fail. ... Temporarily, the following changes are made on macOS to start BE successfully. 1. Disable memory tracker. 2. Use tcmalloc instead of jemalloc. This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues. ## Use case ```shell ./build.sh -j 8 --be --clean cd output/be/bin ulimit -n 60000 ./start_be.sh --daemon ``` ## Something else It takes around _10+_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the development experience on macOS greatly when we finish the adaptation job.	2022-10-18 13:10:13 +08:00
Pxl	bdcb600f3d	[Bug](load) fix core dump on big block load (#13014 )	2022-10-10 12:38:32 +08:00
Pxl	9607f60845	[Feature](serialize) move block_data_version to fe heart beat (#12667 ) Move block_data_version from be config to fe heart beat	2022-09-27 18:25:54 +08:00
Shane	35076431ab	[fix](column)fix get_shrinked_column misspell (#12961 ) Fix misspell	2022-09-26 17:32:03 +08:00
yixiutt	b136d80e1a	[enhancement](compress) reuse compression ctx and buffer (#12573 ) Reuse compression ctx and buffer. Use a global instance for every compression algorithm, and use a thread saft buffer pool to reuse compression buffer, pool size is equal to max parallel thread num in compression, and this will not be too large. Test shows this feature increase 5% of data import and compaction. Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-09-15 10:59:46 +08:00
Pxl	0ead048b93	[Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456 ) 1. remove ColumnString terminating zero 2. add a data_version for pblock 3. change EncryptionMode to enum class	2022-09-14 21:25:22 +08:00
camby	56b2fc43d4	[enhancement](array-type) shrink column suffix zero for type ARRAY<CHAR> (#12443 ) In compute level, CHAR type will shrink suffix zeros. To keep the logic the same as CHAR type, we also shrink for ARRAY or ARRAY<ARRAY> types. Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-09-13 23:24:48 +08:00
HappenLee	b8e38b9167	[Bug](load) block call clear_column_data may have ref not equal 1 (#12350 )	2022-09-05 20:40:40 +08:00
Jerry Hu	7b352c93ff	[improvement](sink) avoid frequent allocation and deallocation when serializing block (#12310 )	2022-09-05 12:23:43 +08:00
Gabriel	90fb3b7783	[Improvement](load) accelerate tablet sink (#12174 )	2022-09-01 10:08:09 +08:00
HappenLee	573e5476dd	[Opt](load) Speed up the vectorized load (#12146 ) * [Opt](load) Speed up the vectorized load	2022-08-31 16:23:36 +08:00
Pxl	620d33a763	[Enchancement](optimize) set result_size_hint to filter_block (#11972 )	2022-08-25 11:42:52 +08:00
Jerry Hu	c22d097b59	[improvement](compress) Support compress/decompress block with lz4 (#11955 )	2022-08-22 17:35:43 +08:00
Gabriel	b8a33d2629	[Improvement](load) turn `enable_vectorized_load` on by default (#11833 )	2022-08-18 14:43:09 +08:00
lihangyu	01383c3217	[Enhancement](stream-load-json) using simdjson to parse json (#11665 ) Currently we use rapidjson to parse json document, It's fast but not fast enough compare to simdjson.And I found that the simdjson has a parsing front-end called simdjson::ondemand which will parse json when accessing fields and could strip the field token from the original document, using this feature we could reduce the cost of string copy(eg. we convert everthing to a string literal in _write_data_to_column by sprintf, I saw a hotspot from the flamegrame in this function, using simdjson::to_json_string will strip the token(a string piece) which is std::string_view and this is exactly we need).And second in _set_column_value we could iterate through the json document by for (auto field: object_val) {xxx}, this is much faster than looking up a field by it's field name like objectValue.FindMember("k1").The third optimization is the at_pointer interface simdjson provided, this could directly get the json field from original document.	2022-08-16 14:49:50 +08:00
Mingyu Chen	7e3fc0d321	[enhancement](vec) Support outer join for vectorized exec engine (#11068 ) Hash join node adds three new attributes. The following will take an SQL as an example to illustrate the meaning of these three attributes ``` select t1. a from t1 left join t2 on t1. a=t2. b; ``` 1. vOutputTupleDesc：Tuple2(a'') 2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>) 2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')> The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap. This code mainly merges the contents of two PRs: 1. [fix](vectorized) Support outer join for vectorized exec engine (https://github.com/apache/doris/pull/10323) 2. [Fix](Join) Fix the bug of outer join function under vectorization #9954 The following is the specific description of the first PR In a vectorized scenario, the query plan will generate a new tuple for the join node. This tuple mainly describes the output schema of the join node. Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema. For example: 1. The case where the null side column caused by outer join is converted to nullable. 2. The projection of the outer tuple. The following is the specific description of the second PR This pr mainly fixes the following problems: 1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function. 2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage. For example： ``` select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1 ``` At this time, the nullable property of column k1 in the `tmp` inline view should be true. In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the correctness of the column nullable property of this tableRef is very important. In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable. In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness. That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result. The vectorized nullable attribute requirements are very strict. Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer. Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer. So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem. (At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.) Co-authored-by: EmmyMiao87 Co-authored-by: HappenLee Co-authored-by: morrySnow Co-authored-by: EmmyMiao87 <522274284@qq.com>	2022-07-21 23:39:25 +08:00
Tiewei Fang	c9f86bc7e2	[refactor] Refactoring Status static methods to format message using fmt(#9533 )	2022-07-02 18:58:23 +08:00
camby	06e436b7cc	[bugfix]dump_one_line failed to dump last column (#10522 ) Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-07-01 10:33:49 +08:00
Gabriel	eebfbd0c91	Revert "[fix](vectorized) Support outer join for vectorized exec engine (#10323 )" (#10424 ) This reverts commit 2cc670dba697a330358ae7d485d856e4b457c679.	2022-06-25 22:18:08 +08:00
HappenLee	2cc670dba6	[fix](vectorized) Support outer join for vectorized exec engine (#10323 ) In a vectorized scenario, the query plan will generate a new tuple for the join node. This tuple mainly describes the output schema of the join node. Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema. For example: 1. The case where the null side column caused by outer join is converted to nullable. 2. The projection of the outer tuple.	2022-06-24 08:59:30 +08:00
Xinyi Zou	d58e00c49c	[fix](brpc) Embed serialized request into the attachment and transmit it through http brpc (#9803 ) When the length of `Tuple/Block data` is greater than 2G, serialize the protoBuf request and embed the `Tuple/Block data` into the controller attachment and transmit it through http brpc. This is to avoid errors when the length of the protoBuf request exceeds 2G: `Bad request, error_text=[E1003]Fail to compress request`. In #7164, `Tuple/Block data` was put into attachment and sent via default `baidu_std brpc`, but when the attachment exceeds 2G, it will be truncated. There is no 2G limit for sending via `http brpc`. Also, in #7921, consider putting `Tuple/Block data` into attachment transport by default, as this theoretically reduces one serialization and improves performance. However, the test found that the performance did not improve, but the memory peak increased due to the addition of a memory copy.	2022-06-13 20:41:48 +08:00
HappenLee	c426c2e4b1	[Vectorized-Load] Support vectorized load table with materialized view (#9923 ) * [Vectorized-Load] Support vectorized load table with materialized view * fix ut Co-authored-by: lihaopeng <lihaopeng@baidu.com>	2022-06-02 14:59:01 +08:00
Adonis Ling	2a11a4ab99	[feature-wip][array-type] Support more sub types. (#9466 ) Please refer to #9465	2022-05-26 08:41:34 +08:00
spaces-x	73e31a2179	[stream-load-vec]: memtable flush only if necessary after aggregated (#9459 ) Co-authored-by: weixiang <weixiang06@meituan.com>	2022-05-25 21:12:24 +08:00
Zhengguo Yang	f5bef328fe	[fix] disable transfer data large than 2GB by brpc (#9770 ) because of brpc and protobuf cannot transfer data large than 2GB, if large than 2GB will overflow, so add a check before send	2022-05-25 18:41:13 +08:00
HappenLee	8fa677b59c	[Refactor][Bug-Fix][Load Vec] Refactor code of basescanner and vjson/vparquet/vbroker scanner (#9666 ) * [Refactor][Bug-Fix][Load Vec] Refactor code of basescanner and vjson/vparquet/vbroker scanner 1. fix bug of vjson scanner not support `range_from_file_path` 2. fix bug of vjson/vbrocker scanner core dump by src/dest slot nullable is different 3. fix bug of vparquest filter_block reference of column in not 1 4. refactor code to simple all the code It only changed vectorized load, not original row based load. Co-authored-by: lihaopeng <lihaopeng@baidu.com>	2022-05-20 11:43:03 +08:00
chenlinzhong	c9961c9bb9	[style] clang-format all c++ code (#9305 ) - sh build-support/clang-format.sh to clang-format all c++ code	2022-04-29 16:14:22 +08:00
HappenLee	d330bc3806	[Vectorized](stream-load-vec) Support stream load in vectorized engine (#8709 ) (#9280 ) Implement vectorized stream load. Added fe configuration option `enable_vectorized_load` to enable vectorized stream load. Co-authored-by: tengjp@outlook.com Co-authored-by: mrhhsg@gmail.com Co-authored-by: minghong.zhou@163.com Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com>	2022-04-29 09:50:51 +08:00
Zhengguo Yang	ae680b4248	[UDF] support RPC udaf part 1: support create RPC udaf in fe (#8510 )	2022-04-21 17:38:58 +08:00
Pxl	dda7604e16	[Bug][Storage-vectorized] fix code dump on outer join with not nullable column (#9112 )	2022-04-21 11:02:04 +08:00
HappenLee	fcefed7c1c	[Bug][Vectorized] Fix core bug of segment vectorized (#8800 ) * [Bug][Vectorized] Fix core bug of segment vectorized 1. Read table with delete condition 2. Read table with default value HLL/Bitmap Column * refactor some code Co-authored-by: lihaopeng <lihaopeng@baidu.com>	2022-04-03 19:50:25 +08:00
Pxl	2760bcbcc1	[fix] fix core dump on deep_copy_tuple when data is null (#8620 )	2022-03-24 09:15:38 +08:00

1 2

60 Commits