doris

Author	SHA1	Message	Date
yiguolei	a1114d46e8	[refactor](unify type system) remove switch case in histogram helper (#18222 )	2023-03-30 10:54:08 +08:00
hqx871	1999cccde9	[feature](array-type) Unique table support array value (#17024 ) Unique table support array value --------- Co-authored-by: huangqixiang.871 <huangqixiang.871@bytedance.com>	2023-03-24 10:18:59 +08:00
yiguolei	dd53bc1c8d	[unify type system](remove unused type desc) remove some code (#17921 ) There are many type definitions in BE. Should unify the type system and simplify the development. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-19 14:05:02 +08:00
Pxl	16fc3a0e22	[Chore](compile) remove some unused static on inline function to reduce compile time (#17603 ) remove some unused static on inline function to reduce compile time	2023-03-13 11:11:59 +08:00
Pxl	65b8dfc7ff	[Enchancement](function) Inline some aggregate function && remove nullable combinator (#17328 ) 1. Inline some aggregate function 2. remove nullable combinator	2023-03-09 10:39:04 +08:00
ElvinWei	bd5ed2b0c2	[enhancement](histogram) optimize the histogram bucketing strategy, etc (#17264 ) * optimize the histogram bucketing strategy, etc * fix p0 regression of histogram	2023-03-08 20:12:05 +08:00
ElvinWei	f32cd2c123	[fix](statistics) fix a problem with histogram statistics collection parameters (#16918 ) 1. Fixed a problem with histogram statistics collection parameters. 2. Solved the problem that it takes a long time to collect histogram statistics. TODO: Optimize histogram statistics sampling method and make the sampling parameters effective. The problem is that the histogram function works as expected in the single-node test, but doesn't work in the multi-node test. In addition, the performance of the current support sampling to collect histogram is low, resulting in a large time consumption when collecting histogram information. Fixed the parameter issue and temporarily removed support for sampling to speed up the collection of histogram statistics. Will next support sampling to collect histogram information.	2023-02-20 16:33:18 +08:00
HappenLee	9114896178	[DecimalV3](opt) opt the function of decimalv3 to_string logic (#16427 )	2023-02-07 13:28:07 +08:00
lihangyu	f94a78ab4a	[Fix](topn) fix wrong nullable cast for RowId column and use heapsorter for two phase read (#16399 ) convert_nullable_flags does not contain nullable info for RowID column, but valid_column_ids contain RowID column, nullable falg will be undefined for RowID column	2023-02-03 20:49:45 +08:00
Pxl	ca73c60442	[Chore](build) enable ignored-qualifiers check (#16196 ) enable ignored-qualifiers check	2023-02-01 15:15:59 +08:00
yiguolei	79ad74637d	[refactor](remove expr) remove non vectorized Expr and ExprContext related codes (#16136 )	2023-01-24 10:45:35 +08:00
lihangyu	3894de49d2	[Enhancement](topn) support two phase read for topn query (#15642 ) This PR optimize topn query like `SELECT * FROM tableX ORDER BY columnA ASC/DESC LIMIT N`. TopN is is compose of SortNode and ScanNode, when user table is wide like 100+ columns the order by clause is just a few columns.But ScanNode need to scan all data from storage engine even if the limit is very small.This may lead to lots of read amplification.So In this PR I devide TopN query into two phase: 1. The first phase we just need to read `columnA`'s data from storage engine along with an extra RowId column called `__DORIS_ROWID_COL__`.The other columns are pruned from ScanNode. 2. The second phase I put it in the ExchangeNode beacuase it's the central node for topn nodes in the cluster.The ExchangeNode will spawn a RPC to other nodes using the RowIds(sorted and limited from SortNode) read from the first phase and read row by row from storage engine. After the second phase read, Block will contain all the data needed for the query	2023-01-19 10:01:33 +08:00
ElvinWei	76ad599fd7	[enhancement](histogram) optimise aggregate function histogram (#15317 ) This pr mainly to optimize the histogram(👉🏻 https://github.com/apache/doris/pull/14910) aggregation function. Including the following: 1. Support input parameters `sample_rate` and `max_bucket_num` 2. Add UT and regression test 3. Add documentation 4. Optimize function implementation logic Parameter description： - `sample_rate`：Optional. The proportion of sample data used to generate the histogram. The default is 0.2. - `max_bucket_num`：Optional. Limit the number of histogram buckets. The default value is 128. --- Example： ``` MySQL [test]> SELECT histogram(c_float) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_float`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_string`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ ``` Query result description： ``` { "sample_rate": 0.2, "max_bucket_num": 128, "bucket_num": 3, "buckets": [ { "lower": "0.1", "upper": "0.2", "count": 2, "pre_sum": 0, "ndv": 2 }, { "lower": "0.8", "upper": "0.9", "count": 2, "pre_sum": 2, "ndv": 2 }, { "lower": "1.0", "upper": "1.0", "count": 2, "pre_sum": 4, "ndv": 1 } ] } ``` Field description： - sample_rate：Rate of sampling - max_bucket_num：Limit the maximum number of buckets - bucket_num：The actual number of buckets - buckets：All buckets - lower：Upper bound of the bucket - upper：Lower bound of the bucket - count：The number of elements contained in the bucket - pre_sum：The total number of elements in the front bucket - ndv：The number of different values in the bucket > Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).	2023-01-07 00:50:32 +08:00
lihangyu	9eeb8b8711	[Bugfix](Jsonb) fix jsonb load in unique key model (#14958 ) * [Bugfix](Jsonb) fix jsonb load in unique key model Register JSONB to `replace_load` agg function * fix test	2022-12-09 21:10:14 +08:00
Gabriel	2c42f0a905	[refactor](decimalv3) Refine code for DecimalV3 (#14394 )	2022-11-19 16:57:17 +08:00
xy720	f329d33666	[chore](fix) Fix some spell errors in be's comments. #13452	2022-10-20 08:56:01 +08:00
Gabriel	1ba9e4b568	[Improvement](sort) Reuse memory in sort node (#12921 )	2022-09-28 09:44:35 +08:00
Mingyu Chen	d80b7b9689	[feature-wip](new-scan) support more load situation (#12953 )	2022-09-27 21:48:32 +08:00
Pxl	0ead048b93	[Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456 ) 1. remove ColumnString terminating zero 2. add a data_version for pblock 3. change EncryptionMode to enum class	2022-09-14 21:25:22 +08:00
Ashin Gau	6d925054de	[feature-wip](parquet-reader) decode parquet time & datetime & decimal (#11845 ) 1. Spark can set the timestamp precision by the following configuration: spark.sql.parquet.outputTimestampType = INT96(NANOS), TIMESTAMP_MICROS, TIMESTAMP_MILLIS DATETIME V1 only keeps the second precision, DATETIME V2 keeps the microsecond precision. 2. If using DECIMAL V2, the BE saves the value as decimal128, and keeps the precision of decimal as (precision=27, scale=9). DECIMAL V3 can maintain the right precision of decimal	2022-08-22 10:15:35 +08:00
zhangstar333	1f30e563a7	[refactor][vectorized] refactor first/last value agg functions (#10661 ) * refactor first and last [refactor][vectorized] refactor first/last value agg functions * add some change * remove first/last about always nullable * remove always nullable and register it * refactor value remove bool null flag * refactor win first last to ptr and pos	2022-07-30 18:38:56 +08:00
Gabriel	babab5d535	[feature-wip] support datetimev2 (#11085 )	2022-07-23 16:07:59 +08:00
camby	00c9455f16	[fix](array-type) fix arrow column to doris array column (#10855 ) * support merge array column, while convert from arrow column to doris array column * fix typo Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-07-16 11:49:42 +08:00
Mingyu Chen	ba1c527a23	[improvement](arrow) Avoid parse timezone for each datetime value (#10869 ) * [improvement](arrow) Avoid parse timezone for each datetime value Convert arrow batch to doris block is too slow when there are datetime values. Because we call `TimezoneUtils::find_cctz_time_zone` for each values. After modify, the tpch-100 q1 with external table cost from 40s -> 9s Co-authored-by: morningman <morningman@apache.org>	2022-07-15 21:19:36 +08:00
Pxl	6a566ccb74	[Enhancement][Vectorized] add constexpr_loop_match (#10283 )	2022-06-29 14:58:50 +08:00
Gabriel	ca94867b4e	[Feature-wip] add date v2 type (#9916 )	2022-06-26 16:07:56 +08:00
camby	0e404edf54	[improvement] Change array offset type from UInt32 to UInt64 (#10070 ) Now column `Array<T>` contains column `offsets` and `data`, and type of column `offsets` is UInt32 now. If we call array_union to merge arrays repeatedly, the size of array may overflow. So we need to extend it before `Array Data Type` release.	2022-06-19 10:24:08 +08:00
yinzhijian	19bc14cf8d	[feature-wip](array-type) Add array type support for vectorized parquet-orc scanner (#9856 ) Only support one level array now. for example: - nullable(array(nullable(tinyint))) is support. - nullable(array(nullable(array(xx))) is not support.	2022-06-09 12:11:47 +08:00
HappenLee	35f99faa0a	[Bug][Vectorized] fix core dump on vcase_expr::close (#9893 ) Co-authored-by: lihaopeng <lihaopeng@baidu.com>	2022-06-01 08:05:09 +08:00
jacktengg	f4dd3bf013	[bugfix] fix memleak in olapscannode(#9736 )	2022-05-26 15:06:54 +08:00
yinzhijian	bee5c2f8aa	[feature-wip](parquet-vec) Support parquet scanner in vectorized engine (#9433 )	2022-05-17 09:37:17 +08:00
Pxl	0ee53be883	[fix][improvement](runtime-filter) fix string type length limit error && add runtime filter decimal support (#8282 )	2022-03-03 22:44:49 +08:00
Pxl	668188b91f	[improvement][vectorized] support es node predicate peel (#8174 )	2022-02-26 17:02:54 +08:00
Pxl	90a8ca808a	[Bug][Vectorized] fix bitmap_min(empty) not return null (#8190 )	2022-02-24 11:06:27 +08:00
Pxl	87e555c27d	[Feature][Vectorized] support function json_array/json_object/json_quote (#8158 )	2022-02-22 09:29:56 +08:00
Pxl	143c4085ee	[Feature][Vectorized] support aggregate function ndv()/approx_count_distinct() (#8044 )	2022-02-16 14:30:13 +08:00
HappenLee	e1d7233e9c	[feature](vectorization) Support Vectorized Exec Engine In Doris (#7785 ) # Proposed changes Issue Number: close #6238 Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com> Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com> Co-authored-by: wangbo <506340561@qq.com> Co-authored-by: emmymiao87 <522274284@qq.com> Co-authored-by: Pxl <952130278@qq.com> Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> Co-authored-by: thinker <zchw100@qq.com> Co-authored-by: Zeno Yang <1521564989@qq.com> Co-authored-by: Wang Shuo <wangshuo128@gmail.com> Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> Co-authored-by: xinghuayu007 <1450306854@qq.com> Co-authored-by: weizuo93 <weizuo@apache.org> Co-authored-by: yiguolei <guoleiyi@tencent.com> Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com> Co-authored-by: awakeljw <993007281@qq.com> Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com> Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com> ## Problem Summary: ### 1. Some code from clickhouse ClickHouse is an excellent implementation of the vectorized execution engine database, so here we have referenced and learned a lot from its excellent implementation in terms of data structure and function implementation. We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers. The following comment has been added to the code from Clickhouse, eg: // This file is copied from // https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h // and modified by Doris ### 2. Support exec node and query: * vaggregation_node * vanalytic_eval_node * vassert_num_rows_node * vblocking_join_node * vcross_join_node * vempty_set_node * ves_http_scan_node * vexcept_node * vexchange_node * vintersect_node * vmysql_scan_node * vodbc_scan_node * volap_scan_node * vrepeat_node * vschema_scan_node * vselect_node * vset_operation_node * vsort_node * vunion_node * vhash_join_node You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set. ### 3. Data Model Vec Exec Engine Support Dup/Agg/Unq table, Support Block Reader Vectorized. Segment Vec is working in process. ### 4. How to use 1. Set the environment variable `set enable_vectorized_engine = true; `(required) 2. Set the environment variable `set batch_size = 4096; ` (recommended) ### 5. Some diff from origin exec engine https://github.com/doris-vectorized/doris-vectorized/issues/294 ## Checklist(Required) 1. Does it affect the original behavior: (No) 2. Has unit tests been added: (Yes) 3. Has document been added or modified: (No) 4. Does it need to update dependencies: (No) 5. Are there any changes that cannot be rolled back: (Yes)	2022-01-18 10:07:15 +08:00

37 Commits