doris

Author	SHA1	Message	Date
TengJianPing	2bdfaac609	[fix](ubsan) fix ubsan errors (#19658 ) ixu ubsan errors: doris/be/src/util/string_parser.hpp:275:58: runtime error: signed integer overflow: 2147483647 + 1 cannot be represented in type 'int' doris/be/src/vec/functions/functions_comparison.h:214:51: runtime error: addition of unsigned offset to 0x7fea6c6b7010 overflowed to 0x7fea6c6b700c doris/be/src/vec/functions/multiply.cpp:67:50: runtime error: signed integer overflow: 1295699415680000000 * 0x0000000000015401d0a4cd4890a77700 cannot be represented in type '__int128 doris/be/src/vec/aggregate_functions/aggregate_function_percentile_approx.h:445:73: runtime error: addition of unsigned offset to 0x7feca3343d10 overflowed to 0x7feca3343d08 doris/be/src/exec/schema_scanner/schema_tables_scanner.cpp:330:24: run	2023-05-17 09:32:03 +08:00
Pxl	4eb2604789	[Bug](function) fix function define of Retention inconsist and change some static_cast to assert cast (#19455 ) 1. fix function define of `Retention` inconsist, this function return tinyint on `FE` and return uint8 on `BE` 2. make assert_cast support cast to derived 3. change some static cast to assert cast 4. support sum(bool)/avg(bool)	2023-05-15 11:50:02 +08:00
zhangstar333	1d1b2f98c3	[refactor](function) let agg functions exception safety (#19109 )	2023-05-11 10:17:11 +08:00
Xinyi Zou	f2a34dde52	[fix](memory) Fix memory leak due to incorrect block reuse of AggregateFunctionSortData #19214	2023-05-05 14:29:34 +08:00
Xinyi Zou	f23c93b3c6	[fix](memory) Fix AggFunc memory leak due to incorrect destroy (#19126 )	2023-04-27 14:58:32 +08:00
Xinyi Zou	98a975b013	[fix](memory) Fix SchemaChange memory leak due to incorrect aggfunc destroy (#19130 )	2023-04-27 14:44:00 +08:00
Yongqiang YANG	8412571030	[fix](memleak) avoid memleak due to race condition (#19071 )	2023-04-27 14:22:09 +08:00
zclllyybb	e1651bfea5	[bugfix](aggregate_function) Fix wrong registration for percentile_approx #19070	2023-04-26 16:17:46 +08:00
Jerry Hu	c4e469c82c	[feature](agg) Support spill to disk in aggregation (#18051 )	2023-04-20 18:59:08 +08:00
Adonis Ling	e412dd12e8	[chore](build) Use include-what-you-use to optimize includes (PART II) (#18761 ) Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.	2023-04-19 23:11:48 +08:00
amory	564446e52f	[Refact](type system) refact serde for type system and pb serde impl (#18627 )	2023-04-18 14:13:56 +08:00
Jerry Hu	3de4d64657	[chore](hashtable) Use doris' Allocator to replace std::allocator in phmap (#18735 )	2023-04-18 09:58:28 +08:00
Pxl	76d76f672c	[Chore](build) enchancement for backend build time usage (#18344 )	2023-04-06 11:13:21 +08:00
yiguolei	a77921d767	[refactor](typesystem) remove unused rpc common file and using function rpc (#18270 ) rpc common is duplicate, all its method is included in function rpc. So that I remove it. get_field_type is never used, remove it. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-31 18:13:25 +08:00
TengJianPing	3b04d42779	[fix](bitmap) fix bug: orthogonal_bitmap_union_count coredump when arg is nullable (#18182 ) Query cause be cordump: select ORTHOGONAL_BITMAP_UNION_COUNT( cast(null as bitmap)) from t;	2023-03-30 09:31:58 +08:00
Pxl	45ad297a1d	[Enchancement](function) change aggregate function creator to return AggregateFunctionPtr (#18025 ) change creator_type to return AggregateFunctionPtr. remove some function and use creator directly.	2023-03-26 11:41:34 +08:00
yiguolei	7ae51c856e	[refactor](unify exception) unify exception definition and error code (#18006 ) * [refactor](unify exception) unify exception definition and error code --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-25 12:41:07 +08:00
hqx871	1999cccde9	[feature](array-type) Unique table support array value (#17024 ) Unique table support array value --------- Co-authored-by: huangqixiang.871 <huangqixiang.871@bytedance.com>	2023-03-24 10:18:59 +08:00
Gabriel	5445a86570	[Bug](array_product) Fix array_product for ARRAY<DECIMAL> (#18014 )	2023-03-23 20:29:50 +08:00
HappenLee	3661ca4510	[Bug](regression-test) Fix the collect_set/list size limit result failed (#17963 ) fix regression test failed in fuzzy mode:collect_set	2023-03-20 21:00:41 +08:00
zhangstar333	e359e412e1	[vectorized](udaf) fix java udaf meet error of std::bad_alloc (#17848 ) Now if the user code of java udaf throws exception, because c++ code of agg function nobody could deal with it, so maybe get error of std::bad_alloc	2023-03-19 11:52:15 +08:00
zhbinbin	ff9e03e2bf	[Feature](add bitmap udaf) add the bitmap intersection and difference set for mixed calculation of udaf (#15588 ) * Add the bitmap intersection and difference set for mixed calculation of udaf Co-authored-by: zhangbinbin05 <zhangbinbin05@baidu.com>	2023-03-14 20:40:37 +08:00
spaces-x	5b39fa9843	[Feature](vec)(quantile_state): support quantile state in vectorized engine (#16562 ) * [Feature](vectorized)(quantile_state): support vectorized quantile state functions 1. now quantile column only support not nullable 2. add up some regression test cases 3. set default enable_quantile_state_type = true --------- Co-authored-by: spaces-x <weixiang06@meituan.com>	2023-03-14 10:54:04 +08:00
Pxl	16fc3a0e22	[Chore](compile) remove some unused static on inline function to reduce compile time (#17603 ) remove some unused static on inline function to reduce compile time	2023-03-13 11:11:59 +08:00
zhangstar333	4ef46159ae	[vectorized](udaf) support array type for java-udaf (#17351 )	2023-03-09 11:30:07 +08:00
Pxl	65b8dfc7ff	[Enchancement](function) Inline some aggregate function && remove nullable combinator (#17328 ) 1. Inline some aggregate function 2. remove nullable combinator	2023-03-09 10:39:04 +08:00
starocean999	2b6d971c2f	[fix](nereids)fix first_value/lead/lag window function bug in nereids (#17315 ) * [fix](nereids)fix first_value/lead/lag window function bug in nereids * add more test * add order by to fix test case * fix test cases	2023-03-09 09:35:27 +08:00
ElvinWei	bd5ed2b0c2	[enhancement](histogram) optimize the histogram bucketing strategy, etc (#17264 ) * optimize the histogram bucketing strategy, etc * fix p0 regression of histogram	2023-03-08 20:12:05 +08:00
Pxl	e2ac06d6d6	[Chore](execution) change PipelineTaskState to enum class && remove some row-based code (#17300 ) 1. change PipelineTaskState to enum class 2. remove some row-based code on FoldConstantExecutor::_get_result 3. reduce memcpy on minmax runtime filter function(Now we can guarantee that the input data is aligned) 4. add Wunused-template check, and remove some unused function, change some static function to inline function.	2023-03-08 12:41:15 +08:00
yiguolei	4692d6764c	[refactor](remove string val) remove string val structure, it is same with string ref (#17461 ) remove stringval, decimalv2val, bigintval	2023-03-08 10:42:20 +08:00
starocean999	27352afdf6	[fix](fe)support multi distinct group_concat (#17237 ) * [fix](fe)support multi distinct group_concat * update based on comments	2023-03-02 17:53:13 +08:00
Pxl	527eb5b059	[Enchancement](function) nullable inline refactor of min_max_by/bitmap && add register_functio… (#17228 ) 1. nullable inline refactor of min_max_by/bitmap/group_concat/histogram/topn 2. add register_function_both method 3. add datetimev2 type creator of min_max_by 4. remove uint16/32/64 in FOR_INTEGER_TYPES	2023-03-02 00:00:01 +08:00
zhangstar333	1dd2a41e38	[vectorized](bug) fix window function can't handle first row of beyond (#17084 ) Issue Number: close #16845	2023-02-28 17:30:23 +08:00
奕冷	c0360f80bb	[enhancement](aggregate-function) enhance aggregate funtion collect and add group_array aliases (#15339 ) Enhance aggregate function `collect_set` and `collect_list` to support optional `max_size` param, which enables to limit the number of elements in result array.	2023-02-27 14:22:30 +08:00
Pxl	c4edea5936	[Enchancement](function) refact and optimize some function register (#16955 ) refact and optimize some function register	2023-02-24 10:05:11 +08:00
ElvinWei	f32cd2c123	[fix](statistics) fix a problem with histogram statistics collection parameters (#16918 ) 1. Fixed a problem with histogram statistics collection parameters. 2. Solved the problem that it takes a long time to collect histogram statistics. TODO: Optimize histogram statistics sampling method and make the sampling parameters effective. The problem is that the histogram function works as expected in the single-node test, but doesn't work in the multi-node test. In addition, the performance of the current support sampling to collect histogram is low, resulting in a large time consumption when collecting histogram information. Fixed the parameter issue and temporarily removed support for sampling to speed up the collection of histogram statistics. Will next support sampling to collect histogram information.	2023-02-20 16:33:18 +08:00
Pxl	2bc014d83a	[Enchancement](function) remove unused params on aggregate function (#16886 ) remove unused params on aggregate function	2023-02-20 11:08:45 +08:00
HappenLee	24ef60b491	[Opt](exec) opt aggreate function performance in nullable column	2023-02-16 22:26:12 +08:00
abmdocrt	41947c73eb	[Feature](array-function) Support array functions for nested type datev2 and datetimev2 (#16382 )	2023-02-08 12:51:07 +08:00
Pxl	5e4bb98900	[Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290 ) enable -Wpedantic and update lowest gcc version to 11.1	2023-02-03 11:28:48 +08:00
abmdocrt	ca7eb94f23	[improvement](agg-function) Increase the limit maximum number of agg function parameters (#15924 )	2023-01-31 21:03:50 +08:00
yiguolei	e49766483e	[refactor](remove unused code) remove many xxxVal structure (#16143 ) remove many xxxVal structure remove BetaRowsetWriter::_add_row remove anyval_util.cpp remove non-vectorized geo functions remove non-vectorized like predicate Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-28 14:17:43 +08:00
yiguolei	615a5e7b51	[refactor](remove non vec code) remove non vec functions and AggregateInfo (#16138 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-25 12:53:05 +08:00
yiguolei	79ad74637d	[refactor](remove expr) remove non vectorized Expr and ExprContext related codes (#16136 )	2023-01-24 10:45:35 +08:00
ZhaoChangle	199d7d3be8	[Refactor]Merged string_value into string_ref (#15925 )	2023-01-22 16:39:23 +08:00
ElvinWei	36590da24b	[fix](regression p0) add the alias function hist to histogram and fix p0 (#15708 ) add the alias function hist to histogram and fix p0	2023-01-08 11:31:23 +08:00
ElvinWei	76ad599fd7	[enhancement](histogram) optimise aggregate function histogram (#15317 ) This pr mainly to optimize the histogram(👉🏻 https://github.com/apache/doris/pull/14910) aggregation function. Including the following: 1. Support input parameters `sample_rate` and `max_bucket_num` 2. Add UT and regression test 3. Add documentation 4. Optimize function implementation logic Parameter description： - `sample_rate`：Optional. The proportion of sample data used to generate the histogram. The default is 0.2. - `max_bucket_num`：Optional. Limit the number of histogram buckets. The default value is 128. --- Example： ``` MySQL [test]> SELECT histogram(c_float) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_float`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_string`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ ``` Query result description： ``` { "sample_rate": 0.2, "max_bucket_num": 128, "bucket_num": 3, "buckets": [ { "lower": "0.1", "upper": "0.2", "count": 2, "pre_sum": 0, "ndv": 2 }, { "lower": "0.8", "upper": "0.9", "count": 2, "pre_sum": 2, "ndv": 2 }, { "lower": "1.0", "upper": "1.0", "count": 2, "pre_sum": 4, "ndv": 1 } ] } ``` Field description： - sample_rate：Rate of sampling - max_bucket_num：Limit the maximum number of buckets - bucket_num：The actual number of buckets - buckets：All buckets - lower：Upper bound of the bucket - upper：Lower bound of the bucket - count：The number of elements contained in the bucket - pre_sum：The total number of elements in the front bucket - ndv：The number of different values in the bucket > Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).	2023-01-07 00:50:32 +08:00
zhangstar333	b50448d5c4	[vectorized](udaf) fix udaf result is null when has multiple aggs (#15554 )	2023-01-03 16:03:43 +08:00
AlexYue	fe562bc3e7	[Bug](Agg) fix crash when encountering not supported agg function like last_value(bitmap) (#15257 ) The former logic inside aggregate_function_window.cpp would shutdown BE once encountering agg function with complex type like BITMAP. This pr makes it don't crash and would return one more concrete error message which tells the unsupported function signature to user.	2022-12-23 14:23:21 +08:00
ElvinWei	754fceafaf	[feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910 ) Histogram statistics Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics. For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan. The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred. The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency. --- Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows: Syntax ```SQL histogram(expr) ``` > The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query. example ``` MySQL [test]> select histogram(login_time) from dev_table; +------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`login_time`) \| +------------------------------------------------------------------------------------------------------------------------------+ \| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}\| +------------------------------------------------------------------------------------------------------------------------------+ ``` description ```JSON { "bucket_size": 5, "buckets": [ { "lower": "2022-09-21 17:30:29", "upper": "2022-09-21 22:30:29", "count": 9, "pre_sum": 0, "ndv": 1 }, { "lower": "2022-09-22 17:30:29", "upper": "2022-09-22 22:30:29", "count": 10, "pre_sum": 9, "ndv": 1 }, { "lower": "2022-09-23 17:30:29", "upper": "2022-09-23 22:30:29", "count": 9, "pre_sum": 19, "ndv": 1 }, { "lower": "2022-09-24 17:30:29", "upper": "2022-09-24 22:30:29", "count": 9, "pre_sum": 28, "ndv": 1 }, { "lower": "2022-09-25 17:30:29", "upper": "2022-09-25 22:30:29", "count": 9, "pre_sum": 37, "ndv": 1 } ] } ``` TODO: - histogram func supports parameter and sample statistics (It's got another pr) - use histogram statistics - add p0 regression	2022-12-22 16:42:17 +08:00

1 2 3 4

152 Commits