doris

Author	SHA1	Message	Date
Pxl	5e4bb98900	[Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290 ) enable -Wpedantic and update lowest gcc version to 11.1	2023-02-03 11:28:48 +08:00
abmdocrt	ca7eb94f23	[improvement](agg-function) Increase the limit maximum number of agg function parameters (#15924 )	2023-01-31 21:03:50 +08:00
yiguolei	e49766483e	[refactor](remove unused code) remove many xxxVal structure (#16143 ) remove many xxxVal structure remove BetaRowsetWriter::_add_row remove anyval_util.cpp remove non-vectorized geo functions remove non-vectorized like predicate Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-28 14:17:43 +08:00
yiguolei	615a5e7b51	[refactor](remove non vec code) remove non vec functions and AggregateInfo (#16138 ) Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-01-25 12:53:05 +08:00
yiguolei	79ad74637d	[refactor](remove expr) remove non vectorized Expr and ExprContext related codes (#16136 )	2023-01-24 10:45:35 +08:00
ZhaoChangle	199d7d3be8	[Refactor]Merged string_value into string_ref (#15925 )	2023-01-22 16:39:23 +08:00
ElvinWei	36590da24b	[fix](regression p0) add the alias function hist to histogram and fix p0 (#15708 ) add the alias function hist to histogram and fix p0	2023-01-08 11:31:23 +08:00
ElvinWei	76ad599fd7	[enhancement](histogram) optimise aggregate function histogram (#15317 ) This pr mainly to optimize the histogram(👉🏻 https://github.com/apache/doris/pull/14910) aggregation function. Including the following: 1. Support input parameters `sample_rate` and `max_bucket_num` 2. Add UT and regression test 3. Add documentation 4. Optimize function implementation logic Parameter description： - `sample_rate`：Optional. The proportion of sample data used to generate the histogram. The default is 0.2. - `max_bucket_num`：Optional. Limit the number of histogram buckets. The default value is 128. --- Example： ``` MySQL [test]> SELECT histogram(c_float) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_float`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_string`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ ``` Query result description： ``` { "sample_rate": 0.2, "max_bucket_num": 128, "bucket_num": 3, "buckets": [ { "lower": "0.1", "upper": "0.2", "count": 2, "pre_sum": 0, "ndv": 2 }, { "lower": "0.8", "upper": "0.9", "count": 2, "pre_sum": 2, "ndv": 2 }, { "lower": "1.0", "upper": "1.0", "count": 2, "pre_sum": 4, "ndv": 1 } ] } ``` Field description： - sample_rate：Rate of sampling - max_bucket_num：Limit the maximum number of buckets - bucket_num：The actual number of buckets - buckets：All buckets - lower：Upper bound of the bucket - upper：Lower bound of the bucket - count：The number of elements contained in the bucket - pre_sum：The total number of elements in the front bucket - ndv：The number of different values in the bucket > Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).	2023-01-07 00:50:32 +08:00
zhangstar333	b50448d5c4	[vectorized](udaf) fix udaf result is null when has multiple aggs (#15554 )	2023-01-03 16:03:43 +08:00
AlexYue	fe562bc3e7	[Bug](Agg) fix crash when encountering not supported agg function like last_value(bitmap) (#15257 ) The former logic inside aggregate_function_window.cpp would shutdown BE once encountering agg function with complex type like BITMAP. This pr makes it don't crash and would return one more concrete error message which tells the unsupported function signature to user.	2022-12-23 14:23:21 +08:00
ElvinWei	754fceafaf	[feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910 ) Histogram statistics Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics. For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan. The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred. The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency. --- Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows: Syntax ```SQL histogram(expr) ``` > The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query. example ``` MySQL [test]> select histogram(login_time) from dev_table; +------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`login_time`) \| +------------------------------------------------------------------------------------------------------------------------------+ \| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}\| +------------------------------------------------------------------------------------------------------------------------------+ ``` description ```JSON { "bucket_size": 5, "buckets": [ { "lower": "2022-09-21 17:30:29", "upper": "2022-09-21 22:30:29", "count": 9, "pre_sum": 0, "ndv": 1 }, { "lower": "2022-09-22 17:30:29", "upper": "2022-09-22 22:30:29", "count": 10, "pre_sum": 9, "ndv": 1 }, { "lower": "2022-09-23 17:30:29", "upper": "2022-09-23 22:30:29", "count": 9, "pre_sum": 19, "ndv": 1 }, { "lower": "2022-09-24 17:30:29", "upper": "2022-09-24 22:30:29", "count": 9, "pre_sum": 28, "ndv": 1 }, { "lower": "2022-09-25 17:30:29", "upper": "2022-09-25 22:30:29", "count": 9, "pre_sum": 37, "ndv": 1 } ] } ``` TODO: - histogram func supports parameter and sample statistics (It's got another pr) - use histogram statistics - add p0 regression	2022-12-22 16:42:17 +08:00
Gabriel	0b6054a4ce	[Bug](decimalv3) Fix wrong argument for min_by/max_by (#15153 )	2022-12-19 10:15:28 +08:00
Pxl	1b07e3e18b	[Chore](refactor) some modify for pass c++20 standard (#15042 ) some modify for pass c++20 standard	2022-12-17 14:41:07 +08:00
zhengshengjun	1f56279fd8	[Vectorized] Use SIMD to skip batches of null data in aggregation (#10392 )	2022-12-12 23:40:31 +08:00
plat1ko	f3aea7f0f0	[Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744 )	2022-12-11 23:33:18 +08:00
Adonis Ling	ec2539e2a3	[chore](macOS) Resolve the issue with missing python program (#14864 )	2022-12-07 15:30:12 +08:00
Gabriel	9dd1d989e8	[test](decimalv3) add regression test cases for decimalv3 (#14672 )	2022-12-01 15:18:40 +08:00
Gabriel	3e8b3658c7	[feature-wip](decimalv3) Support basic agg and arithmetic operations for decimal v3 (#14513 )	2022-11-29 15:12:41 +08:00
abmdocrt	529bdfb153	[Fix](function) Fix retention function return wrong value type (#14552 ) MySQL [db]> SELECT SUM(a.r[1]) as active_user_num, SUM(a.r[2]) as active_user_num_1day, SUM(a.r[3]) as active_user_num_3day, SUM(a.r[4]) as active_user_num_7day FROM ( SELECT user_id, retention( day = '2022-11-01', day = '2022-11-02', day = '2022-11-04', day = '2022-11-07') as r FROM login_event WHERE (day >= '2022-11-01') AND (day <= '2022-11-21') GROUP BY user_id ) a; ERROR 1105 (HY000): errCode = 2, detailMessage = sum requires a numeric parameter: sum(%element_extract%(a.r, 1))	2022-11-28 15:56:18 +08:00
zy-kkk	59b31a03c4	[Improvement](agg function) support group_bit_and/group_bit_or/group_bit_xor functions (#14386 )	2022-11-24 16:46:42 +08:00
zhangstar333	b04ec41c1d	[Vectorized](udaf) fix java-udaf couldn't get jar core dump (#14393 ) fix java-udaf couldn't get jar core dump	2022-11-22 20:49:02 +08:00
Gabriel	1ec7f45fb6	[Bug](avg) Fix `avg` for bigint (#14433 )	2022-11-22 10:29:59 +08:00
Gabriel	2c42f0a905	[refactor](decimalv3) Refine code for DecimalV3 (#14394 )	2022-11-19 16:57:17 +08:00
zhangstar333	70cc725649	[Vectorized](function) support avg_weighted/percentile_array/topn_wei… (#14209 ) * [Vectorized](function) support avg_weighted/percentile_array/topn_weighted functions * update add to stringRef	2022-11-15 16:38:38 +08:00
abmdocrt	6cc5ae077e	[Improvement](Sequence function) Capitalize const variables (#14270 )	2022-11-15 10:41:53 +08:00
xy720	035657c5a1	[typo](comment) Fix a lot of spell errors in be comments (#14208 ) fix typos.	2022-11-12 16:06:15 +08:00
abmdocrt	b6ba654f5b	[Feature](Sequence) Support sequence_match and sequence_count functions (#13785 )	2022-11-11 13:38:45 +08:00
Zhengguo Yang	12652ebb0e	[UDF](java udf) using config to enable java udf instead of macro at compile time (#14062 ) * [UDF](java udf) useing config to enable java udf instead of macro at compile time	2022-11-11 09:03:52 +08:00
zhangstar333	df622d8b7d	[Bug](udf) fix java-udaf process string type error and add some tests (#14106 )	2022-11-10 09:30:57 +08:00
xy720	d183199319	[Bug](array-type) Fix array product calculate decimal type return wrong result (#13794 )	2022-11-03 17:26:34 +08:00
HappenLee	fbc8b7311f	[Opt](function) opt the function of ndv (#13887 )	2022-11-02 22:21:20 +08:00
zhangstar333	374303186c	[Vectorized](function) support topn_array function (#13869 )	2022-11-02 19:49:23 +08:00
Gabriel	287a739510	[javaudf](string) Fix string format in java udf (#13854 )	2022-11-01 21:25:12 +08:00
Pxl	2fab0c45c7	[Feature](runtime-filter) add runtime filter breaking change adapt (#13246 ) add runtime filter breaking change adapt	2022-10-28 10:59:28 +08:00
xy720	f329d33666	[chore](fix) Fix some spell errors in be's comments. #13452	2022-10-20 08:56:01 +08:00
Adonis Ling	125def5102	[enhancement](macOS M1) Support building from source on macOS (M1) (#13195 ) # Proposed changes This PR fixed lots of issues when building from source on macOS with Apple M1 chip. ## ATTENTION The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime: 1. Some errors with memory tracker occur when BE (RELEASE) starts. 2. Some UT cases fail. ... Temporarily, the following changes are made on macOS to start BE successfully. 1. Disable memory tracker. 2. Use tcmalloc instead of jemalloc. This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues. ## Use case ```shell ./build.sh -j 8 --be --clean cd output/be/bin ulimit -n 60000 ./start_be.sh --daemon ``` ## Something else It takes around _10+_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the development experience on macOS greatly when we finish the adaptation job.	2022-10-18 13:10:13 +08:00
luozenglin	207f4e559e	[feature](agg) support `group_bitmap_xor` agg function. (#13287 ) support `group_bitmap_xor` agg function	2022-10-17 18:40:06 +08:00
abmdocrt	045bccdbea	[Feature](Retention) support retention function (#13056 )	2022-10-17 11:00:47 +08:00
starocean999	f2fa9606c9	[fix](agg)count function should return 0 for null value (#13247 ) count(null) should return 0 instead of 1, the streaming_agg_serialize_to_column function didn't handle if the input value is null, this pr fix it.	2022-10-15 10:40:52 +08:00
luozenglin	cb300b0b39	[feature](agg) support `any`,`any_value` agg functions. (#13228 )	2022-10-13 18:31:19 +08:00
Gabriel	1ba9e4b568	[Improvement](sort) Reuse memory in sort node (#12921 )	2022-09-28 09:44:35 +08:00
HappenLee	9d6c199553	[Bug](vec) Fix avg overflow in clickbench (#12621 )	2022-09-16 14:43:40 +08:00
Pxl	0ead048b93	[Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456 ) 1. remove ColumnString terminating zero 2. add a data_version for pblock 3. change EncryptionMode to enum class	2022-09-14 21:25:22 +08:00
Gabriel	af09c1f4eb	[Improvement](window funnel) restrict timestamp to datetime type in window funnel (#12123 )	2022-08-29 12:14:04 +08:00
Pxl	3af0745c8f	[Bug](function) fix aggFnParams set not correct (#12006 )	2022-08-26 14:29:56 +08:00
Jerry Hu	f875684345	[fix](agg) Crashing caused by serialization in streaming aggregation (#12027 )	2022-08-24 14:38:25 +08:00
HappenLee	3abc4f357f	[Bug](bitmap) intersect_count function use in string cause ASAN error (#11936 )	2022-08-24 08:51:53 +08:00
Jerry Hu	c22d097b59	[improvement](compress) Support compress/decompress block with lz4 (#11955 )	2022-08-22 17:35:43 +08:00
Jerry Hu	dc8f64b3e3	[improvement](agg) Serialize the fixed-length aggregation results with corresponding columns instead of ColumnString (#11801 )	2022-08-22 10:12:06 +08:00
Adonis Ling	982c5f06b5	[fix](build) Resolve the conflicts when building be with java-udf (#11938 )	2022-08-20 18:24:32 +08:00

1 2 3

113 Commits