doris

Author	SHA1	Message	Date
ElvinWei	754fceafaf	[feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910 ) Histogram statistics Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics. For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan. The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred. The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency. --- Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows: Syntax ```SQL histogram(expr) ``` > The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query. example ``` MySQL [test]> select histogram(login_time) from dev_table; +------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`login_time`) \| +------------------------------------------------------------------------------------------------------------------------------+ \| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}\| +------------------------------------------------------------------------------------------------------------------------------+ ``` description ```JSON { "bucket_size": 5, "buckets": [ { "lower": "2022-09-21 17:30:29", "upper": "2022-09-21 22:30:29", "count": 9, "pre_sum": 0, "ndv": 1 }, { "lower": "2022-09-22 17:30:29", "upper": "2022-09-22 22:30:29", "count": 10, "pre_sum": 9, "ndv": 1 }, { "lower": "2022-09-23 17:30:29", "upper": "2022-09-23 22:30:29", "count": 9, "pre_sum": 19, "ndv": 1 }, { "lower": "2022-09-24 17:30:29", "upper": "2022-09-24 22:30:29", "count": 9, "pre_sum": 28, "ndv": 1 }, { "lower": "2022-09-25 17:30:29", "upper": "2022-09-25 22:30:29", "count": 9, "pre_sum": 37, "ndv": 1 } ] } ``` TODO: - histogram func supports parameter and sample statistics (It's got another pr) - use histogram statistics - add p0 regression	2022-12-22 16:42:17 +08:00
Gabriel	0b6054a4ce	[Bug](decimalv3) Fix wrong argument for min_by/max_by (#15153 )	2022-12-19 10:15:28 +08:00
Pxl	1b07e3e18b	[Chore](refactor) some modify for pass c++20 standard (#15042 ) some modify for pass c++20 standard	2022-12-17 14:41:07 +08:00
zhengshengjun	1f56279fd8	[Vectorized] Use SIMD to skip batches of null data in aggregation (#10392 )	2022-12-12 23:40:31 +08:00
plat1ko	f3aea7f0f0	[Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744 )	2022-12-11 23:33:18 +08:00
Adonis Ling	ec2539e2a3	[chore](macOS) Resolve the issue with missing python program (#14864 )	2022-12-07 15:30:12 +08:00
Gabriel	9dd1d989e8	[test](decimalv3) add regression test cases for decimalv3 (#14672 )	2022-12-01 15:18:40 +08:00
Gabriel	3e8b3658c7	[feature-wip](decimalv3) Support basic agg and arithmetic operations for decimal v3 (#14513 )	2022-11-29 15:12:41 +08:00
abmdocrt	529bdfb153	[Fix](function) Fix retention function return wrong value type (#14552 ) MySQL [db]> SELECT SUM(a.r[1]) as active_user_num, SUM(a.r[2]) as active_user_num_1day, SUM(a.r[3]) as active_user_num_3day, SUM(a.r[4]) as active_user_num_7day FROM ( SELECT user_id, retention( day = '2022-11-01', day = '2022-11-02', day = '2022-11-04', day = '2022-11-07') as r FROM login_event WHERE (day >= '2022-11-01') AND (day <= '2022-11-21') GROUP BY user_id ) a; ERROR 1105 (HY000): errCode = 2, detailMessage = sum requires a numeric parameter: sum(%element_extract%(a.r, 1))	2022-11-28 15:56:18 +08:00
zy-kkk	59b31a03c4	[Improvement](agg function) support group_bit_and/group_bit_or/group_bit_xor functions (#14386 )	2022-11-24 16:46:42 +08:00
zhangstar333	b04ec41c1d	[Vectorized](udaf) fix java-udaf couldn't get jar core dump (#14393 ) fix java-udaf couldn't get jar core dump	2022-11-22 20:49:02 +08:00
Gabriel	1ec7f45fb6	[Bug](avg) Fix `avg` for bigint (#14433 )	2022-11-22 10:29:59 +08:00
Gabriel	2c42f0a905	[refactor](decimalv3) Refine code for DecimalV3 (#14394 )	2022-11-19 16:57:17 +08:00
zhangstar333	70cc725649	[Vectorized](function) support avg_weighted/percentile_array/topn_wei… (#14209 ) * [Vectorized](function) support avg_weighted/percentile_array/topn_weighted functions * update add to stringRef	2022-11-15 16:38:38 +08:00
abmdocrt	6cc5ae077e	[Improvement](Sequence function) Capitalize const variables (#14270 )	2022-11-15 10:41:53 +08:00
xy720	035657c5a1	[typo](comment) Fix a lot of spell errors in be comments (#14208 ) fix typos.	2022-11-12 16:06:15 +08:00
abmdocrt	b6ba654f5b	[Feature](Sequence) Support sequence_match and sequence_count functions (#13785 )	2022-11-11 13:38:45 +08:00
Zhengguo Yang	12652ebb0e	[UDF](java udf) using config to enable java udf instead of macro at compile time (#14062 ) * [UDF](java udf) useing config to enable java udf instead of macro at compile time	2022-11-11 09:03:52 +08:00
zhangstar333	df622d8b7d	[Bug](udf) fix java-udaf process string type error and add some tests (#14106 )	2022-11-10 09:30:57 +08:00
xy720	d183199319	[Bug](array-type) Fix array product calculate decimal type return wrong result (#13794 )	2022-11-03 17:26:34 +08:00
HappenLee	fbc8b7311f	[Opt](function) opt the function of ndv (#13887 )	2022-11-02 22:21:20 +08:00
zhangstar333	374303186c	[Vectorized](function) support topn_array function (#13869 )	2022-11-02 19:49:23 +08:00
Gabriel	287a739510	[javaudf](string) Fix string format in java udf (#13854 )	2022-11-01 21:25:12 +08:00
Pxl	2fab0c45c7	[Feature](runtime-filter) add runtime filter breaking change adapt (#13246 ) add runtime filter breaking change adapt	2022-10-28 10:59:28 +08:00
xy720	f329d33666	[chore](fix) Fix some spell errors in be's comments. #13452	2022-10-20 08:56:01 +08:00
Adonis Ling	125def5102	[enhancement](macOS M1) Support building from source on macOS (M1) (#13195 ) # Proposed changes This PR fixed lots of issues when building from source on macOS with Apple M1 chip. ## ATTENTION The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime: 1. Some errors with memory tracker occur when BE (RELEASE) starts. 2. Some UT cases fail. ... Temporarily, the following changes are made on macOS to start BE successfully. 1. Disable memory tracker. 2. Use tcmalloc instead of jemalloc. This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues. ## Use case ```shell ./build.sh -j 8 --be --clean cd output/be/bin ulimit -n 60000 ./start_be.sh --daemon ``` ## Something else It takes around _10+_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the development experience on macOS greatly when we finish the adaptation job.	2022-10-18 13:10:13 +08:00
luozenglin	207f4e559e	[feature](agg) support `group_bitmap_xor` agg function. (#13287 ) support `group_bitmap_xor` agg function	2022-10-17 18:40:06 +08:00
abmdocrt	045bccdbea	[Feature](Retention) support retention function (#13056 )	2022-10-17 11:00:47 +08:00
starocean999	f2fa9606c9	[fix](agg)count function should return 0 for null value (#13247 ) count(null) should return 0 instead of 1, the streaming_agg_serialize_to_column function didn't handle if the input value is null, this pr fix it.	2022-10-15 10:40:52 +08:00
luozenglin	cb300b0b39	[feature](agg) support `any`,`any_value` agg functions. (#13228 )	2022-10-13 18:31:19 +08:00
Gabriel	1ba9e4b568	[Improvement](sort) Reuse memory in sort node (#12921 )	2022-09-28 09:44:35 +08:00
HappenLee	9d6c199553	[Bug](vec) Fix avg overflow in clickbench (#12621 )	2022-09-16 14:43:40 +08:00
Pxl	0ead048b93	[Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456 ) 1. remove ColumnString terminating zero 2. add a data_version for pblock 3. change EncryptionMode to enum class	2022-09-14 21:25:22 +08:00
Gabriel	af09c1f4eb	[Improvement](window funnel) restrict timestamp to datetime type in window funnel (#12123 )	2022-08-29 12:14:04 +08:00
Pxl	3af0745c8f	[Bug](function) fix aggFnParams set not correct (#12006 )	2022-08-26 14:29:56 +08:00
Jerry Hu	f875684345	[fix](agg) Crashing caused by serialization in streaming aggregation (#12027 )	2022-08-24 14:38:25 +08:00
HappenLee	3abc4f357f	[Bug](bitmap) intersect_count function use in string cause ASAN error (#11936 )	2022-08-24 08:51:53 +08:00
Jerry Hu	c22d097b59	[improvement](compress) Support compress/decompress block with lz4 (#11955 )	2022-08-22 17:35:43 +08:00
Jerry Hu	dc8f64b3e3	[improvement](agg) Serialize the fixed-length aggregation results with corresponding columns instead of ColumnString (#11801 )	2022-08-22 10:12:06 +08:00
Adonis Ling	982c5f06b5	[fix](build) Resolve the conflicts when building be with java-udf (#11938 )	2022-08-20 18:24:32 +08:00
chenlinzhong	7a505cf040	[remote-udaf](optimize) Optimize RPC exception handling logic (#11680 )	2022-08-19 10:25:01 +08:00
zhangstar333	7df8c6f493	[vectorized](improvement) improve agg function of bitmap_union with f… (#11822 ) * [vectorized](improvement) improve agg function of bitmap_union with fastuinon	2022-08-17 14:13:01 +08:00
ZenoYang	288b440b14	[improvement](vectorized) Improve count distinct performance by using fastunion (#11516 ) Improve count distinct performance by using fastunion. Testing our user real data has a 10-40% performance improvement.	2022-08-16 12:18:46 +08:00
starocean999	092a394782	[improvement](agg)limit the output of agg node (#11461 ) * [improvement](agg)limit the output of agg node	2022-08-05 07:53:55 +08:00
Gabriel	27be5e8667	[feature-wip](decimalv3) Fix UTs when `decimalv3` is enabled (#11380 )	2022-08-01 23:07:38 +08:00
Jerry Hu	d360974dce	[improvement](agg)Use phmap::flat_hash_set in AggregateFunctionUniq (#11363 ) This reverts commit 688b55053dd1fc5113343a6f565ad732ddd9612a.	2022-08-01 10:36:11 +08:00
Mingyu Chen	688b55053d	Revert "[improvement]Use phmap::flat_hash_set in AggregateFunctionUniq (#11257 )" (#11356 ) This reverts commit a7199fb98e18b925664b38460b667d04cbee8e01.	2022-07-30 23:15:36 +08:00
zhangstar333	1f30e563a7	[refactor][vectorized] refactor first/last value agg functions (#10661 ) * refactor first and last [refactor][vectorized] refactor first/last value agg functions * add some change * remove first/last about always nullable * remove always nullable and register it * refactor value remove bool null flag * refactor win first last to ptr and pos	2022-07-30 18:38:56 +08:00
Pxl	532395c6a0	[Bug][Function] core dump on sum(distinct) (#11308 ) * fix align size calculate for distinct combinator	2022-07-30 10:24:48 +08:00
Jerry Hu	a7199fb98e	[improvement]Use phmap::flat_hash_set in AggregateFunctionUniq (#11257 )	2022-07-29 16:55:22 +08:00

1 2 3

103 Commits