doris

Author	SHA1	Message	Date
zhangstar333	fd2d9ae63e	[improve](test) fix regression test case report error when run times (#30531 )	2024-02-01 19:01:08 +08:00
starocean999	065eb9a72b	[feature](nereids)support partition property in nereids (#28982 )	2023-12-26 11:19:04 +08:00
Xinyi Zou	b096062680	[feature-wip](arrow-flight)(step6) Support regression test (#27847 ) Design Documentation Linked to #25514 Regression test add a new group: arrow_flight_sql, ./run-regression-test.sh -g arrow_flight_sql to run regression-test, can use jdbc:arrow-flight-sql to run all Suites whose group contains arrow_flight_sql. ./run-regression-test.sh -g p0,arrow_flight_sql to run regression-test, can use jdbc:arrow-flight-sql to run all Suites whose group contains arrow_flight_sql, and use jdbc:mysql to run other Suites whose group contains p0 but does not contain arrow_flight_sql. Requires attention, the formats of jdbc:arrow-flight-sql and jdbc:mysql and mysql client query results are different, for example: Datatime field type: jdbc:mysql returns 2010-01-02T05:09:06, mysql client returns 2010-01-02 05:09:06, jdbc:arrow-flight-sql also returns 2010-01-02 05:09 :06. Array and Map field types: jdbc:mysql returns ["ab", "efg", null], {"f1": 1, "f2": "a"}, jdbc:arrow-flight-sql returns ["ab ","efg",null], {"f1":1,"f2":"a"}, which is missing spaces. Float field type: jdbc:mysql and mysql client returns 6.333, jdbc:arrow-flight-sql returns 6.333000183105469, in query_p0/subquery/test_subquery.groovy. If the query result is empty, jdbc:arrow-flight-sql returns empty and jdbc:mysql returns \N. use database; and query should be divided into two SQL executions as much as possible. otherwise the results may not be as expected. For example: USE information_schema; select cast ("0.0101031417" as datetime) The result is 2000-01-01 03:14:1 (constant fold), select cast ("0.0101031417" as datetime) The result is null (no constant fold), In addition, doris jdbc:arrow-flight-sql still has unfinished parts, such as: Unsupported data type: Decimal256. INVALID_ARGUMENT: [INTERNAL_ERROR]Fail to convert block data to arrow data, error: [E3] write_column_to_arrow with type Decimal256 Unsupported null value of map key. INVALID_ARGUMENT: [INTERNAL_ERROR]Fail to convert block data to arrow data, error: [E33] Can not write null value of map key to arrow. Unsupported data type: ARRAY<MAP<TEXT,TEXT>> jdbc:arrow-flight-sql not support connecting to specify DB name, such asjdbc:arrow-flight-sql://127.0.0.1:9090/{db_name}", In order to be compatible with regression-test, use db_nameis added before all SQLs whenjdbc:arrow-flight-sql` runs regression test. select timediff("2010-01-01 01:00:00", "2010-01-02 01:00:00");, error java.lang.NumberFormatException: For input string: "-24:00:00"	2023-12-04 19:23:56 +08:00
amory	ed713c2e2d	[FIX](func) fix count distinct do not support arr/map/struct (#25483 )	2023-10-19 01:04:50 -05:00
zhangstar333	9649e09aaa	[feature](function) support bitmap type in min/max_by agg function (#25430 ) support bitmap type in min/max_by agg function	2023-10-16 11:05:32 +08:00
morrySnow	b670dd0db7	[feature](Nereids) support array type (#22851 ) FEATURE: 1. enable array type in Nereids 2. support generice on function signature 3. support array and map type in type coercion and type check 4. add element_at and element_slice syntax in Nereids parser REFACTOR: 1. remove AbstractDataType BUG FIX: 1. remove FROM from nonReserved keyword list TODO: 1. support lambda expression 2. use Nereids' way do function type coercion 3. use castIfnotSame when do implict cast on BoundFunction 4. let AnyDataType type coercion do same thing as function type coercion 5. add below array function - array_apply - array_concat - array_filter - array_sortby - array_exists - array_first_index - array_last_index - array_count - array_shuffle shuffle - array_pushfront - array_pushback - array_repeat - array_zip - reverse - concat_ws - split_by_string - explode - bitmap_from_array - bitmap_to_array - multi_search_all_positions - multi_match_any - tokenize	2023-08-22 09:47:55 +08:00
HappenLee	4608dcb2d9	[fix](agg) fix coredump caused by push down count aggregation (#22699 ) fix coredump caused by push down count aggregation	2023-08-09 10:21:20 +08:00
czzmmc	1a8a1e5b16	[Feature](count_by_enum) support count_by_enum function (#22071 ) count_by_enum(expr1, expr2, ... , exprN); Treats the data in a column as an enumeration and counts the number of values in each enumeration. Returns the number of enumerated values for each column, and the number of non-null values versus the number of null values.	2023-08-06 16:05:14 +08:00
TengJianPing	baf9a2107b	[fix](regression) fix case failure by adding sync after stream load (#21155 )	2023-06-26 10:38:46 +08:00
Jerry Hu	63e8fb7300	[chore](regression) Add 'sync' after stream_load in some cases (#18945 )	2023-04-23 14:39:33 +08:00
zhangstar333	df0aaece1d	[Function](test) add some test cases for agg functions (#18610 )	2023-04-13 10:23:41 +08:00
924060929	321bb3e9ee	[refactor](Nereids) Refactor and optimize partition pruning (#18003 ) the legacy PartitionPruner only support some simple cases, some useful cases not support: 1. can not support evaluate some builtin functions, like `cast(part_column as bigint) = 1` 2. can not prune multi level range partition, for partition `[[('1', 'a'), ('2', 'b'))`, it has some constraints: - first_part_column between '1' and '2' - if first_part_column = '1' then second_part_column >= 'a' - if first_part_column = '2' then second_part_column < 'a' This pr refactor it and support: 1. use visitor to evaluate function and fold constant 2. if the partition is discrete like int, date, we can expand it and evaluate, e.g `[1, 5)` will be expand to `[1, 2, 3, 4]` 3. support prune multi level range partition, as previously described 4. support evaluate capabilities for a range slot, e.g. datetime range partition `[('2023-03-21 00:00:00'), ('2023-03-21 23:59:59'))`, if the filter is `date(col1) = '2023-03-22'`, this partition will be pruned, we can do this prune because we know that the date always is `2023-03-21`. you can implement the visit method in FoldConstantRuleOnFE and OneRangePartitionEvaluator to support this functions. ### How can we do it so finely ？ Generally, the range partition can separate to three parts: `const`, `range`, `other`. for example, the partition `[(1, 'a', 'D'), ('1', 'c', 'D'))` exist 1. first partition column is `const`: always equals to '1' 2. second partition column is `range`: `slot >= 'a' and <= 'c'`. If not later slot, it must be `slot >= 'a' and < 'c'` 3. third partition column is `other`: regardless of whether the upper and lower bounds are the same, it must exist multi values, e.g. `('1', 'a', 'D')`, `('1', 'a', 'F')`, `('1', 'b', 'A')`, `('1', 'c', 'A')` In a partition, there is one and only one `range` slot can exist; maybe zero or one or many `const`/`other` slots. Normally, a partition look like [const, range, other], these are the possible shapes: 1. [range], e.g `[('1'), ('10'))` 2. [const, range], e.g. `[('1', 'a'), ('1', 'd'))` 3. [range, other, other], e.g. `[('1', '1', '1'), ('2', '1', '1'))` 4. [const, const, ..., range, other, other, ...], e.g. `[('1', '1', '2', '3', '4'), ('1', '1', '3', '3', '4'))` The properties of `const`: 1. we can replace slot to literal to evaluate expression tree. The properties of `range`: 1. if the slot date type is discrete type, like int, and date, we can expand it to literal and evaluate expression tree 2. if not discrete type, like datetime, or the discrete values too much, like [1, 1000000), we can keep the slot in the expression tree, and assign a range for it, when evaluate expression tree, we also compute the range and check whether range is empty set, if so we can simplify to BooleanLiteral.FALSE to skip this partition. 5. if the range slot satisfied some conditions , we can fold the slot with some function too, see the datetime example above The properties of `other`: 1. only when the previous slot is literal and equals to the lower bound or upper bound of partition, we can shrink the range of the `other` slot According this properties, we can do it finely. at the runtime, the `range` and `other` slot maybe shrink the range of values, e.g. 1. the partition `[('a'), ('b'))` with predicate `part_col = 'a'` will shrink range `['a', 'b')` to `['a']`, like a `range` slot change/downgrading to `const` slot; 2. the partition `[('a', '1'), ('b', '10'))` with predicate `part_col1 = 'a'` will shrink the range of `other` slot from unknown(all range) to `['1', +∞)`, like a `other` slot change/downgrading to `range` slot. But to simplify, I haven't change the type at the runtime, just shrink the ColumnRange.	2023-03-24 09:06:52 +08:00
924060929	d3e7f12ada	[refactor](Nereids) refactor column pruning (#17579 ) This pr refactor the column pruning by the visitor, the good sides 1. easy to provide ability of column pruning for new plan by implement the interface `OutputPrunable` if the plan contains output field or do nothing if not contains output field, don't need to add new rule like `PruneXxxChildColumns`, few scenarios need to override the visit function to write special logic, like prune the LogicalSetOperation and Aggregate 2. support shrink output field in some plans, this can skip some useless operations so improvement example: ```sql select id from ( select id, sum(age) from student group by id )a ``` we should prune the useless `sum (age)` in the aggregate. before refactor: ``` LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true ) +--LogicalSubQueryAlias ( qualifier=[a] ) +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0, sum(age#2) AS `sum(age)`#4], hasRepeat=false ) +--LogicalProject ( distinct=false, projects=[id#0, age#2], excepts=[], canEliminate=true ) +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON ) ``` after refactor: ``` LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true ) +--LogicalSubQueryAlias ( qualifier=[a] ) +--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0], hasRepeat=false ) +--LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true ) +--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON ) ```	2023-03-24 09:00:48 +08:00
Gabriel	079e6a3e12	[regression-test](vectorized) remove unused vectorization flag (#17662 )	2023-03-15 17:59:22 +08:00
ZhaoChangle	66f3ef568e	(functions) optimize const_column to full convert	2023-03-15 10:57:03 +08:00
spaces-x	5b39fa9843	[Feature](vec)(quantile_state): support quantile state in vectorized engine (#16562 ) * [Feature](vectorized)(quantile_state): support vectorized quantile state functions 1. now quantile column only support not nullable 2. add up some regression test cases 3. set default enable_quantile_state_type = true --------- Co-authored-by: spaces-x <weixiang06@meituan.com>	2023-03-14 10:54:04 +08:00
starocean999	782001c75b	[fix](planner) project should be done inside subquery (#17630 ) WITH t0 AS( SELECT report.date1 AS date2 FROM( SELECT DATE_FORMAT(date, '%Y%m%d') AS date1 FROM cir_1756_t1 ) report GROUP BY report.date1 ), t3 AS( SELECT date_format(date, '%Y%m%d') AS date3 FROM cir_1756_t2 ) SELECT row_number() OVER(ORDER BY date2) FROM( SELECT t0.date2 FROM t0 LEFT JOIN t3 ON t0.date2 = t3.date3 ) tx; The DATE_FORMAT(date, '%Y%m%d') was calculated in GROUP BY node, which is wrong. This expr should be calculated inside the subquery.	2023-03-13 11:10:27 +08:00
ElvinWei	bd5ed2b0c2	[enhancement](histogram) optimize the histogram bucketing strategy, etc (#17264 ) * optimize the histogram bucketing strategy, etc * fix p0 regression of histogram	2023-03-08 20:12:05 +08:00
奕冷	c0360f80bb	[enhancement](aggregate-function) enhance aggregate funtion collect and add group_array aliases (#15339 ) Enhance aggregate function `collect_set` and `collect_list` to support optional `max_size` param, which enables to limit the number of elements in result array.	2023-02-27 14:22:30 +08:00
ElvinWei	f32cd2c123	[fix](statistics) fix a problem with histogram statistics collection parameters (#16918 ) 1. Fixed a problem with histogram statistics collection parameters. 2. Solved the problem that it takes a long time to collect histogram statistics. TODO: Optimize histogram statistics sampling method and make the sampling parameters effective. The problem is that the histogram function works as expected in the single-node test, but doesn't work in the multi-node test. In addition, the performance of the current support sampling to collect histogram is low, resulting in a large time consumption when collecting histogram information. Fixed the parameter issue and temporarily removed support for sampling to speed up the collection of histogram statistics. Will next support sampling to collect histogram information.	2023-02-20 16:33:18 +08:00
ElvinWei	f443ebfd9a	[Improvement](statistics) optimise histogram keyword (#16369 )	2023-02-03 23:02:41 +08:00
abmdocrt	ca7eb94f23	[improvement](agg-function) Increase the limit maximum number of agg function parameters (#15924 )	2023-01-31 21:03:50 +08:00
Gabriel	d062ca2944	[refactor](vectorized) remove unnecessary vectorization check (#15984 )	2023-01-17 12:21:46 +08:00
ElvinWei	36590da24b	[fix](regression p0) add the alias function hist to histogram and fix p0 (#15708 ) add the alias function hist to histogram and fix p0	2023-01-08 11:31:23 +08:00
ElvinWei	76ad599fd7	[enhancement](histogram) optimise aggregate function histogram (#15317 ) This pr mainly to optimize the histogram(👉🏻 https://github.com/apache/doris/pull/14910) aggregation function. Including the following: 1. Support input parameters `sample_rate` and `max_bucket_num` 2. Add UT and regression test 3. Add documentation 4. Optimize function implementation logic Parameter description： - `sample_rate`：Optional. The proportion of sample data used to generate the histogram. The default is 0.2. - `max_bucket_num`：Optional. Limit the number of histogram buckets. The default value is 128. --- Example： ``` MySQL [test]> SELECT histogram(c_float) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_float`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test; +-------------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`c_string`) \| +-------------------------------------------------------------------------------------------------------------------------------------+ \| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} \| +-------------------------------------------------------------------------------------------------------------------------------------+ ``` Query result description： ``` { "sample_rate": 0.2, "max_bucket_num": 128, "bucket_num": 3, "buckets": [ { "lower": "0.1", "upper": "0.2", "count": 2, "pre_sum": 0, "ndv": 2 }, { "lower": "0.8", "upper": "0.9", "count": 2, "pre_sum": 2, "ndv": 2 }, { "lower": "1.0", "upper": "1.0", "count": 2, "pre_sum": 4, "ndv": 1 } ] } ``` Field description： - sample_rate：Rate of sampling - max_bucket_num：Limit the maximum number of buckets - bucket_num：The actual number of buckets - buckets：All buckets - lower：Upper bound of the bucket - upper：Lower bound of the bucket - count：The number of elements contained in the bucket - pre_sum：The total number of elements in the front bucket - ndv：The number of different values in the bucket > Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).	2023-01-07 00:50:32 +08:00
starocean999	51b14c06d3	[enhancement](nereids) support approx_count_distinct function (#15406 )	2022-12-27 22:25:21 +08:00
Gabriel	4dbe30d37b	[regression](vectorized) delete vectorized config in regression tests (#15126 )	2022-12-16 17:08:29 +08:00
abmdocrt	529bdfb153	[Fix](function) Fix retention function return wrong value type (#14552 ) MySQL [db]> SELECT SUM(a.r[1]) as active_user_num, SUM(a.r[2]) as active_user_num_1day, SUM(a.r[3]) as active_user_num_3day, SUM(a.r[4]) as active_user_num_7day FROM ( SELECT user_id, retention( day = '2022-11-01', day = '2022-11-02', day = '2022-11-04', day = '2022-11-07') as r FROM login_event WHERE (day >= '2022-11-01') AND (day <= '2022-11-21') GROUP BY user_id ) a; ERROR 1105 (HY000): errCode = 2, detailMessage = sum requires a numeric parameter: sum(%element_extract%(a.r, 1))	2022-11-28 15:56:18 +08:00
HappenLee	38b4cbe253	[Bug](regression) regression fail random in fuzzy mode (#14614 )	2022-11-27 09:23:36 +08:00
zy-kkk	59b31a03c4	[Improvement](agg function) support group_bit_and/group_bit_or/group_bit_xor functions (#14386 )	2022-11-24 16:46:42 +08:00
lihangyu	8afe298a0f	[Fix](function) fix function `retention` lost `ARRAY`'s element type … (#14538 )	2022-11-24 15:19:50 +08:00
abmdocrt	70ea07bc4b	[fix](nullable) Fix nullable cache to avoid function returning wrong value (#14463 )	2022-11-24 09:35:08 +08:00
Pxl	bcd641877f	[Enhancement](scan) disable build key range and filters when push down agg work (#14248 ) disable build key range and filters when push down agg work	2022-11-21 12:47:57 +08:00
abmdocrt	b6ba654f5b	[Feature](Sequence) Support sequence_match and sequence_count functions (#13785 )	2022-11-11 13:38:45 +08:00
zhangstar333	374303186c	[Vectorized](function) support topn_array function (#13869 )	2022-11-02 19:49:23 +08:00
camby	bed759b3f5	[Fix](array-type) support CTAS for ARRAY column from collect_list and collect_set (#13627 ) Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>	2022-10-26 19:42:15 +08:00
zhengyu	b85c78ee00	[fix](regression) add 'if not exists' to 'create table' to support parallel test (#13576 ) (#13578 ) Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>	2022-10-25 16:37:07 +08:00
Mingyu Chen	32b1456b28	[feature-wip](array) remove array config and check array nested depth (#13428 ) 1. remove FE config `enable_array_type` 2. limit the nested depth of array in FE side. 3. Fix bug that when loading array from parquet, the decimal type is treated as bigint 4. Fix loading array from csv(vec-engine), handle null and "null" 5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null. 6. Remove `check_array_format()`, because it's logic is wrong and meaningless 7. Add stream load csv test cases and more parquet broker load tests	2022-10-20 15:52:31 +08:00
luozenglin	207f4e559e	[feature](agg) support `group_bitmap_xor` agg function. (#13287 ) support `group_bitmap_xor` agg function	2022-10-17 18:40:06 +08:00
abmdocrt	045bccdbea	[Feature](Retention) support retention function (#13056 )	2022-10-17 11:00:47 +08:00
luozenglin	d63a80eaba	[fix](bitmap_intersect) fix `bitmap_intersect` result error (#13298 )	2022-10-12 19:12:11 +08:00
Yongqiang YANG	ff1971f916	[improvement](test) add dryRun option and group all cases into either p0 or p1 (#11576 ) 1. add dryRun option to list tests 2. group all cases into p0 p1 p2	2022-08-17 22:45:53 +08:00

42 Commits