Design Documentation Linked to #25514
Regression test add a new group: arrow_flight_sql,
./run-regression-test.sh -g arrow_flight_sql to run regression-test, can use jdbc:arrow-flight-sql to run all Suites whose group contains arrow_flight_sql.
./run-regression-test.sh -g p0,arrow_flight_sql to run regression-test, can use jdbc:arrow-flight-sql to run all Suites whose group contains arrow_flight_sql, and use jdbc:mysql to run other Suites whose group contains p0 but does not contain arrow_flight_sql.
Requires attention, the formats of jdbc:arrow-flight-sql and jdbc:mysql and mysql client query results are different, for example:
Datatime field type: jdbc:mysql returns 2010-01-02T05:09:06, mysql client returns 2010-01-02 05:09:06, jdbc:arrow-flight-sql also returns 2010-01-02 05:09 :06.
Array and Map field types: jdbc:mysql returns ["ab", "efg", null], {"f1": 1, "f2": "a"}, jdbc:arrow-flight-sql returns ["ab ","efg",null], {"f1":1,"f2":"a"}, which is missing spaces.
Float field type: jdbc:mysql and mysql client returns 6.333, jdbc:arrow-flight-sql returns 6.333000183105469, in query_p0/subquery/test_subquery.groovy.
If the query result is empty, jdbc:arrow-flight-sql returns empty and jdbc:mysql returns \N.
use database; and query should be divided into two SQL executions as much as possible. otherwise the results may not be as expected. For example: USE information_schema; select cast ("0.0101031417" as datetime) The result is 2000-01-01 03:14:1 (constant fold), select cast ("0.0101031417" as datetime) The result is null (no constant fold),
In addition, doris jdbc:arrow-flight-sql still has unfinished parts, such as:
Unsupported data type: Decimal256. INVALID_ARGUMENT: [INTERNAL_ERROR]Fail to convert block data to arrow data, error: [E3] write_column_to_arrow with type Decimal256
Unsupported null value of map key. INVALID_ARGUMENT: [INTERNAL_ERROR]Fail to convert block data to arrow data, error: [E33] Can not write null value of map key to arrow.
Unsupported data type: ARRAY<MAP<TEXT,TEXT>>
jdbc:arrow-flight-sql not support connecting to specify DB name, such asjdbc:arrow-flight-sql://127.0.0.1:9090/{db_name}", In order to be compatible with regression-test, use db_nameis added before all SQLs whenjdbc:arrow-flight-sql` runs regression test.
select timediff("2010-01-01 01:00:00", "2010-01-02 01:00:00");, error java.lang.NumberFormatException: For input string: "-24:00:00"
FEATURE:
1. enable array type in Nereids
2. support generice on function signature
3. support array and map type in type coercion and type check
4. add element_at and element_slice syntax in Nereids parser
REFACTOR:
1. remove AbstractDataType
BUG FIX:
1. remove FROM from nonReserved keyword list
TODO:
1. support lambda expression
2. use Nereids' way do function type coercion
3. use castIfnotSame when do implict cast on BoundFunction
4. let AnyDataType type coercion do same thing as function type coercion
5. add below array function
- array_apply
- array_concat
- array_filter
- array_sortby
- array_exists
- array_first_index
- array_last_index
- array_count
- array_shuffle shuffle
- array_pushfront
- array_pushback
- array_repeat
- array_zip
- reverse
- concat_ws
- split_by_string
- explode
- bitmap_from_array
- bitmap_to_array
- multi_search_all_positions
- multi_match_any
- tokenize
count_by_enum(expr1, expr2, ... , exprN);
Treats the data in a column as an enumeration and counts the number of values in each enumeration. Returns the number of enumerated values for each column, and the number of non-null values versus the number of null values.
the legacy PartitionPruner only support some simple cases, some useful cases not support:
1. can not support evaluate some builtin functions, like `cast(part_column as bigint) = 1`
2. can not prune multi level range partition, for partition `[[('1', 'a'), ('2', 'b'))`, it has some constraints:
- first_part_column between '1' and '2'
- if first_part_column = '1' then second_part_column >= 'a'
- if first_part_column = '2' then second_part_column < 'a'
This pr refactor it and support:
1. use visitor to evaluate function and fold constant
2. if the partition is discrete like int, date, we can expand it and evaluate, e.g `[1, 5)` will be expand to `[1, 2, 3, 4]`
3. support prune multi level range partition, as previously described
4. support evaluate capabilities for a range slot, e.g. datetime range partition `[('2023-03-21 00:00:00'), ('2023-03-21 23:59:59'))`, if the filter is `date(col1) = '2023-03-22'`, this partition will be pruned, we can do this prune because we know that the date always is `2023-03-21`. you can implement the visit method in FoldConstantRuleOnFE and OneRangePartitionEvaluator to support this functions.
### How can we do it so finely ?
Generally, the range partition can separate to three parts: `const`, `range`, `other`.
for example, the partition `[(1, 'a', 'D'), ('1', 'c', 'D'))` exist
1. first partition column is `const`: always equals to '1'
2. second partition column is `range`: `slot >= 'a' and <= 'c'`. If not later slot, it must be `slot >= 'a' and < 'c'`
3. third partition column is `other`: regardless of whether the upper and lower bounds are the same, it must exist multi values, e.g. `('1', 'a', 'D')`, `('1', 'a', 'F')`, `('1', 'b', 'A')`, `('1', 'c', 'A')`
In a partition, there is one and only one `range` slot can exist; maybe zero or one or many `const`/`other` slots.
Normally, a partition look like [const*, range, other*], these are the possible shapes:
1. [range], e.g `[('1'), ('10'))`
2. [const, range], e.g. `[('1', 'a'), ('1', 'd'))`
3. [range, other, other], e.g. `[('1', '1', '1'), ('2', '1', '1'))`
4. [const, const, ..., range, other, other, ...], e.g. `[('1', '1', '2', '3', '4'), ('1', '1', '3', '3', '4'))`
The properties of `const`:
1. we can replace slot to literal to evaluate expression tree.
The properties of `range`:
1. if the slot date type is discrete type, like int, and date, we can expand it to literal and evaluate expression tree
2. if not discrete type, like datetime, or the discrete values too much, like [1, 1000000), we can keep the slot in the expression tree, and assign a range for it, when evaluate expression tree, we also compute the range and check whether range is empty set, if so we can simplify to BooleanLiteral.FALSE to skip this partition.
5. if the range slot satisfied some conditions , we can fold the slot with some function too, see the datetime example above
The properties of `other`:
1. only when the previous slot is literal and equals to the lower bound or upper bound of partition, we can shrink the range of the `other` slot
According this properties, we can do it finely.
at the runtime, the `range` and `other` slot maybe shrink the range of values,
e.g.
1. the partition `[('a'), ('b'))` with predicate `part_col = 'a'` will shrink range `['a', 'b')` to `['a']`, like a `range` slot change/downgrading to `const` slot;
2. the partition `[('a', '1'), ('b', '10'))` with predicate `part_col1 = 'a'` will shrink the range of `other` slot from unknown(all range) to `['1', +∞)`, like a `other` slot change/downgrading to `range` slot.
But to simplify, I haven't change the type at the runtime, just shrink the ColumnRange.
This pr refactor the column pruning by the visitor, the good sides
1. easy to provide ability of column pruning for new plan by implement the interface `OutputPrunable` if the plan contains output field or do nothing if not contains output field, don't need to add new rule like `PruneXxxChildColumns`, few scenarios need to override the visit function to write special logic, like prune the LogicalSetOperation and Aggregate
2. support shrink output field in some plans, this can skip some useless operations so improvement
example:
```sql
select id
from (
select id, sum(age)
from student
group by id
)a
```
we should prune the useless `sum (age)` in the aggregate.
before refactor:
```
LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
+--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0, sum(age#2) AS `sum(age)`#4], hasRepeat=false )
+--LogicalProject ( distinct=false, projects=[id#0, age#2], excepts=[], canEliminate=true )
+--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )
```
after refactor:
```
LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalSubQueryAlias ( qualifier=[a] )
+--LogicalAggregate ( groupByExpr=[id#0], outputExpr=[id#0], hasRepeat=false )
+--LogicalProject ( distinct=false, projects=[id#0], excepts=[], canEliminate=true )
+--LogicalOlapScan ( qualified=default_cluster:test.student, indexName=<index_not_selected>, selectedIndexId=10007, preAgg=ON )
```
* [Feature](vectorized)(quantile_state): support vectorized quantile state functions
1. now quantile column only support not nullable
2. add up some regression test cases
3. set default enable_quantile_state_type = true
---------
Co-authored-by: spaces-x <weixiang06@meituan.com>
WITH t0 AS(
SELECT report.date1 AS date2 FROM(
SELECT DATE_FORMAT(date, '%Y%m%d') AS date1 FROM cir_1756_t1
) report GROUP BY report.date1
),
t3 AS(
SELECT date_format(date, '%Y%m%d') AS date3
FROM cir_1756_t2
)
SELECT row_number() OVER(ORDER BY date2)
FROM(
SELECT t0.date2 FROM t0 LEFT JOIN t3 ON t0.date2 = t3.date3
) tx;
The DATE_FORMAT(date, '%Y%m%d') was calculated in GROUP BY node, which is wrong. This expr should be calculated inside the subquery.
Enhance aggregate function `collect_set` and `collect_list` to support optional `max_size` param,
which enables to limit the number of elements in result array.
1. Fixed a problem with histogram statistics collection parameters.
2. Solved the problem that it takes a long time to collect histogram statistics.
TODO: Optimize histogram statistics sampling method and make the sampling parameters effective.
The problem is that the histogram function works as expected in the single-node test, but doesn't work in the multi-node test. In addition, the performance of the current support sampling to collect histogram is low, resulting in a large time consumption when collecting histogram information.
Fixed the parameter issue and temporarily removed support for sampling to speed up the collection of histogram statistics.
Will next support sampling to collect histogram information.
This pr mainly to optimize the histogram(👉🏻https://github.com/apache/doris/pull/14910) aggregation function. Including the following:
1. Support input parameters `sample_rate` and `max_bucket_num`
2. Add UT and regression test
3. Add documentation
4. Optimize function implementation logic
Parameter description:
- `sample_rate`:Optional. The proportion of sample data used to generate the histogram. The default is 0.2.
- `max_bucket_num`:Optional. Limit the number of histogram buckets. The default value is 128.
---
Example:
```
MySQL [test]> SELECT histogram(c_float) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_float`) |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.2,"max_bucket_num":128,"bucket_num":3,"buckets":[{"lower":"0.1","upper":"0.1","count":1,"pre_sum":0,"ndv":1},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
MySQL [test]> SELECT histogram(c_string, 0.5, 2) FROM histogram_test;
+-------------------------------------------------------------------------------------------------------------------------------------+
| histogram(`c_string`) |
+-------------------------------------------------------------------------------------------------------------------------------------+
| {"sample_rate":0.5,"max_bucket_num":2,"bucket_num":2,"buckets":[{"lower":"str1","upper":"str7","count":4,"pre_sum":0,"ndv":3},...]} |
+-------------------------------------------------------------------------------------------------------------------------------------+
```
Query result description:
```
{
"sample_rate": 0.2,
"max_bucket_num": 128,
"bucket_num": 3,
"buckets": [
{
"lower": "0.1",
"upper": "0.2",
"count": 2,
"pre_sum": 0,
"ndv": 2
},
{
"lower": "0.8",
"upper": "0.9",
"count": 2,
"pre_sum": 2,
"ndv": 2
},
{
"lower": "1.0",
"upper": "1.0",
"count": 2,
"pre_sum": 4,
"ndv": 1
}
]
}
```
Field description:
- sample_rate:Rate of sampling
- max_bucket_num:Limit the maximum number of buckets
- bucket_num:The actual number of buckets
- buckets:All buckets
- lower:Upper bound of the bucket
- upper:Lower bound of the bucket
- count:The number of elements contained in the bucket
- pre_sum:The total number of elements in the front bucket
- ndv:The number of different values in the bucket
> Total number of histogram elements = number of elements in the last bucket(count) + total number of elements in the previous bucket(pre_sum).
MySQL [db]> SELECT SUM(a.r[1]) as active_user_num, SUM(a.r[2]) as active_user_num_1day, SUM(a.r[3]) as active_user_num_3day, SUM(a.r[4]) as active_user_num_7day FROM ( SELECT user_id, retention( day = '2022-11-01', day = '2022-11-02', day = '2022-11-04', day = '2022-11-07') as r FROM login_event WHERE (day >= '2022-11-01') AND (day <= '2022-11-21') GROUP BY user_id ) a;
ERROR 1105 (HY000): errCode = 2, detailMessage = sum requires a numeric parameter: sum(%element_extract%(a.r, 1))
1. remove FE config `enable_array_type`
2. limit the nested depth of array in FE side.
3. Fix bug that when loading array from parquet, the decimal type is treated as bigint
4. Fix loading array from csv(vec-engine), handle null and "null"
5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null.
6. Remove `check_array_format()`, because it's logic is wrong and meaningless
7. Add stream load csv test cases and more parquet broker load tests