Commit Graph

1504 Commits

Author SHA1 Message Date
7fe08c74fe [fix](inverted index) return empty result instead of error for empty match query (#22592)
return empty result instead of error for empty match query as follows:

`SELECT * FROM t WHERE msg MATCH ''`

`SELECT * FROM t WHERE msg MATCH 'stop_word'`
2023-08-04 17:36:32 +08:00
ef53a27887 [fix](nereids) allow in or exits subquery in binary operator (#22391)
support subquery in binary operator like if( xx  in ( subquery ), 1, 0 )
2023-08-04 15:35:19 +08:00
658d75c816 [feature](Nereids): normalize join condition after expanding or condition NLJ (#22555) 2023-08-04 13:37:37 +08:00
f828a3d826 [shape](nereids) ssb sf100 plan shape check (#22596) 2023-08-04 13:12:21 +08:00
62b1a7bcf3 [tpcds](nereids) add rule to eliminate empty relation #22203
1. eliminate emptyrelation,
2. const fold after filter pushdown
2023-08-04 12:49:53 +08:00
0e9fad4fe9 [stats](nereids) improve Anti join stats estimation #22444
No impact on TPC-H
impact on TPC-DS 16/69/94  improved
2023-08-04 12:48:39 +08:00
3447a70b25 [Fix](planner)fix delete stmt contains where but delete all data. (#22563) 2023-08-03 23:44:05 +08:00
469886eb4e [FIX](array)fix if function for array() #22553
[FIX](array)fix if function for array() #22553
2023-08-03 19:40:45 +08:00
4322fdc96d [feature](Nereids): add or expansion in CBO(#22465) 2023-08-03 13:29:33 +08:00
938f768aba [fix](parquet) resolve offset check failed in parquet map type (#22510)
Fix error when reading empty map values in parquet. The `offsets.back()` doesn't not equal the number of elements in map's key column.

### How does this happen
Map in parquet is stored as repeated group, and `repeated_parent_def_level` is set incorrectly when parsing map node in parquet schema.
```
the map definition in parquet:
 optional group <name> (MAP) {
   repeated group map (MAP_KEY_VALUE) {
     required <type> key;
     optional <type> value;
   }
}
```

### How to fix
Set the `repeated_parent_def_level` of key/value node as the definition level of map node.

`repeated_parent_def_level` is the definition level of the first ancestor node whose `repetition_type` equals `REPEATED`.  Empty array/map values are not stored in doris column, so have to use `repeated_parent_def_level` to skip the empty or null values in ancestor node.

For instance, considering an array of strings with 3 rows like the following:
`null, [], [a, b, c]`
We can store four elements in data column: `null, a, b, c`
and the offsets column is: `1, 1, 4`
and the null map is: `1, 0, 0`
For the `i-th` row in array column: range from `offsets[i - 1]` until `offsets[i]` represents the elements in this row, so we can't store empty array/map values in doris data column. As a comparison, spark does not require `repeated_parent_def_level`, because the spark column stores empty array/map values , and use anther length column to indicate empty values. Please reference: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetColumnVector.java

Furthermore, we can also avoid store null array/map values in doris data column. The same three rows as above, We can only store three elements in data column: `a, b, c`
and the offsets column is: `0, 0, 3`
and the null map is: `1, 0, 0`
2023-08-02 22:33:10 +08:00
0cd5183556 [Refactor](inverted index) refact tokenize function for inverted index (#22313) 2023-08-02 19:12:22 +08:00
ddd90855a9 [vectorized](udaf) java udaf support with map type (#22397)
[vectorized](udaf) java udaf support with map type (#22397)
* test
* remove some unused
* update
* add case
2023-08-02 15:03:44 +08:00
bf50f9fa7f [fix](decimal) fix cast rounding half up with negative number (#22450) 2023-08-01 21:47:42 +08:00
b8399148ef [fix](DOE) es catalog not working with pipeline,datetimev2, array and esquery (#22046) 2023-08-01 21:45:16 +08:00
d5d82b7c31 [stats](nereids) fix bug for avg-size (#22421) 2023-08-01 17:13:00 +08:00
Pxl
8d16f1bb09 [Chore](materialized-view) update documentation about materialized-view and update test (#22350)
update documentation about materialized-view and update test
2023-08-01 15:13:34 +08:00
7a2ff56863 [regression](fix) fix test_round case (#22441) 2023-08-01 11:35:44 +08:00
c1f36639fd [fix](sort) VSortedRunMerger does not return any rows with a large offset value (#22191) 2023-07-31 22:28:13 +08:00
450e0b1078 [fix](nereids) recompute logical properties in plan post process (#22356)
join commute rule will swap the left and right child. This cause the change of logical properties. So we need recompute the logical properties in plan post process to get the correct result
2023-07-31 21:04:39 +08:00
3a1d678ca9 [Fix](Planner) fix parse error of view with group_concat order by (#22196)
Problem:
    When create view with projection group_concat(xxx, xxx order by orderkey). It will failed during second parse of inline view

For example:
    it works when doing 
    "SELECT id, group_concat(`name`, "," ORDER BY id) AS test_group_column FROM  test GROUP BY id"
    but when create view it does not work
    "create view test_view as SELECT id, group_concat(`name`, "," ORDER BY id) AS test_group_column FROM  test GROUP BY id"

Reason:
    when creating view, we will doing parse again of view.toSql() to check whether it has some syntax error. And when doing toSql() to group_concat with order by, it add seperate ', ' between second parameter and order by. So when parsing again, it
would failed because it is different semantic with original statement.
    group_concat(`name`, "," ORDER BY id)  ==> group_concat(`name`, "," , ORDER BY id)

Solved:
    Change toSql of group_concat and add order by statement analyze() of group_concat in Planner cause it would work if we get order by from view statement and do not analyze and binding slot reference to it
2023-07-31 17:20:23 +08:00
7261845b3d [FIX](complex-type)fix complex type nested col_const (#22375)
for array/map/struct in mysql_writer unpack_if_const only unpack self column not nested , so col_const should not used in nested column.
2023-07-31 14:53:18 +08:00
f2919567df [feature](datetime) Support timezone when insert datetime value (#21898) 2023-07-31 13:08:28 +08:00
79289e32dc [fix](cast) fix wrong result of casting empty string to array date (#22281) 2023-07-30 21:15:03 +08:00
03761c37cd [Improvement](multi catalog) Support Iceberg, Paimon and MaxCompute table in nereids. (#22338) 2023-07-29 21:43:35 +08:00
47c2cc5c74 [vectorized](udf) java udf support with return map type (#22300) 2023-07-29 12:52:27 +08:00
ae8a26335c [opt](hive)opt select count(*) stmt push down agg on parquet in hive . (#22115)
Optimization "select count(*) from table" stmtement , push down "count" type to BE.
support file type : parquet ,orc in hive .

1. 4kfiles , 60kwline num 
    before:  1 min 37.70 sec 
    after:   50.18 sec

2. 50files , 60kwline num
    before: 1.12 sec
    after: 0.82 sec
2023-07-29 00:31:01 +08:00
53d255f482 [fix](partial update) remove CHECK on illegal number of partial columns (#22319) 2023-07-28 23:11:58 +08:00
f7c106c709 [opt](nereids) enhance broadcast join cost calculation (#22092)
Enhance broadcast join cost calculation, by considering both the build side effort from building bigger hash table, and more probe side effort from bigger cost of ProbeWhenBuildSideOutput and ProbeWhenSearchHashTable, if parallel_fragment_exec_instance_num is more than 1.

Current solution gives a penalty factor on rightRowCount, and the factor is the total instance number to the power of 2.
Penalty on outputRows is not taken currently and will be refined in next generation cost model.

Also brings some update for shape checking:

update original control variable in shape file parallel_fragment_exec_instance_num to parallel_pipeline_task_num, if pipeline is enabled.
fix a be_number variable inactive issue.
2023-07-28 23:06:02 +08:00
2f43e59535 [test](regression) add partial update seq_col delete cases (#22340) 2023-07-28 17:36:55 +08:00
05abfbc5ef [improvement](regression-test) add compression algorithm regression test (#22303) 2023-07-28 17:28:52 +08:00
5a0ad09856 [fix](nereids) SubqueryToApply may lost conjunct (#22262)
consider sql:
```
SELECT *
        FROM sub_query_correlated_subquery1 t1
        WHERE coalesce(bitand( 
        cast(
            (SELECT sum(k1)
            FROM sub_query_correlated_subquery3 ) AS int), 
            cast(t1.k1 AS int)), 
            coalesce(t1.k1, t1.k2)) is NULL
        ORDER BY  t1.k1, t1.k2;
```
is Null conjunct is lost in SubqueryToApply rule. This pr fix it
2023-07-28 15:08:56 +08:00
0c734a861e [Enhancement](delete) eliminate reading the old values of non-key columns for delete stmt (#22270) 2023-07-28 14:37:33 +08:00
5da5fac37a [refactor](Nereids) add result sink node (#22254)
use ResultSink as query root node to let plan of query statement
has the same pattern with insert statement
2023-07-28 11:31:09 +08:00
adc44d9f46 [regression-test] add list partition case and multi partition keys case (#22042)
* [regression-test] add list partition case and multi partition keys case

* fix delete failed
2023-07-28 10:12:35 +08:00
0d7d9b92db [fix](multi-catalog) complex types parsing failed, with unexpected nulls and rows (#22228)
Fix tow bugs:
1. Unexpected null values in array column. If 65535 consecutive values are not null in nullable array column, this error will be triggered. The reason is that the array parser did not handle boundary conditions.
2. The number of rows of key filed, and that of value field in map column are not equal. Similarly, the number of rows among fields in struct column are not the same. This would be triggered when the number of rows are not equal among parquet pages of different columns in a row group.
2023-07-28 10:03:08 +08:00
8caa5a9ba4 [Fix](mutli-catalog) Fix null partitions error in iceberg tables. (#22185)
### Issue
when partition has null partitions, it throws error
`Failed to fill partition column: t_int=null`

### Resolution
- Fix the following null partitions error in iceberg tables by replacing null partition to '\N'.
- Add regression test for hive null partition.
2023-07-27 23:57:35 +08:00
b5fa29e138 [fix](bitmap) incorrect result of function 'bitmap_from_array' (#22305) 2023-07-27 22:44:06 +08:00
716d58f5ff [fix](Nereids) decimal divide should not return null if numerator is zero (#22309) 2023-07-27 20:23:04 +08:00
a87d34b19b [Fix](multi catalog statistics)Improve external table statistics collection (#22224)
Improve external table statistics collection, including log, observability and fix some bugs.
1. Add Running state for statistics job.
2. Add progress for show analyze job. (n/m tasks finished, n/m task failed and so on)
3. Add analyze time cost for show analyze task.
4. Make task failure message more clear.
5. Synchronize the job status updating code in updateTaskStatus.
6. Fix NPE in HMSAnalyzeTask. (Avoid refreshing statistics cache if the collection sql failed)
7. Return error message for with sync collection while timeout. 
8. Log level improvement
9. Fix misuse of logCreateAnalysisJob for tasks.
2023-07-27 20:01:14 +08:00
ae5e39ad26 [opt](Nereids) add double signature back for round like function (#22284)
add double signature back for round like function
2023-07-27 19:10:43 +08:00
6f1c03c766 [fix](jdbc_catalog) fix int and bigint in mysql view when use doris catalog (#22251) 2023-07-27 16:50:42 +08:00
0512e0b168 [test](regression) add cases for partial update with sequence_type (#22215) 2023-07-27 15:51:01 +08:00
4f6a3c5bf0 [feature](catalog) support clob type in oracle jdbc catalog (#21532) 2023-07-27 15:49:15 +08:00
ddfdf62993 [opt](planner) support to parse scientific notation(aEb) (#22248) 2023-07-27 13:31:37 +08:00
41a230b721 [fix] iceberg catalog to specify the version and time (#22209)
problem:
1. create a iceberg_type catalog:
2. use iceberg catalog to specify verison
```
mysql> show catalog iceberg;
+----------------------+--------------------------+
| Key                  | Value                    |
+----------------------+--------------------------+
| type                 | iceberg                  |
| iceberg.catalog.type | hms                      |
| hive.metastore.uris  | thrift://127.0.0.1:9083 |
| hadoop.username      | hadoop                   |
| create_time          | 2023-07-25 16:51:00.522  |
+----------------------+--------------------------+
5 rows in set (0.02 sec)

mysql> select * from iceberg.iceberg_db.tb1 FOR VERSION AS OF 8783036402036752909;
ERROR 5090 (42000): errCode = 2, detailMessage = Only iceberg/hudi external table supports time travel in current version
```

change:
Add `ICEBERG_EXTERNAL_TABLE` type for specify the version and time
2023-07-27 12:04:41 +08:00
619a2857e1 [improvement](jdbc catalog) improve mysql jdbc catalog read bytea`s types & else improve (#22233) 2023-07-27 10:18:37 +08:00
341c45974c [round](decimalv2) round precise decimalv2 value (#22258) 2023-07-27 10:00:36 +08:00
163a38a527 [opt](Nereids) support sql cache (#22144)
1. let Nereids support sql cache
2. let legacy planner's sql cache supports union all
2023-07-27 09:57:31 +08:00
8fb28ecc9e [test](partial-update) add some cases for partial-update (#22210) 2023-07-27 09:52:40 +08:00
dcd6844ea5 [improvement](regression-test) add partial update with schema change case (#22213) 2023-07-27 09:51:42 +08:00