Commit Graph

3105 Commits

Author SHA1 Message Date
3e1e8db173 [fix](exec) fix thread token shutdown (#14418)
Fix Thread pool token was shut down error.
This is because when there are more than 1 fragment of a query on one BE, the thread token maybe
reset incorrectly, causing thread token shutdown earlier.
cherry-pick from master
Introduced from #13021
2022-11-20 00:04:48 +08:00
2c42f0a905 [refactor](decimalv3) Refine code for DecimalV3 (#14394) 2022-11-19 16:57:17 +08:00
1f2c06dd6e [enhancement](rewrite) Remove unused wide common factors to improve scan performance in ExtractCommonFactorsRule (#14381)
* [enhancemeng](sql) Remove unused wide common factors to improve scan performance in ExtractCommonFactorsRule

* fix regression test

Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2022-11-19 13:23:49 +08:00
f5f2e84e31 [refactor](planner) remove the limit return rows of order by (#12478)
Originally, Order By Limit returned a maximum of 65535 rows of data by default during the query,
but now many businesses do not apply this limit.
It is necessary to add larger data after the query statement to complete the full data query,
which is extremely inconvenient, so adjustments have been made.

At the same time, I added the variable DEFAULT_ORDER_BY_LIMIT to the SessionVariable,
the default value is -1, if the user does not use the LIMIT keyword or the LIMIT value is a negative integer,
the default query return value is Long.MAX_VALUE. If the corresponding maximum query value is set,
the number of data items is returned according to the maximum query value or the value followed by the
LIMIT keyword.
2022-11-19 12:45:44 +08:00
1b6e872a8a [improvement](common) table name length exceeds limit error message (#14368)
For the table name check, the regular match error and the length exceeds the limit, both of which display the message "Incorrect table name 'xxx'. Table name regex is 'xxx'".
Obviously, the message cannot clearly point out what kind of error it is.
So it is a better way to separate the two error messages.
2022-11-19 11:36:08 +08:00
63a2344e68 [Enhancement](Nereids) Refactor AggregateFunction and support explain plan (#14380)
# Proposed changes

- Refactor AggregateFunction
    1. AggregateFunction implement ComputeSignature
    3. Add a CustomSignature to dynamic compute signature, we can check input type and compute implicit cast type in the `customSignature` method
    2. Add PartialAggType to record some type information before disassemble aggregate
    4. Refine and create a custom catalog function when translate AggregateFunction, without `finalizeForNereids`
-  Support explain plan
    1. explain parsed plan select ...
    5. explain analyzed plan select ...
    6. explain rewritten/logical plan select ...
    7. explain optimized/physical plan select ...
    8. explain all plan select ...
2022-11-18 23:40:33 +08:00
c4bade71c8 [refactor](nereids) remove ColumnStatistics.UNKNOWN from StatsDerive (#14343)
ColumnStatistics.UNKNOWN can be replaced by ColumnStatistics.DEFAULT
2022-11-18 23:40:00 +08:00
a82896f420 [fix](broker-load) fix that broker load don not set be exec version and limit node channel memory (#14399) 2022-11-18 23:38:37 +08:00
68da6bccb7 [fix](type) fix DECIMAL scale when cast function on fe (#12877)
before:
MySQL [test]> select cast('135.759999999' as DECIMAL(10,3));
+----------------------------------------+
| CAST('135.759999999' AS DECIMAL(10,3)) |
+----------------------------------------+
| 135.759999999 |
+----------------------------------------+
1 row in set (0.00 sec)

now:
MySQL [stage]> select cast('135.759999999' as DECIMAL(10,3));
+----------------------------------------+
| CAST('135.759999999' AS DECIMAL(10,3)) |
+----------------------------------------+
| 135.759 |
+----------------------------------------+
1 row in set (0.01 sec)
2022-11-18 19:36:14 +08:00
2c4236fd24 [improvement](ctas) use string type for varchar/char/string (#14382)
When executing create table as select stmt,
the varchar/char/string type of column in created table will be unified to string type.

Because when select from external table (mysql/pg, etc), the length of varchar in external database
is calculated by "char" length, not "byte" length.
So if there is a column with varchar(10) in external table, then there will be a same varchar(10)
in created table. But the byte length of data in external table may be larger than 10, causing failure of CTAS.

Change to string will not impact performance of the capacity of disk storage.
And notice that if a string type column is the first column, it will be changed to varchar(65535),
because we do not allow string type column as sort key column.
2022-11-18 14:20:13 +08:00
a1d02f36ac [feature](table-valued-function) support hdfs() tvf (#14213)
This pr does two things:
1. support `hdfs()` table valued function.
2. add regression test
2022-11-18 14:17:02 +08:00
7952bce03f [compatibility](Nereids) process escape in string literal (#14294) 2022-11-18 11:24:00 +08:00
9e25aa8d3e [feature](Nereids): Add subgraph enumerator #14291
Add subgraph enumerator to find the best plan

For DPHyp, we need an enumerator for all csg-cmp pairs to find the best plan
2022-11-18 10:33:30 +08:00
da0b09caea [fix](Nereids) DateTimeType migrate to DateType is wrong when hour, minute and second all zero (#14327)
1. fix DateTimeType migrate to DateType is wrong when hour, minute and second all zero
2. add TPC-H regression test with DATEV2 type
2022-11-18 01:38:03 +08:00
fb140d0180 [Enhancement](sequence-column) optimize the use of sequence column (#13872)
When you create the Uniq table, you can specify the mapping of sequence column to other columns.
You no longer need to specify mapping column when importing.
2022-11-17 22:39:09 +08:00
50bfd99b59 [feature](join) support nested loop semi/anti join (#14227) 2022-11-17 22:20:08 +08:00
8fe5211df4 [improvement](multi-catalog)(cache) invalidate catalog cache when refresh (#14342)
Invalidate catalog/db/table cache when doing
refresh catalog/db/table.

Tested table with 10000 partitions. The refresh operation will cost about 10-20 ms.
2022-11-17 20:47:46 +08:00
ccf4db394c [feature-wip](multi-catalog) Collect external table statistics (#14160)
Collect HMS external table statistic information through external metadata.
Insert the result into __internal_schema.column_statistics using insert into SQL.
2022-11-17 20:41:09 +08:00
44ee4386f7 [test](multi-catalog)Regression test for external hive orc table (#13762)
Add regression test for external hive orc table. This PR has generated all basic types support by hive orc, and create a hive external table to touch them in docker environment.
Functions to be tested:
1. Ensure that all types are parsed correctly
2. Ensure that the null map of all types are parsed correctly
3. Ensure that the `SearchArgument` of `OrcReader` works well
4. Only select partition columns
2022-11-17 20:36:02 +08:00
98956dfa19 [fix](statistics) statistics inaccurate after analyze same table more than once (#14279)
If a table already been analyzed, then we analyze it again, the new statistics would larger than expected since the incremental would contain the values from table level statistics since the SQL lack the predication for the nullability of part_id
2022-11-17 20:18:14 +08:00
6da2948283 [feature-wip](multi-catalog) support iceberg v2(step 1) (#13867)
Support position delete(part of).
2022-11-17 17:56:48 +08:00
af462b07c7 [enhancement](explain) compress descriptor table explain string (#14152)
1. compress slot descriptor explain string to one row
2. remove unmaterialized tuple descriptor and slot descriptor

before this PR descriptor table explain string is like this:
```
TupleDescriptor{id=0, tbl=lineitem, byteSize=176, materialized=true}
  SlotDescriptor{id=0, col=l_shipdate, type=DATEV2}
    parent=0
    materialized=true
    byteSize=4
    byteOffset=0
    nullIndicatorByte=0
    nullIndicatorBit=-1
    nullable=false
    slotIdx=0

  SlotDescriptor{id=1, col=l_orderkey, type=BIGINT}
    parent=0
    materialized=true
    byteSize=8
    byteOffset=24
    nullIndicatorByte=0
    nullIndicatorBit=-1
    nullable=false
    slotIdx=6
```

after this PR descriptor table explain string is like this:
```
TupleDescriptor{id=2, tbl=lineitem}
  SlotDescriptor{id=1, col=l_extendedprice, type=DECIMAL(15,2), nullable=false}
  SlotDescriptor{id=2, col=l_discount, type=DECIMAL(15,2), nullable=false}
```
2022-11-17 15:19:17 +08:00
afc9065b51 [test](nereids) add filter estimation ut cases (#14293)
fix a bug for filter estimation, in pattern of A>10 and A<20.
2022-11-17 11:01:30 +08:00
7182f14645 [improvement][fix](multi-catalog) speed up list partition prune (#14268)
In previous implementation, when doing list partition prune, we need to generation `rangeToId`
every time we doing prune.
But `rangeToId` is actually a static data that should be create-once-use-every-where.

So for hive partition, I created the `rangeToId` and all other necessary data structures for partition prunning
in partition cache, so that we can use it directly.

In my test, the cost of partition prune for 10000 partitions reduce from 8s -> 0.2s.

Aslo add "partition" info in explain string for hive table.
```
|   0:VEXTERNAL_FILE_SCAN_NODE                           |
|      predicates: `nation` = '0024c95b'                 |
|      inputSplitNum=1, totalFileSize=4750, scanRanges=1 |
|      partition=1/10000                                 |
|      numNodes=1                                        |
|      limit: 10                                         |
```

Bug fix:
1. Fix bug that es scan node can not filter data
2. Fix bug that query es with predicate like `where substring(test2,2) = "ext2";` will fail at planner phase.
`Unexpected exception: org.apache.doris.analysis.FunctionCallExpr cannot be cast to org.apache.doris.analysis.SlotRef`

TODO:
1. Some problem when quering es version 8: ` Unexpected exception: Index: 0, Size: 0`, will be fixed later.
2022-11-17 08:30:03 +08:00
wxy
943e014414 [enhancement](decommission) speed up decommission process (#14028) (#14006) 2022-11-16 20:43:07 +08:00
47a6373e0a [feature](Nereids) support datev2 and datetimev2 type (#14263)
1. split DateLiteral and DateTimeLiteral into V1 and V2
2. add a type coercion about DateLikeType: DateTimeV2Type > DateTimeType > DateV2Type > DateType
3. add a rule to remove unnecessary CAST on DateLikeType in ComparisonPredicate
2022-11-16 15:51:48 +08:00
6881989dd9 [Bug](jvm memory) Support multiple java version to get max heap size (#14295)
`sun.misc.VM.maxDirectMemory` is used in JDK1.8 only. This PR add the interface for JDK11.
2022-11-16 10:58:58 +08:00
70cc725649 [Vectorized](function) support avg_weighted/percentile_array/topn_wei… (#14209)
* [Vectorized](function) support avg_weighted/percentile_array/topn_weighted functions

* update add to stringRef
2022-11-15 16:38:38 +08:00
87544a017f [fuzztest](fe session variable) add fuzzy test config for fe session variables. (#14272)
Many feature in fe session variable is disabled by default. So that these features do not pass github workflow test actually. I add a fuzzy test config in fe.conf. If it is set to true, then we will use fuzzy session variables for every connection so that every feature developer could set fuzzy values for its config.



Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-11-15 15:43:21 +08:00
5ae046b208 [bugfix](log) fix wrong print introduced by 49fecd2a6dae #14266 2022-11-15 11:39:05 +08:00
a3062c662c [feature-wip](statistics) support statistics injection and show statistics (#14201)
1. Reduce the configuration options for statistics framework, and add comment for those rest.
2. Move the logic of creation of analysis job to the `StatisticsRepository` which defined all the functions used to interact with internal statistics table
3. Move AnalysisJobScheduler to the statistics package
4. Support display and injections manually for statistics
2022-11-15 11:29:51 +08:00
89db3fee00 [feature-wip](MTMV)Add show statement for MTMV (#13786)
Use Case

mysql> CREATE TABLE t1 (pk INT, v1 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk) PROPERTIES ('replication_num' = '1');
mysql> CREATE TABLE t2 (pk INT, v2 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk) PROPERTIES ('replication_num' = '1');
mysql > CREATE MATERIALIZED VIEW mv BUILD IMMEDIATE REFRESH COMPLETE KEY (mv_pk) DISTRIBUTED BY HASH (mv_pk) PROPERTIES ('replication_num' = '1') AS SELECT t1.pk as mv_pk FROM t1, t2 WHERE t1.pk = t2.pk;

mysql> SHOW MTMV JOB;
mysql> SHOW MTMV TASK;
2022-11-15 10:32:47 +08:00
37fdd011b4 [fix](fe-metric) Prometheus read format error #13831 (#13832)
Co-authored-by: 迟成 <chicheng@meituan.com>
2022-11-14 22:07:00 +08:00
b0ff852d74 [opt](Nereids) right deep tree penalty adjust: use right rowCount, not abs(left - right) (#14239)
in origin algorithm, the penalty is abs(leftRowCount - RightRowCount). this will make some right deep tree escape from penalty, because the substraction is almost zero. Penalty by RightRowCount can avoid this escape.
2022-11-14 16:40:26 +08:00
bea66e6a12 [fix](nereids) cannot generate RF on colocate join and prune useful RF in RF prune (#14234)
1. when we translate colocated join, we lost RF information attached to the right child, and hence BE will not generate those RFs.
2. when a RF is useless, we prune all RFs on the scan node by mistake
2022-11-14 16:36:55 +08:00
8dd2f8b349 [enhancement](nereids) set Ndv=rowCount if ndv is almost equal to rowCount on ColumnStatisitics load (#14238) 2022-11-14 16:30:35 +08:00
bdf7d2779a [fix](Nereids) aggregate always report has 1 row count (#14236)
the data structure of new stats is changed, bug Agg-estimation is not changed
2022-11-14 16:27:55 +08:00
47326f951d [fix](nereids) count(*) reports npe when do filter selectivity estimation (#14235) 2022-11-14 16:11:08 +08:00
cf5e2a2eb6 [fix](nereids) new statistics use wrong default selectivity (#14233)
by default, column selectivity MUST be 1.0, not ZERO
2022-11-14 16:09:17 +08:00
7eed5a292c [feature-wip](multi-catalog) Support hive partition cache (#14134) 2022-11-14 14:12:40 +08:00
594e3b8224 [feature](Nereids) add circle detector and avoid overlap (#14164) 2022-11-14 14:02:14 +08:00
23a8c7eeb6 (fix)(multi-catalog)(es) Fix error result because not used fields_context (#14229)
Fix error result because not used fields_context
2022-11-14 14:00:55 +08:00
49fecd2a6d [improvement](log) print info of error replicas (#14220) 2022-11-14 11:37:18 +08:00
13b1f92c63 [enhancement](Nereids) add output set and output exprid set cache (#14151) 2022-11-14 11:24:57 +08:00
8263c34da6 [fix](ctas) use json_object in CTAS get wrong result (#14173)
* [fix](ctas) use json_object in CTAS get wrong result

Signed-off-by: nextdreamblue <zxw520blue1@163.com>
2022-11-14 09:13:05 +08:00
beaf2fcaf6 [feature](partition) support new create partition syntax (#13772)
Create partitions use :
```
PARTITION BY RANGE(event_day)(
        FROM ("2000-11-14") TO ("2021-11-14") INTERVAL 1 YEAR,
        FROM ("2021-11-14") TO ("2022-11-14") INTERVAL 1 MONTH,
        FROM ("2022-11-14") TO ("2023-01-03") INTERVAL 1 WEEK,
        FROM ("2023-01-03") TO ("2023-01-14") INTERVAL 1 DAY,
        PARTITION p_20230114 VALUES [('2023-01-14'), ('2023-01-15'))
)

PARTITION BY RANGE(event_time)(
        FROM ("2023-01-03 12") TO ("2023-01-14 22") INTERVAL 1 HOUR
)
```
can create a year/month/week/day/hour's date partitions in a batch,
also it is compatible with the single partitioning method.
2022-11-12 20:52:37 +08:00
d9913b1317 [Enhancement](Nerieds) Support numbers TableValuedFunction and some bitmap/hll aggregate function (#14169)
## Problem summary
This pr support
1. `numbers` TableValuedFunction for nereids test, like `select * from numbers(number = 10, backend_num = 1)`
2. bitmap/hll aggregate function
3. support find variable length function in function registry, like `coalesce`
4. fix a bug that print nerieds trace will throw exception because use RewriteRule in ApplyRuleJob, e.g: `AggregateDisassemble`, introduced by #13957
2022-11-11 16:29:15 +08:00
7c48168a53 [refactor](Nereids) remove DecimalType, use DecimalV2Type instead (#14166) 2022-11-11 13:58:16 +08:00
b6ba654f5b [Feature](Sequence) Support sequence_match and sequence_count functions (#13785) 2022-11-11 13:38:45 +08:00
5fad4f4c7b [feature](Nereids) replace order by keys by child output if possible (#14108)
To support query like that:
SELECT c1 + 1 as a, sum(c2) FROM t GROUP BY c1 + 1 ORDER BY c1 + 1

After rewrite, plan will equal to
SELECT c1 + 1 as a, sum(c2) FROM t GROUP BY c1 + 1 ORDER BY a
2022-11-11 13:34:29 +08:00