Commit Graph

5055 Commits

Author SHA1 Message Date
16c218fde5 [feature](nereids) support bind external relation out of Doris fe environment (#21123)
support bind external relation out of Doris fe environment, for example, analyze sql in other java application.
see BindRelationTest.bindExternalRelation.
2023-06-29 14:29:29 +08:00
3a12b67517 [Improvement](statistics, multi catalog)Implement hive table statistic connector (#21053)
This pr is to add the collecting hive statistic function. While the CBO fetching hive table statistics, statistic cache will 
first load from internal stats olap table. If not found, then using this pr's function to fetch from remote Hive metastore.
2023-06-29 10:50:54 +08:00
Pxl
45f1909bc3 [Bug](lateral-view) make lateral view function's nullable mode work (#21242)
make lateral view function's nullable mode work
2023-06-29 10:50:07 +08:00
30b1b93353 [dependency](fe)Dependency version upgrade (#21191)
Keep hadoop-aliyun version consistent with hadoop main version (3.3.5)
upgrade jackson to 2.14.3
upgrade netty version to 4.1.94.final
binding check.freamework version to 3.32.0
upgrade snappy-java to 1.1.10.1
upgrade hudi version to 0.13.1
upgrade spring version to 2.7.13
upgrade orc version to 1.8.4
revert nonsensical changes
2023-06-29 10:01:33 +08:00
64ffb06a79 [fix](Nereids) olap scan should not be gather since coordinator chould not process (#21298)
in PR #21168 , we refactor physcial properties and translator
to ensure not generating useless excahange. olap scan node
could be gather in Nereids but translate to hash partitioned.
since coordinator could not process gather olap scan node,
we remove the candidate distribution spec of olap scan
2023-06-29 09:12:08 +08:00
9af714bceb [fix](catalog) disble FileSystem Cache to avoid too many fs cache (#21283)
When creating a new hive catalog or refresh the hive catalog, it will refresh the HiveMetaStore cache.
And it will call "FileInputFormat.setInputPaths()".
In this method, it will create a new FileSystem instance and store it in FileSystem's cache.
So if refresh catalog frequently, there will be too many FileSystem instances in cache, causing OOM.

This PR disable the FileSystem Cache.
2023-06-29 09:06:00 +08:00
884c908e25 [Enhancement](multi-catalog) try to reuse existed ugi. (#21274)
Try to reuse an existed ugi at DFSFileSystem, otherwise if we query a more then ten-thousands partitons hms table, we will do more than ten-thousands login operations, each login operation will cost hundreds of ms from my test.
Co-authored-by: 王翔宇 <wangxiangyu@360shuke.com>
2023-06-29 09:04:59 +08:00
449c8d4568 [fix](jdbc) Handling Zero DateTime Values in Non-nullable Columns for JDBC Catalog Reading MySQL (#21296) 2023-06-28 22:51:17 +08:00
e7dd65f551 [fix](test) fix PlannerTest testEliminatingSortNode (#21112)
testEliminatingSortNode needs to check if SortNode is existed in plan tree, so it should check plan1.contains("order by:"), but rather than plan1.contains("SORT INFO:") or plan1.contains("SORT LIMIT:").
2023-06-28 21:29:23 +08:00
a6b51ec19a [Feature](avro) Support Apache Avro file format (#19990)
support read avro file by hdfs() or s3() .
```sql
select * from s3(
         "uri" = "http://127.0.0.1:9312/test2/person.avro",
         "ACCESS_KEY" = "ak",
         "SECRET_KEY" = "sk",
         "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name   | boolean_type | double_type | long_type       |
+--------+--------------+-------------+-----------------+
| Alyssa |            1 |     10.0012 | 100000000221133 |
| Ben    |            0 |    5555.999 |      4009990000 |
| lisi   |            0 | 5992225.999 |      9099933330 |
+--------+--------------+-------------+-----------------+

select * from hdfs(
                "uri" = "hdfs://127.0.0.1:9000/input/person2.avro",
                "fs.defaultFS" = "hdfs://127.0.0.1:9000",
                "hadoop.username" = "doris",
                "format" = "avro");
+--------+--------------+-------------+-----------+
| name   | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------+
| Alyssa |            1 |  8888.99999 |  89898989 |
+--------+--------------+-------------+-----------+
```

current avro reader only support common data type, the complex data types will be supported later.
2023-06-28 21:15:35 +08:00
325504deeb [bugfix](recover) do not need dynamic partition recover except olap table (#21290)
introduced by #19031

FE could not recover any more because there is a convert to olap table operation in the code. But there are many table types that is not a olap table such as view jdbc table ...
It will convert failed and FE will not start correctly.Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-06-28 19:56:17 +08:00
016870b673 [opt](nereids) use Expression's isConstant to check whether could be remove from group by key (#21195) 2023-06-28 19:12:36 +08:00
76620c21aa [improvement](nereids) prune hash join output slot ids list (#20789)
1. prune hash join output slot ids list based on slot ids in required project and other conjunctions, to reduce the be side effort.
2. support pruning for semi/anti also
2023-06-28 17:28:18 +08:00
7588abe76b [refactor](Nereids) refactor physical properties and plan translator (#21168)
this PR
1. refactor physical properties, property deriver and property regular 
to ensure Nereids could generate plan with sufficent PhysicalDistribute.
2. refactor PhyscialPlanTranslator to ensure all ExchangeNode generated
by PhysicalDistribute, except CTEConsumer. We will refactor all cte
related node later. 

the detail changes of this PR:
1. update DistributionSpec of physical properties:
- Any: random distribution, used in output and require
- StorageAny: random distribution but constrained by where the data is stored, used in output
- ExecutionAny: random distribution to present random shuffle, used in output
- Gather: gather distribution, used in output and require
- StorageGather: gather distribution but constrained by where the data is stored, used in output
- Replicated: broadcast distribution
- Hash: bucket distribution

2. update shuffle type of DistributionSpecHash
- REQUIRE: used in require
- NATURAL: distribution as storage engine hash algorithm, constrained by where the data is stored
- STORAGE_BUCKETED: distribution as storage engine hash algorithm
- EXECUTION_BUCKETED: distribution as execution engine hash algorithm

3. update HideOneRowRelationUnderSetOperation to MergeOneRowRelationIntoSetOperation

4. update property deriver of SetOperation to ensure suitable PhysicalDistribute be added
at top and below of SetOperation

5. refactor PhysicalPlanTranslator to ensure no unplanned exchange node will be added
2023-06-28 15:15:11 +08:00
08fe22cb0c [improvement](backup) Add BackupJobInfo with tableCommitSeqMap (#21255)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2023-06-28 11:10:12 +08:00
853fa5f688 [typo](nativeInsertStmt) fix object-stored column exception description (#21221) 2023-06-28 10:12:55 +08:00
b1e973b721 [Improve](func)support array to window-func first-last-value arg type (#21201)
* support array to windown-func first-last-value arg type

* add regress test for first-last-value of array type

* update

* format be:
2023-06-28 10:02:00 +08:00
98b2bc87b5 [typo](MultiPartitionDesc) fix Multi partition time interval exception description (#21222) 2023-06-28 00:42:25 +08:00
d871df64ca [improvement](oracle jdbc)Support for automatically obtaining the precision of the oracle timestamp type (#21252) 2023-06-28 00:19:01 +08:00
92882ebd91 [fix](inverted index) update output rowset index meta with input rowset when drop inverted index (#21248) 2023-06-27 23:54:35 +08:00
5506faa7b4 [datetimev2](minor) Add scale parameter for datetimev2 (#21176) 2023-06-27 19:55:35 +08:00
acba8648a5 [enhancement](nereids) Add log for stats (#21164)
1. LOG sql when analyze failed
2. Return directly for analyze_test suite when there is more than one frontend
3. Set query_timeout for tpcds suites to avoid unneccessary failed caused by analyze sync
2023-06-27 19:17:22 +08:00
7d22910fbd [improvement](workloadgroup)add check when drop/set workload group (#21174)
1 check group exists when set group for user property;
eg, if g1 not exists, then set op should be failed.

mysql [test]>SET PROPERTY FOR 'root' 'default_workload_group' = 'g1';
ERROR 1105 (HY000): errCode = 2, detailMessage = workload group g1 not exists
2 check whether group is used for user when drop group;
eg, if a group is set for root, then drop should be failed.

mysql [test]>drop workload group test_g1;
ERROR 1105 (HY000): errCode = 2, detailMessage = workload group test_g1 is set for user root
2023-06-27 18:10:32 +08:00
64a1eb77f0 [opt](planner) support delete with a subquery in predicate by construct an insert. (#20983)
complex predicate in delete stmt like: 
```sql
delete from t1 where t1.id in (select id from t2);
```

will be replaced to an insert stmt.
```sql
insert into t1(id, __DORIS_DELETE_SIGN__) select id, 1 from t1 where id in (select id from t2);
```
2023-06-27 17:51:13 +08:00
c52c73c1c6 [fix](nereids)return original expr if cast to decimal literal overflow (#21189) 2023-06-27 17:25:04 +08:00
84554ec0fd [fix](planner) the resultExprs should be substituted using table function node's outputSmap (#21182) 2023-06-27 17:19:49 +08:00
7b93b26b8c [feature-wip](MTMV) optimize lock of mtmv job & task, to avoid dead lock (#21054) 2023-06-27 16:23:50 +08:00
efcc65a0d3 [feature-wip](workload-group) Support for workload group Authentication (#20242) 2023-06-27 09:57:18 +08:00
c9306e9c48 [improvement](ms jdbc)Support for automatically obtaining the precision of the sqlserver datetime type (#21145) 2023-06-26 23:10:46 +08:00
095550271b [fix](nereids) set proper sort info to scan node to enable TopN-opt (#21148) 2023-06-26 19:54:37 +08:00
c19e35116b [fix](inverted index)fix transaction id not unique for one index change job when light index change (#21180) 2023-06-26 19:54:05 +08:00
50c1d55769 [Improve](dynamic schema) support filtering invalid data (#21160)
* [Improve](dynamic schema) support filtering invalid data

1. Support dynamic schema to filter illegal data.
2. Expand the regular expression for ColumnName to support more column names.
3. Be compatible with PropertyAnalyzer and support legacy tables.
4. Default disable parse multi dimenssion array, since some bug unresolved
2023-06-26 19:32:43 +08:00
9c5a0cc471 [bug](jdbc catalog) fix getPrimaryKeys fun bug (#21137) 2023-06-26 17:13:50 +08:00
cdc2d42c3a [refactor](Nereids): adjust order of rewrite rules. (#21133)
Put the rules that eliminate plan in front to avoid block other rules, so we can avoid to invoke pushdown filter/limit again
2023-06-26 16:47:33 +08:00
f2ed1bce1a [fix](nereids)change PushdownFilterThroughProject post processor from bottom up to top down rewrite (#21125)
1. pass physicalProperties in withChildren function
2. use top down traverse  in PushdownFilterThroughProject post processor
2023-06-26 15:34:41 +08:00
2b3c82f57a [fix](multi-catalog)fix max compute scanner OOM and datetime (#20957)
1. Fix MC jni scanner OOM
2. add the second datetime type for MC SDK timestamp
3. make s3 uri case insensitive by the way
4. optimize max compute scanner parallel model
2023-06-26 13:53:29 +08:00
d4240ac21b [fix](multi-catalog)add oss sdk, supported oss properties (#21029) 2023-06-26 13:00:44 +08:00
f8ef4ed18f [fix](log4j) fix some issues when modify log config (#21099)
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2023-06-26 08:46:33 +08:00
Pxl
0122aa79df [Chore](vectorized) remove all isVectorized (#21076)
isVectorized is always true now
2023-06-25 23:13:34 +08:00
58b3e5ebdb [fix](nereids)scan node's smap should use materiazlied slots and project list as left and right expr list (#21142) 2023-06-25 22:34:43 +08:00
8f7a62c79b [improvement](mutil-catalog) PaimonColumnValue support short and Decimal (#20723) 2023-06-25 22:31:38 +08:00
2c2d56e8a0 [Feature](broker-load) Add priority info for ShowLoadStmt. (#20984)
Following pr #20628 , add priority information of the load job.
2023-06-25 22:11:21 +08:00
64790a3a86 [bugfix](workloadgroup) could not upgrade from 2.0 alpha (#21149)
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-06-25 22:02:53 +08:00
2d1163c4d8 [refactor](nereids) update Agg stats derive method #21036
This pr has no effect on tpch queries.
Some tpcds queries are impacted.
They are 4/11/23/24/47/51/57/65/74, in which 4 and 51 are improved
2023-06-25 21:47:32 +08:00
34b048a2bd [fix](nereids) update outer join estimation #21126
the row count of left outer join should be no less than left child row count.
2023-06-25 21:37:55 +08:00
af2b67e65a [Fix](multi-catalog) Invalidate cache when enable auto refresh catalog. (#21070)
The default value of RefreshCatalogStmt.invalidCache is false now, but the RefreshManager.RefreshTask does not invoke RefreshCatalogStmt.analyze() so it will not invalidate the cache. This pr mainly fix this problem
2023-06-25 19:14:44 +08:00
638aa41988 [fix](planner) fix push filter through agg #21080
In the previous implementation, the check for groupby exprs was ignored. Add this necessary check to make sure it would work

You could reproduce it by runnning belowing sql:

CREATE TABLE t_push_filter_through_agg (col1 varchar(11451) not null, col2 int not null, col3 int not null)
UNIQUE KEY(col1)
DISTRIBUTED BY HASH(col1)
BUCKETS 3
PROPERTIES(
    "replication_num"="1"
);

CREATE VIEW `view_i` AS 
SELECT 
    `b`.`col1` AS `col1`, 
    `b`.`col2` AS `col2`
FROM 
(
    SELECT 
        `col1` AS `col1`, 
        sum(`cost`) AS `col2`
    FROM 
    (
        SELECT 
            `col1` AS `col1`, 
            sum(CAST(`col3` AS INT)) AS `cost` 
        FROM 
            `t_push_filter_through_agg` 
        GROUP BY 
            `col1`
    ) a 
    GROUP BY 
        `col1`
) b;

SELECT SUM(`total_cost`) FROM view_a WHERE `dt` BETWEEN '2023-06-12' AND '2023-06-18' LIMIT 1;
2023-06-25 19:14:20 +08:00
b6c9feb458 [fix](nereids) check table privilege when it's needed (#21130)
check privilege on LogicalOlapScan, LogicalEsScan, LogicalFileScan and LogicalSchemaScan
2023-06-25 18:35:39 +08:00
46f0295b78 [feature](load-refactor-with-tvf) S3 load with S3 tvf and native insert (#19937) 2023-06-25 17:45:31 +08:00
771b0cbb4c [fix](stats) Update analyze task execute time (#21026)
Before this PR last_execute_time of pending analyze jobs would be 1970-01-01, you can reproduce it by run show analyze
2023-06-25 15:52:33 +08:00