Commit Graph

5755 Commits

Author SHA1 Message Date
7adb2be360 [Fix](Nereids) fix insert into return npe from follower node. (#22734)
insert into table command run at a follower node, it will forward to the master node, and the parsed statement is not set to the cascades context, but set to the executor::parsedStmt, we use the latter to get the user info.
2023-08-16 16:37:17 +08:00
5148bc6fa7 [fix](partial update)allow delete sign column in partial update in planForPipeline (#23034) 2023-08-16 14:20:39 +08:00
4510e16845 [improvement](delete) support delete predicate on value column for merge-on-write unique table (#21933)
Previously, delete statement with conditions on value columns are only supported on duplicate tables. After we introduce delete sign mechanism to do batch delete, a delete statement with conditions on value columns on unique tables will be transformed into the corresponding insert into ..., __DELETE_SIGN__ select ... statement. However, for unique table with merge-on-write enabled, the overhead of inserting these data can be eliminated. So this PR add the ability to allow delete predicate on value columns for merge-on-write unique tables.
2023-08-16 12:18:05 +08:00
3efa06e63e [Fix](View)varchar type conversion error (#22987) 2023-08-16 11:49:04 +08:00
221e7bdd17 [test](jdbc external) fix mysql and pg external regression test (#22998) 2023-08-16 10:44:47 +08:00
423002b20a [fix](nereids) partitionTopN & Window estimation (#22953)
* partitionTopN & winExpr estimation

* tpcds 44/47/57
2023-08-15 20:19:03 +08:00
80566f7fed [stats](nereids)support partition stats (#22606) 2023-08-15 17:52:25 +08:00
9b2323b7fd [Pipeline](exec) support async writer in pipelien query engine (#22901) 2023-08-15 17:32:53 +08:00
d7a5c37672 [improvement](tablet clone) update the capacity coeficient for calculating backend load score (#22857)
update the capacity coeficient for calcutating the backend load score:
1. Add fe config entry `backend_load_capacity_coeficient` to allow setting the capacity coeficient manually;
2. Adjust calculating capacity coeficient as below.

We emphasize disk usage for calculating load score. 
If a be has a high used capacity percent, we should increase its load score.
So we increase capacity coefficient with a be's used capacity percent.

But this is not enough. For example, if the tablets have a big difference in data size.
Then for below two BEs, their load score maybe the same:
BE A:  disk usage = 60%,  replica number = 2000  (it contains the big tablets)
BE B:  disk usage = 30%,  replica number = 4000  (it contains the small tablets)

But what we want is: firstly move some big tablets from A to B, after their disk usages are close,
then move some small tablets from B to A, finally both of their disk usages and replica number
are close.

To achieve this, when the max difference between all BE's disk usages >= 30%,  we set the capacity cofficient to 1.0 and avoid the affect of replica num. After the disk usage difference decrease, then decrease the capacity cofficient to make replica num effective.
2023-08-15 17:27:31 +08:00
7de362f646 [fix](Nereids): expand other join which has or condition (#22809) 2023-08-15 16:49:19 +08:00
dd09e42ca9 [enhancement](Nereids): expression unify constructor by using List (#22985) 2023-08-15 16:47:58 +08:00
140ab60a74 [Enhancement](multi-catalog) add a BE selection strategy for hdfs short-circuit-read. (#22697)
Sometimes the BEs will be deployed on the same node with DataNode, so we can use a more reasonable BE selection policy to use the hdfs short-circuit-read as much as possible.
2023-08-15 15:34:39 +08:00
a2e00727d6 [feature](auth)support Col auth (#22629)
support GRANT privilege[(col1,col2...)] [, privilege] ON db.tbl TO user_identity [ROLE 'role'];
2023-08-15 15:32:51 +08:00
9b42093742 [feature](agg) Make 'map_agg' support array type as value (#22945) 2023-08-15 14:44:50 +08:00
c2ff940947 [refactor](parquet)change decimal type export as fixed-len-byte on parquet write (#22792)
before the parquet write export decimal as byte-binary,
but can't be import those fied to Hive.
Now, change to export decimal as fixed-len-byte-array in order to import hive directly.
2023-08-15 13:17:50 +08:00
f6ca16e273 [fix](analysis) fix error msg #22950 2023-08-15 13:15:13 +08:00
1eab93b368 [chore](Nereids): remove useless code (#22960) 2023-08-15 13:14:20 +08:00
13d24297a7 [fix](Nereids) type check could not work when root node is table or file sink (#22902)
type check could not work because no expression in plan.
sink and scan have no expression at all. so cannot check type.
this pr add expression on logical sink to let type check work well
2023-08-15 11:45:16 +08:00
bfc1efe1aa [fix](createTableStmt)fix bug that createTableStmt toSql (#22750)
Issue Number: https://github.com/apache/doris/issues/22749
2023-08-15 10:35:22 +08:00
b49dc8042d [feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load (#22539)
## Proposed changes

Refactor thoughts: close #22383
Descriptions about `enclose` and `escape`: #22385

## Further comments

2023-08-09: 
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.
 
Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed: 

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.  

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
2023-08-15 09:23:53 +08:00
hzq
ad8a8203a2 [fix](mysql compatibility) add an internal database mysql to improve mysql compatibility (#22868) 2023-08-14 17:03:11 +08:00
45481f5fe2 [optimize](Nereids): optimize Nereids performance (#22885) 2023-08-14 15:21:29 +08:00
8f471a3a1f [fix](Nereids) push agg to meta scan is not work well (#22811) 2023-08-14 14:35:21 +08:00
fa6110accd [fix](catalog)paimon support more data type (#22899) 2023-08-14 13:48:33 +08:00
Pxl
49d503911e [MV](exec) disable create mv with select star (#22895) 2023-08-13 19:28:51 +08:00
bddab94121 [Enhancement](partial update) Support including delete sign column in partial update stream load (#22874) 2023-08-13 10:32:21 +08:00
bff3b90263 [fix](tablet clone) fix tablet sched failed when tablet missing tag and version incomplete (#22861) 2023-08-13 10:18:01 +08:00
5b09254fac [improvement](external statistics)Fix external stats collection bugs (#22788)
1. Collect external table row count when execute analyze database.
2. Support show cached table stats (row count)
3. Support alter external table column stats.
4. Refresh/Invalidate table row count stat memory cache when analyze task finished and drop table stats.
2023-08-11 21:58:24 +08:00
130c47e669 [Fix](Nereids)add need forward for enable_nereids_dml and format some cases (#22888) 2023-08-11 19:35:29 +08:00
045843991a [Fix](Nereids) fix insert into table of random distribution for nereids (#22831)
currently insert into a table of random distribution info is not supported, we fix it by set physical properties to Any.
2023-08-11 19:26:39 +08:00
a2fd488438 [chore](Nereids): polish StatsCalculatorTest (#22884) 2023-08-11 18:08:18 +08:00
a089fe3e43 [Improve](jni-avro)Reduce the volume of the avro-scanner-jar package (#22276)
The avro-scanner-jar package is reduced from 204M to 160M.

Hadoop-related dependencies in the original avro pom are directly packaged into a jar package, resulting in a jar volume of 200M. Now since there is already a hadoop jar package environment in be lib, it can be directly referenced.
2023-08-11 17:26:14 +08:00
db69457576 [fix](avro)Fix S3 TVF avro format reading failure (#22199)
This pr fixes two issues:

1. when using s3 TVF to query files in AVRO format, due to the change of `TFileType`, the originally queried `FILE_S3 ` becomes `FILE_LOCAL`, causing the query failed.
2. currently, both parameters `s3.virtual.key` and `s3.virtual.bucket` are removed. A new `S3Utils`  in jni-avro to parse the bucket and key of s3.
The purpose of doing this operation is mainly to unify the parameters of s3.
2023-08-11 17:22:48 +08:00
548226acfc [fix](planner)shouldn't change the child type to assignmentCompatibleType if it's INVALID_TYPE (#22841)
if changing the child type to INVALID_TYPE, the later getBuiltinFunction call will fail
2023-08-11 17:14:49 +08:00
8c3b95c523 [Fix](multi-catalog) sync default catalog when forwarding query to master. (#22684)
Assume that there is a hive catalog named hive_ctl, a hive db named db1 and a table named tbl1, if we connect a slave FE and execute following commands:

1. `switch hive_ctl`
2. `show partitions from db1.tbl1`

Then we will meet the error like this:
```
MySQL [(none)]> show partitions from db1.tbl1;
ERROR 1049 (42000): errCode = 2, detailMessage = Unknown database 'default_cluster:db1'
```

The reason is that the slave FE  will forward the `ShowPartitionStmt` to master FE but we do not sync the default catalog information, so the parser can not find the db and throws this exception. This is just one case, some other simillar cases will failed too.
2023-08-11 14:59:04 +08:00
72837a3ab4 [enhancement](Nereids): Plan equals() hashcode() don't need LogicalProprties (#22774)
- deepEquals don't need to compare LogicalProperties
- Plan equals() hashcode() don't need logicalProperty
2023-08-11 14:53:47 +08:00
209f36f1bf [fix](multi-catalog)fix jdbc loader (#22814) 2023-08-11 14:36:19 +08:00
94a7b44540 [Improvement](log) add config to controll compression of fe log & fe audit log (#22865)
fe log is large for a busy doris cluster, if you want to preserve some historical logs, it cost too much disk space.
enable compression is a good way to save space.
and a gzip compressed text file can be viewed without decompression.
2023-08-11 14:08:08 +08:00
080d613238 [enhancement](Nereids): speed up rewrite() (#22846)
- use Set<Integer> instead of Set<String> to speedup `contains`
- remove `getValidRules` and use `if` in `for` to avoid `toImmutableList`
2023-08-11 13:04:30 +08:00
caf496a67e [Chore](RoutineLoad)Change max_batch_interval minimum limit from 5 to 1 (#22858) 2023-08-11 12:02:20 +08:00
b9b9071c9b [improvement](create partition) create partition require quorum replicas succ (#22554) 2023-08-11 11:59:05 +08:00
e17779f193 [Dependency](fe)Upgrade dependency version (#22496)
Upgrade guava to 32.1.2-jre
Set ck dependency scope to provided
Upgrade okio to 3.4.0
Upgrade snake yaml to 1.33
Upgrade aws-java-sdk to 1.12.519
Upgrade hadoop to 3.3.6
2023-08-11 10:54:37 +08:00
0aa00026bb [fix](autoinc) ignore column property isAutoInc() for create table as select ... statement(#22827) 2023-08-10 23:25:54 +08:00
9dc0f80386 [log](tablet clone) add decommission replica log (#22799) 2023-08-10 21:41:45 +08:00
a99211d818 [test](ctas) add some ut for testing varchar length in ctas (#22817)
1. If derived from a origin column, eg: `create table tbl1 as select col1 from tbl2`, the length will be same os the origin column.
2. If derived from a function, eg: `create table tbl1 as select func(col1) from tbl2`, the length will be 65533.
3. If derived from a constant value, eg: `create table tbl1 as select "abc" from tbl2`, the length will be 65533.
2023-08-10 20:48:12 +08:00
71807ceb5f [Enhancement](tvf) Table value function support reading local file (#17404)
I tested the local tvf with tpch queries. First, generate `lineitem` datasets with 6001215 rows, and load it into `lineitem` table by:
```
insert into lineitem select c11, c1, c4, c2, c3, c5, c6, c7, c8, c9, c10, c12, c13, c14, c15, c16 
from local(
        "file_path" = "tools/tpch-tools/bin/tpch-data/lineitem.tbl.1", 
        "backend_id" = "10003", 
        "format" = "csv", 
        "column_separator" = "|"
);
```
Then, run `q1` and `q16` tpch queries, the query result is correct.

It can also analyze the BE's log directly like:

```
mysql> select * from local(
        "file_path" = "log/be.out",
        "backend_id" = "10006",
        "format" = "csv")
       where c1 like "%start_time%" limit 10;
+--------------------------------------------------------+
| c1                                                     |
+--------------------------------------------------------+
| start time: 2023年 08月 07日 星期一 23:20:32 CST       |
| start time: 2023年 08月 07日 星期一 23:32:10 CST       |
| start time: 2023年 08月 08日 星期二 00:20:50 CST       |
| start time: 2023年 08月 08日 星期二 00:29:15 CST       |
+--------------------------------------------------------+
```
2023-08-10 20:07:42 +08:00
879024a3a2 disable costmodelV2 (#22830) 2023-08-10 19:22:24 +08:00
8e5b4005dc [enhancement](data type) add use_mysql_bigint_for_largeint config Tell Doris to use bigint when returning largeint type to mysql jdbc (#22835) 2023-08-10 18:53:31 +08:00
a1223218f3 [pipeline](exec) Support shared scan in jdbc and odbc scan node (#22826)
Support shared scan in jdbc and odbc scan node to improve exec performance
2023-08-10 18:34:45 +08:00
94c9dce308 [fix](iceberg) fix iceberg's filter expr to filter file (#22740)
Fix iceberg's filter expr to filter file, and add counts the number of partitions read
2023-08-10 18:20:57 +08:00