Commit Graph

8276 Commits

Author SHA1 Message Date
b4aef889f2 [feature-array](array-function) add array constructor function array() (#14250)
* [feature-array](array-function) add array constructor function `array()`

```
mysql>  select array(qid, creationDate) from nested_c_2  limit 10;
+------------------------------+
| array(`qid`, `creationDate`) |
+------------------------------+
| [1000038, 20090616074056]    |
| [1000069, 20090616075005]    |
| [1000130, 20090616080918]    |
| [1000145, 20090616081545]    |
+------------------------------+
10 rows in set (0.01 sec)
```
2022-11-19 10:49:50 +08:00
02372ca2ea [test](jdbc external table) add new jdbc mysql external table (#14323) 2022-11-19 09:46:48 +08:00
eb76160b48 [chore](third-party) Use GNU official mirror to boost the download speed (#14358)
According to the description in https://www.gnu.org/server/mirror.html, using the address http://ftpmirror.gnu.org/ to download GNU packages is recommended. It can boost the download speed worldwide.
2022-11-19 00:04:52 +08:00
63a2344e68 [Enhancement](Nereids) Refactor AggregateFunction and support explain plan (#14380)
# Proposed changes

- Refactor AggregateFunction
    1. AggregateFunction implement ComputeSignature
    3. Add a CustomSignature to dynamic compute signature, we can check input type and compute implicit cast type in the `customSignature` method
    2. Add PartialAggType to record some type information before disassemble aggregate
    4. Refine and create a custom catalog function when translate AggregateFunction, without `finalizeForNereids`
-  Support explain plan
    1. explain parsed plan select ...
    5. explain analyzed plan select ...
    6. explain rewritten/logical plan select ...
    7. explain optimized/physical plan select ...
    8. explain all plan select ...
2022-11-18 23:40:33 +08:00
c4bade71c8 [refactor](nereids) remove ColumnStatistics.UNKNOWN from StatsDerive (#14343)
ColumnStatistics.UNKNOWN can be replaced by ColumnStatistics.DEFAULT
2022-11-18 23:40:00 +08:00
a82896f420 [fix](broker-load) fix that broker load don not set be exec version and limit node channel memory (#14399) 2022-11-18 23:38:37 +08:00
21416f9947 [enhancement](memory) Support Jemalloc metrics and default allocator changed to Jemalloc (#14384) 2022-11-18 21:02:54 +08:00
68da6bccb7 [fix](type) fix DECIMAL scale when cast function on fe (#12877)
before:
MySQL [test]> select cast('135.759999999' as DECIMAL(10,3));
+----------------------------------------+
| CAST('135.759999999' AS DECIMAL(10,3)) |
+----------------------------------------+
| 135.759999999 |
+----------------------------------------+
1 row in set (0.00 sec)

now:
MySQL [stage]> select cast('135.759999999' as DECIMAL(10,3));
+----------------------------------------+
| CAST('135.759999999' AS DECIMAL(10,3)) |
+----------------------------------------+
| 135.759 |
+----------------------------------------+
1 row in set (0.01 sec)
2022-11-18 19:36:14 +08:00
eab0af7afe [optimization](array-type) optimize the export precision of floating point numbers (#14261)
Co-authored-by: hucheng01 <hucheng01@baidu.com>
2022-11-18 18:24:11 +08:00
bd5882d08a [fix](datax)doris writer write error (#14276)
* doris writer write error
2022-11-18 18:20:13 +08:00
Pxl
734525de86 [Bug](runtime filter) fix minmax filter not copy rightly on shared hash join (#14367)
fix minmax filter not copy rightly on shared hash join
2022-11-18 17:52:45 +08:00
2c4236fd24 [improvement](ctas) use string type for varchar/char/string (#14382)
When executing create table as select stmt,
the varchar/char/string type of column in created table will be unified to string type.

Because when select from external table (mysql/pg, etc), the length of varchar in external database
is calculated by "char" length, not "byte" length.
So if there is a column with varchar(10) in external table, then there will be a same varchar(10)
in created table. But the byte length of data in external table may be larger than 10, causing failure of CTAS.

Change to string will not impact performance of the capacity of disk storage.
And notice that if a string type column is the first column, it will be changed to varchar(65535),
because we do not allow string type column as sort key column.
2022-11-18 14:20:13 +08:00
a1d02f36ac [feature](table-valued-function) support hdfs() tvf (#14213)
This pr does two things:
1. support `hdfs()` table valued function.
2. add regression test
2022-11-18 14:17:02 +08:00
1f326fc0d6 [enhancement](be)limit mem cost to 16m when pre serialize keys in agg node (#14321)
* [enhancement](be)limit mem cost to 16m when pre serialize keys in agg node

* use only one chunk memory when serializing keys in agg node
2022-11-18 12:31:52 +08:00
7952bce03f [compatibility](Nereids) process escape in string literal (#14294) 2022-11-18 11:24:00 +08:00
9e25aa8d3e [feature](Nereids): Add subgraph enumerator #14291
Add subgraph enumerator to find the best plan

For DPHyp, we need an enumerator for all csg-cmp pairs to find the best plan
2022-11-18 10:33:30 +08:00
2b6f85ab96 [chore](macOS) Fix BE UT (#14307)
#13195 left some unresolved issues. One of them is that some BE unit tests fail.
This PR fixes this issue. Now, we can run the command ./run-be-ut.sh --run successfully on macOS.
2022-11-18 10:13:38 +08:00
da0b09caea [fix](Nereids) DateTimeType migrate to DateType is wrong when hour, minute and second all zero (#14327)
1. fix DateTimeType migrate to DateType is wrong when hour, minute and second all zero
2. add TPC-H regression test with DATEV2 type
2022-11-18 01:38:03 +08:00
bd5a593403 [enhancement](memtracker) Use proc/meminfo MemAvailable to control memory and optimize MemTracker log printing (#14335) 2022-11-17 22:46:07 +08:00
fb140d0180 [Enhancement](sequence-column) optimize the use of sequence column (#13872)
When you create the Uniq table, you can specify the mapping of sequence column to other columns.
You no longer need to specify mapping column when importing.
2022-11-17 22:39:09 +08:00
1a035e2073 [fix](profile)(AggNode) fix the GetResultsTime is always zero (#14366)
add scoped_timer in _serialize_with_serialized_key_result
2022-11-17 22:30:21 +08:00
50bfd99b59 [feature](join) support nested loop semi/anti join (#14227) 2022-11-17 22:20:08 +08:00
d5af4f6558 [Neried](Profile) Add projection timer for neried (#14286) 2022-11-17 22:17:55 +08:00
8fe5211df4 [improvement](multi-catalog)(cache) invalidate catalog cache when refresh (#14342)
Invalidate catalog/db/table cache when doing
refresh catalog/db/table.

Tested table with 10000 partitions. The refresh operation will cost about 10-20 ms.
2022-11-17 20:47:46 +08:00
ccf4db394c [feature-wip](multi-catalog) Collect external table statistics (#14160)
Collect HMS external table statistic information through external metadata.
Insert the result into __internal_schema.column_statistics using insert into SQL.
2022-11-17 20:41:09 +08:00
44ee4386f7 [test](multi-catalog)Regression test for external hive orc table (#13762)
Add regression test for external hive orc table. This PR has generated all basic types support by hive orc, and create a hive external table to touch them in docker environment.
Functions to be tested:
1. Ensure that all types are parsed correctly
2. Ensure that the null map of all types are parsed correctly
3. Ensure that the `SearchArgument` of `OrcReader` works well
4. Only select partition columns
2022-11-17 20:36:02 +08:00
98956dfa19 [fix](statistics) statistics inaccurate after analyze same table more than once (#14279)
If a table already been analyzed, then we analyze it again, the new statistics would larger than expected since the incremental would contain the values from table level statistics since the SQL lack the predication for the nullability of part_id
2022-11-17 20:18:14 +08:00
a382bb95e7 [fix](runtimefilter) fix heap-user-after-free of runtime filter merge (#14362) 2022-11-17 19:38:45 +08:00
dba19e591c [cherry-pick](scanner) using avg rowset to calculate batch size instead of using total_bytes since it costs a lot of cpu (#14345)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-11-17 18:57:21 +08:00
6da2948283 [feature-wip](multi-catalog) support iceberg v2(step 1) (#13867)
Support position delete(part of).
2022-11-17 17:56:48 +08:00
a9e53e5c86 [improvement](test) add conf for pipline (#14254)
Add conf used by pipline to git, then we will change conf of pipeline
via pr like code commit.
2022-11-17 16:05:24 +08:00
af462b07c7 [enhancement](explain) compress descriptor table explain string (#14152)
1. compress slot descriptor explain string to one row
2. remove unmaterialized tuple descriptor and slot descriptor

before this PR descriptor table explain string is like this:
```
TupleDescriptor{id=0, tbl=lineitem, byteSize=176, materialized=true}
  SlotDescriptor{id=0, col=l_shipdate, type=DATEV2}
    parent=0
    materialized=true
    byteSize=4
    byteOffset=0
    nullIndicatorByte=0
    nullIndicatorBit=-1
    nullable=false
    slotIdx=0

  SlotDescriptor{id=1, col=l_orderkey, type=BIGINT}
    parent=0
    materialized=true
    byteSize=8
    byteOffset=24
    nullIndicatorByte=0
    nullIndicatorBit=-1
    nullable=false
    slotIdx=6
```

after this PR descriptor table explain string is like this:
```
TupleDescriptor{id=2, tbl=lineitem}
  SlotDescriptor{id=1, col=l_extendedprice, type=DECIMAL(15,2), nullable=false}
  SlotDescriptor{id=2, col=l_discount, type=DECIMAL(15,2), nullable=false}
```
2022-11-17 15:19:17 +08:00
a4d4fc8c02 datax doris writer doc fix (#14344) 2022-11-17 13:08:32 +08:00
afc9065b51 [test](nereids) add filter estimation ut cases (#14293)
fix a bug for filter estimation, in pattern of A>10 and A<20.
2022-11-17 11:01:30 +08:00
0bf6d1fd79 [typo](doc)Datax doris writer doc update (#14328) 2022-11-17 08:53:55 +08:00
7182f14645 [improvement][fix](multi-catalog) speed up list partition prune (#14268)
In previous implementation, when doing list partition prune, we need to generation `rangeToId`
every time we doing prune.
But `rangeToId` is actually a static data that should be create-once-use-every-where.

So for hive partition, I created the `rangeToId` and all other necessary data structures for partition prunning
in partition cache, so that we can use it directly.

In my test, the cost of partition prune for 10000 partitions reduce from 8s -> 0.2s.

Aslo add "partition" info in explain string for hive table.
```
|   0:VEXTERNAL_FILE_SCAN_NODE                           |
|      predicates: `nation` = '0024c95b'                 |
|      inputSplitNum=1, totalFileSize=4750, scanRanges=1 |
|      partition=1/10000                                 |
|      numNodes=1                                        |
|      limit: 10                                         |
```

Bug fix:
1. Fix bug that es scan node can not filter data
2. Fix bug that query es with predicate like `where substring(test2,2) = "ext2";` will fail at planner phase.
`Unexpected exception: org.apache.doris.analysis.FunctionCallExpr cannot be cast to org.apache.doris.analysis.SlotRef`

TODO:
1. Some problem when quering es version 8: ` Unexpected exception: Index: 0, Size: 0`, will be fixed later.
2022-11-17 08:30:03 +08:00
3259fcb790 [typo](docs) fix docs kafka-load.md (#14313) 2022-11-16 23:17:30 +08:00
wxy
943e014414 [enhancement](decommission) speed up decommission process (#14028) (#14006) 2022-11-16 20:43:07 +08:00
47a6373e0a [feature](Nereids) support datev2 and datetimev2 type (#14263)
1. split DateLiteral and DateTimeLiteral into V1 and V2
2. add a type coercion about DateLikeType: DateTimeV2Type > DateTimeType > DateV2Type > DateType
3. add a rule to remove unnecessary CAST on DateLikeType in ComparisonPredicate
2022-11-16 15:51:48 +08:00
6881989dd9 [Bug](jvm memory) Support multiple java version to get max heap size (#14295)
`sun.misc.VM.maxDirectMemory` is used in JDK1.8 only. This PR add the interface for JDK11.
2022-11-16 10:58:58 +08:00
20634ab7e3 [feature-wip](multi-catalog) support partition&missing columns in parquet lazy read (#14264)
PR https://github.com/apache/doris/pull/13917 has supported lazy read for non-predicate columns in ParquetReader, 
but can't trigger lazy read when predicate columns are partition or missing columns.
This PR support such case, and fill partition and missing columns in `FileReader`.
2022-11-16 08:43:11 +08:00
442b844b22 [regressiontest](delete)delete-where-in-test (#14036)
* delete-where-in-test

* Update test_delete_where_in.groovy

* Update test_delete_where_in.groovy
2022-11-15 18:35:31 +08:00
3ea9d3f2e1 [enhancement](array) support read list(Array) type from orc file (#14132)
Before this pr, if we try to load ORC file with native list(or array) type data, the be will crash.
Because complex types in ORC file include multi real columns, so we need to filter columns by column names.
Otherwise we could not read all columns we need.
Now arrow release-7.0.0 only support create stripe reader by column index, so we patch it to support create stripe reader by column names.
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-11-15 17:48:17 +08:00
9d70c531a3 [improvement](publish) fix publish timeout in cocurrent load (#14231)
In concurrent load, some publish timeout happens occasionally. This is
cause by meta lock hold by other thread so publish add increase rowset
hang for several seconds.
StorageEngine::start_delete_unused_rowset will hold gc_mutex and it cost
a lot of time, so that add_used_rowset wait lock, and compaction modify_rowset
or other tablet method will hold meta_lock and call add_unused_rowset which
will make meta_lock occupied for too long, finally makes publish timeout.

In this pr, I copy unused_rowsets in lock and delete these rowset without lock,
makes gc_mutex more lightweight so meta lock can be acquired immediately in publish thread.
My test shows that no publish timeout in concurrent stream load.
2022-11-15 16:39:38 +08:00
70cc725649 [Vectorized](function) support avg_weighted/percentile_array/topn_wei… (#14209)
* [Vectorized](function) support avg_weighted/percentile_array/topn_weighted functions

* update add to stringRef
2022-11-15 16:38:38 +08:00
5badd70db2 [fix](csv-reader) Fix core dump when load text into doris with special delimiter (#14196) 2022-11-15 16:06:59 +08:00
6d2e6d85d3 [enhancement](be)release memory in Node's close() method (#14258)
* [enhancement](be)release memory in Node's close() method

* format code
2022-11-15 15:59:23 +08:00
333c6390ee [fix](be-ut) AddressSanitizer detects container-overflow issues (#14255)
* [chore] Fix the container-overflow errors detected by address sanitizer

* Fix compilation errors
2022-11-15 15:49:55 +08:00
a45685d028 [fix](regression) concurrent regression cases may fail #14271
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-11-15 15:46:34 +08:00
Pxl
e298696baf [Chore](env) add error information when DORIS_GCC_HOME not set well (#14249) 2022-11-15 15:45:35 +08:00