Commit Graph

287 Commits

Author SHA1 Message Date
a4e415ab09 [feature](hive)Support hive tables after alter type. (#25138)
1.Reconstruct the logic of decode to read parquet. The parquet  reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.
2023-11-02 00:24:21 +08:00
3e10e5af39 [Fix](Serde) Fix content displayed by complex types in MySQL Client (#25946)
This pr makes three changes to the display of complex types:
1. NULL value in complex types refers to being displayed as `null`, not `NULL`
2. struct type is displayed as "column_name": column_value
3. Time types such as `datetime` and `date`, are displayed with double quotes in complex types. like
    `{1, "2023-10-26 12:12:12"}`

This pr also do a code refactor:
1. nesting_level is set to a member variable of the `DataTypeSerDe`, rather than a parameter in methods.

What's more, this pr fix a bug that fileSize is not correct, introduced by this pr: #25854
2023-11-01 23:48:55 +08:00
aafd53766b [chore](file-reader) rm unused interface from generic reader (#26205) 2023-11-01 18:43:14 +08:00
Pxl
696ecc8c83 [Chore](log) adjust error code on too many filtered rows (#26168) 2023-11-01 00:15:56 +08:00
b98744ae90 [Bug](iceberg)fix read partitioned iceberg without partition path (#25503)
Iceberg does not require partition values to exist on file paths, so we should get the partition value from `PartitionScanTask.partition`.
2023-10-31 18:09:53 +08:00
6dd60c6ebb [Enhance](BE) Add -Wshadow-field compile option to avoid unexpected shadowing behavior (#25698)
* Fix `Tablet::_meta_lock` shadows member inherited from `BaseTablet`

* Add -Wshadow-field compile option to avoid unexpected shadowing behavior
2023-10-26 10:00:28 +08:00
693982fd1a [feature](decimal) support decimal256 (#25386) 2023-10-25 15:47:51 +08:00
88dd480c2e [enhancement](CSV-reader) enhance err log for csv reading containing enclose or escape (#25816) 2023-10-24 22:10:08 +08:00
d62e914205 [opt](profile) set datalake profile level as 1 (#25686)
Follow #25491, only the profile marked as 1 will be shown in simplified profile.
2023-10-24 09:55:25 +08:00
0e0f8090f7 [refactor](text_convert)Use serde to replace text_convert. (#25543)
Remove text_convert and use serde to replace it.
2023-10-24 09:52:43 +08:00
08832d9f3a [Fix](exec) Fix date dict dead loop. (#25570) 2023-10-24 02:51:43 +08:00
9006e2b8a5 [fix](prefetch-read) make prefetch range correct to accelerate S3 load and fix its speed unbalance (#25775) 2023-10-23 20:02:24 +08:00
Pxl
642c149e6a remove datetime_value and move vecdatetime_value to doris namespace (#25695)
remove datetime_value and move vecdatetime_value to doris namespace
2023-10-20 22:08:17 +08:00
e4a83a22d1 [opt](error msg) Make data codec error clearly when load csv data can't display (#25540)
Co-authored-by: Tanya-W <tanya1218w@163,com>
2023-10-18 16:12:22 +08:00
47689fd452 [refactor](jni) unified jni framework for java udf (#25302)
Use the unified jni framework to refactor java udf.
The unified jni framework takes VectorTable as the container to transform data between c++ and java, and hide the details of data format conversion.
In addition, the unified framework supports complex and nested types.
The performance of basic types remains consistent, with a 30% improvement in string types and an order of magnitude improvement in complex types.
2023-10-18 09:27:54 +08:00
18c2a13e09 [fix](multi-catalog)fix maxcompute partition filter and session creation (#24911)
add maxcompute partition support
fix maxcompute partition filter
modify maxcompute session create method
2023-10-17 22:36:10 +08:00
2664d1cffb [chore](vec) Make this copy constructor of StringRef explicit (#25337) 2023-10-12 14:12:46 +08:00
58d96ecdbf [Improve](status) avoid print too may stack log for DATA_QUALITY_ERROR code (#25292) 2023-10-12 09:58:51 +08:00
46ab4346ca [Opt](parquet reader) Optimize the performance of reading decimal in parquet reader. (#25012)
Optimize the performance of reading decimal in parquet reader.

- Static dispatch `DecimalScaleParams`.
- Optimize `memcpy`, static dispatch copy size in fixed length cases.
- Use right shift bit operator to convert decimals.
2023-10-12 09:53:08 +08:00
bb670118f5 [coverage](test) Delete unused function to improve test coverage (#25233) 2023-10-11 11:50:51 +08:00
2f706cc84b [compile](simdjson reader) use __AVX2__ macro to decide whether use simdjson to parse (#25165) 2023-10-11 10:50:13 +08:00
6fe060b79e [fix](streamload) fix http_stream retry mechanism (#24978)
If a failure occurs, doris may retry. Due to ctx->is_read_schema is a global variable that has not been reset in a timely manner, which may cause exceptions.


---------

Co-authored-by: yiguolei <676222867@qq.com>
2023-10-08 11:16:21 +08:00
4e8cde127c [Enhance](catalog)add table cache in paimon jni (#25014)
- fix get old schema after refresh paimon table
- add table cache in paimon jni
2023-10-08 10:36:18 +08:00
642e5cdb69 [Fix](Status) Make Status [[nodiscard]] and handle returned Status correctly (#23395) 2023-09-29 22:38:52 +08:00
082bcd820b [feature](insert) Support wal for group commit insert (#23053) 2023-09-26 14:46:24 +08:00
3c99743bf2 [enhancement](csv_reader)Optimize the reading efficiency of nullable (string) columns. (#24698)
Optimize the performance of stream load tsv by reducing virtual function calls .
(Optimize read performance of nullable (string) columns by reducing virtual function calls.)
before : 600+ s
after : 560+ s
2023-09-22 13:44:37 +08:00
c704497d02 [fix](csv_reader)Fixed bug when parsing multi-character delimiters. (#24572)
Fixed bug when parsing multi-character delimiters.
2023-09-20 12:41:35 +08:00
4dad7c94da [fix](orc) fix the count(*) pushdown issue in orc format (#24446)
In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.

This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
2023-09-16 09:57:39 +08:00
b9ddcbf729 [feature](merge-cloud) Rewrite code related to IOContext (#24269) 2023-09-15 19:57:58 +08:00
9c681692bd Revert "[fix] fix http_stream retry mechanism (#23969)" (#24407)
This reverts commit 05e365ea137eb8c92b8e7eedc7d1435e83f065ae.
2023-09-15 10:07:53 +08:00
05e365ea13 [fix] fix http_stream retry mechanism (#23969)
Co-authored-by: yiguolei <676222867@qq.com>
2023-09-14 21:41:11 +08:00
d8ef9dda59 [feature](merge-cloud) Rewrite FS interface (#23953) 2023-09-12 19:20:25 +08:00
6e28d878b5 [fix](hudi) compatible with hudi spark configuration and support skip merge (#24067)
Fix three bugs:
1. Hudi slice maybe has log files only, so `new Path(filePath)`  will throw errors.
2. Hive column names are lowercase only, so match column names in ignore-case-mode.
3.  Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.
2023-09-11 19:54:59 +08:00
9b3be0ba7a [Fix](multi-catalog) Do not throw exceptions when file not exists for external hive tables. (#23799)
A similar bug compares to #22140 .

When executing a query with hms catalog, the query maybe failed because some hdfs files are not existed. We should just distinguish this kind of errors and skip it.

```
errCode = 2, detailMessage = (xxx.xxx.xxx.xxx)[CANCELLED][INTERNAL_ERROR]failed to init reader for file hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: [INTERNAL_ERROR]Init OrcReader failed. reason = Failed to read hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc: [INTERNAL_ERROR]Read hdfs file failed. (BE: xxx.xxx.xxx.xxx) namenode:hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: (2), No such file or directory), reason: RemoteException: File does not exist: /xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86) 
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) 
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1927) 
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738) 
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:426) 
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
```
2023-09-10 21:55:09 +08:00
f9a75b5c4f [feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352)
1. append csv serde for serialize to csv and deserialize from csv.
2. let csvReader use csv serde not text_converter.
2023-09-10 00:16:21 +08:00
0f0ffa3482 [Fix](Parquet Reader) fix parquet read issue (#24092) 2023-09-09 00:35:18 +08:00
6b56896a01 [chore](json reader) add original data to error messge for tracing (#22803) 2023-09-02 20:15:18 +08:00
657e927d50 [fix](json)Fix the bug that read json file Out of bounds access (#23411) 2023-09-02 01:11:37 +08:00
eaf2a6a80e [fix](date) return right date value even if out of the range of date dictionary(#23664)
PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark.
How to fix:
1. Increase dictionary range: 1900 ~ 2038
2. The date out of 1900 ~ 2038 is regenerated.
2023-09-01 14:40:20 +08:00
3a2c0d16f7 [fix](parquet) fix potential heap-use-after-free issue and cache issue (#23638)
1. When file meta cache is disabled (by setting `max_external_file_meta_cache_num=0` in be.conf),
the parquet's meta info is owned by parquet reader and will be released when calling `reader->close()`.

But the underlying file reader of this parquet reader will be released after `reader->close()`,
this may causing `heap-use-after-free` bug because some part of meta info may be referenced by file reader.

This PR fix it by making sure that meta info is released after file reader released.

2. Add modification time for file meta cache in BE, to avoid parquet read error like:
`Failed to deserialize parquet page header`
2023-08-31 18:23:05 +08:00
40be6a0b05 [fix](hive) do not split compress data file and support lz4/snappy block codec (#23245)
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304
2023-08-26 12:59:05 +08:00
f66f161017 [fix](multi-catalog)fix hive table with cosn location issue (#23409)
Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc).
This PR mainly changes:

1. Fix the bug of accessing files via cosn.
2. Add a new field `fs_name` in TFileRangeDesc
    This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query
request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name
for each file, otherwise, it may return error:

`reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`
2023-08-26 00:16:00 +08:00
8af1e7f27f [Fix](orc-reader) Fix incorrect result if null partition fields in orc file. (#23369)
Fix incorrect result if null partition fields in orc file. 

### Root Cause
Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the  pushed down partition field which are null values. filter incorrectly.

### Solution
we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.
2023-08-26 00:13:11 +08:00
a3a951c71d [Fix](multi-catalog) Fix load string dict issue for transactional hive tables. (#23306)
Fix load string dict issue for transactional hive tables. The column name need to pass 'row.column_name'.

apache/doris-thirdparty#112
2023-08-26 00:09:12 +08:00
29273771f7 [Fix](multi-catalog) Fix hive incorrect result by disable string dict filter if exprs contain null expr. (#23361)
Issue Number: close #21960

Fix hive incorrect result by disable string dict filter if exprs contain null expr.
2023-08-25 21:16:43 +08:00
caddcc6215 [Fix](orc-reader) Fix decimal type check for ColumnValueRange issue and use primitive_type. (#23424)
Fix decimal type check for ColumnValueRange issue and use primitive_type in orc_reader. Because in #22842 the `CppType` of `PrimitiveTypeTraits<TYPE_DECIMALXXX> ` were changed.
2023-08-24 23:26:41 +08:00
2dda44d7b5 [fix](csv-reader)fix bug of multi-char delimiter in csv reader
fix bug that csv_reader parse line in order to get column.
2023-08-23 15:19:13 +08:00
527293aa41 [refactor](dynamic table) remove dynamic table (#23298) 2023-08-23 14:15:14 +08:00
Pxl
8ed4045df9 [Chore](primitive-type) remove VecPrimitiveTypeTraits (#22842) 2023-08-23 08:37:40 +08:00
Pxl
1a1f86486d [Improvement](function) opt for case when (#23068)
opt for case when
2023-08-22 18:31:40 +08:00