Commit Graph

295 Commits

Author SHA1 Message Date
3585c7e216 [test](parquet)append parquet reader byte_array_decimal and rle_bool case (#26751) 2023-11-14 15:05:10 +08:00
ec40603b93 [fix](parquet) compressed_page_size has the same meaning in page v1 and v2 (#26783)
1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data.
2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.
2023-11-14 08:30:42 +08:00
c07a70e22a [Fix](orc-reader) Add missing break introduced by #26548. (#26633)
Add missing break introduced by #26548. Sorry for this mistake.
2023-11-09 18:29:44 +08:00
a5565f68b2 [Refactor](opentelemetry) Remove opentelemetry (#26605) 2023-11-09 18:05:34 +08:00
22bf2889e5 [feature](tvf)(jni-avro)jni-avro scanner add complex data types (#26236)
Support avro's enum, record, union data types
2023-11-09 13:58:49 +08:00
d1438a8563 [Fix](orc-reader) Fix orc complex types when late materialization was turned on by disabling late materialization in this case. (#26548)
Fix orc complex types when late materialization was turned on in orc reader by disabling late materialization in this case.
2023-11-09 12:05:43 +08:00
3bce6d3828 [Opt](orc-reader) Optimize orc string dict filter in not_single_conjunct case. (#26386)
Optimize orc/parquet string dict filter in not_single_conjunct case. We can optimize this processing to filter block firstly by dict code, then filter by not_single_conjunct. Because dict code is int, it will filter faster than string.

For example:
```
select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate  and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01';
```
 `l_receiptdate` and `l_shipmode` will using string dict filtering, and `l_commitdate < l_receiptdate` is the an not_single_conjunct which contains dict filter field. We can optimize this processing to filter block firstly by dict code, then filter by not_single_conjunct. Because dict code is int, it will filter faster than string.

### Test Result:
Before:
 mysql> select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate  and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01';
+----------------------+
| count(l_receiptdate) |
+----------------------+
|             49314694 |
+----------------------+
1 row in set (6.87 sec)

After:
mysql> select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate  and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01';
+----------------------+
| count(l_receiptdate) |
+----------------------+
|             49314694 |
+----------------------+
1 row in set (4.85 sec)
2023-11-08 18:03:18 +08:00
44b51bf0b9 [Feature](Variant) support variant load (#26572) 2023-11-08 00:37:57 -06:00
a4e415ab09 [feature](hive)Support hive tables after alter type. (#25138)
1.Reconstruct the logic of decode to read parquet. The parquet  reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.
2023-11-02 00:24:21 +08:00
3e10e5af39 [Fix](Serde) Fix content displayed by complex types in MySQL Client (#25946)
This pr makes three changes to the display of complex types:
1. NULL value in complex types refers to being displayed as `null`, not `NULL`
2. struct type is displayed as "column_name": column_value
3. Time types such as `datetime` and `date`, are displayed with double quotes in complex types. like
    `{1, "2023-10-26 12:12:12"}`

This pr also do a code refactor:
1. nesting_level is set to a member variable of the `DataTypeSerDe`, rather than a parameter in methods.

What's more, this pr fix a bug that fileSize is not correct, introduced by this pr: #25854
2023-11-01 23:48:55 +08:00
aafd53766b [chore](file-reader) rm unused interface from generic reader (#26205) 2023-11-01 18:43:14 +08:00
Pxl
696ecc8c83 [Chore](log) adjust error code on too many filtered rows (#26168) 2023-11-01 00:15:56 +08:00
b98744ae90 [Bug](iceberg)fix read partitioned iceberg without partition path (#25503)
Iceberg does not require partition values to exist on file paths, so we should get the partition value from `PartitionScanTask.partition`.
2023-10-31 18:09:53 +08:00
6dd60c6ebb [Enhance](BE) Add -Wshadow-field compile option to avoid unexpected shadowing behavior (#25698)
* Fix `Tablet::_meta_lock` shadows member inherited from `BaseTablet`

* Add -Wshadow-field compile option to avoid unexpected shadowing behavior
2023-10-26 10:00:28 +08:00
693982fd1a [feature](decimal) support decimal256 (#25386) 2023-10-25 15:47:51 +08:00
88dd480c2e [enhancement](CSV-reader) enhance err log for csv reading containing enclose or escape (#25816) 2023-10-24 22:10:08 +08:00
d62e914205 [opt](profile) set datalake profile level as 1 (#25686)
Follow #25491, only the profile marked as 1 will be shown in simplified profile.
2023-10-24 09:55:25 +08:00
0e0f8090f7 [refactor](text_convert)Use serde to replace text_convert. (#25543)
Remove text_convert and use serde to replace it.
2023-10-24 09:52:43 +08:00
08832d9f3a [Fix](exec) Fix date dict dead loop. (#25570) 2023-10-24 02:51:43 +08:00
9006e2b8a5 [fix](prefetch-read) make prefetch range correct to accelerate S3 load and fix its speed unbalance (#25775) 2023-10-23 20:02:24 +08:00
Pxl
642c149e6a remove datetime_value and move vecdatetime_value to doris namespace (#25695)
remove datetime_value and move vecdatetime_value to doris namespace
2023-10-20 22:08:17 +08:00
e4a83a22d1 [opt](error msg) Make data codec error clearly when load csv data can't display (#25540)
Co-authored-by: Tanya-W <tanya1218w@163,com>
2023-10-18 16:12:22 +08:00
47689fd452 [refactor](jni) unified jni framework for java udf (#25302)
Use the unified jni framework to refactor java udf.
The unified jni framework takes VectorTable as the container to transform data between c++ and java, and hide the details of data format conversion.
In addition, the unified framework supports complex and nested types.
The performance of basic types remains consistent, with a 30% improvement in string types and an order of magnitude improvement in complex types.
2023-10-18 09:27:54 +08:00
18c2a13e09 [fix](multi-catalog)fix maxcompute partition filter and session creation (#24911)
add maxcompute partition support
fix maxcompute partition filter
modify maxcompute session create method
2023-10-17 22:36:10 +08:00
2664d1cffb [chore](vec) Make this copy constructor of StringRef explicit (#25337) 2023-10-12 14:12:46 +08:00
58d96ecdbf [Improve](status) avoid print too may stack log for DATA_QUALITY_ERROR code (#25292) 2023-10-12 09:58:51 +08:00
46ab4346ca [Opt](parquet reader) Optimize the performance of reading decimal in parquet reader. (#25012)
Optimize the performance of reading decimal in parquet reader.

- Static dispatch `DecimalScaleParams`.
- Optimize `memcpy`, static dispatch copy size in fixed length cases.
- Use right shift bit operator to convert decimals.
2023-10-12 09:53:08 +08:00
bb670118f5 [coverage](test) Delete unused function to improve test coverage (#25233) 2023-10-11 11:50:51 +08:00
2f706cc84b [compile](simdjson reader) use __AVX2__ macro to decide whether use simdjson to parse (#25165) 2023-10-11 10:50:13 +08:00
6fe060b79e [fix](streamload) fix http_stream retry mechanism (#24978)
If a failure occurs, doris may retry. Due to ctx->is_read_schema is a global variable that has not been reset in a timely manner, which may cause exceptions.


---------

Co-authored-by: yiguolei <676222867@qq.com>
2023-10-08 11:16:21 +08:00
4e8cde127c [Enhance](catalog)add table cache in paimon jni (#25014)
- fix get old schema after refresh paimon table
- add table cache in paimon jni
2023-10-08 10:36:18 +08:00
642e5cdb69 [Fix](Status) Make Status [[nodiscard]] and handle returned Status correctly (#23395) 2023-09-29 22:38:52 +08:00
082bcd820b [feature](insert) Support wal for group commit insert (#23053) 2023-09-26 14:46:24 +08:00
3c99743bf2 [enhancement](csv_reader)Optimize the reading efficiency of nullable (string) columns. (#24698)
Optimize the performance of stream load tsv by reducing virtual function calls .
(Optimize read performance of nullable (string) columns by reducing virtual function calls.)
before : 600+ s
after : 560+ s
2023-09-22 13:44:37 +08:00
c704497d02 [fix](csv_reader)Fixed bug when parsing multi-character delimiters. (#24572)
Fixed bug when parsing multi-character delimiters.
2023-09-20 12:41:35 +08:00
4dad7c94da [fix](orc) fix the count(*) pushdown issue in orc format (#24446)
In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.

This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
2023-09-16 09:57:39 +08:00
b9ddcbf729 [feature](merge-cloud) Rewrite code related to IOContext (#24269) 2023-09-15 19:57:58 +08:00
9c681692bd Revert "[fix] fix http_stream retry mechanism (#23969)" (#24407)
This reverts commit 05e365ea137eb8c92b8e7eedc7d1435e83f065ae.
2023-09-15 10:07:53 +08:00
05e365ea13 [fix] fix http_stream retry mechanism (#23969)
Co-authored-by: yiguolei <676222867@qq.com>
2023-09-14 21:41:11 +08:00
d8ef9dda59 [feature](merge-cloud) Rewrite FS interface (#23953) 2023-09-12 19:20:25 +08:00
6e28d878b5 [fix](hudi) compatible with hudi spark configuration and support skip merge (#24067)
Fix three bugs:
1. Hudi slice maybe has log files only, so `new Path(filePath)`  will throw errors.
2. Hive column names are lowercase only, so match column names in ignore-case-mode.
3.  Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.
2023-09-11 19:54:59 +08:00
9b3be0ba7a [Fix](multi-catalog) Do not throw exceptions when file not exists for external hive tables. (#23799)
A similar bug compares to #22140 .

When executing a query with hms catalog, the query maybe failed because some hdfs files are not existed. We should just distinguish this kind of errors and skip it.

```
errCode = 2, detailMessage = (xxx.xxx.xxx.xxx)[CANCELLED][INTERNAL_ERROR]failed to init reader for file hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: [INTERNAL_ERROR]Init OrcReader failed. reason = Failed to read hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc: [INTERNAL_ERROR]Read hdfs file failed. (BE: xxx.xxx.xxx.xxx) namenode:hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: (2), No such file or directory), reason: RemoteException: File does not exist: /xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86) 
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) 
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1927) 
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738) 
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:426) 
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
```
2023-09-10 21:55:09 +08:00
f9a75b5c4f [feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352)
1. append csv serde for serialize to csv and deserialize from csv.
2. let csvReader use csv serde not text_converter.
2023-09-10 00:16:21 +08:00
0f0ffa3482 [Fix](Parquet Reader) fix parquet read issue (#24092) 2023-09-09 00:35:18 +08:00
6b56896a01 [chore](json reader) add original data to error messge for tracing (#22803) 2023-09-02 20:15:18 +08:00
657e927d50 [fix](json)Fix the bug that read json file Out of bounds access (#23411) 2023-09-02 01:11:37 +08:00
eaf2a6a80e [fix](date) return right date value even if out of the range of date dictionary(#23664)
PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark.
How to fix:
1. Increase dictionary range: 1900 ~ 2038
2. The date out of 1900 ~ 2038 is regenerated.
2023-09-01 14:40:20 +08:00
3a2c0d16f7 [fix](parquet) fix potential heap-use-after-free issue and cache issue (#23638)
1. When file meta cache is disabled (by setting `max_external_file_meta_cache_num=0` in be.conf),
the parquet's meta info is owned by parquet reader and will be released when calling `reader->close()`.

But the underlying file reader of this parquet reader will be released after `reader->close()`,
this may causing `heap-use-after-free` bug because some part of meta info may be referenced by file reader.

This PR fix it by making sure that meta info is released after file reader released.

2. Add modification time for file meta cache in BE, to avoid parquet read error like:
`Failed to deserialize parquet page header`
2023-08-31 18:23:05 +08:00
40be6a0b05 [fix](hive) do not split compress data file and support lz4/snappy block codec (#23245)
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304
2023-08-26 12:59:05 +08:00
f66f161017 [fix](multi-catalog)fix hive table with cosn location issue (#23409)
Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc).
This PR mainly changes:

1. Fix the bug of accessing files via cosn.
2. Add a new field `fs_name` in TFileRangeDesc
    This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query
request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name
for each file, otherwise, it may return error:

`reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`
2023-08-26 00:16:00 +08:00