doris

Author	SHA1	Message	Date
daidai	3585c7e216	[test](parquet)append parquet reader byte_array_decimal and rle_bool case (#26751 )	2023-11-14 15:05:10 +08:00
Ashin Gau	ec40603b93	[fix](parquet) compressed_page_size has the same meaning in page v1 and v2 (#26783 ) 1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data. 2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.	2023-11-14 08:30:42 +08:00
Qi Chen	c07a70e22a	[Fix](orc-reader) Add missing `break` introduced by #26548 . (#26633 ) Add missing break introduced by #26548. Sorry for this mistake.	2023-11-09 18:29:44 +08:00
zhiqiang	a5565f68b2	[Refactor](opentelemetry) Remove opentelemetry (#26605 )	2023-11-09 18:05:34 +08:00
wudongliang	22bf2889e5	[feature](tvf)(jni-avro)jni-avro scanner add complex data types (#26236 ) Support avro's enum, record, union data types	2023-11-09 13:58:49 +08:00
Qi Chen	d1438a8563	[Fix](orc-reader) Fix orc complex types when late materialization was turned on by disabling late materialization in this case. (#26548 ) Fix orc complex types when late materialization was turned on in orc reader by disabling late materialization in this case.	2023-11-09 12:05:43 +08:00
Qi Chen	3bce6d3828	[Opt](orc-reader) Optimize orc string dict filter in not_single_conjunct case. (#26386 ) Optimize orc/parquet string dict filter in not_single_conjunct case. We can optimize this processing to filter block firstly by dict code, then filter by not_single_conjunct. Because dict code is int, it will filter faster than string. For example: ``` select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01'; ``` `l_receiptdate` and `l_shipmode` will using string dict filtering, and `l_commitdate < l_receiptdate` is the an not_single_conjunct which contains dict filter field. We can optimize this processing to filter block firstly by dict code, then filter by not_single_conjunct. Because dict code is int, it will filter faster than string. ### Test Result: Before: mysql> select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01'; +----------------------+ \| count(l_receiptdate) \| +----------------------+ \| 49314694 \| +----------------------+ 1 row in set (6.87 sec) After: mysql> select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01'; +----------------------+ \| count(l_receiptdate) \| +----------------------+ \| 49314694 \| +----------------------+ 1 row in set (4.85 sec)	2023-11-08 18:03:18 +08:00
lihangyu	44b51bf0b9	[Feature](Variant) support variant load (#26572 )	2023-11-08 00:37:57 -06:00
daidai	a4e415ab09	[feature](hive)Support hive tables after alter type. (#25138 ) 1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion. 2.Support hive alter table.	2023-11-02 00:24:21 +08:00
Tiewei Fang	3e10e5af39	[Fix](Serde) Fix content displayed by complex types in MySQL Client (#25946 ) This pr makes three changes to the display of complex types： 1. NULL value in complex types refers to being displayed as `null`, not `NULL` 2. struct type is displayed as "column_name": column_value 3. Time types such as `datetime` and `date`, are displayed with double quotes in complex types. like `{1, "2023-10-26 12:12:12"}` This pr also do a code refactor: 1. nesting_level is set to a member variable of the `DataTypeSerDe`, rather than a parameter in methods. What's more, this pr fix a bug that fileSize is not correct, introduced by this pr: #25854	2023-11-01 23:48:55 +08:00
Siyang Tang	aafd53766b	[chore](file-reader) rm unused interface from generic reader (#26205 )	2023-11-01 18:43:14 +08:00
Pxl	696ecc8c83	[Chore](log) adjust error code on too many filtered rows (#26168 )	2023-11-01 00:15:56 +08:00
wuwenchi	b98744ae90	[Bug](iceberg)fix read partitioned iceberg without partition path (#25503 ) Iceberg does not require partition values to exist on file paths, so we should get the partition value from `PartitionScanTask.partition`.	2023-10-31 18:09:53 +08:00
plat1ko	6dd60c6ebb	[Enhance](BE) Add -Wshadow-field compile option to avoid unexpected shadowing behavior (#25698 ) * Fix `Tablet::_meta_lock` shadows member inherited from `BaseTablet` * Add -Wshadow-field compile option to avoid unexpected shadowing behavior	2023-10-26 10:00:28 +08:00
TengJianPing	693982fd1a	[feature](decimal) support decimal256 (#25386 )	2023-10-25 15:47:51 +08:00
Siyang Tang	88dd480c2e	[enhancement](CSV-reader) enhance err log for csv reading containing enclose or escape (#25816 )	2023-10-24 22:10:08 +08:00
Ashin Gau	d62e914205	[opt](profile) set datalake profile level as 1 (#25686 ) Follow #25491, only the profile marked as 1 will be shown in simplified profile.	2023-10-24 09:55:25 +08:00
daidai	0e0f8090f7	[refactor](text_convert)Use serde to replace text_convert. (#25543 ) Remove text_convert and use serde to replace it.	2023-10-24 09:52:43 +08:00
Qi Chen	08832d9f3a	[Fix](exec) Fix date dict dead loop. (#25570 )	2023-10-24 02:51:43 +08:00
Siyang Tang	9006e2b8a5	[fix](prefetch-read) make prefetch range correct to accelerate S3 load and fix its speed unbalance (#25775 )	2023-10-23 20:02:24 +08:00
Pxl	642c149e6a	remove datetime_value and move vecdatetime_value to doris namespace (#25695 ) remove datetime_value and move vecdatetime_value to doris namespace	2023-10-20 22:08:17 +08:00
YueW	e4a83a22d1	[opt](error msg) Make data codec error clearly when load csv data can't display (#25540 ) Co-authored-by: Tanya-W <tanya1218w@163,com>	2023-10-18 16:12:22 +08:00
Ashin Gau	47689fd452	[refactor](jni) unified jni framework for java udf (#25302 ) Use the unified jni framework to refactor java udf. The unified jni framework takes VectorTable as the container to transform data between c++ and java, and hide the details of data format conversion. In addition, the unified framework supports complex and nested types. The performance of basic types remains consistent, with a 30% improvement in string types and an order of magnitude improvement in complex types.	2023-10-18 09:27:54 +08:00
slothever	18c2a13e09	[fix](multi-catalog)fix maxcompute partition filter and session creation (#24911 ) add maxcompute partition support fix maxcompute partition filter modify maxcompute session create method	2023-10-17 22:36:10 +08:00
Jerry Hu	2664d1cffb	[chore](vec) Make this copy constructor of StringRef explicit (#25337 )	2023-10-12 14:12:46 +08:00
lihangyu	58d96ecdbf	[Improve](status) avoid print too may stack log for `DATA_QUALITY_ERROR` code (#25292 )	2023-10-12 09:58:51 +08:00
Qi Chen	46ab4346ca	[Opt](parquet reader) Optimize the performance of reading decimal in parquet reader. (#25012 ) Optimize the performance of reading decimal in parquet reader. - Static dispatch `DecimalScaleParams`. - Optimize `memcpy`, static dispatch copy size in fixed length cases. - Use right shift bit operator to convert decimals.	2023-10-12 09:53:08 +08:00
Gabriel	bb670118f5	[coverage](test) Delete unused function to improve test coverage (#25233 )	2023-10-11 11:50:51 +08:00
lihangyu	2f706cc84b	[compile](simdjson reader) use `__AVX2__` macro to decide whether use simdjson to parse (#25165 )	2023-10-11 10:50:13 +08:00
zzzzzzzs	6fe060b79e	[fix](streamload) fix http_stream retry mechanism (#24978 ) If a failure occurs, doris may retry. Due to ctx->is_read_schema is a global variable that has not been reset in a timely manner, which may cause exceptions. --------- Co-authored-by: yiguolei <676222867@qq.com>	2023-10-08 11:16:21 +08:00
zhangdong	4e8cde127c	[Enhance](catalog)add table cache in paimon jni (#25014 ) - fix get old schema after refresh paimon table - add table cache in paimon jni	2023-10-08 10:36:18 +08:00
bobhan1	642e5cdb69	[Fix](Status) Make `Status` `[[nodiscard]]` and handle returned `Status` correctly (#23395 )	2023-09-29 22:38:52 +08:00
huanghaibin	082bcd820b	[feature](insert) Support wal for group commit insert (#23053 )	2023-09-26 14:46:24 +08:00
daidai	3c99743bf2	[enhancement](csv_reader)Optimize the reading efficiency of nullable (string) columns. (#24698 ) Optimize the performance of stream load tsv by reducing virtual function calls . (Optimize read performance of nullable (string) columns by reducing virtual function calls.) before : 600+ s after : 560+ s	2023-09-22 13:44:37 +08:00
daidai	c704497d02	[fix](csv_reader)Fixed bug when parsing multi-character delimiters. (#24572 ) Fixed bug when parsing multi-character delimiters.	2023-09-20 12:41:35 +08:00
Mingyu Chen	4dad7c94da	[fix](orc) fix the count() pushdown issue in orc format (#24446 ) In previous, when querying hive table in orc format, and the file is splitted. the result of select count() may be multiple of the real row number. This is because the number of rows should be got after orc strip prune, otherwise, it may return wrong result	2023-09-16 09:57:39 +08:00
plat1ko	b9ddcbf729	[feature](merge-cloud) Rewrite code related to IOContext (#24269 )	2023-09-15 19:57:58 +08:00
yiguolei	9c681692bd	Revert "[fix] fix http_stream retry mechanism (#23969 )" (#24407 ) This reverts commit 05e365ea137eb8c92b8e7eedc7d1435e83f065ae.	2023-09-15 10:07:53 +08:00
zzzzzzzs	05e365ea13	[fix] fix http_stream retry mechanism (#23969 ) Co-authored-by: yiguolei <676222867@qq.com>	2023-09-14 21:41:11 +08:00
plat1ko	d8ef9dda59	[feature](merge-cloud) Rewrite FS interface (#23953 )	2023-09-12 19:20:25 +08:00
Ashin Gau	6e28d878b5	[fix](hudi) compatible with hudi spark configuration and support skip merge (#24067 ) Fix three bugs: 1. Hudi slice maybe has log files only, so `new Path(filePath)` will throw errors. 2. Hive column names are lowercase only, so match column names in ignore-case-mode. 3. Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.	2023-09-11 19:54:59 +08:00
Xiangyu Wang	9b3be0ba7a	[Fix](multi-catalog) Do not throw exceptions when file not exists for external hive tables. (#23799 ) A similar bug compares to #22140 . When executing a query with hms catalog, the query maybe failed because some hdfs files are not existed. We should just distinguish this kind of errors and skip it. ``` errCode = 2, detailMessage = (xxx.xxx.xxx.xxx)[CANCELLED][INTERNAL_ERROR]failed to init reader for file hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: [INTERNAL_ERROR]Init OrcReader failed. reason = Failed to read hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc: [INTERNAL_ERROR]Read hdfs file failed. (BE: xxx.xxx.xxx.xxx) namenode:hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: (2), No such file or directory), reason: RemoteException: File does not exist: /xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1927) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:426) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682) ```	2023-09-10 21:55:09 +08:00
daidai	f9a75b5c4f	[feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352 ) 1. append csv serde for serialize to csv and deserialize from csv. 2. let csvReader use csv serde not text_converter.	2023-09-10 00:16:21 +08:00
GoGoWen	0f0ffa3482	[Fix](Parquet Reader) fix parquet read issue (#24092 )	2023-09-09 00:35:18 +08:00
lihangyu	6b56896a01	[chore](json reader) add original data to error messge for tracing (#22803 )	2023-09-02 20:15:18 +08:00
daidai	657e927d50	[fix](json)Fix the bug that read json file Out of bounds access (#23411 )	2023-09-02 01:11:37 +08:00
Ashin Gau	eaf2a6a80e	[fix](date) return right date value even if out of the range of date dictionary(#23664 ) PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark. How to fix: 1. Increase dictionary range: 1900 ~ 2038 2. The date out of 1900 ~ 2038 is regenerated.	2023-09-01 14:40:20 +08:00
Mingyu Chen	3a2c0d16f7	[fix](parquet) fix potential heap-use-after-free issue and cache issue (#23638 ) 1. When file meta cache is disabled (by setting `max_external_file_meta_cache_num=0` in be.conf), the parquet's meta info is owned by parquet reader and will be released when calling `reader->close()`. But the underlying file reader of this parquet reader will be released after `reader->close()`, this may causing `heap-use-after-free` bug because some part of meta info may be referenced by file reader. This PR fix it by making sure that meta info is released after file reader released. 2. Add modification time for file meta cache in BE, to avoid parquet read error like: `Failed to deserialize parquet page header`	2023-08-31 18:23:05 +08:00
Mingyu Chen	40be6a0b05	[fix](hive) do not split compress data file and support lz4/snappy block codec (#23245 ) 1. do not split compress data file Some data file in hive is compressed with gzip, deflate, etc. These kinds of file can not be splitted. 2. Support lz4 block codec for hive scan node, use lz4 block codec instead of lz4 frame codec 4. Support snappy block codec For hadoop snappy 5. Optimize the `count()` query of csv file For query like `select count() from tbl`, only need to split the line, no need to split the column. Need to pick to branch-2.0 after this PR: #22304	2023-08-26 12:59:05 +08:00
slothever	f66f161017	[fix](multi-catalog)fix hive table with cosn location issue (#23409 ) Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc). This PR mainly changes: 1. Fix the bug of accessing files via cosn. 2. Add a new field `fs_name` in TFileRangeDesc This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name for each file, otherwise, it may return error: `reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`	2023-08-26 00:16:00 +08:00

1 2 3 4 5 ...

295 Commits