doris

Author	SHA1	Message	Date
Mingyu Chen	35f8563a75	[feature](iceberg) support iceberg equality delete (#34223 ) (#34327 ) bp #34223 Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>	2024-04-30 11:51:29 +08:00
Qi Chen	7cb00a8e54	[Feature](hive-writer) Implements s3 file committer. (#34307 ) Backport #33937.	2024-04-29 19:56:49 +08:00
daidai	1bfe0f0393	[feature](iceberg)support read iceberg complex type，iceberg.orc format and position delete. (#33935 ) (#34256 ) master #33935	2024-04-29 14:40:12 +08:00
Mingyu Chen	45556686ea	[fix](test) fix some external test cases (#34209 ) Fix some test cases and enable `test_information_schema_external` suite	2024-04-27 23:25:33 +08:00
Mingyu Chen	50f9d47e96	[test](hive) run suite cases both in hive2 and hive3 (#33874 ) (#34156 ) bp #33874 Co-authored-by: 苏小刚 <suxiaogang223@icloud.com>	2024-04-26 13:48:09 +08:00
Mingyu Chen	0e3ad5cd9d	[fix](parquet) fix time zone error(isAdjustedToUTC=true) in parquet reader (#33675 ) (#33924 ) bp (#33675) Co-authored-by: Ashin Gau <AshinGau@users.noreply.github.com>	2024-04-20 19:06:54 +08:00
Mingyu Chen	4740b22481	[fix](test) fix some p2 external table test cases (#33624 ) bp #33621 Also fix a merge bug from #33245	2024-04-17 23:42:12 +08:00
Ashin Gau	9b7af4c0cf	[feature](schema change) unified schema change for parquet and orc reader (#32873 ) Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well. Unified schema change interface for all format readers: - First, read the data according to the column type of the file into source column; - Second, convert source column to the destination column with type planned by FE.	2024-04-12 15:09:25 +08:00
Ashin Gau	29556f758e	[fix](parquet) fix time zone error in parquet reader (#33217 ) `isAdjustedToUTC` is exactly the opposite in parquet reader(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), resulting the time with `isAdjustedToUTC=true` has increased by eight hours(UTC8). The parquet with `isAdjustedToUTC=true` can be produced by spark-sql with the following configuration: ``` --conf spark.sql.session.timeZone=UTC --conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS ``` However, using the following configuration, there's no logical and convert type in parquet meta data, so the time read by doris will also increase by eight hours(UTC8). Users need to set their own UTC time zone in doris(https://doris.apache.org/docs/dev/advanced/time-zone/) ``` --conf spark.sql.session.timeZone=UTC --conf spark.sql.parquet.outputTimestampType=INT96 ```	2024-04-07 23:24:22 +08:00
Mingyu Chen	d9d950d98e	[fix](iceberg) fix iceberg predicate conversion bug (#33283 ) Followup #32923 Some cases are not covered in #32923	2024-04-07 22:12:38 +08:00
wuwenchi	190763e301	[bugfix](iceberg)Convert the datetime type in the predicate according to the target column (#32923 ) Convert the datetime type in the predicate according to the target column. And add a testcase for #32194 related #30478 #30162	2024-04-07 22:12:33 +08:00
Mingyu Chen	71e16e6f35	[fix](iceberg) fix iceberg catalog bug and p2 test cases (#32898 ) 1. Fix iceberg catalog bug This PR #30198 change the logic of `IcebergHMSExternalCatalog.java`, to get locationUrl by calling hive metastore's `getCatalog()` method. But this method only exists in hive 3+. So it will fail if we using hive 2.x. I temporary remove this logic, because this logic is only used from iceberg table writing. Which is still under development. We will rethink this logic later. 2. Fix test cases Some of P2 test cases missed `order_qt`. And because the output format of the floating point type is changed, some result in `out` files need to be regenerated.	2024-03-27 20:44:38 +08:00
Mingyu Chen	c0d7a5660e	[fix](paimon) support paimon with hive2 (#32455 ) In order to support paimon with hive2, we need to modify the origin HiveMetastoreClient.java to let it compatible with both hive2 and hive3. And this modified HiveMetastoreClient should be at the front of the CLASSPATH, so that it can overwrite the HiveMetastoreClient in hadoop jar. This PR mainly changes: 1. Copy HiveMetastoreClient.java in FE to BE's preload jar. 2. Split the origin `preload-extensions-jar-with-dependencies.jar` into 2 jars 1. `preload-extensions-project.jar`, which contains the modified HiveMetastoreClient. 2. `preload-extensions-jar-with-dependencies.jar`, which contains other dependency jars. 3. Modify the `start_be.sh`, to let `preload-extensions-project.jar` be loaded first. 4. Change the way the assemble the jni scanner jar Only need to assemble the project jar, without other dependencies. Because actually we only use classed under `org.apache.doris` package. So remove other unused dependency jars can also reduce the output size of BE. 5. fix bug that the prefix of paimon properties should be `paimon.`, not `paimon` 6. Support paimon with hive2 User can set `hive.version` in paimon catalog properties to specify the hive version.	2024-03-26 15:31:07 +08:00
Ashin Gau	ec43f65235	[feature](hudi) support hudi incremental read (#32052 ) * [feature](hudi) support incremental read for hudi table * fix jdk17 java options	2024-03-26 15:31:07 +08:00
Ashin Gau	260568db17	[update](hudi) update hudi version to 0.14.1 and compatible with flink hive catalog (#31181 ) 1. Update hudi version from 0.13.1 to .14.1 2. Compatible with the hudi table created by flink hive catalog	2024-02-22 19:51:20 +08:00
wuwenchi	4648902350	[bugfix](iceberg)fix read NULL with date partition (#30478 ) * fix date * fix date * add case	2024-01-30 15:32:43 +08:00
wuwenchi	8308bc96b9	[fix](paimon)set timestamp's scale for parquet which has no logical type (#30119 )	2024-01-23 13:22:14 +08:00
wuwenchi	44ba9e102c	[feature](statistics)support statistics for iceberg/paimon/hudi table (#29868 )	2024-01-18 12:03:07 +08:00
wuwenchi	74991c4af2	[bugfix](paimon)support native and jni to read paimon for minio/cos #29933	2024-01-16 18:49:01 +08:00
Ashin Gau	96d4778f2e	[fix](parquet) the end offset of column chunk may be wrong in parquet metadata (#28891 )	2023-12-23 22:21:04 +08:00
Ashin Gau	c72ad9b673	[fix](regression) fix regression error of test_compress_type (#28826 )	2023-12-22 12:08:23 +08:00
Mingyu Chen	5d8c465644	[regression](p2) fix test cases result (#28768 ) regression-test/data/external_table_p2/hive/test_hive_hudi.out regression-test/data/external_table_p2/hive/test_hive_to_array.out regression-test/suites/external_table_p2/tvf/test_local_tvf_compression.groovy regression-test/suites/external_table_p2/tvf/test_path_partition_keys.groovy regression-test/data/external_table_p2/hive/test_hive_text_complex_type.out	2023-12-21 14:38:30 +08:00
Qi Chen	eb99e4270d	[Fix](parquet_reader) Fix dict filtering doesn't work with plain dict encoding in parquet reader. (#28290 )	2023-12-15 09:27:02 +08:00
daidai	80d2c7ab41	[feature](parquet)support read parquet lzo compress. (#27706 )	2023-12-03 09:55:52 +08:00
slothever	1706699e7e	[fix](multi-catalog)support the max compute partition prune (#27154 ) 1. max compute partition prune, we just support filter mc partitions by '='，it can filter just one partition to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported. 2. add max compute row count cache and partitionValues cache 3. add max compute regression case	2023-12-01 22:28:26 +08:00
daidai	ce271ff382	[fix](parquet)fix can not read parquet lz4 compress. (#27383 ) Fixed the problem of not being able to read parquet lz4 compressed format. By default, it is decompressed according to the Hadoop lz4 format. If it fails, it will fall back to the standard lz4 compression format.	2023-11-29 19:04:53 +08:00
slothever	add6bdb240	[fix](multi-catalog)add the max compute fe ut and fix download expired (#27007 ) 1. add the max compute fe ut and fix download expired 2. solve memery leak when allocator close 3. add correct partition rows	2023-11-20 10:42:07 +08:00
Ashin Gau	52995c528e	[fix](iceberg) iceberg use customer method to encode special characters of field name (#27108 ) Fix two bugs: 1. Missing column is case sensitive, change the column name to lower case in FE for hive/iceberg/hudi 2. Iceberg use custom method to encode special characters in column name. Decode the column name to match the right column in parquet reader.	2023-11-17 18:38:55 +08:00
Qi Chen	a0661ed9d2	[Fix](multi-catalog) Fix complex type crash when using dict filter facility in the parquet-reader. (#27151 ) - Fix complex type crash when using the dict filter facility in the parquet-reader by turning off the dict filter facility in this case. - Add orc complex types regression test.	2023-11-17 13:43:58 +08:00
Ashin Gau	ec40603b93	[fix](parquet) compressed_page_size has the same meaning in page v1 and v2 (#26783 ) 1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data. 2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.	2023-11-14 08:30:42 +08:00
Tiewei Fang	57ed781bb6	[fix](regression-test) Add tvf regression tests (#26455 )	2023-11-09 12:09:32 +08:00
daidai	a4e415ab09	[feature](hive)Support hive tables after alter type. (#25138 ) 1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion. 2.Support hive alter table.	2023-11-02 00:24:21 +08:00
wuwenchi	b98744ae90	[Bug](iceberg)fix read partitioned iceberg without partition path (#25503 ) Iceberg does not require partition values to exist on file paths, so we should get the partition value from `PartitionScanTask.partition`.	2023-10-31 18:09:53 +08:00
wuwenchi	9633d0a83b	[case](iceberg)add test case (#26107 )	2023-10-31 17:23:22 +08:00
Jibing-Li	8a8ae44eee	[Fix](regression)Fix statistics related regression test (#25888 )	2023-10-25 05:59:13 -05:00
slothever	40e430ca55	[regression](multi-catalog) add aliyun dlf hive on oss and huawei obs test case (#25650 ) add aliyun dlf hive on oss and huawei obs test case now obs cases have some problem, will not fix this at this PR, just add comment.	2023-10-24 20:52:50 +08:00
slothever	18c2a13e09	[fix](multi-catalog)fix maxcompute partition filter and session creation (#24911 ) add maxcompute partition support fix maxcompute partition filter modify maxcompute session create method	2023-10-17 22:36:10 +08:00
Ashin Gau	26818de9c8	[feature](jni) support complex types in jni framework (#24810 ) Support complex types in jni framework, and successfully run end-to-end on hudi. ### How to Use Other scanners only need to implement three interfaces in `ColumnValue`: ``` // Get array elements and append into values void unpackArray(List<ColumnValue> values); // Get map key array&value array, and append into keys&values void unpackMap(List<ColumnValue> keys, List<ColumnValue> values); // Get the struct fields specified by `structFieldIndex`, and append into values void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values); ``` Developers can take `HudiColumnValue` as an example.	2023-09-27 14:47:41 +08:00
Jibing-Li	b4432ce577	[Feature](statistics)Support external table analyze partition (#24154 ) Enable collect partition level stats for hive external table.	2023-09-18 14:59:26 +08:00
Jibing-Li	f3e350e8ec	[Improvement](statistics)Improve statistics user experience (#24414 ) Two improvements: 1. Move the `Job_id` column for the return info of `Analyze table` command to the first column. To keep consistent with `show analyze`. ``` mysql> analyze table hive.tpch100.region; +--------+--------------+-------------------------+------------+--------------------------------+ \| Job_Id \| Catalog_Name \| DB_Name \| Table_Name \| Columns \| +--------+--------------+-------------------------+------------+--------------------------------+ \| 14403 \| hive \| default_cluster:tpch100 \| region \| [r_regionkey,r_comment,r_name] \| +--------+--------------+-------------------------+------------+--------------------------------+ 1 row in set (0.03 sec) ``` 2. Add `analyze_timeout` session variable, to control `analyze table/database with sync` timeout.	2023-09-18 13:36:41 +08:00
Mingyu Chen	4dad7c94da	[fix](orc) fix the count() pushdown issue in orc format (#24446 ) In previous, when querying hive table in orc format, and the file is splitted. the result of select count() may be multiple of the real row number. This is because the number of rows should be got after orc strip prune, otherwise, it may return wrong result	2023-09-16 09:57:39 +08:00
Tiewei Fang	c5ef6cfea2	[fix](Table-Valued Function) fix be core when user sepcified empty `column_separator` using hdfs tvf (#24369 )	2023-09-14 23:19:48 +08:00
daidai	e30c3f3a65	[fix](csv_reader)fix bug that Read garbled files caused be crash. (#24164 ) fix bug that read garbled files caused be crash.	2023-09-13 14:12:55 +08:00
daidai	ebe3749996	[fix](tvf)support s3,local compress_type and append regression test (#24055 ) support s3,local compress_type and append regression test.	2023-09-13 00:32:59 +08:00
Qi Chen	9df72a96f3	[Feature](multi-catalog) Support hadoop viewfs. (#24168 ) ### Feature Support hadoop viewfs. ### Test - Regression tests: - hive viewfs test. - tvf viewfs test. - Broker load with broker and with hdfs tests manually.	2023-09-13 00:20:12 +08:00
Ashin Gau	6e28d878b5	[fix](hudi) compatible with hudi spark configuration and support skip merge (#24067 ) Fix three bugs: 1. Hudi slice maybe has log files only, so `new Path(filePath)` will throw errors. 2. Hive column names are lowercase only, so match column names in ignore-case-mode. 3. Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.	2023-09-11 19:54:59 +08:00
daidai	f9a75b5c4f	[feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352 ) 1. append csv serde for serialize to csv and deserialize from csv. 2. let csvReader use csv serde not text_converter.	2023-09-10 00:16:21 +08:00
zhangguoqiang	2cb7536c6c	[fix](regression)fix case test_external_catalog_es (#23908 )	2023-09-06 12:12:43 +08:00
Ashin Gau	eaf2a6a80e	[fix](date) return right date value even if out of the range of date dictionary(#23664 ) PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark. How to fix: 1. Increase dictionary range: 1900 ~ 2038 2. The date out of 1900 ~ 2038 is regenerated.	2023-09-01 14:40:20 +08:00
Mingyu Chen	40be6a0b05	[fix](hive) do not split compress data file and support lz4/snappy block codec (#23245 ) 1. do not split compress data file Some data file in hive is compressed with gzip, deflate, etc. These kinds of file can not be splitted. 2. Support lz4 block codec for hive scan node, use lz4 block codec instead of lz4 frame codec 4. Support snappy block codec For hadoop snappy 5. Optimize the `count()` query of csv file For query like `select count() from tbl`, only need to split the line, no need to split the column. Need to pick to branch-2.0 after this PR: #22304	2023-08-26 12:59:05 +08:00

1 2

67 Commits