doris

Author	SHA1	Message	Date
Jibing-Li	1b783aaa7f	[fix](p2)Fix analyze hive partition column p2 case after row count change. #31958	2024-03-09 19:45:03 +08:00
Mingyu Chen	ad3308c8ab	[fix](hive) support partition prune for _HIVE_DEFAULT_PARTITION_ (#31736 ) This PR #23026 support the partition prune for hive table with `_HIVE_DEFAULT_PARTITION`, but it will always select partition with `_HIVE_DEFAULT_PARTITION`. This PR #31613 support null partition for olap table's list partition, so we can treat `_HIVE_DEFAULT_PARTITION` as null partition of hive table. So this PR change the partition prune logic	2024-03-06 13:07:49 +08:00
Jibing-Li	32033d08c6	Fix hive p2 cases. (#31541 )	2024-02-29 12:37:38 +08:00
Ashin Gau	260568db17	[update](hudi) update hudi version to 0.14.1 and compatible with flink hive catalog (#31181 ) 1. Update hudi version from 0.13.1 to .14.1 2. Compatible with the hudi table created by flink hive catalog	2024-02-22 19:51:20 +08:00
Jibing-Li	87b5ed187e	Fix hive p2 case (#31149 )	2024-02-21 13:53:18 +08:00
Jibing-Li	ae809cd900	Fix hive p2 case. (#31072 )	2024-02-19 17:20:21 +08:00
Jibing-Li	3da168afc9	Fix hive sample case ndv value. (#31043 )	2024-02-19 17:20:21 +08:00
Jibing-Li	9e76592297	Support analyze materialized view. (#30540 )	2024-02-04 22:21:16 +08:00
Mingyu Chen	1548813a17	[fix](test) fix case with same catalog name (#30585 )	2024-02-01 19:00:50 +08:00
wuwenchi	7d037c12bf	[bugfix](paimon)fix paimon testcases (#30514 ) 1. set default timezone 2. not supported `char` type to pushdown	2024-01-31 23:53:39 +08:00
wuwenchi	4648902350	[bugfix](iceberg)fix read NULL with date partition (#30478 ) * fix date * fix date * add case	2024-01-30 15:32:43 +08:00
wuwenchi	8308bc96b9	[fix](paimon)set timestamp's scale for parquet which has no logical type (#30119 )	2024-01-23 13:22:14 +08:00
wuwenchi	44ba9e102c	[feature](statistics)support statistics for iceberg/paimon/hudi table (#29868 )	2024-01-18 12:03:07 +08:00
wuwenchi	74991c4af2	[bugfix](paimon)support native and jni to read paimon for minio/cos #29933	2024-01-16 18:49:01 +08:00
Jibing-Li	40badbf5c5	Fix analyze empty external NPE bug. (#29675 )	2024-01-12 11:41:21 +08:00
Ashin Gau	96d4778f2e	[fix](parquet) the end offset of column chunk may be wrong in parquet metadata (#28891 )	2023-12-23 22:21:04 +08:00
Ashin Gau	c72ad9b673	[fix](regression) fix regression error of test_compress_type (#28826 )	2023-12-22 12:08:23 +08:00
Mingyu Chen	5d8c465644	[regression](p2) fix test cases result (#28768 ) regression-test/data/external_table_p2/hive/test_hive_hudi.out regression-test/data/external_table_p2/hive/test_hive_to_array.out regression-test/suites/external_table_p2/tvf/test_local_tvf_compression.groovy regression-test/suites/external_table_p2/tvf/test_path_partition_keys.groovy regression-test/data/external_table_p2/hive/test_hive_text_complex_type.out	2023-12-21 14:38:30 +08:00
Jibing-Li	64ebdb2777	[fix](regression)Change analyze_timeout to global. (#28587 ) Fix hive statistics regression case. analyze_timeout is a global session variable.	2023-12-19 15:52:38 +08:00
daidai	80d2c7ab41	[feature](parquet)support read parquet lzo compress. (#27706 )	2023-12-03 09:55:52 +08:00
slothever	1706699e7e	[fix](multi-catalog)support the max compute partition prune (#27154 ) 1. max compute partition prune, we just support filter mc partitions by '='，it can filter just one partition to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported. 2. add max compute row count cache and partitionValues cache 3. add max compute regression case	2023-12-01 22:28:26 +08:00
daidai	ce271ff382	[fix](parquet)fix can not read parquet lz4 compress. (#27383 ) Fixed the problem of not being able to read parquet lz4 compressed format. By default, it is decompressed according to the Hadoop lz4 format. If it fails, it will fall back to the standard lz4 compression format.	2023-11-29 19:04:53 +08:00
Jibing-Li	39a5229027	[fix](regression)Fix hive p2 case (#27466 )	2023-11-23 23:32:21 +08:00
slothever	add6bdb240	[fix](multi-catalog)add the max compute fe ut and fix download expired (#27007 ) 1. add the max compute fe ut and fix download expired 2. solve memery leak when allocator close 3. add correct partition rows	2023-11-20 10:42:07 +08:00
Ashin Gau	52995c528e	[fix](iceberg) iceberg use customer method to encode special characters of field name (#27108 ) Fix two bugs: 1. Missing column is case sensitive, change the column name to lower case in FE for hive/iceberg/hudi 2. Iceberg use custom method to encode special characters in column name. Decode the column name to match the right column in parquet reader.	2023-11-17 18:38:55 +08:00
Jibing-Li	ec92ba4af1	[fix](statistics)Fix alter column stats bug (#27093 ) Encode the min and max value with base64 encoder while inject the column stats.	2023-11-17 15:40:47 +08:00
Qi Chen	a0661ed9d2	[Fix](multi-catalog) Fix complex type crash when using dict filter facility in the parquet-reader. (#27151 ) - Fix complex type crash when using the dict filter facility in the parquet-reader by turning off the dict filter facility in this case. - Add orc complex types regression test.	2023-11-17 13:43:58 +08:00
Ashin Gau	ec40603b93	[fix](parquet) compressed_page_size has the same meaning in page v1 and v2 (#26783 ) 1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data. 2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.	2023-11-14 08:30:42 +08:00
Tiewei Fang	57ed781bb6	[fix](regression-test) Add tvf regression tests (#26455 )	2023-11-09 12:09:32 +08:00
daidai	a4e415ab09	[feature](hive)Support hive tables after alter type. (#25138 ) 1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion. 2.Support hive alter table.	2023-11-02 00:24:21 +08:00
wuwenchi	b98744ae90	[Bug](iceberg)fix read partitioned iceberg without partition path (#25503 ) Iceberg does not require partition values to exist on file paths, so we should get the partition value from `PartitionScanTask.partition`.	2023-10-31 18:09:53 +08:00
wuwenchi	9633d0a83b	[case](iceberg)add test case (#26107 )	2023-10-31 17:23:22 +08:00
Jibing-Li	78204f7c92	[Fix](statistics)Fix external couldn't analyze database bug (#26025 )	2023-10-31 11:32:47 +08:00
Jibing-Li	8a8ae44eee	[Fix](regression)Fix statistics related regression test (#25888 )	2023-10-25 05:59:13 -05:00
slothever	40e430ca55	[regression](multi-catalog) add aliyun dlf hive on oss and huawei obs test case (#25650 ) add aliyun dlf hive on oss and huawei obs test case now obs cases have some problem, will not fix this at this PR, just add comment.	2023-10-24 20:52:50 +08:00
Jibing-Li	4d2e7d7c86	[improvement](statistics)Set min max to NULL when collect stats with sample (#25593 ) 1. To avoid misleading of inaccurate min max stats, set the stats value to NULL while using sample to collect stats. 2. Fix NDV_SAMPLE_TEMPLATE typo, it shouldn't contain row count related contents.	2023-10-19 18:00:55 +08:00
Jibing-Li	7cfb1d9b0e	[Regression case](statistics) Add regression test case for fetching HMSExternalTable through hms. (#25548 ) Regression case for fetching HMSExternalTable statistics through HMS when the table is not analyzed.	2023-10-18 09:57:58 +08:00
slothever	26e332c608	[fix](multi-catalog)add exception for unsupported hive input format (#25490 ) add exception for unsupported hive input format	2023-10-17 22:53:53 +08:00
slothever	18c2a13e09	[fix](multi-catalog)fix maxcompute partition filter and session creation (#24911 ) add maxcompute partition support fix maxcompute partition filter modify maxcompute session create method	2023-10-17 22:36:10 +08:00
Jibing-Li	1130317b91	[Improvement](statistics)Collect stats for hive partition column using metadata (#24853 ) Hive partition columns' stats could be calculated from hive metastore data. Doesn't need to execute sql to get the stats. This PR is using hive partition metadata to collect partition column stats.	2023-10-17 10:31:57 +08:00
Jibing-Li	c63bf24c84	[Improvement](statistics) Improve sample count accuracy (#25175 ) While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.	2023-10-12 14:42:02 +08:00
daidai	9a4baf7ccf	[fix](Nereids)Fix the bug that count() does not push down for tables with only one column. (#25222 ) after pr #22115 . Fixed the bug that when selecting count() from table, if the table has only one column, the aggregate count is not pushed down.	2023-10-11 23:17:30 +08:00
Mingyu Chen	727fa2c0cd	[opt](tvf) refine the class of ExternalFileTableValuedFunction (#24706 ) `ExternalFileTableValuedFunction` now has 3 derived classes: - LocalTableValuedFunction - HdfsTableValuedFunction - S3TableValuedFunction All these tvfs are for reading data from file. The difference is where to read the file, eg, from HDFS or from local filesystem. So I refine the fields and methods of these classes. Now there 3 kinds of properties of these tvfs: 1. File format properties File format properties, such as `format`, `column_separator`. For all these tvfs, they are common properties. So these properties should be analyzed in parenet class `ExternalFileTableValuedFunction`. 2. URI or file path The URI or file path property indicate the file location. For different storage, the format of the uri are not same. So they should be analyzed in each derived classes. 3. Other properties All other properties which are special for certain tvf. So they should be analyzed in each derived classes. There are 2 new classes: - `FileFormatConstants`: Define some common property names or variables related to file format. - `FileFormatUtils`: Define some util methods related to file format. After this PR, if we want to add some common properties for all these tvfs, only need to handled it in `ExternalFileTableValuedFunction`, to avoid missing handle it in any one of them. ### Behavior change 1. Remove `fs.defaultFS` property in `hdfs()`, it can be got from `uri` 2. Use `\t` as the default column separator of csv format, same as stream load	2023-10-07 12:44:04 +08:00
Ashin Gau	26818de9c8	[feature](jni) support complex types in jni framework (#24810 ) Support complex types in jni framework, and successfully run end-to-end on hudi. ### How to Use Other scanners only need to implement three interfaces in `ColumnValue`: ``` // Get array elements and append into values void unpackArray(List<ColumnValue> values); // Get map key array&value array, and append into keys&values void unpackMap(List<ColumnValue> keys, List<ColumnValue> values); // Get the struct fields specified by `structFieldIndex`, and append into values void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values); ``` Developers can take `HudiColumnValue` as an example.	2023-09-27 14:47:41 +08:00
daidai	ef72321878	[fix](regression-text)fix test_path_partition_keys regression test (#24796 ) fix test_path_partition_keys regression test	2023-09-24 23:32:33 +08:00
Jibing-Li	80bcb43143	[Feature]Support external table sample stats collection (#24376 ) Support hive table sample stats collection. Gramma is like `analyze table with sample percent 10`	2023-09-19 11:20:27 +08:00
Jibing-Li	b4432ce577	[Feature](statistics)Support external table analyze partition (#24154 ) Enable collect partition level stats for hive external table.	2023-09-18 14:59:26 +08:00
Jibing-Li	f3e350e8ec	[Improvement](statistics)Improve statistics user experience (#24414 ) Two improvements: 1. Move the `Job_id` column for the return info of `Analyze table` command to the first column. To keep consistent with `show analyze`. ``` mysql> analyze table hive.tpch100.region; +--------+--------------+-------------------------+------------+--------------------------------+ \| Job_Id \| Catalog_Name \| DB_Name \| Table_Name \| Columns \| +--------+--------------+-------------------------+------------+--------------------------------+ \| 14403 \| hive \| default_cluster:tpch100 \| region \| [r_regionkey,r_comment,r_name] \| +--------+--------------+-------------------------+------------+--------------------------------+ 1 row in set (0.03 sec) ``` 2. Add `analyze_timeout` session variable, to control `analyze table/database with sync` timeout.	2023-09-18 13:36:41 +08:00
Mingyu Chen	4dad7c94da	[fix](orc) fix the count() pushdown issue in orc format (#24446 ) In previous, when querying hive table in orc format, and the file is splitted. the result of select count() may be multiple of the real row number. This is because the number of rows should be got after orc strip prune, otherwise, it may return wrong result	2023-09-16 09:57:39 +08:00
Mingyu Chen	b407f275c8	[fix](hive) fix partition prune issue and some external table test cases (#24338 ) 1. Fix hive partition prune bug, introduced from #23845, will fail `test_hive_default_partition` test case. 2. Fix `test_local_tvf.groovy` test case, the path of local tvf should be relative path. 3. Fix `test_external_catalog_hive` test case, the `partitions` is now reserve keywords 4. Support `local` tvf in Nereids, but fix related issue like: ``` Caused by: java.lang.NullPointerException at org.apache.doris.nereids.stats.ExpressionEstimation.castMinMax(ExpressionEstimation.java:171) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.stats.ExpressionEstimation.visitCast(ExpressionEstimation.java:167) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.stats.ExpressionEstimation.visitCast(ExpressionEstimation.java:109) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.trees.expressions.Cast.accept(Cast.java:55) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.stats.ExpressionEstimation.visitAlias(ExpressionEstimation.java:394) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.stats.ExpressionEstimation.visitAlias(ExpressionEstimation.java:109) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.trees.expressions.Alias.accept(Alias.java:145) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.stats.ExpressionEstimation.estimate(ExpressionEstimation.java:119) ~[doris-fe.jar:1.2-SNAPSHOT] at org.apache.doris.nereids.stats.StatsCalculator.lambda$computeProject$7(StatsCalculator.java:785) ~[doris-fe.jar:1.2-SNAPSHOT] at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_341] at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_341] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_341] ```	2023-09-15 20:57:04 +08:00

1 2

86 Commits