Commit Graph

86 Commits

Author SHA1 Message Date
1b783aaa7f [fix](p2)Fix analyze hive partition column p2 case after row count change. #31958 2024-03-09 19:45:03 +08:00
ad3308c8ab [fix](hive) support partition prune for _HIVE_DEFAULT_PARTITION_ (#31736)
This PR #23026 support the partition prune for hive table with `_HIVE_DEFAULT_PARTITION`,
but it will always select partition with `_HIVE_DEFAULT_PARTITION`.

This PR #31613 support null partition for olap table's list partition, so we can treat `_HIVE_DEFAULT_PARTITION`
as null partition of hive table.

So this PR change the partition prune logic
2024-03-06 13:07:49 +08:00
32033d08c6 Fix hive p2 cases. (#31541) 2024-02-29 12:37:38 +08:00
260568db17 [update](hudi) update hudi version to 0.14.1 and compatible with flink hive catalog (#31181)
1. Update hudi version from 0.13.1 to .14.1
2. Compatible with the hudi table created by flink hive catalog
2024-02-22 19:51:20 +08:00
87b5ed187e Fix hive p2 case (#31149) 2024-02-21 13:53:18 +08:00
ae809cd900 Fix hive p2 case. (#31072) 2024-02-19 17:20:21 +08:00
3da168afc9 Fix hive sample case ndv value. (#31043) 2024-02-19 17:20:21 +08:00
9e76592297 Support analyze materialized view. (#30540) 2024-02-04 22:21:16 +08:00
1548813a17 [fix](test) fix case with same catalog name (#30585) 2024-02-01 19:00:50 +08:00
7d037c12bf [bugfix](paimon)fix paimon testcases (#30514)
1. set default timezone
2. not supported `char` type to pushdown
2024-01-31 23:53:39 +08:00
4648902350 [bugfix](iceberg)fix read NULL with date partition (#30478)
* fix date

* fix date

* add case
2024-01-30 15:32:43 +08:00
8308bc96b9 [fix](paimon)set timestamp's scale for parquet which has no logical type (#30119) 2024-01-23 13:22:14 +08:00
44ba9e102c [feature](statistics)support statistics for iceberg/paimon/hudi table (#29868) 2024-01-18 12:03:07 +08:00
74991c4af2 [bugfix](paimon)support native and jni to read paimon for minio/cos #29933 2024-01-16 18:49:01 +08:00
40badbf5c5 Fix analyze empty external NPE bug. (#29675) 2024-01-12 11:41:21 +08:00
96d4778f2e [fix](parquet) the end offset of column chunk may be wrong in parquet metadata (#28891) 2023-12-23 22:21:04 +08:00
c72ad9b673 [fix](regression) fix regression error of test_compress_type (#28826) 2023-12-22 12:08:23 +08:00
5d8c465644 [regression](p2) fix test cases result (#28768)
regression-test/data/external_table_p2/hive/test_hive_hudi.out
regression-test/data/external_table_p2/hive/test_hive_to_array.out
regression-test/suites/external_table_p2/tvf/test_local_tvf_compression.groovy
regression-test/suites/external_table_p2/tvf/test_path_partition_keys.groovy
regression-test/data/external_table_p2/hive/test_hive_text_complex_type.out
2023-12-21 14:38:30 +08:00
64ebdb2777 [fix](regression)Change analyze_timeout to global. (#28587)
Fix hive statistics regression case. analyze_timeout is a global session variable.
2023-12-19 15:52:38 +08:00
80d2c7ab41 [feature](parquet)support read parquet lzo compress. (#27706) 2023-12-03 09:55:52 +08:00
1706699e7e [fix](multi-catalog)support the max compute partition prune (#27154)
1. max compute partition prune,
we just support filter mc partitions by '=',it can filter just one partition
to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported.

2. add max compute row count cache and partitionValues cache

3. add max compute regression case
2023-12-01 22:28:26 +08:00
ce271ff382 [fix](parquet)fix can not read parquet lz4 compress. (#27383)
Fixed the problem of not being able to read parquet lz4 compressed format. By default, it is decompressed according to the Hadoop lz4 format. If it fails, it will fall back to the standard lz4 compression format.
2023-11-29 19:04:53 +08:00
39a5229027 [fix](regression)Fix hive p2 case (#27466) 2023-11-23 23:32:21 +08:00
add6bdb240 [fix](multi-catalog)add the max compute fe ut and fix download expired (#27007)
1. add the max compute fe ut and fix download expired
2. solve memery leak when allocator close
3. add correct partition rows
2023-11-20 10:42:07 +08:00
52995c528e [fix](iceberg) iceberg use customer method to encode special characters of field name (#27108)
Fix two bugs:
1. Missing column is case sensitive, change the column name to lower case in FE for hive/iceberg/hudi
2. Iceberg use custom method to encode special characters in column name. Decode the column name to match the right column in parquet reader.
2023-11-17 18:38:55 +08:00
ec92ba4af1 [fix](statistics)Fix alter column stats bug (#27093)
Encode the min and max value with base64 encoder while inject the column stats.
2023-11-17 15:40:47 +08:00
a0661ed9d2 [Fix](multi-catalog) Fix complex type crash when using dict filter facility in the parquet-reader. (#27151)
- Fix complex type crash when using the dict filter facility in the parquet-reader by turning off the dict filter facility in this case.
- Add orc complex types regression test.
2023-11-17 13:43:58 +08:00
ec40603b93 [fix](parquet) compressed_page_size has the same meaning in page v1 and v2 (#26783)
1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data.
2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.
2023-11-14 08:30:42 +08:00
57ed781bb6 [fix](regression-test) Add tvf regression tests (#26455) 2023-11-09 12:09:32 +08:00
a4e415ab09 [feature](hive)Support hive tables after alter type. (#25138)
1.Reconstruct the logic of decode to read parquet. The parquet  reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.
2023-11-02 00:24:21 +08:00
b98744ae90 [Bug](iceberg)fix read partitioned iceberg without partition path (#25503)
Iceberg does not require partition values to exist on file paths, so we should get the partition value from `PartitionScanTask.partition`.
2023-10-31 18:09:53 +08:00
9633d0a83b [case](iceberg)add test case (#26107) 2023-10-31 17:23:22 +08:00
78204f7c92 [Fix](statistics)Fix external couldn't analyze database bug (#26025) 2023-10-31 11:32:47 +08:00
8a8ae44eee [Fix](regression)Fix statistics related regression test (#25888) 2023-10-25 05:59:13 -05:00
40e430ca55 [regression](multi-catalog) add aliyun dlf hive on oss and huawei obs test case (#25650)
add aliyun dlf hive on oss and huawei obs test case
now obs cases have some problem, will not fix this at this PR, just add comment.
2023-10-24 20:52:50 +08:00
4d2e7d7c86 [improvement](statistics)Set min max to NULL when collect stats with sample (#25593)
1. To avoid misleading of inaccurate min max stats, set the stats value to NULL while using sample to collect stats.
2. Fix NDV_SAMPLE_TEMPLATE typo, it shouldn't contain row count related contents.
2023-10-19 18:00:55 +08:00
7cfb1d9b0e [Regression case](statistics) Add regression test case for fetching HMSExternalTable through hms. (#25548)
Regression case for fetching HMSExternalTable statistics through HMS when the table is not analyzed.
2023-10-18 09:57:58 +08:00
26e332c608 [fix](multi-catalog)add exception for unsupported hive input format (#25490)
add exception for unsupported hive input format
2023-10-17 22:53:53 +08:00
18c2a13e09 [fix](multi-catalog)fix maxcompute partition filter and session creation (#24911)
add maxcompute partition support
fix maxcompute partition filter
modify maxcompute session create method
2023-10-17 22:36:10 +08:00
1130317b91 [Improvement](statistics)Collect stats for hive partition column using metadata (#24853)
Hive partition columns' stats could be calculated from hive metastore data. Doesn't need to execute sql to get the stats.
This PR is using hive partition metadata to collect partition column stats.
2023-10-17 10:31:57 +08:00
c63bf24c84 [Improvement](statistics) Improve sample count accuracy (#25175)
While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on 
the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.
2023-10-12 14:42:02 +08:00
9a4baf7ccf [fix](Nereids)Fix the bug that count(*) does not push down for tables with only one column. (#25222)
after pr #22115 .

Fixed the bug that when selecting count(*) from table, if the table has only one column, the aggregate count is not pushed down.
2023-10-11 23:17:30 +08:00
727fa2c0cd [opt](tvf) refine the class of ExternalFileTableValuedFunction (#24706)
`ExternalFileTableValuedFunction` now has 3 derived classes:

- LocalTableValuedFunction
- HdfsTableValuedFunction
- S3TableValuedFunction

All these tvfs are for reading data from file. The difference is where to read the file, eg, from HDFS or from local filesystem.

So I refine the fields and methods of these classes.
Now there 3 kinds of properties of these tvfs:

1. File format properties

	File format properties, such as `format`, `column_separator`. For all these tvfs, they are common properties.
	So these properties should be analyzed in parenet class `ExternalFileTableValuedFunction`.
	
2. URI or file path

	The URI or file path property indicate the file location. For different storage, the format of the uri are not same.
	So they should be analyzed in each derived classes.
	
3. Other properties

	All other properties which are special for certain tvf.
	So they should be analyzed in each derived classes.
	
There are 2 new classes:

- `FileFormatConstants`: Define some common property names or variables related to file format.
- `FileFormatUtils`: Define some util methods related to file format.

After this PR, if we want to add some common properties for all these tvfs, only need to handled it in
`ExternalFileTableValuedFunction`, to avoid missing handle it in any one of them.

### Behavior change

1. Remove `fs.defaultFS` property in `hdfs()`, it can be got from `uri`
2. Use `\t` as the default column separator of csv format, same as stream load
2023-10-07 12:44:04 +08:00
26818de9c8 [feature](jni) support complex types in jni framework (#24810)
Support complex types in jni framework, and successfully run end-to-end on hudi.
### How to Use
Other scanners only need to implement three interfaces in `ColumnValue`:
```
// Get array elements and append into values
void unpackArray(List<ColumnValue> values);

// Get map key array&value array, and append into keys&values
void unpackMap(List<ColumnValue> keys, List<ColumnValue> values);

// Get the struct fields specified by `structFieldIndex`, and append into values
void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values);
```
Developers can take `HudiColumnValue` as an example.
2023-09-27 14:47:41 +08:00
ef72321878 [fix](regression-text)fix test_path_partition_keys regression test (#24796)
fix test_path_partition_keys regression test
2023-09-24 23:32:33 +08:00
80bcb43143 [Feature]Support external table sample stats collection (#24376)
Support hive table sample stats collection. Gramma is like

`analyze table with sample percent 10`
2023-09-19 11:20:27 +08:00
b4432ce577 [Feature](statistics)Support external table analyze partition (#24154)
Enable collect partition level stats for hive external table.
2023-09-18 14:59:26 +08:00
f3e350e8ec [Improvement](statistics)Improve statistics user experience (#24414)
Two improvements:
1. Move the `Job_id` column for the return info of `Analyze table` command to the first column. To keep consistent with `show analyze`.
```
mysql> analyze table hive.tpch100.region;
+--------+--------------+-------------------------+------------+--------------------------------+
| Job_Id | Catalog_Name | DB_Name                 | Table_Name | Columns                        |
+--------+--------------+-------------------------+------------+--------------------------------+
| 14403  | hive         | default_cluster:tpch100 | region     | [r_regionkey,r_comment,r_name] |
+--------+--------------+-------------------------+------------+--------------------------------+
1 row in set (0.03 sec)
```
2. Add `analyze_timeout` session variable, to control `analyze table/database with sync` timeout.
2023-09-18 13:36:41 +08:00
4dad7c94da [fix](orc) fix the count(*) pushdown issue in orc format (#24446)
In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.

This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
2023-09-16 09:57:39 +08:00
b407f275c8 [fix](hive) fix partition prune issue and some external table test cases (#24338)
1. Fix hive partition prune bug, introduced from #23845, will fail `test_hive_default_partition` test case.
2. Fix `test_local_tvf.groovy` test case, the path of local tvf should be relative path.
3. Fix `test_external_catalog_hive` test case, the `partitions` is now reserve keywords
4. Support `local` tvf in Nereids, but fix related issue like:

```
Caused by: java.lang.NullPointerException
        at org.apache.doris.nereids.stats.ExpressionEstimation.castMinMax(ExpressionEstimation.java:171) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.stats.ExpressionEstimation.visitCast(ExpressionEstimation.java:167) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.stats.ExpressionEstimation.visitCast(ExpressionEstimation.java:109) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.trees.expressions.Cast.accept(Cast.java:55) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.stats.ExpressionEstimation.visitAlias(ExpressionEstimation.java:394) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.stats.ExpressionEstimation.visitAlias(ExpressionEstimation.java:109) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.trees.expressions.Alias.accept(Alias.java:145) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.stats.ExpressionEstimation.estimate(ExpressionEstimation.java:119) ~[doris-fe.jar:1.2-SNAPSHOT]
        at org.apache.doris.nereids.stats.StatsCalculator.lambda$computeProject$7(StatsCalculator.java:785) ~[doris-fe.jar:1.2-SNAPSHOT]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_341]
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_341]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_341]
```
2023-09-15 20:57:04 +08:00