Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Unified schema change interface for all format readers:
- First, read the data according to the column type of the file into source column;
- Second, convert source column to the destination column with type planned by FE.
`isAdjustedToUTC` is exactly the opposite in parquet reader(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), resulting the time with `isAdjustedToUTC=true` has increased by eight hours(UTC8).
The parquet with `isAdjustedToUTC=true` can be produced by spark-sql with the following configuration:
```
--conf spark.sql.session.timeZone=UTC
--conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS
```
However, using the following configuration, there's no logical and convert type in parquet meta data, so the time read by doris will also increase by eight hours(UTC8). Users need to set their own UTC time zone in doris(https://doris.apache.org/docs/dev/advanced/time-zone/)
```
--conf spark.sql.session.timeZone=UTC
--conf spark.sql.parquet.outputTimestampType=INT96
```
1. Fix iceberg catalog bug
This PR #30198 change the logic of `IcebergHMSExternalCatalog.java`,
to get locationUrl by calling hive metastore's `getCatalog()` method.
But this method only exists in hive 3+. So it will fail if we using hive 2.x.
I temporary remove this logic, because this logic is only used from iceberg table writing.
Which is still under development. We will rethink this logic later.
2. Fix test cases
Some of P2 test cases missed `order_qt`. And because the output format of the floating point
type is changed, some result in `out` files need to be regenerated.
In order to support paimon with hive2, we need to modify the origin HiveMetastoreClient.java
to let it compatible with both hive2 and hive3.
And this modified HiveMetastoreClient should be at the front of the CLASSPATH, so that
it can overwrite the HiveMetastoreClient in hadoop jar.
This PR mainly changes:
1. Copy HiveMetastoreClient.java in FE to BE's preload jar.
2. Split the origin `preload-extensions-jar-with-dependencies.jar` into 2 jars
1. `preload-extensions-project.jar`, which contains the modified HiveMetastoreClient.
2. `preload-extensions-jar-with-dependencies.jar`, which contains other dependency jars.
3. Modify the `start_be.sh`, to let `preload-extensions-project.jar` be loaded first.
4. Change the way the assemble the jni scanner jar
Only need to assemble the project jar, without other dependencies.
Because actually we only use classed under `org.apache.doris` package.
So remove other unused dependency jars can also reduce the output size of BE.
5. fix bug that the prefix of paimon properties should be `paimon.`, not `paimon`
6. Support paimon with hive2
User can set `hive.version` in paimon catalog properties to specify the hive version.
1. max compute partition prune,
we just support filter mc partitions by '=',it can filter just one partition
to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported.
2. add max compute row count cache and partitionValues cache
3. add max compute regression case
Fixed the problem of not being able to read parquet lz4 compressed format. By default, it is decompressed according to the Hadoop lz4 format. If it fails, it will fall back to the standard lz4 compression format.
Fix two bugs:
1. Missing column is case sensitive, change the column name to lower case in FE for hive/iceberg/hudi
2. Iceberg use custom method to encode special characters in column name. Decode the column name to match the right column in parquet reader.
- Fix complex type crash when using the dict filter facility in the parquet-reader by turning off the dict filter facility in this case.
- Add orc complex types regression test.
1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data.
2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.
1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion.
2.Support hive alter table.
Support complex types in jni framework, and successfully run end-to-end on hudi.
### How to Use
Other scanners only need to implement three interfaces in `ColumnValue`:
```
// Get array elements and append into values
void unpackArray(List<ColumnValue> values);
// Get map key array&value array, and append into keys&values
void unpackMap(List<ColumnValue> keys, List<ColumnValue> values);
// Get the struct fields specified by `structFieldIndex`, and append into values
void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values);
```
Developers can take `HudiColumnValue` as an example.
Two improvements:
1. Move the `Job_id` column for the return info of `Analyze table` command to the first column. To keep consistent with `show analyze`.
```
mysql> analyze table hive.tpch100.region;
+--------+--------------+-------------------------+------------+--------------------------------+
| Job_Id | Catalog_Name | DB_Name | Table_Name | Columns |
+--------+--------------+-------------------------+------------+--------------------------------+
| 14403 | hive | default_cluster:tpch100 | region | [r_regionkey,r_comment,r_name] |
+--------+--------------+-------------------------+------------+--------------------------------+
1 row in set (0.03 sec)
```
2. Add `analyze_timeout` session variable, to control `analyze table/database with sync` timeout.
In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.
This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
Fix three bugs:
1. Hudi slice maybe has log files only, so `new Path(filePath)` will throw errors.
2. Hive column names are lowercase only, so match column names in ignore-case-mode.
3. Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.
2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec
4. Support snappy block codec
For hadoop snappy
5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.
Need to pick to branch-2.0 after this PR: #22304