Files

daidai 5d02c48715 [feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. (#38432 ) (#38809 )

bp #38432 

## Proposed changes
Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.

2024-08-05 09:06:49 +08:00

common

…

conf

[regression](s3) add default conf for s3 releated cases (#37952 ) (#38472 )

2024-07-29 18:01:27 +08:00

ctas_p0

…

data

[feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. (#38432 ) (#38809 )

2024-08-05 09:06:49 +08:00

framework

[Fix](regression) fix regression sql which has schema change (#37941 ) (#38456 )

2024-07-31 22:31:38 +08:00

java-udf-src

…

pipeline

[chore](test) disable fault injection to make pipeline task check happy (#38665 ) (#38821 )

2024-08-04 11:18:56 +08:00

plugins

[test](inverted index)Add cases for inverted index format v2 (#38132 )(#38443 ) (#38222 )

2024-08-02 12:04:26 +08:00

script

…

ssl_default_certificate

…

suites

[feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. (#38432 ) (#38809 )

2024-08-05 09:06:49 +08:00

certificate.p12

…

README.md

[case](restart_fe) add demo case for restart_fe test (#37091 ) (#37313 )

2024-07-15 19:42:20 +08:00

README.md

新加case注意事项

常规 case

变量名前要写 def，否则是全局变量，并行跑的 case 的时候可能被其他 case 影响。

Problematic code:
```
ret = ***
```
Correct code:
```
def ret = ***
```
尽量不要在 case 中 global 的设置 session variable，或者修改集群配置，可能会影响其他 case。

Problematic code:
```
sql """set global enable_pipeline_x_engine=true;"""
```
Correct code:
```
sql """set enable_pipeline_x_engine=true;"""
```
如果必须要设置 global，或者要改集群配置，可以指定 case 以 nonConcurrent 的方式运行。

示例
case 中涉及时间相关的，最好固定时间，不要用类似 now() 函数这种动态值，避免过一段时间后 case 就跑不过了。

Problematic code:
```
sql """select count(*) from table where created < now();"""
```
Correct code:
```
sql """select count(*) from table where created < '2023-11-13';"""
```

case 中 streamload 后请加上 sync 一下，避免在多 FE 环境中执行不稳定。

Problematic code:

streamLoad { ... }
sql """select count(*) from table """

Correct code:

streamLoad { ... }
sql """sync"""
sql """select count(*) from table """

UDF 的 case，需要把对应的 jar 包拷贝到所有 BE 机器上。

示例

兼容性 case

指重启 FE 测试或升级测试中，在初始集群上创建的资源或规则，在集群重启或升级后也能正常使用，比如权限、UDF等。这些 case 需要拆分成两个文件，load.groovy 和 xxxx.groovy，放到一个文件夹中并加上 restart_fe 组标签，示例。