Files
doris/regression-test
daidai 5d02c48715 [feature](hive)Support reading renamed Parquet Hive and Orc Hive tables. (#38432) (#38809)
bp #38432 

## Proposed changes
Add `hive_parquet_use_column_names` and `hive_orc_use_column_names`
session variables to read the table after rename column in `Hive`.

These two session variables are referenced from
`parquet_use_column_names` and `orc_use_column_names` of `Trino` hive
connector.

By default, these two session variables are true. When they are set to
false, reading orc/parquet will access the columns according to the
ordinal position in the Hive table definition.

For example:
```mysql
in Hive :
hive> create table tmp (a int , b string) stored as parquet;
hive> insert into table tmp values(1,"2");
hive> alter table tmp  change column  a new_a int;
hive> insert into table tmp values(2,"4");

in Doris :
mysql> set hive_parquet_use_column_names=true;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|  NULL | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)

mysql> set hive_parquet_use_column_names=false;
Query OK, 0 rows affected (0.00 sec)

mysql> select  * from tmp;
+-------+------+
| new_a | b    |
+-------+------+
|     1 | 2    |
|     2 | 4    |
+-------+------+
2 rows in set (0.02 sec)
```

You can use `set
parquet.column.index.access/orc.force.positional.evolution = true/false`
in hive 3 to control the results of reading the table like these two
session variables. However, for the rename struct inside column parquet
table, the effects of hive and doris are different.
2024-08-05 09:06:49 +08:00
..

新加case注意事项

常规 case

  1. 变量名前要写 def,否则是全局变量,并行跑的 case 的时候可能被其他 case 影响。

    Problematic code:

    ret = ***
    

    Correct code:

    def ret = ***
    
  2. 尽量不要在 case 中 global 的设置 session variable,或者修改集群配置,可能会影响其他 case。

    Problematic code:

    sql """set global enable_pipeline_x_engine=true;"""
    

    Correct code:

    sql """set enable_pipeline_x_engine=true;"""
    
  3. 如果必须要设置 global,或者要改集群配置,可以指定 case 以 nonConcurrent 的方式运行。

    示例

  4. case 中涉及时间相关的,最好固定时间,不要用类似 now() 函数这种动态值,避免过一段时间后 case 就跑不过了。

    Problematic code:

    sql """select count(*) from table where created < now();"""
    

    Correct code:

    sql """select count(*) from table where created < '2023-11-13';"""
    
  5. case 中 streamload 后请加上 sync 一下,避免在多 FE 环境中执行不稳定。

    Problematic code:

    streamLoad { ... }
    sql """select count(*) from table """
    

    Correct code:

    streamLoad { ... }
    sql """sync"""
    sql """select count(*) from table """
    
  6. UDF 的 case,需要把对应的 jar 包拷贝到所有 BE 机器上。

    示例

兼容性 case

指重启 FE 测试或升级测试中,在初始集群上创建的资源或规则,在集群重启或升级后也能正常使用,比如权限、UDF等。 这些 case 需要拆分成两个文件,load.groovy 和 xxxx.groovy,放到一个文件夹中并加上 restart_fe 组标签,示例