bp #41956
This PR #40225 try to pass time zone info from BE to JNI, and it use
`_state->timezone_obj().name()`
to get the timezone name.
But when we do some rolling upgrade of BE, it may coredump like:
```
*** SIGSEGV address not mapped to object (@0x610) received by PID 72661 (TID 73538 OR 0x7f2e898d1640) from PID 1552; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/common/signal_handler.h:421
1# os::Linux::chained_handler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
2# JVM_handle_linux_signal in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
3# signalHandler(int, siginfo_t*, void*) in /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
4# 0x00007F3070D3E520 in /lib/x86_64-linux-gnu/libc.so.6
5# cctz::time_zone::name[abi:cxx11]() const in /mnt/hdd01/ci/compatibility-deploy/be/lib/doris_be
6# doris::vectorized::JniConnector::open(doris::RuntimeState*, doris::RuntimeProfile*) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/jni_connector.cpp:87
7# doris::vectorized::AvroJNIReader::init_fetch_table_schema_reader() at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/vec/exec/format/avro/avro_jni_reader.cpp:119
8# std::_Function_handler::_M_invoke(std::_Any_data const&) at /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
9# doris::WorkThreadPool::work_thread(int) at /home/zcp/repo_center/doris_branch-2.1/doris/be/src/util/work_thread_pool.hpp:159
10# execute_native_thread_routine at ../../../../../libstdc++-v3/src/c++11/thread.cc:84
11# start_thread at ./nptl/pthread_create.c:442
12# 0x00007F3070E22850 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83
172.20.50.206 last coredump sql: 2024-10-13 04:12:23,985 [query]
```
This PR use another method: `_state->timezone()`, which just return a
string, instead of reading and initializing
time zone info file, to avoid potential coredump.
bp #40225 , #40888 ,#41386
## Proposed changes
Among them, #40225 is the new api of mc,
#40888 is used to fix the bug when reading null between the new and old
apis,
#41386 is used for compatibility between the new and old versions
Also move the analysis exception of "Not support insert with partition
spec in hive catalog."
from create sink phase to bind sink phase.
So that when `set enable_fallback_to_original_planner=false;`, the
return error will be correct.
backport https://github.com/apache/doris/pull/33836
<!--Describe your changes.-->
## Further comments
If this is a relatively large or complex change, kick off the discussion
at [dev@doris.apache.org](mailto:dev@doris.apache.org) by explaining why
you chose the solution you did and what alternatives you considered,
etc...
1. Fix the issue with tvf reading empty compressed files.
2. move two test cases (`test_local_tvf_compression` and `test_s3_tvf_compression`) from p2 to p0
fix test_hive_parquet_alter_column p2 case.
Since this is a p2 case. The data is stored on emr, not in docker. So there is no need to consider hive2 and hive3.
Following #25138, unified schema change interface for parquet and orc reader, and can be applied to other format readers as well.
Unified schema change interface for all format readers:
- First, read the data according to the column type of the file into source column;
- Second, convert source column to the destination column with type planned by FE.
`isAdjustedToUTC` is exactly the opposite in parquet reader(https://github.com/apache/parquet-format/blob/master/LogicalTypes.md), resulting the time with `isAdjustedToUTC=true` has increased by eight hours(UTC8).
The parquet with `isAdjustedToUTC=true` can be produced by spark-sql with the following configuration:
```
--conf spark.sql.session.timeZone=UTC
--conf spark.sql.parquet.outputTimestampType=TIMESTAMP_MICROS
```
However, using the following configuration, there's no logical and convert type in parquet meta data, so the time read by doris will also increase by eight hours(UTC8). Users need to set their own UTC time zone in doris(https://doris.apache.org/docs/dev/advanced/time-zone/)
```
--conf spark.sql.session.timeZone=UTC
--conf spark.sql.parquet.outputTimestampType=INT96
```
1. Fix iceberg catalog bug
This PR #30198 change the logic of `IcebergHMSExternalCatalog.java`,
to get locationUrl by calling hive metastore's `getCatalog()` method.
But this method only exists in hive 3+. So it will fail if we using hive 2.x.
I temporary remove this logic, because this logic is only used from iceberg table writing.
Which is still under development. We will rethink this logic later.
2. Fix test cases
Some of P2 test cases missed `order_qt`. And because the output format of the floating point
type is changed, some result in `out` files need to be regenerated.
In order to support paimon with hive2, we need to modify the origin HiveMetastoreClient.java
to let it compatible with both hive2 and hive3.
And this modified HiveMetastoreClient should be at the front of the CLASSPATH, so that
it can overwrite the HiveMetastoreClient in hadoop jar.
This PR mainly changes:
1. Copy HiveMetastoreClient.java in FE to BE's preload jar.
2. Split the origin `preload-extensions-jar-with-dependencies.jar` into 2 jars
1. `preload-extensions-project.jar`, which contains the modified HiveMetastoreClient.
2. `preload-extensions-jar-with-dependencies.jar`, which contains other dependency jars.
3. Modify the `start_be.sh`, to let `preload-extensions-project.jar` be loaded first.
4. Change the way the assemble the jni scanner jar
Only need to assemble the project jar, without other dependencies.
Because actually we only use classed under `org.apache.doris` package.
So remove other unused dependency jars can also reduce the output size of BE.
5. fix bug that the prefix of paimon properties should be `paimon.`, not `paimon`
6. Support paimon with hive2
User can set `hive.version` in paimon catalog properties to specify the hive version.
This PR #23026 support the partition prune for hive table with `_HIVE_DEFAULT_PARTITION`,
but it will always select partition with `_HIVE_DEFAULT_PARTITION`.
This PR #31613 support null partition for olap table's list partition, so we can treat `_HIVE_DEFAULT_PARTITION`
as null partition of hive table.
So this PR change the partition prune logic