In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.
This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
Fix three bugs:
1. Hudi slice maybe has log files only, so `new Path(filePath)` will throw errors.
2. Hive column names are lowercase only, so match column names in ignore-case-mode.
3. Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.
Issue Number: close #xxx
If the user has configured the wrong priority_network, direct startup failure to avoid users mistakenly assuming that the configuration is correct
If the user has not configured p_ n. Select only the first IP from the IPv4 list, rather than selecting from all IPs, to avoid users' servers not supporting IPv4
extends #23784
A similar bug compares to #22140 .
When executing a query with hms catalog, the query maybe failed because some hdfs files are not existed. We should just distinguish this kind of errors and skip it.
```
errCode = 2, detailMessage = (xxx.xxx.xxx.xxx)[CANCELLED][INTERNAL_ERROR]failed to init reader for file hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: [INTERNAL_ERROR]Init OrcReader failed. reason = Failed to read hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc: [INTERNAL_ERROR]Read hdfs file failed. (BE: xxx.xxx.xxx.xxx) namenode:hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: (2), No such file or directory), reason: RemoteException: File does not exist: /xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1927)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:426)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
```
Add 2 metrics in jdbc scan node profile:
- `CallJniNextTime`: call get next from jdbc result set
- `ConvertBatchTime`: call convert jobject to columm block
Also fix a potential concurrency issue when init jdbc connection cache pool
```if (!new_top.is_null() && new_top != old_top)``` is always false since old_top is Null when init and Field == Null always return true.
We add old_top.is_null() check first to avoid the problem and then issue more carefull discussion about Field == Null semantics.
1. When file meta cache is disabled (by setting `max_external_file_meta_cache_num=0` in be.conf),
the parquet's meta info is owned by parquet reader and will be released when calling `reader->close()`.
But the underlying file reader of this parquet reader will be released after `reader->close()`,
this may causing `heap-use-after-free` bug because some part of meta info may be referenced by file reader.
This PR fix it by making sure that meta info is released after file reader released.
2. Add modification time for file meta cache in BE, to avoid parquet read error like:
`Failed to deserialize parquet page header`
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.
2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec
4. Support snappy block codec
For hadoop snappy
5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.
Need to pick to branch-2.0 after this PR: #22304
Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc).
This PR mainly changes:
1. Fix the bug of accessing files via cosn.
2. Add a new field `fs_name` in TFileRangeDesc
This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query
request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name
for each file, otherwise, it may return error:
`reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`
Fix incorrect result if null partition fields in orc file.
### Root Cause
Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the pushed down partition field which are null values. filter incorrectly.
### Solution
we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.