1. Fix auto analyze external table recursively load schema cache bug.
2. Move some function in StatisticsAutoAnalyzer class to TableIf. So that external table and internal table could implement the logic separately.
3. Disable external catalog auto analyze by default, could open it by adding catalog property "enable.auto.analyze"="true"
Before, show column stats will ignore column with error.
In this pr, when min or max value failed to deserialize, show column stats will use N/A as value of min or max, and still show the rest stats. (count, null_count, ndv and so on).
1. jdbc
Before, in the constructor of Jdbc catalog, we may call checksum action of the jdbc driver.
But the download link of the jdbc driver may not be available when replaying, causing replay error.
This PR change the logic to avoid calling checksum when replaying creating jdbc catalog.
2. ranger
After this PR, when creating catalog, it will try to init access controller to make sure the config is ok.
3. catalog priv check
When creating/dropping/altering/ catalog, doris will only use internal access controller to check catalog level priv.
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.
2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec
4. Support snappy block codec
For hadoop snappy
5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.
Need to pick to branch-2.0 after this PR: #22304
Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc).
This PR mainly changes:
1. Fix the bug of accessing files via cosn.
2. Add a new field `fs_name` in TFileRangeDesc
This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query
request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name
for each file, otherwise, it may return error:
`reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`
Fix incorrect result if null partition fields in orc file.
### Root Cause
Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the pushed down partition field which are null values. filter incorrectly.
### Solution
we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.
Nereids doesn't support view based table value function, because tvf view doesn't contain the proper qualifier (catalog, db and table name). This pr is to support this function.
Also, fix nereids table value function explain output exprs incorrect bug.
The sql like
Select count(1) from view
would contain all the columns in old planner's execution plan, which is slow, because BE need to read all the column in data files. This pr is to improve the plan to only contain one column.
1. Load the cache for external table row count while init table, this could avoid no row number stats for the very first time to run an sql.
2. Show cardinality for an external scan node when explain the sql.
3. fix bugs introduced by https://github.com/apache/doris/pull/22963
A row of complex type may be stored across two(or more) pages, and the parameter `align_rows` indicates that whether the reader should read the remaining value of the last row in previous page.
`ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.
[Fix](orc-reader) Fix filling partition or missing column used incorrect row count.
`_row_reader->nextBatch` returns number of read rows. When orc lazy materialization is turned on, the number of read rows includes filtered rows, so caller must look at `numElements` in the row batch to determine how
many rows were not filtered which will to fill to the block.
In this case, filling partition or missing column used incorrect row count which will cause be crash by `filter.size() != offsets.size()` in filter column step.
When orc lazy materialization is turned off, add `_convert_dict_cols_to_string_cols(block, nullptr)` if `(block->rows() == 0)`.
1. the real value of BE config `file_cache_max_file_reader_cache_size` will be the 1/3 of process's max open file number.
2. use thread pool to create or init the file cache concurrently.
To solve the issue that when there are lots of files in file cache dir, the starting time of BE will be very slow because
it will traverse all file cache dirs sequentially.
1. Collect external table row count when execute analyze database.
2. Support show cached table stats (row count)
3. Support alter external table column stats.
4. Refresh/Invalidate table row count stat memory cache when analyze task finished and drop table stats.
sort out the test cases of external table.
After modify, there are 2 directories:
1. `external_table_p0`: all p0 cases of external tables: hive, es, jdbc and tvf
2. `external_table_p2`: all p2 cases of external tables: hive, es, mysql, pg, iceberg and tvf
So that we can run it with one line command like:
```
sh run-regression-test.sh --run -d external_table_p0,external_table_p2
```