1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data.
2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.
`ExternalFileTableValuedFunction` now has 3 derived classes:
- LocalTableValuedFunction
- HdfsTableValuedFunction
- S3TableValuedFunction
All these tvfs are for reading data from file. The difference is where to read the file, eg, from HDFS or from local filesystem.
So I refine the fields and methods of these classes.
Now there 3 kinds of properties of these tvfs:
1. File format properties
File format properties, such as `format`, `column_separator`. For all these tvfs, they are common properties.
So these properties should be analyzed in parenet class `ExternalFileTableValuedFunction`.
2. URI or file path
The URI or file path property indicate the file location. For different storage, the format of the uri are not same.
So they should be analyzed in each derived classes.
3. Other properties
All other properties which are special for certain tvf.
So they should be analyzed in each derived classes.
There are 2 new classes:
- `FileFormatConstants`: Define some common property names or variables related to file format.
- `FileFormatUtils`: Define some util methods related to file format.
After this PR, if we want to add some common properties for all these tvfs, only need to handled it in
`ExternalFileTableValuedFunction`, to avoid missing handle it in any one of them.
### Behavior change
1. Remove `fs.defaultFS` property in `hdfs()`, it can be got from `uri`
2. Use `\t` as the default column separator of csv format, same as stream load
In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.
This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
1. Fix hive partition prune bug, introduced from #23845, will fail `test_hive_default_partition` test case.
2. Fix `test_local_tvf.groovy` test case, the path of local tvf should be relative path.
3. Fix `test_external_catalog_hive` test case, the `partitions` is now reserve keywords
4. Support `local` tvf in Nereids, but fix related issue like:
```
Caused by: java.lang.NullPointerException
at org.apache.doris.nereids.stats.ExpressionEstimation.castMinMax(ExpressionEstimation.java:171) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.stats.ExpressionEstimation.visitCast(ExpressionEstimation.java:167) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.stats.ExpressionEstimation.visitCast(ExpressionEstimation.java:109) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.trees.expressions.Cast.accept(Cast.java:55) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.stats.ExpressionEstimation.visitAlias(ExpressionEstimation.java:394) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.stats.ExpressionEstimation.visitAlias(ExpressionEstimation.java:109) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.trees.expressions.Alias.accept(Alias.java:145) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.stats.ExpressionEstimation.estimate(ExpressionEstimation.java:119) ~[doris-fe.jar:1.2-SNAPSHOT]
at org.apache.doris.nereids.stats.StatsCalculator.lambda$computeProject$7(StatsCalculator.java:785) ~[doris-fe.jar:1.2-SNAPSHOT]
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_341]
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_341]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_341]
```
Nereids doesn't support view based table value function, because tvf view doesn't contain the proper qualifier (catalog, db and table name). This pr is to support this function.
Also, fix nereids table value function explain output exprs incorrect bug.
The sql like
Select count(1) from view
would contain all the columns in old planner's execution plan, which is slow, because BE need to read all the column in data files. This pr is to improve the plan to only contain one column.
A row of complex type may be stored across two(or more) pages, and the parameter `align_rows` indicates that whether the reader should read the remaining value of the last row in previous page.
sort out the test cases of external table.
After modify, there are 2 directories:
1. `external_table_p0`: all p0 cases of external tables: hive, es, jdbc and tvf
2. `external_table_p2`: all p2 cases of external tables: hive, es, mysql, pg, iceberg and tvf
So that we can run it with one line command like:
```
sh run-regression-test.sh --run -d external_table_p0,external_table_p2
```