We can support this by add a new properties for tvf, like :
`select * from hdfs("uri" = "xxx", ..., "compress_type" = "lz4", ...)`
User can:
Specify compression explicitly by setting `"compression" = "xxx"`.
Doris can infer the compression type by the suffix of file name(e.g. `file1.gz`)
Currently, we only support reading compress file in `csv` format, and on BE side, we already support.
All need to do is to analyze the `"compress_type"` on FE side and pass it to BE.
1. fix function define of `Retention` inconsist, this function return tinyint on `FE` and return uint8 on `BE`
2. make assert_cast support cast to derived
3. change some static cast to assert cast
4. support sum(bool)/avg(bool)
1. support export `LARGEINT` data type to parquet/orc file format.
2. Export the DORIS `DATE/DATETIME` type to the `Date/Timestamp` logic type of parquet file format.
3. Fix that the data is not correct when the DATE type data is exported to ORC.
supports users to manually inject table level statistics.
table stats type:
- row_count
Modify table or partition statistics:
```SQL
ALTER TABLE table_name SET STATS ('k1' = 'v1', ...)
```
TODO:
- support other table stats type if necessary
- update statistics cache if necessary
1. refactor aggregate normalization to avoid data amplification before aggregate
2. remove useless aggreagte processing in ExtractAndNormalizeWindowExpression
3. only push distinct aggregate function children
TODO:
1. push down redundant expression in aggregate functions
2. refactor normalize repeat rule
3. move expression normalization and optimization after plan normalization to avoid unexpected expression optimization.
When user set default_storage_medium to true, the storage medium of all partitions should be SSD,
and cooldown time should be 9999-12-31 23:59:59.
So that it won't change to HDD.
But looks like sometimes it still change to HDD.
So I change the debug log to info to observer it.
1.some encrypt and decrypt functions have wrong blockEncryptionMode
2.topN node should compare tuples from intermediate_row_desc with first_sort_slot.tuple_id
3.must keep the limit if it's an uncorrelated in-subquery with limit on sort, like select a from t1 where a in ( select b from t2 order by xx limit yy )
make nereids generate more reasonable plans with table row count, but without column stats.
TODO: q5 and q7 is not good, because of column correlation
ps_suppkey and ps_partkey
If submit a query contains hms tbls which data files are compressed (bz2,lzo,lz4 ...), a error will occurs like this:
```[INTERNAL_ERROR]Only support csv data in utf8 codec``` .
This is because `org.apache.doris.planner.external.HiveScanNode` set `fileFormatType` as `TFileFormatType.FORMAT_CSV_PLAIN` whether the real compress algo of data files are. This pr try to fix this problem.
We shouldn't push top Project of JoinCluster in PushdownProjectThroughJoin
like
```
* Project (id + 1) if this project is top project of Join Cluster
* |
* Join
* / \
* Join Join
* / ....
* Join
```
Get the last modification time from file status, and use the combination of path and modification time to generate cache identifier.
When a file is changed, the modification time will be changed, so the former cache path will be invalid.
Fix some hive partition issues.
1. Fix be will crash when using hive partitions field of `date`, `timestamp`, `decimal` type.
2. Fix hdfs uri decode error when using `timestamp` partition filed which will cause some url-encoding for special chars, such as `%3A` will encode `:`.