1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.
2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec
4. Support snappy block codec
For hadoop snappy
5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.
Need to pick to branch-2.0 after this PR: #22304
Problem:
It will return a result although we use wrong ak/sk/bucket name, such as:
```sql
mysql> select * from demo.student
-> into outfile "s3://xxxx/exp_"
-> format as csv
-> properties(
-> "s3.endpoint" = "https://cos.ap-beijing.myqcloud.com",
-> "s3.region" = "ap-beijing",
-> "s3.access_key"= "xxx",
-> "s3.secret_key" = "yyyy"
-> );
+------------+-----------+----------+----------------------------------------------------------------------------------------------------+
| FileNumber | TotalRows | FileSize | URL |
+------------+-----------+----------+----------------------------------------------------------------------------------------------------+
| 1 | 3 | 26 | s3://xxxx/exp_2ae166e2981d4c08-b577290f93aa82ba_ |
+------------+-----------+----------+----------------------------------------------------------------------------------------------------+
1 row in set (0.15 sec)
```
The reason for this is that we did not catch the error returned by `close()` phase.
Fix incorrect result if null partition fields in orc file.
### Root Cause
Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the pushed down partition field which are null values. filter incorrectly.
### Solution
we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.
Nereids doesn't support view based table value function, because tvf view doesn't contain the proper qualifier (catalog, db and table name). This pr is to support this function.
Also, fix nereids table value function explain output exprs incorrect bug.
* Revert "[fix](testcase) fix test case failure of insert null value into not null column (#20963)"
This reverts commit 55a6649da962fb170ddb40fea8ef26bdc552a51a.
Mannual Revert "fix in strict mode, return error for insert if datatype convert fails (#20378)"
This mannual reverts commit 1b94b6368f5e871c9a0fe53dd7c64409079a4c9d
* fix case failure
* support int128 in jsonb
* fix jsonb int128 write
* fix jsonb to json int128
* fix json functions for int128
* add nereids function jsonb_extract_largeint
* add testcase for json int128
* change docs for json int128
* add nereids function jsonb_extract_largeint
* clang format
* fix check style
* using int128_t = __int128_t for all int128
* use fmt::format_to instead of snprintf digit by digit for int128
* clang format
* delete useless check
* add warn log
* clang format
This problem is casued by #21197
Fixed an issue that `csv_with_names` and `csv_with_names_and_types` file format could not be exported on nereids optimizer when using `select...into outfile`.
1. rename TVFProperties to Properties
2. add generating function explode and explode_outer
3. fix concat_ws could not apply on array
4. check tokenize second argument format on FE
5. add test case for concat_ws, tokenize, explode, explode_outer and split_by_string
- add config `enable_col_auth` to temporarily disable column permissions(because old/new planner has bug when select from view)
- Restore the old optimizer to the previous authentication method
- Support for new optimizer authentication(Legacy issue: When querying the view, the permissions of the base table will be authenticated. The view's own permissions should be authenticated and processed after the new optimizer is improved)
- fix: show grants for non-existent users
- fix: role:`admin` can not grant/revoke to/from user
Problem: `select...from tablets()` are invalidated when there exists predicates, such as:
```sql
// The all data is:
mysql> select * from student3;
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 18 |
| 3 | yy | 19 |
| 4 | xx | 21 |
| 2 | cyx | 20 |
+------+------+------+
// when we specified tablet to read:
mysql> select * from student3 tablet(131131);
+------+------+------+
| id | name | age |
+------+------+------+
| 1 | ftw | 18 |
| 3 | yy | 19 |
+------+------+------+
// Howerver, when there exists predicates, the `tablet(131131)` is invalidated
mysql> select * from student3 tablet(131131) where id > 1;
+------+------+------+
| id | name | age |
+------+------+------+
| 4 | xx | 21 |
| 3 | yy | 19 |
| 2 | cyx | 20 |
+------+------+------+
```
After the fix, we get promising data
```sql
mysql> select * from student3 tablet(131131) where id > 1;
+------+------+------+
| id | name | age |
+------+------+------+
| 3 | yy | 19 |
+------+------+------+
```
fix array index func with decimal
in old analyzer when sql with array_position or array_contains with decimal , may loss precision to which will make result wrong
The changes of this PR for JdbcOracleClient are as follows:
#### bug fixes:
1. Fix the problem that if there is an approximate table name for Schema synchronization with a table name with `/` characters, the synchronization Column will be confused
2. Fix the NPE problem of metadata synchronization after enabling lower_case_table_names configuration
#### improvement:
1. Modify the method of synchronizing Oracle User to Doris Database mapping, use `metadata.getSchemas` instead of `SELECT DISTINCT OWNER FROM all_tables`
2. When synchronizing metadata, change `null` at the catalog level to `conn.getcatalog`
A row of complex type may be stored across two(or more) pages, and the parameter `align_rows` indicates that whether the reader should read the remaining value of the last row in previous page.
`ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.