Commit Graph

19 Commits

Author SHA1 Message Date
eaf2a6a80e [fix](date) return right date value even if out of the range of date dictionary(#23664)
PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark.
How to fix:
1. Increase dictionary range: 1900 ~ 2038
2. The date out of 1900 ~ 2038 is regenerated.
2023-09-01 14:40:20 +08:00
40be6a0b05 [fix](hive) do not split compress data file and support lz4/snappy block codec (#23245)
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304
2023-08-26 12:59:05 +08:00
8af1e7f27f [Fix](orc-reader) Fix incorrect result if null partition fields in orc file. (#23369)
Fix incorrect result if null partition fields in orc file. 

### Root Cause
Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the  pushed down partition field which are null values. filter incorrectly.

### Solution
we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.
2023-08-26 00:13:11 +08:00
00826185c1 [fix](tvf view)Support Table valued function view for nereids (#23317)
Nereids doesn't support view based table value function, because tvf view doesn't contain the proper qualifier (catalog, db and table name). This pr is to support this function.

Also, fix nereids table value function explain output exprs incorrect bug.
2023-08-25 21:23:16 +08:00
29273771f7 [Fix](multi-catalog) Fix hive incorrect result by disable string dict filter if exprs contain null expr. (#23361)
Issue Number: close #21960

Fix hive incorrect result by disable string dict filter if exprs contain null expr.
2023-08-25 21:16:43 +08:00
ceb931c513 [regression-test](hdfs_tvf)append regression test that hdfs_tvf read compression file (#23454) 2023-08-25 09:00:21 +08:00
9d2e23b1aa [fix](parquet) A row of complex type may be stored across more pages (#23277)
A row of complex type may be stored across two(or more) pages, and the parameter `align_rows` indicates that whether the reader should read the remaining value of the last row in previous page.
2023-08-22 14:47:10 +08:00
6c8af92175 [fix])(nereids)Support select catalog.db.table.column from xxx for nereids planner. #23221
Nereids doesn't support select table.* from table, this pr is to fix this bug.
Support three layer qualifier. (catalog.database.table)
2023-08-22 13:58:25 +08:00
5ff7b57fc1 [fix](parquet) parquet reader confuses logical/physical/slot id of columns (#23198)
`ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.
2023-08-22 13:35:29 +08:00
4bf055c818 [fix](parquet) the key colum of map type in parquet may be nullable (#23180)
Fix errors when reading map type with nullable key column in parquet file. `ParquetReader` support to read nullable key column, but add a check to prevent reading nullable key column. Unfortunately, this check error was not thrown correctly, causing the BE to crash, and thrown meaningless error logs in be.out:
```
...
11# doris::vectorized::ParquetReader::get_columns(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, doris::TypeDescriptor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, doris::TypeDescriptor> > >*, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*) at /root/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:508
12# doris::vectorized::VFileScanner::_get_next_reader() in /root/yun_you_external/output/be/lib/doris_be
13# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /root/doris/be/src/vec/exec/scan/vfile_scanner.cpp:241
...
```
2023-08-20 22:59:18 +08:00
97fa840324 [feature](multi-catalog)support iceberg hadoop catalog external table query (#22949)
support iceberg hadoop catalog external table query
2023-08-20 19:29:25 +08:00
7c4870c371 [fix](catalog) fix hive partition prune bug on nereids (#23026) 2023-08-18 18:31:01 +08:00
795006ea3d [fix](multi-catalog) conversion of compatible numerical types (#23113)
Hive support schema change, but doesn't rewrite the parquet file, so the physical type of parquet file may not equal the logical type of table schema.
2023-08-18 14:05:33 +08:00
314f5a5143 [Fix](orc-reader) Fix filling partition or missing column used incorrect row count. (#23096)
[Fix](orc-reader) Fix filling partition or missing column used incorrect row count.

`_row_reader->nextBatch` returns number of read rows. When orc lazy materialization is turned on, the number of read rows includes filtered rows, so caller must look at `numElements` in the row batch to determine how
many rows were not filtered which will to fill to the block.

In this case, filling partition or missing column used incorrect row count which will cause be crash by `filter.size() != offsets.size()` in filter column step.

When orc lazy materialization is turned off, add `_convert_dict_cols_to_string_cols(block, nullptr)` if `(block->rows() == 0)`.
2023-08-17 23:26:11 +08:00
fa6110accd [fix](catalog)paimon support more data type (#22899) 2023-08-14 13:48:33 +08:00
41ff48f838 [regresstion][external]fix case test_show_where and es_query 0811 (#22898) 2023-08-12 19:41:55 +08:00
f1db6bd8c1 [feature](hive)append support for struct and map column type on textfile format of hive table (#22347)
1. append support for struct and map column type on textfile format  of hive table.
2. optimizer code that array column type.

```mysql
+------+------------------------------------+
| id   | perf                               |
+------+------------------------------------+
| 1    | {"key1":"value1", "key2":"value2"} |
| 1    | {"key1":"value1", "key2":"value2"} |
| 2    | {"name":"John", "age":"30"}        |
+------+------------------------------------+
```

```mysql
+---------+------------------+
| column1 | column2          |
+---------+------------------+
|       1 | {10, "data1", 1} |
|       2 | {20, "data2", 0} |
|       3 | {30, "data3", 1} |
+---------+------------------+
```
Summarizes support for complex types(support assign delimiter) :

1. array< primitive_type > and array< array< ... > >
2. map< primitive_type , primitive_type >
3. Struct< primitive_type , primitive_type ... >
2023-08-10 13:47:58 +08:00
91b15183e7 [enhance][external]enhance and fix external cases 0807 (#22689)
enhance and fix external cases 0807
2023-08-08 10:53:08 +08:00
c31226b144 [refractor](regression-test) sort out test cases of external tables (#22640)
sort out the test cases of external table.
After modify, there are 2 directories:

1. `external_table_p0`: all p0 cases of external tables: hive, es, jdbc and tvf
2. `external_table_p2`: all p2 cases of external tables: hive, es, mysql, pg, iceberg and tvf

So that we can run it with one line command like:

```
sh run-regression-test.sh --run -d external_table_p0,external_table_p2
```
2023-08-07 11:12:30 +08:00