Commit Graph

29 Commits

Author SHA1 Message Date
b4432ce577 [Feature](statistics)Support external table analyze partition (#24154)
Enable collect partition level stats for hive external table.
2023-09-18 14:59:26 +08:00
f3e350e8ec [Improvement](statistics)Improve statistics user experience (#24414)
Two improvements:
1. Move the `Job_id` column for the return info of `Analyze table` command to the first column. To keep consistent with `show analyze`.
```
mysql> analyze table hive.tpch100.region;
+--------+--------------+-------------------------+------------+--------------------------------+
| Job_Id | Catalog_Name | DB_Name                 | Table_Name | Columns                        |
+--------+--------------+-------------------------+------------+--------------------------------+
| 14403  | hive         | default_cluster:tpch100 | region     | [r_regionkey,r_comment,r_name] |
+--------+--------------+-------------------------+------------+--------------------------------+
1 row in set (0.03 sec)
```
2. Add `analyze_timeout` session variable, to control `analyze table/database with sync` timeout.
2023-09-18 13:36:41 +08:00
4dad7c94da [fix](orc) fix the count(*) pushdown issue in orc format (#24446)
In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.

This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
2023-09-16 09:57:39 +08:00
c5ef6cfea2 [fix](Table-Valued Function) fix be core when user sepcified empty column_separator using hdfs tvf (#24369) 2023-09-14 23:19:48 +08:00
e30c3f3a65 [fix](csv_reader)fix bug that Read garbled files caused be crash. (#24164)
fix bug that read garbled files caused be crash.
2023-09-13 14:12:55 +08:00
ebe3749996 [fix](tvf)support s3,local compress_type and append regression test (#24055)
support s3,local compress_type and append regression test.
2023-09-13 00:32:59 +08:00
9df72a96f3 [Feature](multi-catalog) Support hadoop viewfs. (#24168)
### Feature

Support hadoop viewfs.

### Test

- Regression tests: 
  - hive viewfs test.
  - tvf viewfs test.

- Broker load with broker and with hdfs tests manually.
2023-09-13 00:20:12 +08:00
6e28d878b5 [fix](hudi) compatible with hudi spark configuration and support skip merge (#24067)
Fix three bugs:
1. Hudi slice maybe has log files only, so `new Path(filePath)`  will throw errors.
2. Hive column names are lowercase only, so match column names in ignore-case-mode.
3.  Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.
2023-09-11 19:54:59 +08:00
f9a75b5c4f [feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352)
1. append csv serde for serialize to csv and deserialize from csv.
2. let csvReader use csv serde not text_converter.
2023-09-10 00:16:21 +08:00
2cb7536c6c [fix](regression)fix case test_external_catalog_es (#23908) 2023-09-06 12:12:43 +08:00
eaf2a6a80e [fix](date) return right date value even if out of the range of date dictionary(#23664)
PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark.
How to fix:
1. Increase dictionary range: 1900 ~ 2038
2. The date out of 1900 ~ 2038 is regenerated.
2023-09-01 14:40:20 +08:00
40be6a0b05 [fix](hive) do not split compress data file and support lz4/snappy block codec (#23245)
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304
2023-08-26 12:59:05 +08:00
8af1e7f27f [Fix](orc-reader) Fix incorrect result if null partition fields in orc file. (#23369)
Fix incorrect result if null partition fields in orc file. 

### Root Cause
Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the  pushed down partition field which are null values. filter incorrectly.

### Solution
we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.
2023-08-26 00:13:11 +08:00
00826185c1 [fix](tvf view)Support Table valued function view for nereids (#23317)
Nereids doesn't support view based table value function, because tvf view doesn't contain the proper qualifier (catalog, db and table name). This pr is to support this function.

Also, fix nereids table value function explain output exprs incorrect bug.
2023-08-25 21:23:16 +08:00
29273771f7 [Fix](multi-catalog) Fix hive incorrect result by disable string dict filter if exprs contain null expr. (#23361)
Issue Number: close #21960

Fix hive incorrect result by disable string dict filter if exprs contain null expr.
2023-08-25 21:16:43 +08:00
ceb931c513 [regression-test](hdfs_tvf)append regression test that hdfs_tvf read compression file (#23454) 2023-08-25 09:00:21 +08:00
9d2e23b1aa [fix](parquet) A row of complex type may be stored across more pages (#23277)
A row of complex type may be stored across two(or more) pages, and the parameter `align_rows` indicates that whether the reader should read the remaining value of the last row in previous page.
2023-08-22 14:47:10 +08:00
6c8af92175 [fix])(nereids)Support select catalog.db.table.column from xxx for nereids planner. #23221
Nereids doesn't support select table.* from table, this pr is to fix this bug.
Support three layer qualifier. (catalog.database.table)
2023-08-22 13:58:25 +08:00
5ff7b57fc1 [fix](parquet) parquet reader confuses logical/physical/slot id of columns (#23198)
`ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.
2023-08-22 13:35:29 +08:00
4bf055c818 [fix](parquet) the key colum of map type in parquet may be nullable (#23180)
Fix errors when reading map type with nullable key column in parquet file. `ParquetReader` support to read nullable key column, but add a check to prevent reading nullable key column. Unfortunately, this check error was not thrown correctly, causing the BE to crash, and thrown meaningless error logs in be.out:
```
...
11# doris::vectorized::ParquetReader::get_columns(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, doris::TypeDescriptor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, doris::TypeDescriptor> > >*, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*) at /root/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:508
12# doris::vectorized::VFileScanner::_get_next_reader() in /root/yun_you_external/output/be/lib/doris_be
13# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /root/doris/be/src/vec/exec/scan/vfile_scanner.cpp:241
...
```
2023-08-20 22:59:18 +08:00
97fa840324 [feature](multi-catalog)support iceberg hadoop catalog external table query (#22949)
support iceberg hadoop catalog external table query
2023-08-20 19:29:25 +08:00
7c4870c371 [fix](catalog) fix hive partition prune bug on nereids (#23026) 2023-08-18 18:31:01 +08:00
795006ea3d [fix](multi-catalog) conversion of compatible numerical types (#23113)
Hive support schema change, but doesn't rewrite the parquet file, so the physical type of parquet file may not equal the logical type of table schema.
2023-08-18 14:05:33 +08:00
314f5a5143 [Fix](orc-reader) Fix filling partition or missing column used incorrect row count. (#23096)
[Fix](orc-reader) Fix filling partition or missing column used incorrect row count.

`_row_reader->nextBatch` returns number of read rows. When orc lazy materialization is turned on, the number of read rows includes filtered rows, so caller must look at `numElements` in the row batch to determine how
many rows were not filtered which will to fill to the block.

In this case, filling partition or missing column used incorrect row count which will cause be crash by `filter.size() != offsets.size()` in filter column step.

When orc lazy materialization is turned off, add `_convert_dict_cols_to_string_cols(block, nullptr)` if `(block->rows() == 0)`.
2023-08-17 23:26:11 +08:00
fa6110accd [fix](catalog)paimon support more data type (#22899) 2023-08-14 13:48:33 +08:00
41ff48f838 [regresstion][external]fix case test_show_where and es_query 0811 (#22898) 2023-08-12 19:41:55 +08:00
f1db6bd8c1 [feature](hive)append support for struct and map column type on textfile format of hive table (#22347)
1. append support for struct and map column type on textfile format  of hive table.
2. optimizer code that array column type.

```mysql
+------+------------------------------------+
| id   | perf                               |
+------+------------------------------------+
| 1    | {"key1":"value1", "key2":"value2"} |
| 1    | {"key1":"value1", "key2":"value2"} |
| 2    | {"name":"John", "age":"30"}        |
+------+------------------------------------+
```

```mysql
+---------+------------------+
| column1 | column2          |
+---------+------------------+
|       1 | {10, "data1", 1} |
|       2 | {20, "data2", 0} |
|       3 | {30, "data3", 1} |
+---------+------------------+
```
Summarizes support for complex types(support assign delimiter) :

1. array< primitive_type > and array< array< ... > >
2. map< primitive_type , primitive_type >
3. Struct< primitive_type , primitive_type ... >
2023-08-10 13:47:58 +08:00
91b15183e7 [enhance][external]enhance and fix external cases 0807 (#22689)
enhance and fix external cases 0807
2023-08-08 10:53:08 +08:00
c31226b144 [refractor](regression-test) sort out test cases of external tables (#22640)
sort out the test cases of external table.
After modify, there are 2 directories:

1. `external_table_p0`: all p0 cases of external tables: hive, es, jdbc and tvf
2. `external_table_p2`: all p2 cases of external tables: hive, es, mysql, pg, iceberg and tvf

So that we can run it with one line command like:

```
sh run-regression-test.sh --run -d external_table_p0,external_table_p2
```
2023-08-07 11:12:30 +08:00