Commit Graph

106 Commits

Author SHA1 Message Date
37d1519316 [WIP](dynamic-table) support dynamic schema table (#16335)
Issue Number: close #16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
2023-02-11 13:37:50 +08:00
c1a1275870 [fix](memory) Fix parquet load stack overflow (#16537) 2023-02-10 08:48:12 +08:00
27216dc7e0 [improvement](multi-catalog) push down all predicates into rowgroup/page filtering for ParquetReader (#16388)
Tow improvements:
1. Refactor rowgroup&page filtering in `ParquetReader`, and use the operator overloading of Doris native c++ type to process comparison.
2. Support decimal/decimal v3/date/datev2/datetime/datetimev2
2023-02-07 11:32:57 +08:00
b1b2697cc7 [fix](iceberg) fix iceberg catalog (#16372)
1. Fix iceberg catalog access s3
2. Fix iceberg catalog partition table query
3. Fix persistence
2023-02-05 13:15:28 +08:00
d2b5015d3f [enhancement](profile) add the profile counter RawRowsRead to record the rows read from the parquet file (#16328) 2023-02-04 22:59:34 +08:00
Pxl
5e4bb98900 [Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290)
enable -Wpedantic and update lowest gcc version to 11.1
2023-02-03 11:28:48 +08:00
9618427020 [improvement](multi-catalog) increase default batch_size to 4064 (#16326)
The performance of ClickBench Q30 is affected by batch_size:
| batch_size | 1024 | 4096 | 20480 |
| -- | -- | -- | -- |
| Q30 query time | 2.27 | 1.08 | 0.62 |

Because aggregation operator will create a new result block for each batch block, and Q30 has 90 columns, which is time-consuming. Larger batch_size will decrease the number of aggregation blocks, so the larger batch_size will improve performance.

Doris internal reader will read at least 4064 rows even if batch_size < 4064, so this PR keep the process of reading external table the same  as internal table.
2023-02-02 11:51:09 +08:00
1c5279d26e [fix](multi-catalog) remove the eof check among parquet columns (#16302)
Read parquet file failed:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = [INTERNAL_ERROR]Read parquet file xxx failed, reason = [CORRUPTION]The number of rows are not equal among parquet columns
```
This error may be thrown when reading non-predicate columns in lazy-read, for example:
A row group with 1000 rows has tow non-predicate columns.
Column A has one page, Column B has two pages with 500 rows for each page.
The read range of `ParquetColumnReader` is [0, 400), and the rows between [0, 450) are all filtered by predicate columns.
So column A can skip the first page, and reach the EOF,  while column B can also skip the first page, but doesn't read the EOF.
2023-02-02 09:22:09 +08:00
b878a7e61e [feature](Load)Suppot skip specific lines number for csv stream load (#16055)
Support set skip line number for stream load to load csv file.

Usage `-H skip_lines:number`:
```
curl --location-trusted -u root: -T test.csv -H skip_lines:5  -XPUT http://127.0.0.1:8030/api/testDb/testTbl/_stream_load
```

Skip line number also can be used in mysql load as below:
```sql
LOAD DATA
LOCAL
INFILE '${mysql_load_skip_lines}'
INTO TABLE ${tableName}
COLUMNS TERMINATED BY ','
IGNORE 2 LINES
PROPERTIES ("auth" = "root:");
```
2023-02-01 20:42:43 +08:00
fa14b7ea9c [Enhancement](icebergv2) Optimize the position delete file filtering mechanism in iceberg v2 parquet reader (#16024)
close #16023
2023-01-28 00:04:27 +08:00
1589d453a3 [fix](multi catalog)Support parquet and orc upper case column name (#16111)
External hms catalog table column names in doris are all in lower case,
while iceberg table or spark-sql created hive table may contain upper case column name,
which will cause empty query result. This pr is to fix this bug.
1. For parquet file, transfer all column names to lower case while parse parquet metadata.
2. For orc file, store the origin column names and lower case column names in two vectors, use the suitable names in different cases.
3. FE side, change the column name back to the origin column name in iceberg while doing convertToIcebergExpr.
2023-01-27 23:52:11 +08:00
23edb3de5a [fix](icebergv2) fix bug that delete file reader is not opened (#16133)
This pr #15836 change the way to use parquet reader by first open() then init_reader().
But we forgot to call open() for iceberg delete file, which cause coredump.
2023-01-24 10:19:46 +08:00
199d7d3be8 [Refactor]Merged string_value into string_ref (#15925) 2023-01-22 16:39:23 +08:00
de12957057 [debug](ParquetReader) print file path if failed to read parquet file (#16118) 2023-01-21 08:05:17 +08:00
3ebc98228d [feature wip](multi catalog)Support iceberg schema evolution. (#15836)
Support iceberg schema evolution for parquet file format.
Iceberg use unique id for each column to support schema evolution.
To support this feature in Doris, FE side need to get the current column id for each column and send the ids to be side.
Be read column id from parquet key_value_metadata, set the changed column name in Block to match the name in parquet file before reading data. And set the name back after reading data.
2023-01-20 12:57:36 +08:00
Pxl
b727033906 [Chore](build) enable -Wextra and remove some -Wno (#15760)
enable -Wextra and remove some -Wno
2023-01-15 10:40:35 +08:00
34bb9cd5d3 [fix](parquet-reader) fix coredump when load datatime data to doris from parquet (#15794)
`date_time_v2` will check scale when constructed datatimev2:
```
LOG(FATAL) << fmt::format("Scale {} is out of bounds", scale);
```

This [PR](https://github.com/apache/doris/pull/15510) has fixed this issue, but parquet does not use constructor to create `TypeDescriptor`, leading the `scale = -1` when reading datetimev2 data.
2023-01-13 11:51:11 +08:00
f17d69e450 [feature](file cache)Import file cache for remote file reader (#15622)
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes

Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.

Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`

Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
2023-01-10 12:23:56 +08:00
707eab9a63 [opt](multi-catalog) cache and reuse position delete rows in iceberg v2 (#15670)
A deleted file may belong to multiple data files. Each data file will read a full amount of deleted files,
so a deleted file may be read repeatedly. The deleted files can be cached, and multiple data files
can reuse the first read content.

The performance is improved by 60% in the case of single thread, and by 30% in the case of multithreading.
2023-01-07 22:29:11 +08:00
4075e3aec6 [fix](csv-reader) fix new csv reader's performance issue (#15581) 2023-01-04 18:25:08 +08:00
50f1931f96 [fix](multi-catalog) get dictionary-encode from parquet metadata (#15525) 2022-12-31 19:08:10 +08:00
2c8de30cce [optimize](multi-catalog) use dictionary encode&filter to process delete files (#15441)
**Optimize**
PR #14470 has used `Expr` to filter delete rows to match current data file,
but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files)
to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows.

In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path`
columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column.

After testing, the performance of iceberg v2's MOR is improved by 30%+.

**Fix Bug**
Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr`
and the batch_eof is generated from next batch.
2022-12-30 08:57:55 +08:00
f8bb8c7829 [fix](broker) fix be core dump caused by broker load (#15390)
* [fix](broker) fix be core dump caused by broker load
2022-12-28 10:57:41 +08:00
fc8f6a0715 [fix](multi-catalog) throw NPE when reading data after EOF (#15358)
1. Fix 1 bug:  
Throw null pointer exception when reading data after the reader reaches the end of file, so should return directly when `_do_lazy_read` read no data.

2. Optimize code:  
Remove unused parameters.

3. Fix regression test
2022-12-26 22:49:35 +08:00
bf71943605 [feature](load) stream load trim double quotes for csv (#15241) 2022-12-26 11:45:54 +08:00
ec055e1acb [feature](new file reader) Integrate new file reader (#15175) 2022-12-26 08:55:52 +08:00
5cefd05869 [fix](multi-catalog) fix and optimize iceberg v2 reader (#15274)
Fix three bugs when read iceberg v2 tables:
1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in 
`RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first 
row of current row group from `delete position`.
2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`.
3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row 
group in such case.

Optimize four performance issues:
1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read 
ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly.
2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set.
3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to 
save memory usage, and the same as `IcebergReader`.
4. Change the recursive call of `_do_lazy_read` to loop logic.
2022-12-24 16:02:07 +08:00
7730a88d11 [fix](multi-catalog) add support for orc binary type (#15141)
Fix three bugs:
1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed
2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed
3. fmt::format can't generate right format string, and will be failed
2022-12-19 14:24:12 +08:00
0e1e5a802b [config](load) enable new load scan node by default (#14808)
Set FE `enable_new_load_scan_node` to true by default.
So that all load tasks(broker load, stream load, routine load, insert into) will use FileScanNode instead of BrokerScanNode
to read data

1. Support loading parquet file in stream load with new load scan node.
2. Fix bug that new parquet reader can not read column without logical or converted type.
3. Change jsonb parser function to "jsonb_parse_error_to_null"
    So that if the input string is not a valid json string, it will return null for jsonb column in load task.
2022-12-16 09:41:43 +08:00
c6d93f739c [feature-wip](file reader) Merge stream_load_pipe to the new file reader (#15035)
Currently, there are two sets of file readers in Doris, this pr rewrites the old stream_load_pipe with the new file reader.
2022-12-15 16:31:22 +08:00
67e4292533 [fix](iceberg-v2) icebergv2 filter data path (#14470)
1. a icebergv2 delete file may cross many data paths, so the path of a file split is required as a predicate to filter rows of delete file
- create delete file structure to save predicate parameters
- create predicate for file path
2. add some log to print row range
3.  fix bug when create file metadata
2022-12-15 10:18:12 +08:00
b8f93681eb [feature-wip](file reader) Merge broker reader to the new file reader (#14980)
Currently, there are two sets of file readers in Doris, this pr rewrites the old broker reader with the new file reader.

TODO:
1. rewrite stream load pipe and kafka consumer pipe
2022-12-14 12:48:02 +08:00
73ee352705 [fix](multi catalog)Fix convert_to_doris_type missing break for some cases (#14992) 2022-12-13 13:34:55 +08:00
e7a84e4a16 [fix](multi-catalog)fix page index thrift deserialize (#15001)
fix the err when parse page index: Couldn't deserialize thrift msg.
use two buffer to store column index and offset index msg, avoid parse them in a buffer
2022-12-13 13:33:19 +08:00
8fe0729835 [fix](multi catalog)Check orc file reader is not null before using it. (#14988)
The external table file path cache may out of date, which will cause orc reader to visit non-exist files.
In this case, orc file reader is nullptr.
This pr is to check the reader before using it to avoid core dump of visiting nullptr.
2022-12-13 11:27:51 +08:00
f3aea7f0f0 [Enhancement](status) Unify error code and enable customed err msg for BE internal errors (#14744) 2022-12-11 23:33:18 +08:00
00f44257e2 [feature-wip](file-reader) Merge hdfs reader to the new file reader (#14875) 2022-12-09 13:21:59 +08:00
f7a827c06b [fix](new-scan) fix some bugs about new scan node and readers (#14504)
json reader DCHECK fail because of missing TYPE_STRING

fix bug that if no file is found, the tvf will throw NPE.

The predicate conjuncts can not be pushed down to parquet reader if this is a load task.
Because the predicate should be applied on column of dest table, not on column of source file.

Add a temp property "use_new_load_scan_node" of broker load to make regression test happy.
So that we can use new load scan node for a certain job and avoid setting global FE config.
2022-11-29 10:21:41 +08:00
064b8d2aa6 [fix](multi-catalog) fix coredump when querying partitioned hive table with text format (#14604)
BE will crash when querying partitioned hive table with text format
and put partition column at first of select items.

1. FE should use file slots to set the column mapping index of csv file.
2. BE should use `get_by_name` of block to get right column in a block in csv reader.
2022-11-26 11:42:40 +08:00
25de068a05 [fix](parquet-reader) the value of null map will overflow when LazyRead merges too many empty batches (#14558)
The run length of null map is saved as `uint16_t`. Previously, the run length of null map was
limited by `batch_size` in the `ParquetReader`, by setting `batch_size = std::min(batch_size, (size_t)USHRT_MAX)`.
It works well when the batch size is less than `USHRT_MAX`.
However, [Lazy read](https://github.com/apache/doris/pull/13917) will merge empty batches until reading
a non-empty batch or reaching the EOF of a row group, so the `batch_size` may be greater than `USHRT_MAX`
in non-predicate columns.
In addition, even if the `batch_size` does not exceed `USHRT_MAX`, the adjacent batches may also make
the run  length exceed the `USHRT_MAX` in `ColumnSelectVector::get_next_run`.
2022-11-25 12:22:18 +08:00
1ec7f45fb6 [Bug](avg) Fix avg for bigint (#14433) 2022-11-22 10:29:59 +08:00
2c42f0a905 [refactor](decimalv3) Refine code for DecimalV3 (#14394) 2022-11-19 16:57:17 +08:00
512b787559 [fix](parquet-reader) fix stack-use-after-return error (#14411) 2022-11-19 10:52:50 +08:00
6da2948283 [feature-wip](multi-catalog) support iceberg v2(step 1) (#13867)
Support position delete(part of).
2022-11-17 17:56:48 +08:00
20634ab7e3 [feature-wip](multi-catalog) support partition&missing columns in parquet lazy read (#14264)
PR https://github.com/apache/doris/pull/13917 has supported lazy read for non-predicate columns in ParquetReader, 
but can't trigger lazy read when predicate columns are partition or missing columns.
This PR support such case, and fill partition and missing columns in `FileReader`.
2022-11-16 08:43:11 +08:00
5badd70db2 [fix](csv-reader) Fix core dump when load text into doris with special delimiter (#14196) 2022-11-15 16:06:59 +08:00
fc70179acb [multi-catalog](fix) the eof of lazy read columns may be not equal to the eof of predicate columns (#14212)
Fix three bugs:
1. The EOF of lazy read columns may be not equal to the EOF of predicate columns.
(for example: If the predicate column has 3 pages, with 400 rows for each, but the last page
is filtered by page index. When batch_size=992, the EOF of predicate column is true.
However, we should set batch_size=800 for lazy read column, so the EOF of lazy read column may be false.)
2. The array column does not count the number of nulls
3. Generate wrong NullMap for array column
2022-11-14 14:37:21 +08:00
035657c5a1 [typo](comment) Fix a lot of spell errors in be comments (#14208)
fix typos.
2022-11-12 16:06:15 +08:00
6bd5378f66 [feature-wip](multi-catalog) lazy read for ParquetReader (#13917)
Read predicate columns firstly, and use VExprContext(push-down predicates)
to generate the select vector, which is then applied to read the non-predicate columns.
The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced.
If a whole page can be skipped, the decompress-time can also be reduced.
2022-11-10 16:56:14 +08:00
43eb946543 [feature](table-valued-function)S3 table valued function supports parquet/orc/json file format #14130
S3 table valued function supports parquet/orc/json file format.
For example: parquet format
2022-11-10 10:33:12 +08:00