Commit Graph

123 Commits

Author SHA1 Message Date
Pxl
0723e55f76 [Bug](build) fix compile fail on unused value #17165
error: variable 'nullcount' set but not used [-Werror,-Wunused-but-set-variable]
int nullcount = 0;
2023-02-27 14:19:44 +08:00
29dc08fc45 [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. (#17124)
* [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint.

`_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode,
introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order,
should be quite the same between each lines

* fix case
2023-02-27 10:39:22 +08:00
a0782a1855 [fix](file reader) fix be core in broker file reader (#17039)
A const reference member variables as class member stores a temporary object, which cannot be got after the temporary object being destroyed, cause be core dump while enable debug level log

_broker_addr has been destroyed in BrokerFileReader
2023-02-26 12:35:31 +08:00
f6ce072297 [Enhencement](csv-reader) Optimize csv_reader _split_value and fix json_reader case sensitive (#17093)
1. Enhencement:
    For single-charset column separator,csv_reader use another method of `split value`.
2. BugFix
    Set `json` file format loading to be sensitive.
2023-02-26 09:03:04 +08:00
c43e521d29 [feature](multi-catalog) support map&struct type in parquet&orc reader (#17087)
Support parsing map&struct type in parquet&orc reader.

## Remaining Problems
1. Doris use array type to build the key and value column of a `map`, but doesn't fill the offsets in value column, so the offsets in value column is wasted.
2. Parquet support reading only key or value column in `map`, this PR hasn't supported yet.
3. Parquet support reading partial columns in `struct`, this PR hasn't supported yet.
2023-02-26 08:55:39 +08:00
e42465ae59 [fix](OrcReader) handle null values in orc reader for string type (#17135)
Orc doesn't fill null values in new batch, but the former batch has been release.
Other types like int/long/timestamp... are flat types without pointer in them, 
so other types do not need to be handled separately like string.
2023-02-26 08:10:40 +08:00
3ea6478ba8 [feature](multi-catalog) parquet reader support nested array column (#16961)
Support to decode nested array column in parquet reader:
1. FE should generate the right nested column type. FE doesn't check the nesting depth and legality, like map\<array\<int\>, int\>.
2. `ParquetColumnReader` has removed the filtering of page index to support nested array type.
    It's too difficult to skip values in nested complex  types. Maybe we should support the filtering of page index and lazy read in later PR.
3. `ExternalFileScanNode` has a bug in creating default value expression.
4. Maybe it's slow to read repetition levels in a while loop. I'll optimize this in next PR.
5. Array column has temporary `SchemaElement` in its thrift definition,
we have removed them and keep its parent in former implementation.
The remaining parent should inherit the repetition and definition level of its child.
2023-02-23 14:54:58 +08:00
61826e3a77 [Improvement](parquet-reader) Improve performance of parquet reader filter calculation. (#16934)
Improve performance of parquet reader filter calculation.

- Use `filter_data` instead of `(*filter_ptr)` to merge filter to improve performance. 
- Use mutable column filter func instead of original new column filter func which introduced by #16850.
- Avoid column ref-count increasing which caused unnecessary copying by passing column pointer ref.
2023-02-23 14:41:30 +08:00
29c46d6926 [fix](struct-type) fix be core when load array orc file (#16978)
* fix be core when load array orc file
2023-02-22 10:15:39 +08:00
4cb97b6fb7 [chore](macOS) Fix linkage errors for the release build (#17002)
Issue Number: close #17003

## Problem summary
The linker couldn't find some symbols because the implementation of a template member function doris::vectorized::Decoder::init_decimal_converter is missing in the header file in which the corresponding declaration is placed.
2023-02-22 10:01:51 +08:00
491d269412 [fix](tvf) fix bug that failed to get schema of tvf when file is empty (#16928)
In previous implementation, when querying tvf, FE will get schema from BE.
And BE will try to open the first file to get its schema info, but for orc or parquet format,
if the file is empty, it will return error.
But even for an empty file, we can still get schema info from file's footer.
So we should handle the empty file to get schema info correctly.

Also modify the catalog doc to add some FAQ.
2023-02-21 14:14:32 +08:00
113023fb86 (Enhancement)[load-json] support simdjson in new json reader (#16903)
be config:
enable_simdjson_reader=true

related PR #11665
2023-02-21 11:31:00 +08:00
a46941c684 [Fix](multi-catalog) Fix switch-case fall-through issue in multi-catalog module. (#16931)
Fix switch-case fall-through issue in multi-catalog module.
2023-02-20 21:35:41 +08:00
ef2fdb79bb [Improvement](parquet-reader) Optimize and refactor parquet reader to improve performance. (#16818)
Optimize and refactor parquet reader to improve performance.
- Improve 2x performance for small dict string by aligned copying.
- Refactor code to decrease condition(if) checking.
- Don't call skip(0).
- Don't read page index if no condition.

**ssb-flat-100**: (single-machine, single-thread)
| Query        | before opt           | after opt  |
| ------------- |:-------------:| ---------:|
| SELECT count(lo_revenue) FROM lineorder_flat       | 9.23   | 9.12 |
| SELECT count(lo_linenumber) FROM lineorder_flat | 4.50    | 4.36 |
| SELECT count(c_name) FROM lineorder_flat             | 18.22 | 17.88| 
| **SELECT count(lo_shipmode) FROM lineorder_flat**     |**10.09** | **6.15**|
2023-02-20 11:42:29 +08:00
292926e5aa [Fix](multi catalog)Fix partition case bug (#16763)
Set column names from path to lower case in case-insensitive case.
This is for Iceberg columns from path. Iceberg columns are case sensitive,
which may cause error for table with partitions.
2023-02-16 15:47:23 +08:00
de8d884ec3 [Fix](multi catalog)Fix iceberg parquet file doesn't have iceberg.schema meta problem (#16764)
To support schema evolution, Iceberg add schema information to Parquet file metadata.
But for early iceberg version, it doesn't write any schema information to Parquet file.
This PR is to support read parquet without schema information.
2023-02-16 00:08:59 +08:00
0d9714b179 [Fix](multi catalog)Support read hive1.x orc file. (#16677)
Hive 1.x may write orc file with internal column name (_col0, _col1, _col2...).
This will cause query result be NULL because column name in orc file doesn't match
with column name in Doris table schema. This pr is to support query Hive orc files with internal column names. 

For now, we haven't see any problem in Parquet file, will send new pr to fix parquet if any problem show up in the future.
2023-02-14 14:32:27 +08:00
37d1519316 [WIP](dynamic-table) support dynamic schema table (#16335)
Issue Number: close #16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
2023-02-11 13:37:50 +08:00
c1a1275870 [fix](memory) Fix parquet load stack overflow (#16537) 2023-02-10 08:48:12 +08:00
27216dc7e0 [improvement](multi-catalog) push down all predicates into rowgroup/page filtering for ParquetReader (#16388)
Tow improvements:
1. Refactor rowgroup&page filtering in `ParquetReader`, and use the operator overloading of Doris native c++ type to process comparison.
2. Support decimal/decimal v3/date/datev2/datetime/datetimev2
2023-02-07 11:32:57 +08:00
b1b2697cc7 [fix](iceberg) fix iceberg catalog (#16372)
1. Fix iceberg catalog access s3
2. Fix iceberg catalog partition table query
3. Fix persistence
2023-02-05 13:15:28 +08:00
d2b5015d3f [enhancement](profile) add the profile counter RawRowsRead to record the rows read from the parquet file (#16328) 2023-02-04 22:59:34 +08:00
Pxl
5e4bb98900 [Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290)
enable -Wpedantic and update lowest gcc version to 11.1
2023-02-03 11:28:48 +08:00
9618427020 [improvement](multi-catalog) increase default batch_size to 4064 (#16326)
The performance of ClickBench Q30 is affected by batch_size:
| batch_size | 1024 | 4096 | 20480 |
| -- | -- | -- | -- |
| Q30 query time | 2.27 | 1.08 | 0.62 |

Because aggregation operator will create a new result block for each batch block, and Q30 has 90 columns, which is time-consuming. Larger batch_size will decrease the number of aggregation blocks, so the larger batch_size will improve performance.

Doris internal reader will read at least 4064 rows even if batch_size < 4064, so this PR keep the process of reading external table the same  as internal table.
2023-02-02 11:51:09 +08:00
1c5279d26e [fix](multi-catalog) remove the eof check among parquet columns (#16302)
Read parquet file failed:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = [INTERNAL_ERROR]Read parquet file xxx failed, reason = [CORRUPTION]The number of rows are not equal among parquet columns
```
This error may be thrown when reading non-predicate columns in lazy-read, for example:
A row group with 1000 rows has tow non-predicate columns.
Column A has one page, Column B has two pages with 500 rows for each page.
The read range of `ParquetColumnReader` is [0, 400), and the rows between [0, 450) are all filtered by predicate columns.
So column A can skip the first page, and reach the EOF,  while column B can also skip the first page, but doesn't read the EOF.
2023-02-02 09:22:09 +08:00
b878a7e61e [feature](Load)Suppot skip specific lines number for csv stream load (#16055)
Support set skip line number for stream load to load csv file.

Usage `-H skip_lines:number`:
```
curl --location-trusted -u root: -T test.csv -H skip_lines:5  -XPUT http://127.0.0.1:8030/api/testDb/testTbl/_stream_load
```

Skip line number also can be used in mysql load as below:
```sql
LOAD DATA
LOCAL
INFILE '${mysql_load_skip_lines}'
INTO TABLE ${tableName}
COLUMNS TERMINATED BY ','
IGNORE 2 LINES
PROPERTIES ("auth" = "root:");
```
2023-02-01 20:42:43 +08:00
fa14b7ea9c [Enhancement](icebergv2) Optimize the position delete file filtering mechanism in iceberg v2 parquet reader (#16024)
close #16023
2023-01-28 00:04:27 +08:00
1589d453a3 [fix](multi catalog)Support parquet and orc upper case column name (#16111)
External hms catalog table column names in doris are all in lower case,
while iceberg table or spark-sql created hive table may contain upper case column name,
which will cause empty query result. This pr is to fix this bug.
1. For parquet file, transfer all column names to lower case while parse parquet metadata.
2. For orc file, store the origin column names and lower case column names in two vectors, use the suitable names in different cases.
3. FE side, change the column name back to the origin column name in iceberg while doing convertToIcebergExpr.
2023-01-27 23:52:11 +08:00
23edb3de5a [fix](icebergv2) fix bug that delete file reader is not opened (#16133)
This pr #15836 change the way to use parquet reader by first open() then init_reader().
But we forgot to call open() for iceberg delete file, which cause coredump.
2023-01-24 10:19:46 +08:00
199d7d3be8 [Refactor]Merged string_value into string_ref (#15925) 2023-01-22 16:39:23 +08:00
de12957057 [debug](ParquetReader) print file path if failed to read parquet file (#16118) 2023-01-21 08:05:17 +08:00
3ebc98228d [feature wip](multi catalog)Support iceberg schema evolution. (#15836)
Support iceberg schema evolution for parquet file format.
Iceberg use unique id for each column to support schema evolution.
To support this feature in Doris, FE side need to get the current column id for each column and send the ids to be side.
Be read column id from parquet key_value_metadata, set the changed column name in Block to match the name in parquet file before reading data. And set the name back after reading data.
2023-01-20 12:57:36 +08:00
Pxl
b727033906 [Chore](build) enable -Wextra and remove some -Wno (#15760)
enable -Wextra and remove some -Wno
2023-01-15 10:40:35 +08:00
34bb9cd5d3 [fix](parquet-reader) fix coredump when load datatime data to doris from parquet (#15794)
`date_time_v2` will check scale when constructed datatimev2:
```
LOG(FATAL) << fmt::format("Scale {} is out of bounds", scale);
```

This [PR](https://github.com/apache/doris/pull/15510) has fixed this issue, but parquet does not use constructor to create `TypeDescriptor`, leading the `scale = -1` when reading datetimev2 data.
2023-01-13 11:51:11 +08:00
f17d69e450 [feature](file cache)Import file cache for remote file reader (#15622)
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes

Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.

Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`

Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
2023-01-10 12:23:56 +08:00
707eab9a63 [opt](multi-catalog) cache and reuse position delete rows in iceberg v2 (#15670)
A deleted file may belong to multiple data files. Each data file will read a full amount of deleted files,
so a deleted file may be read repeatedly. The deleted files can be cached, and multiple data files
can reuse the first read content.

The performance is improved by 60% in the case of single thread, and by 30% in the case of multithreading.
2023-01-07 22:29:11 +08:00
4075e3aec6 [fix](csv-reader) fix new csv reader's performance issue (#15581) 2023-01-04 18:25:08 +08:00
50f1931f96 [fix](multi-catalog) get dictionary-encode from parquet metadata (#15525) 2022-12-31 19:08:10 +08:00
2c8de30cce [optimize](multi-catalog) use dictionary encode&filter to process delete files (#15441)
**Optimize**
PR #14470 has used `Expr` to filter delete rows to match current data file,
but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files)
to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows.

In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path`
columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column.

After testing, the performance of iceberg v2's MOR is improved by 30%+.

**Fix Bug**
Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr`
and the batch_eof is generated from next batch.
2022-12-30 08:57:55 +08:00
f8bb8c7829 [fix](broker) fix be core dump caused by broker load (#15390)
* [fix](broker) fix be core dump caused by broker load
2022-12-28 10:57:41 +08:00
fc8f6a0715 [fix](multi-catalog) throw NPE when reading data after EOF (#15358)
1. Fix 1 bug:  
Throw null pointer exception when reading data after the reader reaches the end of file, so should return directly when `_do_lazy_read` read no data.

2. Optimize code:  
Remove unused parameters.

3. Fix regression test
2022-12-26 22:49:35 +08:00
bf71943605 [feature](load) stream load trim double quotes for csv (#15241) 2022-12-26 11:45:54 +08:00
ec055e1acb [feature](new file reader) Integrate new file reader (#15175) 2022-12-26 08:55:52 +08:00
5cefd05869 [fix](multi-catalog) fix and optimize iceberg v2 reader (#15274)
Fix three bugs when read iceberg v2 tables:
1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in 
`RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first 
row of current row group from `delete position`.
2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`.
3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row 
group in such case.

Optimize four performance issues:
1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read 
ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly.
2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set.
3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to 
save memory usage, and the same as `IcebergReader`.
4. Change the recursive call of `_do_lazy_read` to loop logic.
2022-12-24 16:02:07 +08:00
7730a88d11 [fix](multi-catalog) add support for orc binary type (#15141)
Fix three bugs:
1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed
2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed
3. fmt::format can't generate right format string, and will be failed
2022-12-19 14:24:12 +08:00
0e1e5a802b [config](load) enable new load scan node by default (#14808)
Set FE `enable_new_load_scan_node` to true by default.
So that all load tasks(broker load, stream load, routine load, insert into) will use FileScanNode instead of BrokerScanNode
to read data

1. Support loading parquet file in stream load with new load scan node.
2. Fix bug that new parquet reader can not read column without logical or converted type.
3. Change jsonb parser function to "jsonb_parse_error_to_null"
    So that if the input string is not a valid json string, it will return null for jsonb column in load task.
2022-12-16 09:41:43 +08:00
c6d93f739c [feature-wip](file reader) Merge stream_load_pipe to the new file reader (#15035)
Currently, there are two sets of file readers in Doris, this pr rewrites the old stream_load_pipe with the new file reader.
2022-12-15 16:31:22 +08:00
67e4292533 [fix](iceberg-v2) icebergv2 filter data path (#14470)
1. a icebergv2 delete file may cross many data paths, so the path of a file split is required as a predicate to filter rows of delete file
- create delete file structure to save predicate parameters
- create predicate for file path
2. add some log to print row range
3.  fix bug when create file metadata
2022-12-15 10:18:12 +08:00
b8f93681eb [feature-wip](file reader) Merge broker reader to the new file reader (#14980)
Currently, there are two sets of file readers in Doris, this pr rewrites the old broker reader with the new file reader.

TODO:
1. rewrite stream load pipe and kafka consumer pipe
2022-12-14 12:48:02 +08:00
73ee352705 [fix](multi catalog)Fix convert_to_doris_type missing break for some cases (#14992) 2022-12-13 13:34:55 +08:00