Commit Graph

133 Commits

Author SHA1 Message Date
455c800405 [feature](parquet-reader) add rle bool and delta decoder to read AWS Glue (#17112)
Support delta encoding and rle(bool) to read Glue data
add delta bit pack decoder,
add delta length byte array decoder,
add delta byte array decoder.
add rle bool decoder.

We find some data type is read with delta encoding on AWS Glue, so it should be supported.
The definition of delta encoding can refer to the delta encoding in parquet.
2023-03-12 20:09:58 +08:00
8001d65811 [fix](insert) fix memory leak for insert transaction (#17530) 2023-03-08 14:10:59 +08:00
dca16796ad [fix](ParquetReader) definition level of repeated parent is wrong (#17337)
Fix three bugs:
1.  `repeated_parent_def_level ` should be the definition of its repeated parent.
2. Failed to parse schema like `decimal(p, s)`
3. Fill wrong offsets for array type
2023-03-06 18:15:57 +08:00
9477c48ef8 [refactor](functioncontext) remove duplicate type definition in function context (#17421)
remove duplicate type definition in function context
remove unused method in function context
not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned.
remove useless slot_size in all tuple or slot descriptor.
remove doris_udf namespace, it is useless.
remove some unused macro definitions.
init v_conjuncts in vscanner, not need write the same code in every scanner.
using unique ptr to manage function context since it could only belong to a single expr context.
Issue Number: close #xxx
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-03-06 16:07:09 +08:00
e7cba11680 [fix](array)(parquet) fix be core dump due to load from parquet file containing array types (#17298) 2023-03-06 15:18:42 +08:00
3d0beec01d [fix](orc) fix heap-use-after-free and potential memory leak of orc reader (#17431)
fix heap-use-after-free
The OrcReader has a internal FileInputStream, If the file is empty, the memory of FileInputStream will leak.
Besides, there is a Statistics instance in FileInputStream. FileInputStream maybe delete if the orc reader
is inited failed, but Statistics maybe used when orc reader is closed, causing heap-use-after-free error.

Potential memory leak
When init file scanner in file scan node, the file scanner prepare failed, the memory of file scanner will leak.
2023-03-06 08:42:35 +08:00
1244eed1cd [Opt](exec) opt the dispose nullable column logic (#17192) 2023-03-01 23:25:40 +08:00
f1db0d9501 [Enhencement](File Reader) delete old file_reader (#17261)
* delete old file_reader

* fix 1
2023-03-01 20:24:03 +08:00
bf5037d6d5 [fix](OrcReader) typo in anaylize null values (#17156)
typographical error in analyzing null values for OrcReader.
2023-02-28 14:29:13 +08:00
598038e674 [improvement](parquet-reader)support parquet data page v2 (#17054)
Support parquet data page v2
Now the parquet data on AWS glue use data page v2, but we didn't support before.
2023-02-28 14:23:45 +08:00
Pxl
0723e55f76 [Bug](build) fix compile fail on unused value #17165
error: variable 'nullcount' set but not used [-Werror,-Wunused-but-set-variable]
int nullcount = 0;
2023-02-27 14:19:44 +08:00
29dc08fc45 [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. (#17124)
* [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint.

`_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode,
introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order,
should be quite the same between each lines

* fix case
2023-02-27 10:39:22 +08:00
a0782a1855 [fix](file reader) fix be core in broker file reader (#17039)
A const reference member variables as class member stores a temporary object, which cannot be got after the temporary object being destroyed, cause be core dump while enable debug level log

_broker_addr has been destroyed in BrokerFileReader
2023-02-26 12:35:31 +08:00
f6ce072297 [Enhencement](csv-reader) Optimize csv_reader _split_value and fix json_reader case sensitive (#17093)
1. Enhencement:
    For single-charset column separator,csv_reader use another method of `split value`.
2. BugFix
    Set `json` file format loading to be sensitive.
2023-02-26 09:03:04 +08:00
c43e521d29 [feature](multi-catalog) support map&struct type in parquet&orc reader (#17087)
Support parsing map&struct type in parquet&orc reader.

## Remaining Problems
1. Doris use array type to build the key and value column of a `map`, but doesn't fill the offsets in value column, so the offsets in value column is wasted.
2. Parquet support reading only key or value column in `map`, this PR hasn't supported yet.
3. Parquet support reading partial columns in `struct`, this PR hasn't supported yet.
2023-02-26 08:55:39 +08:00
e42465ae59 [fix](OrcReader) handle null values in orc reader for string type (#17135)
Orc doesn't fill null values in new batch, but the former batch has been release.
Other types like int/long/timestamp... are flat types without pointer in them, 
so other types do not need to be handled separately like string.
2023-02-26 08:10:40 +08:00
3ea6478ba8 [feature](multi-catalog) parquet reader support nested array column (#16961)
Support to decode nested array column in parquet reader:
1. FE should generate the right nested column type. FE doesn't check the nesting depth and legality, like map\<array\<int\>, int\>.
2. `ParquetColumnReader` has removed the filtering of page index to support nested array type.
    It's too difficult to skip values in nested complex  types. Maybe we should support the filtering of page index and lazy read in later PR.
3. `ExternalFileScanNode` has a bug in creating default value expression.
4. Maybe it's slow to read repetition levels in a while loop. I'll optimize this in next PR.
5. Array column has temporary `SchemaElement` in its thrift definition,
we have removed them and keep its parent in former implementation.
The remaining parent should inherit the repetition and definition level of its child.
2023-02-23 14:54:58 +08:00
61826e3a77 [Improvement](parquet-reader) Improve performance of parquet reader filter calculation. (#16934)
Improve performance of parquet reader filter calculation.

- Use `filter_data` instead of `(*filter_ptr)` to merge filter to improve performance. 
- Use mutable column filter func instead of original new column filter func which introduced by #16850.
- Avoid column ref-count increasing which caused unnecessary copying by passing column pointer ref.
2023-02-23 14:41:30 +08:00
29c46d6926 [fix](struct-type) fix be core when load array orc file (#16978)
* fix be core when load array orc file
2023-02-22 10:15:39 +08:00
4cb97b6fb7 [chore](macOS) Fix linkage errors for the release build (#17002)
Issue Number: close #17003

## Problem summary
The linker couldn't find some symbols because the implementation of a template member function doris::vectorized::Decoder::init_decimal_converter is missing in the header file in which the corresponding declaration is placed.
2023-02-22 10:01:51 +08:00
491d269412 [fix](tvf) fix bug that failed to get schema of tvf when file is empty (#16928)
In previous implementation, when querying tvf, FE will get schema from BE.
And BE will try to open the first file to get its schema info, but for orc or parquet format,
if the file is empty, it will return error.
But even for an empty file, we can still get schema info from file's footer.
So we should handle the empty file to get schema info correctly.

Also modify the catalog doc to add some FAQ.
2023-02-21 14:14:32 +08:00
113023fb86 (Enhancement)[load-json] support simdjson in new json reader (#16903)
be config:
enable_simdjson_reader=true

related PR #11665
2023-02-21 11:31:00 +08:00
a46941c684 [Fix](multi-catalog) Fix switch-case fall-through issue in multi-catalog module. (#16931)
Fix switch-case fall-through issue in multi-catalog module.
2023-02-20 21:35:41 +08:00
ef2fdb79bb [Improvement](parquet-reader) Optimize and refactor parquet reader to improve performance. (#16818)
Optimize and refactor parquet reader to improve performance.
- Improve 2x performance for small dict string by aligned copying.
- Refactor code to decrease condition(if) checking.
- Don't call skip(0).
- Don't read page index if no condition.

**ssb-flat-100**: (single-machine, single-thread)
| Query        | before opt           | after opt  |
| ------------- |:-------------:| ---------:|
| SELECT count(lo_revenue) FROM lineorder_flat       | 9.23   | 9.12 |
| SELECT count(lo_linenumber) FROM lineorder_flat | 4.50    | 4.36 |
| SELECT count(c_name) FROM lineorder_flat             | 18.22 | 17.88| 
| **SELECT count(lo_shipmode) FROM lineorder_flat**     |**10.09** | **6.15**|
2023-02-20 11:42:29 +08:00
292926e5aa [Fix](multi catalog)Fix partition case bug (#16763)
Set column names from path to lower case in case-insensitive case.
This is for Iceberg columns from path. Iceberg columns are case sensitive,
which may cause error for table with partitions.
2023-02-16 15:47:23 +08:00
de8d884ec3 [Fix](multi catalog)Fix iceberg parquet file doesn't have iceberg.schema meta problem (#16764)
To support schema evolution, Iceberg add schema information to Parquet file metadata.
But for early iceberg version, it doesn't write any schema information to Parquet file.
This PR is to support read parquet without schema information.
2023-02-16 00:08:59 +08:00
0d9714b179 [Fix](multi catalog)Support read hive1.x orc file. (#16677)
Hive 1.x may write orc file with internal column name (_col0, _col1, _col2...).
This will cause query result be NULL because column name in orc file doesn't match
with column name in Doris table schema. This pr is to support query Hive orc files with internal column names. 

For now, we haven't see any problem in Parquet file, will send new pr to fix parquet if any problem show up in the future.
2023-02-14 14:32:27 +08:00
37d1519316 [WIP](dynamic-table) support dynamic schema table (#16335)
Issue Number: close #16351

Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
2023-02-11 13:37:50 +08:00
c1a1275870 [fix](memory) Fix parquet load stack overflow (#16537) 2023-02-10 08:48:12 +08:00
27216dc7e0 [improvement](multi-catalog) push down all predicates into rowgroup/page filtering for ParquetReader (#16388)
Tow improvements:
1. Refactor rowgroup&page filtering in `ParquetReader`, and use the operator overloading of Doris native c++ type to process comparison.
2. Support decimal/decimal v3/date/datev2/datetime/datetimev2
2023-02-07 11:32:57 +08:00
b1b2697cc7 [fix](iceberg) fix iceberg catalog (#16372)
1. Fix iceberg catalog access s3
2. Fix iceberg catalog partition table query
3. Fix persistence
2023-02-05 13:15:28 +08:00
d2b5015d3f [enhancement](profile) add the profile counter RawRowsRead to record the rows read from the parquet file (#16328) 2023-02-04 22:59:34 +08:00
Pxl
5e4bb98900 [Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290)
enable -Wpedantic and update lowest gcc version to 11.1
2023-02-03 11:28:48 +08:00
9618427020 [improvement](multi-catalog) increase default batch_size to 4064 (#16326)
The performance of ClickBench Q30 is affected by batch_size:
| batch_size | 1024 | 4096 | 20480 |
| -- | -- | -- | -- |
| Q30 query time | 2.27 | 1.08 | 0.62 |

Because aggregation operator will create a new result block for each batch block, and Q30 has 90 columns, which is time-consuming. Larger batch_size will decrease the number of aggregation blocks, so the larger batch_size will improve performance.

Doris internal reader will read at least 4064 rows even if batch_size < 4064, so this PR keep the process of reading external table the same  as internal table.
2023-02-02 11:51:09 +08:00
1c5279d26e [fix](multi-catalog) remove the eof check among parquet columns (#16302)
Read parquet file failed:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = [INTERNAL_ERROR]Read parquet file xxx failed, reason = [CORRUPTION]The number of rows are not equal among parquet columns
```
This error may be thrown when reading non-predicate columns in lazy-read, for example:
A row group with 1000 rows has tow non-predicate columns.
Column A has one page, Column B has two pages with 500 rows for each page.
The read range of `ParquetColumnReader` is [0, 400), and the rows between [0, 450) are all filtered by predicate columns.
So column A can skip the first page, and reach the EOF,  while column B can also skip the first page, but doesn't read the EOF.
2023-02-02 09:22:09 +08:00
b878a7e61e [feature](Load)Suppot skip specific lines number for csv stream load (#16055)
Support set skip line number for stream load to load csv file.

Usage `-H skip_lines:number`:
```
curl --location-trusted -u root: -T test.csv -H skip_lines:5  -XPUT http://127.0.0.1:8030/api/testDb/testTbl/_stream_load
```

Skip line number also can be used in mysql load as below:
```sql
LOAD DATA
LOCAL
INFILE '${mysql_load_skip_lines}'
INTO TABLE ${tableName}
COLUMNS TERMINATED BY ','
IGNORE 2 LINES
PROPERTIES ("auth" = "root:");
```
2023-02-01 20:42:43 +08:00
fa14b7ea9c [Enhancement](icebergv2) Optimize the position delete file filtering mechanism in iceberg v2 parquet reader (#16024)
close #16023
2023-01-28 00:04:27 +08:00
1589d453a3 [fix](multi catalog)Support parquet and orc upper case column name (#16111)
External hms catalog table column names in doris are all in lower case,
while iceberg table or spark-sql created hive table may contain upper case column name,
which will cause empty query result. This pr is to fix this bug.
1. For parquet file, transfer all column names to lower case while parse parquet metadata.
2. For orc file, store the origin column names and lower case column names in two vectors, use the suitable names in different cases.
3. FE side, change the column name back to the origin column name in iceberg while doing convertToIcebergExpr.
2023-01-27 23:52:11 +08:00
23edb3de5a [fix](icebergv2) fix bug that delete file reader is not opened (#16133)
This pr #15836 change the way to use parquet reader by first open() then init_reader().
But we forgot to call open() for iceberg delete file, which cause coredump.
2023-01-24 10:19:46 +08:00
199d7d3be8 [Refactor]Merged string_value into string_ref (#15925) 2023-01-22 16:39:23 +08:00
de12957057 [debug](ParquetReader) print file path if failed to read parquet file (#16118) 2023-01-21 08:05:17 +08:00
3ebc98228d [feature wip](multi catalog)Support iceberg schema evolution. (#15836)
Support iceberg schema evolution for parquet file format.
Iceberg use unique id for each column to support schema evolution.
To support this feature in Doris, FE side need to get the current column id for each column and send the ids to be side.
Be read column id from parquet key_value_metadata, set the changed column name in Block to match the name in parquet file before reading data. And set the name back after reading data.
2023-01-20 12:57:36 +08:00
Pxl
b727033906 [Chore](build) enable -Wextra and remove some -Wno (#15760)
enable -Wextra and remove some -Wno
2023-01-15 10:40:35 +08:00
34bb9cd5d3 [fix](parquet-reader) fix coredump when load datatime data to doris from parquet (#15794)
`date_time_v2` will check scale when constructed datatimev2:
```
LOG(FATAL) << fmt::format("Scale {} is out of bounds", scale);
```

This [PR](https://github.com/apache/doris/pull/15510) has fixed this issue, but parquet does not use constructor to create `TypeDescriptor`, leading the `scale = -1` when reading datetimev2 data.
2023-01-13 11:51:11 +08:00
f17d69e450 [feature](file cache)Import file cache for remote file reader (#15622)
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes

Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.

Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`

Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
2023-01-10 12:23:56 +08:00
707eab9a63 [opt](multi-catalog) cache and reuse position delete rows in iceberg v2 (#15670)
A deleted file may belong to multiple data files. Each data file will read a full amount of deleted files,
so a deleted file may be read repeatedly. The deleted files can be cached, and multiple data files
can reuse the first read content.

The performance is improved by 60% in the case of single thread, and by 30% in the case of multithreading.
2023-01-07 22:29:11 +08:00
4075e3aec6 [fix](csv-reader) fix new csv reader's performance issue (#15581) 2023-01-04 18:25:08 +08:00
50f1931f96 [fix](multi-catalog) get dictionary-encode from parquet metadata (#15525) 2022-12-31 19:08:10 +08:00
2c8de30cce [optimize](multi-catalog) use dictionary encode&filter to process delete files (#15441)
**Optimize**
PR #14470 has used `Expr` to filter delete rows to match current data file,
but the rows in the delete file are [sorted by file_path then position](https://iceberg.apache.org/spec/#position-delete-files)
to optimize filtering rows while scanning, so this PR remove `Expr` and use binary search to filter delete rows.

In addition, delete files are likely to be encoded in dictionary, it's time-consuming to decode `file_path`
columns into `ColumnString`, so this PR use `ColumnDictionary` to read `file_path` column.

After testing, the performance of iceberg v2's MOR is improved by 30%+.

**Fix Bug**
Lazy-read-block may not have the filter column, if the whole group is filtered by `Expr`
and the batch_eof is generated from next batch.
2022-12-30 08:57:55 +08:00
f8bb8c7829 [fix](broker) fix be core dump caused by broker load (#15390)
* [fix](broker) fix be core dump caused by broker load
2022-12-28 10:57:41 +08:00