doris

Files

Ashin Gau b837b2eb95 [feature-wip](parquet-reader) filter rows by page index (#12664 )

# Proposed changes

[Parquet v1.11+ supports page skipping](https://github.com/apache/parquet-format/blob/master/PageIndex.md), 
which helps the scanner reduce the amount of data scanned, decompressed, decoded, and insertion.
According to the performance FlameGraph, decompression takes up 20% cpu time.
If a page can be filtered as a whole, the page can not be decompressed.

However, the row numbers between pages are not aligned. Columns containing predicates can be filtered by page granularity,
but other columns need to be skipped within pages, so non predicate columns can only save the decoding and insertion time.

Array column needs the repetition level to align with other columns, so the array column can only save the decoding and insertion time.

## Explore
`OffsetIndex` in the column metadata can locate the page position.
Theoretically, a page can be completely skipped, including the time of reading from HDFS.
However, the average size of a page is around 500KB. Skipping a page requires calling the `skip`.
The performance of `skip` is low when it is called frequently,
and may not be better than continuous reading of large blocks of data (such as 4MB).

If multiple consecutive pages are filtered, `skip` reading can be performed according to`OffsetIndex`.
However, for the convenience of programming and readability, the data of all pages are loaded and filtered in turn.

2022-09-20 15:55:19 +08:00

src

[feature-wip](parquet-reader) filter rows by page index (#12664 )

2022-09-20 15:55:19 +08:00

test

[feature-wip](parquet-reader) add page index row range (#12652 )

2022-09-20 10:36:19 +08:00

CMakeLists.txt

[chore](build) add build param to version string (#12591 )

2022-09-15 17:09:22 +08:00