Files
doris/be/test
Ashin Gau 29f502380c [opt](FileReader) merge small IO to optimize read performace (#18796)
Add `MergeRangeFileReader` to merge small IO to optimize parquet&orc read performance.

`MergeRangeFileReader` is a FileReader that efficiently supports random access in format like parquet and orc.
In order to merge small IO in parquet and orc, the random access ranges should be generated when creating the 
reader. The random access ranges is a list of ranges that order by offset.
The range in random access ranges should be reading sequentially, can be skipped, but can't be read repeatedly.
When calling read_at, if the start offset located in random access ranges, the slice size should not span two ranges.

For example, in parquet, the random access ranges is the column offsets in a row group.

When reading at offset, if [offset, offset + 8MB) contains many random access ranges,
the reader will read data in [offset, offset + 8MB) as a whole, and copy the data in random access ranges into small 
buffers(name as box, default 1MB, 64MB in total). A box can be occupied by many ranges,
and use a reference counter to record how many ranges are cached in the box. If reference counter equals zero,
the box can be release or reused by other ranges. When there is no empty box for a new read operation,
the read operation will do directly.

## Effects
The runtime of ClickBench reduces from 102s to 77s, and the runtime of Query 24 reduces from 24.74s to 9.45s.
The profile of Query 24:
```
 VFILE_SCAN_NODE  (id=0):(Active:  8s344ms,  %  non-child:  83.06%)
    -  FileReadBytes:  534.46  MB
    -  FileReadCalls:  1.031K  (1031)
    -  FileReadTime:  28s801ms
    -  GetNextTime:  8s304ms
    -  MaxScannerThreadNum:  12
    -  MergedSmallIO:  0ns
        -  CopyTime:  157.774ms
        -  MergedBytes:  549.91  MB
        -  MergedIO:  94
        -  ReadTime:  28s642ms
        -  RequestBytes:  507.96  MB
        -  RequestIO:  1.001K  (1001)
    -  NumScanners:  18
```
1001 request IOs has been merged into 94 IOs.

## Remaining problems
1. Add p2 regression test in nest PR
2. Profiles are scattered in various codes and will be refactored in the next PR
3. Support ORC reader
2023-04-23 10:51:38 +08:00
..