[opt](file_reader) add prefetch buffer to read csv&json file (#18301)

Co-authored-by: ByteYue <[yj976240184@gmail.com](mailto:yj976240184@gmail.com)>
This PR is an optimization for https://github.com/apache/doris/pull/17478:
1. Change the buffer size of `LineReader` to 4MB to align with the size of prefetch buffer.
2. Lazily prefetch data in the first read to prevent wasted reading.
3. S3 block size is 32MB only, which is too small for a file split. Set 128MB as default file split size.
4. Add `_end_offset` for prefetch buffer to prevent wasted reading.

The query performance of reading data on object storage is improved by more than 3x+.
This commit is contained in:
Ashin Gau
2023-04-04 19:05:22 +08:00
committed by GitHub
parent d7623028e9
commit 66bfd18601
8 changed files with 309 additions and 6 deletions

View File

@ -1742,8 +1742,11 @@ public class Config extends ConfigBase {
@ConfField(mutable = true, masterOnly = false)
public static long file_scan_node_split_num = 128;
// 0 means use the block size in HDFS/S3 as split size.
// HDFS block size is 128MB, while S3 block size is 32MB.
// 32MB is too small for a S3 file split, so set 128MB as default split size.
@ConfField(mutable = true, masterOnly = false)
public static long file_split_size = 0; // 0 means use the block size in HDFS/S3 as split size
public static long file_split_size = 134217728;
/**
* If set to TRUE, FE will: