doris

Author	SHA1	Message	Date
Tiewei Fang	f17d69e450	[feature](file cache)Import `file cache` for remote file reader (#15622 ) The main purpose of this pr is to import `fileCache` for lakehouse reading remote files. Use the local disk as the cache for reading remote file, so the next time this file is read, the data can be obtained directly from the local disk. In addition, this pr includes a few other minor changes Import File Cache: 1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy. 2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`. Other changes: 1. Add a new interface `fs()` for `FileReader`. 2. `IOContext` adds some statistical information to count the situation of `FileCache` Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>	2023-01-10 12:23:56 +08:00
luozenglin	f8bb8c7829	[fix](broker) fix be core dump caused by broker load (#15390 ) * [fix](broker) fix be core dump caused by broker load	2022-12-28 10:57:41 +08:00
Tiewei Fang	ec055e1acb	[feature](new file reader) Integrate new file reader (#15175 )	2022-12-26 08:55:52 +08:00
Ashin Gau	5cefd05869	[fix](multi-catalog) fix and optimize iceberg v2 reader (#15274 ) Fix three bugs when read iceberg v2 tables: 1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in `RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first row of current row group from `delete position`. 2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`. 3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row group in such case. Optimize four performance issues: 1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly. 2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set. 3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to save memory usage, and the same as `IcebergReader`. 4. Change the recursive call of `_do_lazy_read` to loop logic.	2022-12-24 16:02:07 +08:00
Ashin Gau	7730a88d11	[fix](multi-catalog) add support for orc binary type (#15141 ) Fix three bugs: 1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed 2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed 3. fmt::format can't generate right format string, and will be failed	2022-12-19 14:24:12 +08:00
Jibing-Li	8fe0729835	[fix](multi catalog)Check orc file reader is not null before using it. (#14988 ) The external table file path cache may out of date, which will cause orc reader to visit non-exist files. In this case, orc file reader is nullptr. This pr is to check the reader before using it to avoid core dump of visiting nullptr.	2022-12-13 11:27:51 +08:00
Gabriel	1ec7f45fb6	[Bug](avg) Fix `avg` for bigint (#14433 )	2022-11-22 10:29:59 +08:00
Gabriel	2c42f0a905	[refactor](decimalv3) Refine code for DecimalV3 (#14394 )	2022-11-19 16:57:17 +08:00
Ashin Gau	6bd5378f66	[feature-wip](multi-catalog) lazy read for ParquetReader (#13917 ) Read predicate columns firstly, and use VExprContext(push-down predicates) to generate the select vector, which is then applied to read the non-predicate columns. The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced. If a whole page can be skipped, the decompress-time can also be reduced.	2022-11-10 16:56:14 +08:00
Tiewei Fang	43eb946543	[feature](table-valued-function)S3 table valued function supports parquet/orc/json file format #14130 S3 table valued function supports parquet/orc/json file format. For example: parquet format	2022-11-10 10:33:12 +08:00
Ashin Gau	f7c69ade18	[feature-wip](multi-catalog) implement predicate pushdown in native OrcReader (#13453 ) # Proposed changes Implement predicate pushdown in `OrcReader` by converting doris `ColumnValueRange` to orc `SearchArgument`. ## Remaining problems 1. Orc support `not in`, which may have effect on bloom filter. However, doris `ScanNode` has not push down `not in` to file scanner. 2. Orc support `is null`, and row range has `hasNull` identifier. However, `_contain_null` in `ColumnValueRange` is ambiguous. `_contain_null = true` only means that the value can be nullable, not equal to null. 3. `DateTimeV2` has lost microsecond precision in `ColumnValueRange`, which may cause filtering error when a min-max value equals to the predicate value. 4. `DateTimeV1` is not accurate enough, and only saved to seconds. 5. Orc support the predicate pushdown of `float&double` type, but doris has not push down `float&double` type for precision reason.	2022-10-20 10:07:36 +08:00
Ashin Gau	21f233d7e7	[feature-wip](multi-catalog) use apache orc reader to read orc file (#13404 ) Use apache orc to read orc file, and convert ColumnVectorBatch to doris block.	2022-10-18 13:47:56 +08:00

12 Commits