Commit Graph

5 Commits

Author SHA1 Message Date
c5ad989065 [refactor](reader) refactor the interface of file reader (#12574)
Currently, Doris has a variety of readers for different file formats,
such as parquet reader, orc reader, csv reader, json reader and so on.

The interfaces of these readers are not unified, which makes it impossible to call them through a unified method.

In this PR, I added a `GenericReader` interface class, and other Readers will implement this interface class
to use the `get_next_block()` method.

This PR currently only modifies `arrow_reader` and `parquet reader`.
Other readers will be modified one by one in subsequent PRs.
2022-09-14 22:31:11 +08:00
54f878b781 [feature-wip](multi-catalog) Support orc format file split for file scan node (#11046) 2022-07-25 11:41:46 +08:00
8a366c9ba2 [feature](multi-catalog) read parquet file by start/offset (#10843)
To avoid reading the repeat row group, we should align offsets
2022-07-18 20:51:08 +08:00
8e364fb848 [fix](load) skip empty orc file (#10593)
Something the upstream system(eg, hive) may create empty orc file
which only has a header and footer, without schema.
And if we call `_reader->createRowReader()` with selected columns,
it will throw ParserError: Invalid column selected xx.
So here we first check its number of rows and skip these kind of files.

This is only a fix for non-vec load, for vec load, it use arrow scanner
to read orc file, which does not have this problem.
2022-07-05 22:18:56 +08:00
cbbda7857b [feature-wip](parquet-orc) Support orc scanner in vectorized engine (#9541) 2022-05-26 21:39:12 +08:00