doris

Author	SHA1	Message	Date
slothever	d435f0de41	[feature-wip](parquet-reader) add page index row range (#12652 ) Add some utils and provide the candidate row range (generated with skipped row range of each column) to read for page index filter this version support binary operator filter todo: - use context instead of structures in close() - process complex type filter - use this instead of row group minmax filter - refactor _eval_binary() for row group filter and page index filter	2022-09-20 10:36:19 +08:00
Jibing-Li	5978fd9647	[refactor](file scanner)Refactor file scanner. (#12602 ) Refactor the scanners for hms external catalog, work in progress. Use VFileScanner, will remove NewFileParquetScanner, NewFileOrcScanner and NewFileTextScanner after fully tested. Query for parquet file has been tested, still need to add readers for orc file, text file and load logic as well.	2022-09-19 15:23:51 +08:00
yixiutt	b136d80e1a	[enhancement](compress) reuse compression ctx and buffer (#12573 ) Reuse compression ctx and buffer. Use a global instance for every compression algorithm, and use a thread saft buffer pool to reuse compression buffer, pool size is equal to max parallel thread num in compression, and this will not be too large. Test shows this feature increase 5% of data import and compaction. Co-authored-by: yixiutt <yixiu@selectdb.com>	2022-09-15 10:59:46 +08:00
Mingyu Chen	c5ad989065	[refactor](reader) refactor the interface of file reader (#12574 ) Currently, Doris has a variety of readers for different file formats, such as parquet reader, orc reader, csv reader, json reader and so on. The interfaces of these readers are not unified, which makes it impossible to call them through a unified method. In this PR, I added a `GenericReader` interface class, and other Readers will implement this interface class to use the `get_next_block()` method. This PR currently only modifies `arrow_reader` and `parquet reader`. Other readers will be modified one by one in subsequent PRs.	2022-09-14 22:31:11 +08:00
slothever	9f25544f2f	[feature-wip](parquet-reader) page index bug fix (#12428 ) Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-09-13 10:28:53 +08:00
Ashin Gau	b4663062da	[feature-wip](parquet-reader) bug fix, parquet footer buffer is small when containing many columns (#12477 ) Failed when reading parquet file with many columns(>1600). mysql> select int_col from types_sf100_r100w limit 5; ERROR 1105 (HY000): errCode = 2, detailMessage = Couldn't deserialize thrift msg: TProtocolException: Invalid data parse_thrift_footer uses fixed length buffer(=64k) to read parquet footer, but the meta data of a parquet file with 1600 columns can exceed 5MB. Therefore, the buffer size needs to be applied according to the actual length.	2022-09-09 09:12:34 +08:00
Ashin Gau	dd2f834c79	[feature-wip](parquet-reader) bug fix, create compress codec before parsing dictionary (#12422 ) ## Fix five bugs: 1. Parquet dictionary data may be compressed, but `ColumnChunkReader` try to parse dictionary data before creating compression codec, causing unexpected data errors. 2. `FE` doesn't resolve array type 3. `ParquetFileHdfsScanner` doesn't fill partition values when the table is partitioned 4. `ParquetFileHdfsScanner` set `_scanner_eof = true` when a scan range is empty, causing the end of the scanner, and resulting in data loss 5. typographical error in `PageReader`	2022-09-08 09:54:25 +08:00
slothever	4a55b504c0	[feature-wip](parquet-reader) bug fix, get the correct group reader (#12294 ) Fix the problem that cannot read the lineitem table of TPCH , and the error of allocate memory Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-09-06 13:59:35 +08:00
Ashin Gau	202ad5c659	[feature-wip](parquet-reader) bug fix, the number of rows are different among columns in a block (#12228 ) 1. `ExprContext` is delete in `ParquetReader::close()`, but it has not been closed, so the `DCHECH` in `~ExprContext()` is failed. the lifetime of `ExprContext` is managed by scan node, so we should not delete its pointer in `ParquetReader::close()`. 2. `RowGroupReader::next_batch` will update `_read_rows` in every column loop, and does not ensure the number of rows in every column are equal. 3. The skipped row ranges are variables in stack, which are released when calling `ArrayColumnReader::read_column_data`, so we should copy them out.	2022-09-02 09:50:25 +08:00
Ashin Gau	1cc9eeeb1a	[feature-wip](parquet-reader) read and generate array column (#12166 ) Read and generate parquet array column. When D=1, R=0, representing an empty array. Empty array is not a null value, so the NullMap for this row is false, the offset for this row is [offset_start, offset_end) whose `offset_start == offset_end`, and offset_end is the start offset of the next row, so there is no value in the nested primitive column. When D=0, R=0, representing a null array, and the NullMap for this row is true.	2022-08-31 17:08:12 +08:00
Ashin Gau	dec576a991	[feature-wip](parquet-reader) generate null values and NullMap for parquet column (#12115 ) Generate null values and NullMap for the nullable column by analyzing the definition levels.	2022-08-29 09:30:32 +08:00
Ashin Gau	0b5bb565a7	[feature-wip](parquet-reader) parquet dictionary decoder (#11981 ) Parse parquet data with dictionary encoding. Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files. refer: https://github.com/apache/parquet-format/blob/master/Encodings.md	2022-08-26 19:24:37 +08:00
slothever	0c16740f5c	[feature-wip](parquet-reader) parquert scanner can read data (#11970 ) Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-08-26 09:43:46 +08:00
Ashin Gau	6d925054de	[feature-wip](parquet-reader) decode parquet time & datetime & decimal (#11845 ) 1. Spark can set the timestamp precision by the following configuration: spark.sql.parquet.outputTimestampType = INT96(NANOS), TIMESTAMP_MICROS, TIMESTAMP_MILLIS DATETIME V1 only keeps the second precision, DATETIME V2 keeps the microsecond precision. 2. If using DECIMAL V2, the BE saves the value as decimal128, and keeps the precision of decimal as (precision=27, scale=9). DECIMAL V3 can maintain the right precision of decimal	2022-08-22 10:15:35 +08:00
slothever	124b4f7694	[feature-wip](parquet-reader) row group reader ut finish (#11887 ) Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-08-18 17:18:14 +08:00
slothever	f39f57636b	[feature-wip](parquet-reader) update column read model and add page index (#11601 )	2022-08-16 15:04:07 +08:00
Ashin Gau	0b9bfd15b7	[feature-wip](parquet-reader) parquet physical type to doris logical type (#11769 ) Two improvements have been added: 1. Translate parquet physical type into doris logical type. 2. Decode parquet column chunk into doris ColumnPtr, and add unit tests to show how to use related API.	2022-08-15 16:08:11 +08:00
Ashin Gau	8f5aed27ec	[feature-wip](parquet-reader)read and decode parquet physical type (#11637 ) # Proposed changes Read and decode parquet physical type. 1. The encoding type of boolean is bit-packing, this PR introduces the implementation of bit-packing from Impala 2. Create a parquet including all the primitive types supported by hive ## Remaining Problems 1. At present, only physical types are decoded, and there is no corresponding and conversion methods with doris logical. 2. No parsing and processing Decimal type / Timestamp / Date. 3. Int_8 / Int_16 is stored as Int_32. How to resolve these types.	2022-08-11 10:17:32 +08:00
Ashin Gau	37d1180cca	[feature-wip](parquet-reader)decode parquet data (#11536 )	2022-08-08 12:44:06 +08:00
slothever	e8a344b683	[feature-wip](parquet-reader) add predicate filter and column reader (#11488 )	2022-08-08 10:21:24 +08:00
slothever	95753ec868	[feature](parquet-reader) add group filter util (#11533 ) * [feature-wip](parquet-reader) add group filter util Co-authored-by: jinzhe <jinzhe@selectdb.com>	2022-08-05 14:02:48 +08:00
Ashin Gau	aed0282046	[feature-wip](parquet-reader)get compressed parquet page data (#11493 )	2022-08-04 17:44:52 +08:00
slothever	1b4d6a620a	(feature-wip)[parquet-reader] support page index serde (#11415 )	2022-08-03 10:36:06 +08:00
Ashin Gau	44a1a20e65	[feature-wip](parquet-reader)parse parquet schema (#11381 ) Analyze schema elements in parquet FileMetaData, and generate the hierarchy of nested fields. For exmpale: 1. primitive type ``` // thrift: optional int32 <column-name>; // sql definition: <column-name> int32; ``` 2. nested type ``` // thrift: optional group <column-name> (LIST) { repeated group bag { optional group array_element (LIST) { repeated group bag { optional int32 array_element } } } } // sql definition: <column-name> array<array<int32>> ```	2022-08-02 10:56:13 +08:00
slothever	e4bc3f6b6f	[feature-wip] (parquet-reader) add parquet reader impl template (#11285 )	2022-07-29 14:30:31 +08:00

25 Commits