doris

Author	SHA1	Message	Date
Qi Chen	cc4778a271	[Fix](orc-reader) Check hasNulls() firstly when use notNull data in ColumnVectorBatch. #18674	2023-04-15 19:48:31 +08:00
lihangyu	6c0af24e9d	[Improve](simdjson reader) support UTF-8 unicode (with BOM) (#18585 )	2023-04-13 21:58:44 +08:00
Jerry Hu	6d91635c5b	[fix](json_reader) Do not increase the value of read_rows for empty line (#18611 ) If read an empty row the row num++, the row num will be larger than actual column size, it will core.	2023-04-13 10:08:11 +08:00
Pxl	297764b37d	[Chore](build) fix some compile fail on gnu20 && remove some unused compatibility codes (#18467 )	2023-04-10 18:05:52 +08:00
Ashin Gau	f28c75bd80	[fix](file_reader) bad_typeid when reading csv&json files (#18400 ) PR(#18340) resolve the conflict with PR(#18301) has changed the file_reader to create, resulting in e: [E-123] std::bad_typeid exception.	2023-04-06 10:00:29 +08:00
Ashin Gau	47aa8a6d8a	[fix](file_cache) turn on file cache by FE session variable (#18340 ) Fix tow bugs: 1. Enabling file caching requires both `FE session` and `BE` configurations(enable_file_cache=true) to be enabled. 2. `ParquetReader` has not used `IOContext` previously, but `CachedRemoteFileReader::read_at` needs `IOContext` after PR(#17586).	2023-04-05 15:51:47 +08:00
Ashin Gau	66bfd18601	[opt](file_reader) add prefetch buffer to read csv&json file (#18301 ) Co-authored-by: ByteYue <[yj976240184@gmail.com](mailto:yj976240184@gmail.com)> This PR is an optimization for https://github.com/apache/doris/pull/17478: 1. Change the buffer size of `LineReader` to 4MB to align with the size of prefetch buffer. 2. Lazily prefetch data in the first read to prevent wasted reading. 3. S3 block size is 32MB only, which is too small for a file split. Set 128MB as default file split size. 4. Add `_end_offset` for prefetch buffer to prevent wasted reading. The query performance of reading data on object storage is improved by more than 3x+.	2023-04-04 19:05:22 +08:00
Qi Chen	eb0fd0017e	[Fix](orc-reader) Fix the scale of decimal column is incorrect when query orc tables. (#18324 ) The scale of decimal column is incorrect when query orc tables.	2023-04-04 08:50:47 +08:00
slothever	97aab138aa	[fix](parquet-reader) reset value idx in bool rle decoder and support iceberg datetime(3) (#18245 ) 1. Fix value idx in bool rle decoder 2. Iceberg table support datetimev2(3). In the previous version, we converted hive timestamp to datetimev2(0) default.	2023-04-01 21:00:01 +08:00
Mingyu Chen	7e61a85331	[refactor](libhdfs) introduce hadoop libhdfs (#18204 ) 1. Introduce hadoop libhdfs 2. For Linux-X86 platform, use the hadoop libhdfs 3. For other platform, use libhdfs3, because currently we don't have hadoop libhdfs binary for other platform Co-authored-by: adonis0147 <adonis0147@gmail.com>	2023-03-31 18:41:39 +08:00
huanghaibin	e3bd812887	[fix](stream-load) find line delimiter in csv should start with no offset (#18161 ) when loading big file with multi bytes line delimiter, some line record maybe incomplete because of _output_buf_limit, so this incomplete data will move to the beginning of the output buf and read more data into output buf. In this case, find line delimiter should start with no offset to avoid a bug that spilt two lines as one line.	2023-03-30 14:42:34 +08:00
Ashin Gau	a813ad56ad	[fix](multi-catalog) key and value columns of map are normal column type (#18160 ) PR(#17330) has changed the column type of kay and value from array to normal column, but orc&parquet reader still cast to array column, resulting in cast error.	2023-03-28 23:11:40 +08:00
Qi Chen	6b6682cd96	[Enhancement](Expr) Opt In Set by small size fixed container to improve performance. (#17976 )	2023-03-28 23:10:39 +08:00
Mingyu Chen	7c0bcbdca1	[enhance](parquet-reader) cache file meta of parquet to speed up query (#18074 ) Problem: 1. FE will split the parquet file into split. So a file can have several splits. 2. BE will scan each split, read the footer of the parquet file. 3. If 2 splits belongs to a same parquet file, the footer of this file will be read twice. This PR mainly changes: 1. Use kv cache to cache the footer of parquet file. 2. The kv cache is belong to a scan node, so all parquet reader belong to this scan node will share same kv cache. 3. In cache, the key is "meta_file_path", the value is parsed thrift footer. The KV Cache is sharded into mutlti sub cache. So that different file can use different sub cache, avoid blocking each other In my test, a query with 26 splits can reduce the footer parse time from 4s -> 1s	2023-03-25 23:22:57 +08:00
Qi Chen	3870689cbb	[Fix](parquet-reader) Fix iceberg_schema_evolution regression test caused by slot col name different with parquet col name. (#17988 )	2023-03-23 11:23:08 +08:00
Mingyu Chen	cb79e42e5c	[refactor](file-system)(step-1) refactor file sysmte on BE and remove storage_backend (#17586 ) See #17764 for details I have tested: - Unit test for local/s3/hdfs/broker file system: be/test/io/fs/file_system_test.cpp - Outfile to local/s3/hdfs/broker. - Load from local/s3/hdfs/broker. - Query file on local/s3/hdfs/broker file system, with table value function and catalog. - Backup/Restore with local/s3/hdfs/broker file system Not test: - cold & host data separation case.	2023-03-21 21:08:38 +08:00
Gabriel	bd8e3e6405	[refactor](date) unify DateTimeValue and VecDateTimeValue (#17670 )	2023-03-20 16:27:08 +08:00
Qi Chen	378789ba8a	[Fix](parquet-reader) Fix dict_filter crashed caused by VDirectInPredicate checking expr result is not nullable. (#17924 ) Be crashed in parquet dict_filter function caused by VDirectInPredicate checking expr result is not nullable.	2023-03-20 00:02:59 +08:00
lihangyu	043f77200f	[Bug](dynamic-table) Fix column alignment logic and support filtering null values when slot is not null (#17842 ) Before this PR when encountering null values with some columns which is specified as `NOT NULL`, null values will not be filtered,thi behavior does not match with the original load behavior. Second column alignment logic has bug : ``` template <typename ColumnInserterFn> void align_variant_by_name_and_type(ColumnObject& dst, const ColumnObject& src, size_t row_cnt, ColumnInserterFn inserter) { CHECK(dst.is_finalized() && src.is_finalized()); // Use rows() here instead of size(), since size() will check_consistency // but we could not check_consistency since num_rows will be upgraded even // if src and dst is empty, we just increase the num_rows of dst and fill // num_rows of default values when meet new data size_t num_rows = dst.rows(); ```	2023-03-17 16:53:30 +08:00
Qi Chen	b4b126b817	[Feature](parquet-reader) Implements dict filter functionality parquet reader. (#17594 ) Implements dict filter functionality parquet reader to improve performance.	2023-03-16 20:29:27 +08:00
lihangyu	9b7596f1c6	[Feature](Dynamic schema table) step1 support schema change expression (#17494 ) 1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns 2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility	2023-03-13 15:12:42 +08:00
Pxl	16fc3a0e22	[Chore](compile) remove some unused static on inline function to reduce compile time (#17603 ) remove some unused static on inline function to reduce compile time	2023-03-13 11:11:59 +08:00
slothever	455c800405	[feature](parquet-reader) add rle bool and delta decoder to read AWS Glue (#17112 ) Support delta encoding and rle(bool) to read Glue data add delta bit pack decoder, add delta length byte array decoder, add delta byte array decoder. add rle bool decoder. We find some data type is read with delta encoding on AWS Glue, so it should be supported. The definition of delta encoding can refer to the delta encoding in parquet.	2023-03-12 20:09:58 +08:00
Xin Liao	8001d65811	[fix](insert) fix memory leak for insert transaction (#17530 )	2023-03-08 14:10:59 +08:00
Ashin Gau	dca16796ad	[fix](ParquetReader) definition level of repeated parent is wrong (#17337 ) Fix three bugs: 1. `repeated_parent_def_level ` should be the definition of its repeated parent. 2. Failed to parse schema like `decimal(p, s)` 3. Fill wrong offsets for array type	2023-03-06 18:15:57 +08:00
yiguolei	9477c48ef8	[refactor](functioncontext) remove duplicate type definition in function context (#17421 ) remove duplicate type definition in function context remove unused method in function context not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned. remove useless slot_size in all tuple or slot descriptor. remove doris_udf namespace, it is useless. remove some unused macro definitions. init v_conjuncts in vscanner, not need write the same code in every scanner. using unique ptr to manage function context since it could only belong to a single expr context. Issue Number: close #xxx --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-06 16:07:09 +08:00
luozenglin	e7cba11680	[fix](array)(parquet) fix be core dump due to load from parquet file containing array types (#17298 )	2023-03-06 15:18:42 +08:00
Mingyu Chen	3d0beec01d	[fix](orc) fix heap-use-after-free and potential memory leak of orc reader (#17431 ) fix heap-use-after-free The OrcReader has a internal FileInputStream, If the file is empty, the memory of FileInputStream will leak. Besides, there is a Statistics instance in FileInputStream. FileInputStream maybe delete if the orc reader is inited failed, but Statistics maybe used when orc reader is closed, causing heap-use-after-free error. Potential memory leak When init file scanner in file scan node, the file scanner prepare failed, the memory of file scanner will leak.	2023-03-06 08:42:35 +08:00
HappenLee	1244eed1cd	[Opt](exec) opt the dispose nullable column logic (#17192 )	2023-03-01 23:25:40 +08:00
Tiewei Fang	f1db0d9501	[Enhencement](File Reader) delete old file_reader (#17261 ) * delete old file_reader * fix 1	2023-03-01 20:24:03 +08:00
Ashin Gau	bf5037d6d5	[fix](OrcReader) typo in anaylize null values (#17156 ) typographical error in analyzing null values for OrcReader.	2023-02-28 14:29:13 +08:00
slothever	598038e674	[improvement](parquet-reader)support parquet data page v2 (#17054 ) Support parquet data page v2 Now the parquet data on AWS glue use data page v2, but we didn't support before.	2023-02-28 14:23:45 +08:00
Pxl	0723e55f76	[Bug](build) fix compile fail on unused value #17165 error: variable 'nullcount' set but not used [-Werror,-Wunused-but-set-variable] int nullcount = 0;	2023-02-27 14:19:44 +08:00
lihangyu	29dc08fc45	[Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. (#17124 ) * [Optimize](simd json reader) Cached search results for previous row (keyed as index in JSON object) - used as a hint. `_simdjson_set_column_value` could become a hot spot while parsing json in simdjson mode, introduce `_prev_positions` to cache results for previous row (keyed as index in JSON object) due to the json name field order, should be quite the same between each lines * fix case	2023-02-27 10:39:22 +08:00
zxealous	a0782a1855	[fix](file reader) fix be core in broker file reader (#17039 ) A const reference member variables as class member stores a temporary object, which cannot be got after the temporary object being destroyed, cause be core dump while enable debug level log _broker_addr has been destroyed in BrokerFileReader	2023-02-26 12:35:31 +08:00
Tiewei Fang	f6ce072297	[Enhencement](csv-reader) Optimize csv_reader `_split_value` and fix json_reader case sensitive (#17093 ) 1. Enhencement: For single-charset column separator，csv_reader use another method of `split value`. 2. BugFix Set `json` file format loading to be sensitive.	2023-02-26 09:03:04 +08:00
Ashin Gau	c43e521d29	[feature](multi-catalog) support map&struct type in parquet&orc reader (#17087 ) Support parsing map&struct type in parquet&orc reader. ## Remaining Problems 1. Doris use array type to build the key and value column of a `map`, but doesn't fill the offsets in value column, so the offsets in value column is wasted. 2. Parquet support reading only key or value column in `map`, this PR hasn't supported yet. 3. Parquet support reading partial columns in `struct`, this PR hasn't supported yet.	2023-02-26 08:55:39 +08:00
Ashin Gau	e42465ae59	[fix](OrcReader) handle null values in orc reader for string type (#17135 ) Orc doesn't fill null values in new batch, but the former batch has been release. Other types like int/long/timestamp... are flat types without pointer in them, so other types do not need to be handled separately like string.	2023-02-26 08:10:40 +08:00
Ashin Gau	3ea6478ba8	[feature](multi-catalog) parquet reader support nested array column (#16961 ) Support to decode nested array column in parquet reader: 1. FE should generate the right nested column type. FE doesn't check the nesting depth and legality, like map\<array\<int\>, int\>. 2. `ParquetColumnReader` has removed the filtering of page index to support nested array type. It's too difficult to skip values in nested complex types. Maybe we should support the filtering of page index and lazy read in later PR. 3. `ExternalFileScanNode` has a bug in creating default value expression. 4. Maybe it's slow to read repetition levels in a while loop. I'll optimize this in next PR. 5. Array column has temporary `SchemaElement` in its thrift definition, we have removed them and keep its parent in former implementation. The remaining parent should inherit the repetition and definition level of its child.	2023-02-23 14:54:58 +08:00
Qi Chen	61826e3a77	[Improvement](parquet-reader) Improve performance of parquet reader filter calculation. (#16934 ) Improve performance of parquet reader filter calculation. - Use `filter_data` instead of `(*filter_ptr)` to merge filter to improve performance. - Use mutable column filter func instead of original new column filter func which introduced by #16850. - Avoid column ref-count increasing which caused unnecessary copying by passing column pointer ref.	2023-02-23 14:41:30 +08:00
zxealous	29c46d6926	[fix](struct-type) fix be core when load array orc file (#16978 ) * fix be core when load array orc file	2023-02-22 10:15:39 +08:00
Adonis Ling	4cb97b6fb7	[chore](macOS) Fix linkage errors for the release build (#17002 ) Issue Number: close #17003 ## Problem summary The linker couldn't find some symbols because the implementation of a template member function doris::vectorized::Decoder::init_decimal_converter is missing in the header file in which the corresponding declaration is placed.	2023-02-22 10:01:51 +08:00
Mingyu Chen	491d269412	[fix](tvf) fix bug that failed to get schema of tvf when file is empty (#16928 ) In previous implementation, when querying tvf, FE will get schema from BE. And BE will try to open the first file to get its schema info, but for orc or parquet format, if the file is empty, it will return error. But even for an empty file, we can still get schema info from file's footer. So we should handle the empty file to get schema info correctly. Also modify the catalog doc to add some FAQ.	2023-02-21 14:14:32 +08:00
lihangyu	113023fb86	(Enhancement)[load-json] support simdjson in new json reader (#16903 ) be config: enable_simdjson_reader=true related PR #11665	2023-02-21 11:31:00 +08:00
Qi Chen	a46941c684	[Fix](multi-catalog) Fix switch-case fall-through issue in multi-catalog module. (#16931 ) Fix switch-case fall-through issue in multi-catalog module.	2023-02-20 21:35:41 +08:00
Qi Chen	ef2fdb79bb	[Improvement](parquet-reader) Optimize and refactor parquet reader to improve performance. (#16818 ) Optimize and refactor parquet reader to improve performance. - Improve 2x performance for small dict string by aligned copying. - Refactor code to decrease condition(if) checking. - Don't call skip(0). - Don't read page index if no condition. ssb-flat-100: (single-machine, single-thread) \| Query \| before opt \| after opt \| \| ------------- \|:-------------:\| ---------:\| \| SELECT count(lo_revenue) FROM lineorder_flat \| 9.23 \| 9.12 \| \| SELECT count(lo_linenumber) FROM lineorder_flat \| 4.50 \| 4.36 \| \| SELECT count(c_name) FROM lineorder_flat \| 18.22 \| 17.88\| \| SELECT count(lo_shipmode) FROM lineorder_flat \|10.09 \| 6.15\|	2023-02-20 11:42:29 +08:00
Jibing-Li	292926e5aa	[Fix](multi catalog)Fix partition case bug (#16763 ) Set column names from path to lower case in case-insensitive case. This is for Iceberg columns from path. Iceberg columns are case sensitive, which may cause error for table with partitions.	2023-02-16 15:47:23 +08:00
Jibing-Li	de8d884ec3	[Fix](multi catalog)Fix iceberg parquet file doesn't have iceberg.schema meta problem (#16764 ) To support schema evolution, Iceberg add schema information to Parquet file metadata. But for early iceberg version, it doesn't write any schema information to Parquet file. This PR is to support read parquet without schema information.	2023-02-16 00:08:59 +08:00
Jibing-Li	0d9714b179	[Fix](multi catalog)Support read hive1.x orc file. (#16677 ) Hive 1.x may write orc file with internal column name (_col0, _col1, _col2...). This will cause query result be NULL because column name in orc file doesn't match with column name in Doris table schema. This pr is to support query Hive orc files with internal column names. For now, we haven't see any problem in Parquet file, will send new pr to fix parquet if any problem show up in the future.	2023-02-14 14:32:27 +08:00
lihangyu	37d1519316	[WIP](dynamic-table) support dynamic schema table (#16335 ) Issue Number: close #16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.	2023-02-11 13:37:50 +08:00

1 2 3 4

155 Commits