doris

Author	SHA1	Message	Date
Ashin Gau	260568db17	[update](hudi) update hudi version to 0.14.1 and compatible with flink hive catalog (#31181 ) 1. Update hudi version from 0.13.1 to .14.1 2. Compatible with the hudi table created by flink hive catalog	2024-02-22 19:51:20 +08:00
wuwenchi	44ba9e102c	[feature](statistics)support statistics for iceberg/paimon/hudi table (#29868 )	2024-01-18 12:03:07 +08:00
Ashin Gau	c72ad9b673	[fix](regression) fix regression error of test_compress_type (#28826 )	2023-12-22 12:08:23 +08:00
Mingyu Chen	5d8c465644	[regression](p2) fix test cases result (#28768 ) regression-test/data/external_table_p2/hive/test_hive_hudi.out regression-test/data/external_table_p2/hive/test_hive_to_array.out regression-test/suites/external_table_p2/tvf/test_local_tvf_compression.groovy regression-test/suites/external_table_p2/tvf/test_path_partition_keys.groovy regression-test/data/external_table_p2/hive/test_hive_text_complex_type.out	2023-12-21 14:38:30 +08:00
Qi Chen	eb99e4270d	[Fix](parquet_reader) Fix dict filtering doesn't work with plain dict encoding in parquet reader. (#28290 )	2023-12-15 09:27:02 +08:00
daidai	80d2c7ab41	[feature](parquet)support read parquet lzo compress. (#27706 )	2023-12-03 09:55:52 +08:00
daidai	ce271ff382	[fix](parquet)fix can not read parquet lz4 compress. (#27383 ) Fixed the problem of not being able to read parquet lz4 compressed format. By default, it is decompressed according to the Hadoop lz4 format. If it fails, it will fall back to the standard lz4 compression format.	2023-11-29 19:04:53 +08:00
Qi Chen	a0661ed9d2	[Fix](multi-catalog) Fix complex type crash when using dict filter facility in the parquet-reader. (#27151 ) - Fix complex type crash when using the dict filter facility in the parquet-reader by turning off the dict filter facility in this case. - Add orc complex types regression test.	2023-11-17 13:43:58 +08:00
daidai	a4e415ab09	[feature](hive)Support hive tables after alter type. (#25138 ) 1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion. 2.Support hive alter table.	2023-11-02 00:24:21 +08:00
Jibing-Li	8a8ae44eee	[Fix](regression)Fix statistics related regression test (#25888 )	2023-10-25 05:59:13 -05:00
slothever	40e430ca55	[regression](multi-catalog) add aliyun dlf hive on oss and huawei obs test case (#25650 ) add aliyun dlf hive on oss and huawei obs test case now obs cases have some problem, will not fix this at this PR, just add comment.	2023-10-24 20:52:50 +08:00
Ashin Gau	26818de9c8	[feature](jni) support complex types in jni framework (#24810 ) Support complex types in jni framework, and successfully run end-to-end on hudi. ### How to Use Other scanners only need to implement three interfaces in `ColumnValue`: ``` // Get array elements and append into values void unpackArray(List<ColumnValue> values); // Get map key array&value array, and append into keys&values void unpackMap(List<ColumnValue> keys, List<ColumnValue> values); // Get the struct fields specified by `structFieldIndex`, and append into values void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values); ``` Developers can take `HudiColumnValue` as an example.	2023-09-27 14:47:41 +08:00
Jibing-Li	b4432ce577	[Feature](statistics)Support external table analyze partition (#24154 ) Enable collect partition level stats for hive external table.	2023-09-18 14:59:26 +08:00
Jibing-Li	f3e350e8ec	[Improvement](statistics)Improve statistics user experience (#24414 ) Two improvements: 1. Move the `Job_id` column for the return info of `Analyze table` command to the first column. To keep consistent with `show analyze`. ``` mysql> analyze table hive.tpch100.region; +--------+--------------+-------------------------+------------+--------------------------------+ \| Job_Id \| Catalog_Name \| DB_Name \| Table_Name \| Columns \| +--------+--------------+-------------------------+------------+--------------------------------+ \| 14403 \| hive \| default_cluster:tpch100 \| region \| [r_regionkey,r_comment,r_name] \| +--------+--------------+-------------------------+------------+--------------------------------+ 1 row in set (0.03 sec) ``` 2. Add `analyze_timeout` session variable, to control `analyze table/database with sync` timeout.	2023-09-18 13:36:41 +08:00
daidai	e30c3f3a65	[fix](csv_reader)fix bug that Read garbled files caused be crash. (#24164 ) fix bug that read garbled files caused be crash.	2023-09-13 14:12:55 +08:00
Qi Chen	9df72a96f3	[Feature](multi-catalog) Support hadoop viewfs. (#24168 ) ### Feature Support hadoop viewfs. ### Test - Regression tests: - hive viewfs test. - tvf viewfs test. - Broker load with broker and with hdfs tests manually.	2023-09-13 00:20:12 +08:00
Ashin Gau	6e28d878b5	[fix](hudi) compatible with hudi spark configuration and support skip merge (#24067 ) Fix three bugs: 1. Hudi slice maybe has log files only, so `new Path(filePath)` will throw errors. 2. Hive column names are lowercase only, so match column names in ignore-case-mode. 3. Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.	2023-09-11 19:54:59 +08:00
daidai	f9a75b5c4f	[feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352 ) 1. append csv serde for serialize to csv and deserialize from csv. 2. let csvReader use csv serde not text_converter.	2023-09-10 00:16:21 +08:00
Ashin Gau	eaf2a6a80e	[fix](date) return right date value even if out of the range of date dictionary(#23664 ) PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark. How to fix: 1. Increase dictionary range: 1900 ~ 2038 2. The date out of 1900 ~ 2038 is regenerated.	2023-09-01 14:40:20 +08:00
Mingyu Chen	40be6a0b05	[fix](hive) do not split compress data file and support lz4/snappy block codec (#23245 ) 1. do not split compress data file Some data file in hive is compressed with gzip, deflate, etc. These kinds of file can not be splitted. 2. Support lz4 block codec for hive scan node, use lz4 block codec instead of lz4 frame codec 4. Support snappy block codec For hadoop snappy 5. Optimize the `count()` query of csv file For query like `select count() from tbl`, only need to split the line, no need to split the column. Need to pick to branch-2.0 after this PR: #22304	2023-08-26 12:59:05 +08:00
Qi Chen	8af1e7f27f	[Fix](orc-reader) Fix incorrect result if null partition fields in orc file. (#23369 ) Fix incorrect result if null partition fields in orc file. ### Root Cause Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the pushed down partition field which are null values. filter incorrectly. ### Solution we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.	2023-08-26 00:13:11 +08:00
Qi Chen	29273771f7	[Fix](multi-catalog) Fix hive incorrect result by disable string dict filter if exprs contain null expr. (#23361 ) Issue Number: close #21960 Fix hive incorrect result by disable string dict filter if exprs contain null expr.	2023-08-25 21:16:43 +08:00
Jibing-Li	6c8af92175	[fix])(nereids)Support select catalog.db.table.column from xxx for nereids planner. #23221 Nereids doesn't support select table.* from table, this pr is to fix this bug. Support three layer qualifier. (catalog.database.table)	2023-08-22 13:58:25 +08:00
Ashin Gau	5ff7b57fc1	[fix](parquet) parquet reader confuses logical/physical/slot id of columns (#23198 ) `ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.	2023-08-22 13:35:29 +08:00
Ashin Gau	4bf055c818	[fix](parquet) the key colum of map type in parquet may be nullable (#23180 ) Fix errors when reading map type with nullable key column in parquet file. `ParquetReader` support to read nullable key column, but add a check to prevent reading nullable key column. Unfortunately, this check error was not thrown correctly, causing the BE to crash, and thrown meaningless error logs in be.out: ``` ... 11# doris::vectorized::ParquetReader::get_columns(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, doris::TypeDescriptor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, doris::TypeDescriptor> > >, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) at /root/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:508 12# doris::vectorized::VFileScanner::_get_next_reader() in /root/yun_you_external/output/be/lib/doris_be 13# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState, doris::vectorized::Block, bool*) at /root/doris/be/src/vec/exec/scan/vfile_scanner.cpp:241 ... ```	2023-08-20 22:59:18 +08:00
Mingyu Chen	7c4870c371	[fix](catalog) fix hive partition prune bug on nereids (#23026 )	2023-08-18 18:31:01 +08:00
Ashin Gau	795006ea3d	[fix](multi-catalog) conversion of compatible numerical types (#23113 ) Hive support schema change, but doesn't rewrite the parquet file, so the physical type of parquet file may not equal the logical type of table schema.	2023-08-18 14:05:33 +08:00
Qi Chen	314f5a5143	[Fix](orc-reader) Fix filling partition or missing column used incorrect row count. (#23096 ) [Fix](orc-reader) Fix filling partition or missing column used incorrect row count. `_row_reader->nextBatch` returns number of read rows. When orc lazy materialization is turned on, the number of read rows includes filtered rows, so caller must look at `numElements` in the row batch to determine how many rows were not filtered which will to fill to the block. In this case, filling partition or missing column used incorrect row count which will cause be crash by `filter.size() != offsets.size()` in filter column step. When orc lazy materialization is turned off, add `_convert_dict_cols_to_string_cols(block, nullptr)` if `(block->rows() == 0)`.	2023-08-17 23:26:11 +08:00
zhangguoqiang	41ff48f838	[regresstion][external]fix case test_show_where and es_query 0811 (#22898 )	2023-08-12 19:41:55 +08:00
daidai	f1db6bd8c1	[feature](hive)append support for struct and map column type on textfile format of hive table (#22347 ) 1. append support for struct and map column type on textfile format of hive table. 2. optimizer code that array column type. ```mysql +------+------------------------------------+ \| id \| perf \| +------+------------------------------------+ \| 1 \| {"key1":"value1", "key2":"value2"} \| \| 1 \| {"key1":"value1", "key2":"value2"} \| \| 2 \| {"name":"John", "age":"30"} \| +------+------------------------------------+ ``` ```mysql +---------+------------------+ \| column1 \| column2 \| +---------+------------------+ \| 1 \| {10, "data1", 1} \| \| 2 \| {20, "data2", 0} \| \| 3 \| {30, "data3", 1} \| +---------+------------------+ ``` Summarizes support for complex types(support assign delimiter) : 1. array< primitive_type > and array< array< ... > > 2. map< primitive_type , primitive_type > 3. Struct< primitive_type , primitive_type ... >	2023-08-10 13:47:58 +08:00
zhangguoqiang	91b15183e7	[enhance][external]enhance and fix external cases 0807 (#22689 ) enhance and fix external cases 0807	2023-08-08 10:53:08 +08:00
Mingyu Chen	c31226b144	[refractor](regression-test) sort out test cases of external tables (#22640 ) sort out the test cases of external table. After modify, there are 2 directories: 1. `external_table_p0`: all p0 cases of external tables: hive, es, jdbc and tvf 2. `external_table_p2`: all p2 cases of external tables: hive, es, mysql, pg, iceberg and tvf So that we can run it with one line command like: ``` sh run-regression-test.sh --run -d external_table_p0,external_table_p2 ```	2023-08-07 11:12:30 +08:00

32 Commits