doris

Author	SHA1	Message	Date
lihangyu	278b232e76	[Bug](json reader) object should stop processing when encounter error (#31159 ) If DATA_QUALITY_ERROR encountered we should stop processing this document any more.Otherwise there will be UB in simdjson.	2024-02-21 13:53:32 +08:00
Ashin Gau	7ca3be6d51	[fix](parquet) return error if schema changed in complex types (#31128 ) Check the column type of complex type to prevent core dump in BE. ColumnReader will throw segmentation fault in the following case: Change complex types in hive: hive> create table struct_test( id int, sf struct<f1: int, f2: map<string, string>>) stored as parquet; hive> insert into struct_test values (1, named_struct('f1', 1, 'f2', str_to_map('1:s2,2:s2'))), (2, named_struct('f1', 2, 'f2', str_to_map('k1:s3,k2:s4'))), (3, named_struct('f1', 3, 'f2', str_to_map('k1:s5,k2:s6'))); hive> alter table struct_test change sf sf struct<f1:int, f2: string>;	2024-02-20 09:12:38 +08:00
Vallish Pai	0d4b8386a2	[bugfix][be][cppcheck] Possible NULL pointer access (#31025 ) (#31026 )	2024-02-16 10:16:40 +08:00
Tiewei Fang	f65844fae4	[Enhencement](Outfile/Export) Export data to csv file format with BOM (#30533 ) The UTF8 format of the Windows system has BOM. We add a new user property to `Outfile/Export`。Therefore, when exporting Doris data, users can choose whether to bring BOM on the beginning of the CSV file. Usage: ```sql -- outfile: select * from demo.student into outfile "file:///xxx/export/exp_" format as csv properties( "column_separator" = ",", "with_bom" = "true" ); -- Export: EXPORT TABLE student TO "file:///xx/tmpdata/export/exp_" PROPERTIES( "format" = "csv", "with_bom" = "true" ); ```	2024-02-16 10:16:40 +08:00
huanghaibin	7571ecc42f	[fix](group_commit)Add bounds checking when reading wal file on group commit (#30940 )	2024-02-16 10:12:24 +08:00
HowardQin	0d32aeeaf6	[improvement](load) Enable lzo & Remove dependency on Markus F.X.J. Oberhumer's lzo library (#30573 ) Issue Number: close #29406 1. increase lzop version to 0x1040, I set to 0x1040 only for decompressing lzo files compressed by higher version of lzop, no change of decompressing logic, actully, 0x1040 should have "F_H_FILTER" feature, but it mainly for audio and image data, so we do not support it. 2. use orc::lzoDecompress() instead of lzo1x_decompress_safe() to decompress lzo data 3. use crc32c::Extend() instead of lzo_crc32() 4. use olap_adler32() instead of lzo_adler32() 5. thus, remove dependency of Markus F.X.J. Oberhumer's lzo library 6. remove DORIS_WITH_LZO, so lzo file are supported by stream and broker load by default 7. add some regression test	2024-02-05 22:00:24 +08:00
py023	4b42156fc0	[chore](clang-tidy): add bugprone linters (#29521 ) This PR introduces 4 bugprone linter rules to .clang-tidy, these linters found some bugs in #28965. This PR also add some comments to mute false positive reports.	2024-02-05 21:58:08 +08:00
Qi Chen	92cad69fc4	[Fix](parquet-reader) Fix reading fixed length byte array decimal in parquet reader. (#30535 )	2024-01-31 23:53:40 +08:00
Mingyu Chen	73371d44f8	[fix][refactor] refactor schema init of externa table and some parquet issue (#30325 ) 1. Skip parquet file which has only 4 bytes length: PAR1 2. Refactor the schema init method of iceberg/hudi/hive table in hms catalog 1. Remove some redundant methods of `getIcebergTable` 2. Fix issue described in #23771 3. Support HoodieParquetInputFormatBase, treat it as normal hive table format 4. When listing file, skip all hidden dirs and files	2024-01-31 23:53:40 +08:00
wuwenchi	7d037c12bf	[bugfix](paimon)fix paimon testcases (#30514 ) 1. set default timezone 2. not supported `char` type to pushdown	2024-01-31 23:53:39 +08:00
wuwenchi	8308bc96b9	[fix](paimon)set timestamp's scale for parquet which has no logical type (#30119 )	2024-01-23 13:22:14 +08:00
HHoflittlefish777	32c5153999	[fix](routine-load) pause job when json path is invalid #30197 If jsonpaths is set wrong, routine load job will report error but running all time.For example: CREATE ROUTINE LOAD jobName ON tableName PROPERTIES ( "format" = "json", "max_batch_interval" = "5", "max_batch_rows" = "300000", "max_batch_size" = "209715200", "jsonpaths" = "[\'t\',\'a\']" ) FROM KAFKA ( "kafka_broker_list" = "$IP:PORT", "kafka_topic" = "XXX", "property.kafka_default_offsets" = "OFFSET_BEGINNING" ); Jsonpaths ['t','a'] is invalid, but job will running all time.	2024-01-23 10:12:37 +08:00
nanfeng	be893d792c	[fix](jni) fix jni_reader function name get_nex_block to get_next_block (#29943 )	2024-01-16 18:39:00 +08:00
Qi Chen	d494674ff4	[opt](parquet-reader) Opt parquet decimal type reading. (#29825 )	2024-01-12 13:58:19 +08:00
Qi Chen	7287c0ca15	[Opt](exec)(multi-catalog) Opt date type reading. (#29571 )	2024-01-12 11:48:39 +08:00
huanghaibin	0b731800a0	[enhancement](group_commit) refector wal manager code (#29560 )	2024-01-07 18:54:41 +08:00
Mingyu Chen	2adb0fcc50	[opt](hive) support orc generated from hive 1.x for all file scan node (#28806 )	2024-01-06 17:33:16 +08:00
Qi Chen	8c40f04f2b	[Opt](parquet-reader) Opt `ColumnSelectVector::set_run_length_null_map()` in parquet reader. (#28954 ) (#29527 )	2024-01-05 11:13:40 +08:00
meiyi	706463781c	[refactor](group commit) refactor group commit wal code (#29375 )	2024-01-02 15:52:03 +08:00
huanghaibin	03901b9a7a	[enhancement](group_commit): refector relay wal code (#29183 )	2023-12-30 12:59:46 +08:00
TengJianPing	a525d5c5a3	[refactor](decimal) change type name Decimal128 to Decimal128V2, Decimal128I to Decimal128V3 to avoid confusion (#29265 ) change type name Decimal128 to Decimal128V2, Decimal128I to Decimal128V3 to avoid confusion	2023-12-29 10:11:44 +08:00
Ashin Gau	a90304c208	[fix](parquet) complex type in parquet is case sensitive (#29245 ) Change name of complex type in parquet to case-insensitive. Otherwise, uppercase column names of complex types will return null.	2023-12-28 22:43:11 +08:00
Ashin Gau	2d2f14bc75	[fix](paimon) use SlotDescriptor to parse the required fields (#28990 ) Before this PR, Paimon has created the schema of `VectorTable` by accessing meta information. However, once the schema of `VectorTable` in java is not same as `Block` in c++, BE will crashed, and there is no good way to troubleshoot errors.	2023-12-27 15:45:53 +08:00
py023	137f785698	[fix](parquet_reader) misused bool pointer (#28986 ) Signed-off-by: pengyu <pengyu@selectdb.com>	2023-12-25 22:58:08 +08:00
caiconghui	7081139bdc	[fix](block) fix be core while mutable block merge may cause different row size between columns in origin block (#27943 )	2023-12-25 20:35:22 +08:00
Ashin Gau	96d4778f2e	[fix](parquet) the end offset of column chunk may be wrong in parquet metadata (#28891 )	2023-12-23 22:21:04 +08:00
huanghaibin	0070909d30	[fix](group commit)Fix the issue of duplicate addition of wal path when encouter exception (#28691 )	2023-12-21 20:27:33 +08:00
Ashin Gau	bcf2683b9d	[fix](scanner) fix concurrency bugs when scanner is stopped or finished (#28650 ) `ScannerContext` will schedule scanners even after stopped, and confused with `_is_finished` and `_should_stop`. Only Fix the concurrency bugs when scanner is stopped or finished reported in https://github.com/apache/doris/pull/28384	2023-12-21 10:37:58 +08:00
lihangyu	36857006cd	[Fix](json reader) fix json reader crash due to `fmt::format_to` (#28737 ) ``` 4# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75 5# __cxxabiv1::__terminate(void ()()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48 6# 0x00005622F33D22B1 in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be 7# 0x00005622F33D2404 in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be 8# fmt::v7::detail::error_handler::on_error(char const) in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be 9# char const* fmt::v7::detail::parse_replacement_field<char, fmt::v7::detail::format_handler<fmt::v7::detail::buffer_appender<char>, char, fmt::v7::basic_format_context<fmt::v7::detail::buffer_appender<char>, char> >&>(char const, char const, fmt::v7::detail::format_handler<fmt::v7::detail::buffer_appender<char>, char, fmt::v7::basic_format_context<fmt::v7::detail::buffer_appender<char>, char> >&) in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be 10# void fmt::v7::detail::vformat_to<char>(fmt::v7::detail::buffer<char>&, fmt::v7::basic_string_view<char>, fmt::v7::basic_format_args<fmt::v7::basic_format_context<fmt::v7::detail::buffer_appender<fmt::v7::type_identity<char>::type>, fmt::v7::type_identity<char>::type> >, fmt::v7::detail::locale_ref) in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be 11# doris::vectorized::NewJsonReader::_append_error_msg(rapidjson::GenericValue<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool*) at /root/doris/be/src/vec/exec/format/json/new_json_reader.cpp:924 12# doris::vectorized::NewJsonReader::_set_column_value ```	2023-12-20 19:58:30 +08:00
wudongliang	111185407c	[Improve](tvf)jni-avro support split file (#27933 )	2023-12-19 16:37:34 +08:00
huanghaibin	66fbb22ad7	[fix](group commit) Fix some wal problems on group commit (#28554 )	2023-12-19 09:51:03 +08:00
wuwenchi	97e63516b7	[fix](streamload) catch exception when reading arrow data (#28558 )	2023-12-18 22:03:57 +08:00
Qi Chen	eb99e4270d	[Fix](parquet_reader) Fix dict filtering doesn't work with plain dict encoding in parquet reader. (#28290 )	2023-12-15 09:27:02 +08:00
lihangyu	48937fef48	[Performance](json reader) optimize filling default values (#25542 ) Add a faster path for filling default values, since looking up value map is relatively slow	2023-12-14 10:20:29 +08:00
Ashin Gau	ec91dd1129	[opt](vfilescanner) interrupt running parquet/orc readers when scannode is finished (#28223 ) VScanNode::get_next will check whether the ScanNode has reached limit condition, and send eos to TaskScheduler, and TaskScheduler will try to close ScanNode. However, ScanNode must wait all running scanners finished, so even if ScanNode has reached limit condition, it can't be closed immediately. This PR try to interrupt the running readers, and make ScanNode to end as soon as possible.	2023-12-13 19:31:08 +08:00
Qi Chen	9861cfc4bc	[Fix](Transactional-Hive) Fix transactional hive core dump when `TransactionalHiveReader::init_row_filters()`. (#28238 ) Fix transactional hive core dump when TransactionalHiveReader::init_row_filters().	2023-12-12 14:17:26 +08:00
julic20s	d8d8f15bf3	[improvement](vectorization) Use requires instead of specialization for doris::vectorized::Decimal (#28027 ) Use requires instead of specialization for doris::vectorized::Decimal	2023-12-08 09:59:52 +08:00
HHoflittlefish777	f9d4690023	[improve](stack_trace) avoid print stack trace in csv and json reader #28129	2023-12-07 22:45:18 +08:00
HHoflittlefish777	cb9a6f63ab	[refactor](simd_json_reader) refactor simd json parse to adapt stream parse (#27972 )	2023-12-07 14:45:15 +08:00
wuwenchi	54d062ddee	[feature](stream load) (step one)Add arrow data type for stream load (#26709 ) By using the Arrow data format, we can reduce the streamload of data transferred and improve the data import performance	2023-12-06 23:29:46 +08:00
Mingyu Chen	3e8c75e246	[minor](orc) opt the log info in orc reader (#27951 )	2023-12-06 20:47:36 +08:00
Qi Chen	2b4c4bb442	[Fix][Opt](parquet-reader) Fix filter push down with decimal types in parquet reader. (#27897 ) Fix filter push down with decimal types in parquet reader introduced by #22842	2023-12-04 22:25:39 +08:00
HHoflittlefish777	97d36b4f38	[fix](csv_reader) fix trim_double_quotes behavior change (#27882 )	2023-12-03 22:57:55 +08:00
Qi Chen	fc8b32be7a	[Opt](multi-catalog) Opt parquet orc reader numeric copy by `memcpy()` and `memset()`. (#27545 ) Opt parquet orc reader null map decoding by memset().	2023-12-03 09:55:05 +08:00
HHoflittlefish777	54b5d04ff9	[improve](csv_reader) handle csv reader error (#27892 )	2023-12-02 10:05:02 +08:00
slothever	1706699e7e	[fix](multi-catalog)support the max compute partition prune (#27154 ) 1. max compute partition prune, we just support filter mc partitions by '='，it can filter just one partition to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported. 2. add max compute row count cache and partitionValues cache 3. add max compute regression case	2023-12-01 22:28:26 +08:00
HHoflittlefish777	3e910e2978	[refactor](simd_json_reader) refactor simd json reader to adapt to parse multi json (#27272 )	2023-11-30 15:01:06 +08:00
Qi Chen	e4149c6e4c	[Fix](parquet-reader) Fix null map issue in parquet reader. (#27777 ) Fix null map issue in parquet reader which cause result incorrect such as `min()`, `max()`. In order to share null map between parquet converted src column and dst column to avoid copying. It is very tricky that will call mutable function `doris_nullable_column->get_null_map_column_ptr()` which will set `_need_update_has_null = true`. Because some operations such as agg will call `has_null()` to set `_need_update_has_null = false`.	2023-11-30 13:55:37 +08:00
HHoflittlefish777	498d27c905	[improve](json_reader) add prompt when all fields is null (#27630 )	2023-11-29 18:26:42 +08:00
ShowCode	f565f60bc3	[refactor](standard)BE:Initialize pointer variables in the class to nullptr by default (#27587 )	2023-11-28 13:02:30 +08:00

1 2 3 4 5 ...

349 Commits