Commit Graph

356 Commits

Author SHA1 Message Date
f99db38998 [fix](ParquetReader) Fix Parquet Reader to read int96 parquet type problem (#32394)
`hi - JULIAN_EPOCH_OFFSET_DAYS` could be negative, so we can't all use unsigned int.
2024-03-21 14:07:24 +08:00
2e564036ef [fix](profile) avoid update profile in deconstructor (#32131)
In previous, the counter in `profile` may be updated when close the file reader.
And the file reader may be closed when the object being deconstruted.
But at that time, the `profile` object may already be deleted, causing NPE and BE will crash.

This PR try to fix this issue:

1. Remove the "profile counter update" logic from all `close()` method.

2. Add a new interface `ProfileCollector`

	It has 2 methods:
	
	- `collect_profile_at_runtime()`

		It can be called at runtime, eg, in every `get_next_block()` method.
		So that the counter in profile can be updated at runtime.
		
	- `collect_profile_before_close()`

		Should be called before the object call `close()`. And it will only be called once.
		
3. Derived from `ProfileCollector`

	All classes which may update the profile counter in `close()` method should extends
	the `ProfileCollector`. Such as `GenericReader`, etc. And implement `collect_profile_before_close()`
	
	And `collect_profile_before_close()` will be called in `scanner->mark_to_need_to_close()`.
2024-03-21 14:07:22 +08:00
9c1888e7ec [RuntimeFilter](exec) support min max runtime filter and do refactor (#32210) 2024-03-15 18:06:20 +08:00
7b74b199a5 [fix](memory) Fix LRU cache deleter and memory tracking (#32080)
In order to add common code to the value deleter of LRU cache, let all lru cache values inherit from LRUCacheValueBase class and tracking memory in destructor.
2024-03-15 17:57:58 +08:00
0b5b7175d6 [fix](multi-catalog) add max compute custom odps and tunnel url (#31390)
add max compute custom odps and tunnel url
2024-02-29 16:44:40 +08:00
4e5147c6a4 [fix](parquet) Fix possible memory leak if ParquetReader::parse_thrift_footer failed (#31375) 2024-02-25 18:08:19 +08:00
b66583551c [fix](group_commit)Fix bound checking problem when reading wal block (#31112) 2024-02-22 13:01:48 +08:00
278b232e76 [Bug](json reader) object should stop processing when encounter error (#31159)
If DATA_QUALITY_ERROR encountered we should stop processing this document any more.Otherwise there will be UB in simdjson.
2024-02-21 13:53:32 +08:00
7ca3be6d51 [fix](parquet) return error if schema changed in complex types (#31128)
Check the column type of complex type to prevent core dump in BE. ColumnReader will throw segmentation fault in the following case:
Change complex types in hive:

hive> create table struct_test(
           id int,
           sf struct<f1: int, f2: map<string, string>>) stored as parquet;

hive> insert into struct_test values
          (1, named_struct('f1', 1, 'f2', str_to_map('1:s2,2:s2'))),
          (2, named_struct('f1', 2, 'f2', str_to_map('k1:s3,k2:s4'))),
          (3, named_struct('f1', 3, 'f2', str_to_map('k1:s5,k2:s6')));

hive> alter table struct_test change sf sf struct<f1:int, f2: string>;
2024-02-20 09:12:38 +08:00
0d4b8386a2 [bugfix][be][cppcheck] Possible NULL pointer access (#31025) (#31026) 2024-02-16 10:16:40 +08:00
f65844fae4 [Enhencement](Outfile/Export) Export data to csv file format with BOM (#30533)
The UTF8 format of the Windows system has BOM. 

We add a new user property to `Outfile/Export`。Therefore, when exporting Doris data, users can choose whether to bring BOM on the beginning of the CSV file.

**Usage:**
```sql
-- outfile:
select * from demo.student
into outfile "file:///xxx/export/exp_"
format as csv
properties(
    "column_separator" = ",",
    "with_bom" = "true"
);

-- Export:
EXPORT TABLE student TO "file:///xx/tmpdata/export/exp_"
PROPERTIES(
    "format" = "csv",
    "with_bom" = "true"
);
```
2024-02-16 10:16:40 +08:00
7571ecc42f [fix](group_commit)Add bounds checking when reading wal file on group commit (#30940) 2024-02-16 10:12:24 +08:00
0d32aeeaf6 [improvement](load) Enable lzo & Remove dependency on Markus F.X.J. Oberhumer's lzo library (#30573)
Issue Number: close #29406

1. increase lzop version to 0x1040,
    I set to 0x1040 only for decompressing lzo files compressed by higher version of lzop,
	no change of decompressing logic,
	actully, 0x1040 should have "F_H_FILTER" feature,
	but it mainly for audio and image data, so we do not support it.
2. use orc::lzoDecompress() instead of lzo1x_decompress_safe() to decompress lzo data
3. use crc32c::Extend() instead of lzo_crc32()
4. use olap_adler32() instead of lzo_adler32()
5. thus, remove dependency of Markus F.X.J. Oberhumer's lzo library
6. remove DORIS_WITH_LZO, so lzo file are supported by stream and broker load by default
7. add some regression test
2024-02-05 22:00:24 +08:00
4b42156fc0 [chore](clang-tidy): add bugprone linters (#29521)
This PR introduces 4 bugprone linter rules to .clang-tidy, these linters found some bugs in #28965. This PR also add some comments to mute false positive reports.
2024-02-05 21:58:08 +08:00
92cad69fc4 [Fix](parquet-reader) Fix reading fixed length byte array decimal in parquet reader. (#30535) 2024-01-31 23:53:40 +08:00
73371d44f8 [fix][refactor] refactor schema init of externa table and some parquet issue (#30325)
1. Skip parquet file which has only 4 bytes length: PAR1
2. Refactor the schema init method of iceberg/hudi/hive table in hms catalog
    1. Remove some redundant methods of `getIcebergTable`
    2. Fix issue described in #23771
3. Support HoodieParquetInputFormatBase, treat it as normal hive table format
4. When listing file, skip all hidden dirs and files
2024-01-31 23:53:40 +08:00
7d037c12bf [bugfix](paimon)fix paimon testcases (#30514)
1. set default timezone
2. not supported `char` type to pushdown
2024-01-31 23:53:39 +08:00
8308bc96b9 [fix](paimon)set timestamp's scale for parquet which has no logical type (#30119) 2024-01-23 13:22:14 +08:00
32c5153999 [fix](routine-load) pause job when json path is invalid #30197
If jsonpaths is set wrong, routine load job will report error but running all time.For example:

CREATE ROUTINE LOAD jobName ON tableName
PROPERTIES
(
    "format" = "json",
    "max_batch_interval" = "5",
    "max_batch_rows" = "300000",
    "max_batch_size" = "209715200",
    "jsonpaths" = "[\'t\',\'a\']"
)
FROM KAFKA
(
    "kafka_broker_list" = "$IP:PORT",
    "kafka_topic" = "XXX",
    "property.kafka_default_offsets" = "OFFSET_BEGINNING"
);
Jsonpaths ['t','a'] is invalid, but job will running all time.
2024-01-23 10:12:37 +08:00
be893d792c [fix](jni) fix jni_reader function name get_nex_block to get_next_block (#29943) 2024-01-16 18:39:00 +08:00
d494674ff4 [opt](parquet-reader) Opt parquet decimal type reading. (#29825) 2024-01-12 13:58:19 +08:00
7287c0ca15 [Opt](exec)(multi-catalog) Opt date type reading. (#29571) 2024-01-12 11:48:39 +08:00
0b731800a0 [enhancement](group_commit) refector wal manager code (#29560) 2024-01-07 18:54:41 +08:00
2adb0fcc50 [opt](hive) support orc generated from hive 1.x for all file scan node (#28806) 2024-01-06 17:33:16 +08:00
8c40f04f2b [Opt](parquet-reader) Opt ColumnSelectVector::set_run_length_null_map() in parquet reader. (#28954) (#29527) 2024-01-05 11:13:40 +08:00
706463781c [refactor](group commit) refactor group commit wal code (#29375) 2024-01-02 15:52:03 +08:00
03901b9a7a [enhancement](group_commit): refector relay wal code (#29183) 2023-12-30 12:59:46 +08:00
a525d5c5a3 [refactor](decimal) change type name Decimal128 to Decimal128V2, Decimal128I to Decimal128V3 to avoid confusion (#29265)
change type name Decimal128 to Decimal128V2, Decimal128I to Decimal128V3 to avoid confusion
2023-12-29 10:11:44 +08:00
a90304c208 [fix](parquet) complex type in parquet is case sensitive (#29245)
Change name of complex type in parquet to case-insensitive. Otherwise, uppercase column names of complex types will return null.
2023-12-28 22:43:11 +08:00
2d2f14bc75 [fix](paimon) use SlotDescriptor to parse the required fields (#28990)
Before this PR, Paimon has created the schema of `VectorTable` by accessing meta information. However, once the schema of `VectorTable` in java is not same as `Block` in c++, BE will crashed, and there is no good way to troubleshoot errors.
2023-12-27 15:45:53 +08:00
137f785698 [fix](parquet_reader) misused bool pointer (#28986)
Signed-off-by: pengyu <pengyu@selectdb.com>
2023-12-25 22:58:08 +08:00
7081139bdc [fix](block) fix be core while mutable block merge may cause different row size between columns in origin block (#27943) 2023-12-25 20:35:22 +08:00
96d4778f2e [fix](parquet) the end offset of column chunk may be wrong in parquet metadata (#28891) 2023-12-23 22:21:04 +08:00
0070909d30 [fix](group commit)Fix the issue of duplicate addition of wal path when encouter exception (#28691) 2023-12-21 20:27:33 +08:00
bcf2683b9d [fix](scanner) fix concurrency bugs when scanner is stopped or finished (#28650)
`ScannerContext` will schedule scanners even after stopped, and confused with `_is_finished` and `_should_stop`.
 Only Fix the concurrency bugs when scanner is stopped or finished reported in https://github.com/apache/doris/pull/28384
2023-12-21 10:37:58 +08:00
36857006cd [Fix](json reader) fix json reader crash due to fmt::format_to (#28737)
```
4# __gnu_cxx::__verbose_terminate_handler() [clone .cold] at ../../../../libstdc++-v3/libsupc++/vterminate.cc:75
5# __cxxabiv1::__terminate(void (*)()) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:48
6# 0x00005622F33D22B1 in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
7# 0x00005622F33D2404 in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
8# fmt::v7::detail::error_handler::on_error(char const*) in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
9# char const* fmt::v7::detail::parse_replacement_field<char, fmt::v7::detail::format_handler<fmt::v7::detail::buffer_appender<char>, char, fmt::v7::basic_format_context<fmt::v7::detail::buffer_appender<char>, char> >&>(char const*, char const*, fmt::v7::detail::format_handler<fmt::v7::detail::buffer_appender<char>, char, fmt::v7::basic_format_context<fmt::v7::detail::buffer_appender<char>, char> >&) in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
10# void fmt::v7::detail::vformat_to<char>(fmt::v7::detail::buffer<char>&, fmt::v7::basic_string_view<char>, fmt::v7::basic_format_args<fmt::v7::basic_format_context<fmt::v7::detail::buffer_appender<fmt::v7::type_identity<char>::type>, fmt::v7::type_identity<char>::type> >, fmt::v7::detail::locale_ref) in /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be
11# doris::vectorized::NewJsonReader::_append_error_msg(rapidjson::GenericValue<rapidjson::UTF8<char>, rapidjson::MemoryPoolAllocator<rapidjson::CrtAllocator> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool*) at /root/doris/be/src/vec/exec/format/json/new_json_reader.cpp:924
12# doris::vectorized::NewJsonReader::_set_column_value
```
2023-12-20 19:58:30 +08:00
111185407c [Improve](tvf)jni-avro support split file (#27933) 2023-12-19 16:37:34 +08:00
66fbb22ad7 [fix](group commit) Fix some wal problems on group commit (#28554) 2023-12-19 09:51:03 +08:00
97e63516b7 [fix](streamload) catch exception when reading arrow data (#28558) 2023-12-18 22:03:57 +08:00
eb99e4270d [Fix](parquet_reader) Fix dict filtering doesn't work with plain dict encoding in parquet reader. (#28290) 2023-12-15 09:27:02 +08:00
48937fef48 [Performance](json reader) optimize filling default values (#25542)
Add a faster path for filling default values, since looking up value map is relatively slow
2023-12-14 10:20:29 +08:00
ec91dd1129 [opt](vfilescanner) interrupt running parquet/orc readers when scannode is finished (#28223)
VScanNode::get_next will check whether the ScanNode has reached limit condition, and send eos to TaskScheduler, and TaskScheduler will try to close ScanNode.
However, ScanNode must wait all running scanners finished, so even if ScanNode has reached limit condition, it can't be closed immediately.
This PR try to interrupt the running readers, and make ScanNode to end as soon as possible.
2023-12-13 19:31:08 +08:00
9861cfc4bc [Fix](Transactional-Hive) Fix transactional hive core dump when TransactionalHiveReader::init_row_filters(). (#28238)
Fix transactional hive core dump when TransactionalHiveReader::init_row_filters().
2023-12-12 14:17:26 +08:00
d8d8f15bf3 [improvement](vectorization) Use requires instead of specialization for doris::vectorized::Decimal (#28027)
Use requires instead of specialization for doris::vectorized::Decimal
2023-12-08 09:59:52 +08:00
f9d4690023 [improve](stack_trace) avoid print stack trace in csv and json reader #28129 2023-12-07 22:45:18 +08:00
cb9a6f63ab [refactor](simd_json_reader) refactor simd json parse to adapt stream parse (#27972) 2023-12-07 14:45:15 +08:00
54d062ddee [feature](stream load) (step one)Add arrow data type for stream load (#26709)
By using the Arrow data format, we can reduce the streamload of data transferred and improve the data import performance
2023-12-06 23:29:46 +08:00
3e8c75e246 [minor](orc) opt the log info in orc reader (#27951) 2023-12-06 20:47:36 +08:00
2b4c4bb442 [Fix][Opt](parquet-reader) Fix filter push down with decimal types in parquet reader. (#27897)
Fix filter push down with decimal types in parquet reader introduced by #22842
2023-12-04 22:25:39 +08:00
97d36b4f38 [fix](csv_reader) fix trim_double_quotes behavior change (#27882) 2023-12-03 22:57:55 +08:00