Commit Graph

879 Commits

Author SHA1 Message Date
Pxl
560731f392 [Bug](runtime-filter) fix probe expr prepared twice on minmax runtime filter (#22229)
fix probe expr prepared twice on minmax runtime filter
2023-07-26 19:44:35 +08:00
21ea0055fc [improvement](scanner) use batch size of session instead of limit to improve performance of reading (#22240) 2023-07-26 18:57:42 +08:00
Pxl
9451382428 [Improvement](aggregate) optimization for AggregationMethodKeysFixed::insert_keys_into_columns (#22216)
optimization for AggregationMethodKeysFixed::insert_keys_into_columns
2023-07-26 16:19:15 +08:00
4c4f08f805 [fix](hudi) the required fields are empty if only reading partition columns (#22187)
1. If only read the partition columns, the `JniConnector` will produce empty required fields, so `HudiJniScanner` should read the "_hoodie_record_key" field at least to know how many rows in current hoodie split. Even if the `JniConnector` doesn't read this field, the call of `releaseTable` in `JniConnector` will reclaim the resource.

2. To prevent BE failure and exit, `JniConnector` should call release methods after `HudiJniScanner` is initialized. It should be noted that `VectorTable` is created lazily in `JniScanner`,  so we don't need to reclaim the resource when `HudiJniScanner` is failed to initialize.

## Remaining works
Other jni readers like `paimon` and `maxcompute` may encounter the same problems, the jni reader need to handle this abnormal situation on its own, and currently this fix can only ensure that BE will not exit.
2023-07-26 10:59:45 +08:00
7b270d1ae9 [Fix](mutli-catalog) Fix orc reader crashed when hdfs reading error by catching exception. (#22193)
orc reader crashed when hdfs reading error.

0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/zcp_repo/be/src/common/signal_handler.h:413
1# 0x00007F6F8B3C00C0 in /lib/x86_64-linux-gnu/libc.so.6
2# raise in /lib/x86_64-linux-gnu/libc.so.6
3# abort in /lib/x86_64-linux-gnu/libc.so.6
4# _gnu_cxx::_verbose_terminate_handler() [clone .cold] at ../../../../libstdc+-v3/libsupc+/vterminate.cc:75
5# _cxxabiv1::_terminate(void ()) at ../../../../libstdc+-v3/libsupc+/eh_terminate.cc:48
6# 0x0000555CBC4718C1 in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
7# 0x0000555CBC471A14 in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
8# doris::vectorized::ORCFileInputStream::read(void*, unsigned long, unsigned long) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/format/orc/vorc_reader.cpp:121
9# orc::SeekableFileInputStream::Next(void const*, int) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
10# orc::DecompressionStream::readHeader() in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
11# orc::DecompressionStream::Next(void const*, int) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
12# void orc::RleDecoderV2::next<long>(long*, unsigned long, char const*) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
13# orc::StringDictionaryColumnReader::loadDictionary() in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
14# orc::StructColumnReader::loadStringDicts(std::unordered_map<unsigned long, std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<std::pair<unsigned long const, std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, std::unordered_map<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, orc::StringDictionary*, std::hash<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::_cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, orc::StringDictionary*> > >, orc::StringDictFilter const) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
15# orc::RowReaderImpl::startNextStripe(orc::ReadPhase const&) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
16# orc::RowReaderImpl::nextBatch(orc::ColumnVectorBatch&, void*) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
17# doris::vectorized::OrcReader::get_next_block(doris::vectorized::Block*, unsigned long*, bool*) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/format/orc/vorc_reader.cpp:1420
18# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/scan/vfile_scanner.cpp:250
19# doris::vectorized::VScanner::get_block(doris::RuntimeState*, doris::vectorized::Block*, bool*) in /mnt/hdd01/STRESS_ENV/be/lib/doris_be
20# doris::vectorized::ScannerScheduler::_scanner_scan(doris::vectorized::ScannerScheduler*, doris::vectorized::ScannerContext*, std::shared_ptr<doris::vectorized::VScanner>) at /home/zcp/repo_center/zcp_repo/be/src/vec/exec/scan/scanner_scheduler.cpp:335
21# std::_Function_handler<void (), doris::vectorized::ScannerScheduler::_schedule_scanners(doris::vectorized::ScannerContext*)::$_1::operator()() const::
2023-07-26 08:57:31 +08:00
cf677b327b [fix](jdbc catalog) Fixed mappings with type errors for bool and tinyint(1) (#22089)
First of all, mysql does not have a boolean type, its boolean type is actually tinyint(1), in the previous logic, We force tinyint(1) to be a boolean by passing tinyInt1isBit=true, which causes an error if tinyint(1) is not a 0 or 1, Therefore, we need to match tinyint(1) according to tinyint instead of boolean, and this change will not affect the correctness of where k = 1 or where k = true queries
2023-07-25 22:45:22 +08:00
23e7423748 [pipeline](refactor) refactor pipeline task schedule logics (#22028) 2023-07-25 17:18:26 +08:00
103c473b96 [Bug](pipeline) fix pipeline shared scan + topn optimization (#21940) 2023-07-25 12:48:27 +08:00
752cec9e19 [Fix](multi-catalog) Fix not single slot filter conjuncts with dict filter issue. (#22052)
### Issue
Dictionary filtering is a mechanism that directly reads the dictionary encoding of a single string column filter condition for filter comparison. But dictionary filtered single string columns may be included in other multi-column filter conditions. This can cause problems.

For example:
`select * from multi_catalog.lineitem_string_date_orc where l_commitdate < l_receiptdate and l_receiptdate = '1995-01-01'  order by l_orderkey, l_partkey, l_suppkey, l_linenumber limit 10;`

`l_receiptdate` is string filter column,it is included by multi-column filter condition `l_commitdate < l_receiptdate`.

### Solution
Resolve it by separating the multi-column filter conditions and executing it after the dictionary filter column is converted to string.
2023-07-24 22:31:18 +08:00
7fcf702081 [improvement](multi catalog)paimon support filesystem metastore (#21910)
1.support filesystem metastore

2.support predicate and project when split

3.fix partition table query error

todo: Now you need to manually put paimon-s3-0.4.0-incubating.jar in be/lib/java_extensions when use s3 filesystem

doc pr: #21966
2023-07-24 22:02:57 +08:00
Pxl
19ba6bec38 [Improvement](pipeline) support send eos on local exchange and remove some unused code (#22086)
support send eos on local exchange and remove some unused code
2023-07-24 09:25:32 +08:00
2c16fe0da9 [bugfix](runtimefilter) runtime filter is shared between multi instances with same node id, should not cache exprs (#22114)
runtime filter is shared among multi instances.
in the past, we cached pushdown expr(runtime filter generated)
every scannode[runtime filter consumer] will try to call prepare expr
but the expr may generated with different fn_context_id
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-07-23 13:04:33 +08:00
256051a965 [bug](node) fix partiton sort node core dump when eos (#22108)
fix partiton sort node core dump when eos
2023-07-23 12:00:53 +08:00
f8307f1a1a [bugfix](scanner) when scanner init failed during get tablet, not need call update counters (#22117)
Co-authored-by: yiguolei <yiguolei@gmail.com>
If the scanner is failed during init or open, then not need update counters because the query is fail and the counter is useless.
And it may core during update counters. For example, update counters depend on scanner's tablet, but the tablet == null when init failed.
2023-07-23 10:19:20 +08:00
afeac4419f [Bug](node) fix partition sort node forget handle some type of key in hashmap (#22037)
* [enhancement](repeat) add filter in repeat node in BE

* update
2023-07-21 23:30:40 +08:00
bed940b7fc [fix](log) column index off-by-one error in scanner logs (#19747) 2023-07-21 18:30:01 +08:00
6512893257 [refactor](vectorized) Remove useless control variables to simplify aggregation node code (#22026)
* [refactor](vectorized) Remove useless control variables to simplify aggregation node code

* fix
2023-07-21 12:45:23 +08:00
6875ef4b8b [refactor](mem_reuse) refactor mem_reuse in MutableBlock (#21564) 2023-07-20 22:53:19 +08:00
7947569993 [Bug][RegressionTest] fix the DCHECK failed in join code (#22021) 2023-07-20 18:12:20 +08:00
650d7cfc8c [enhancement](repeat) add filter in repeat node in BE (#21984)
[enhancement](repeat) add filter in repeat node in BE (#21984)
2023-07-20 17:25:13 +08:00
20242d9a0e [Improve](simdjson) put unescaped string value after parsed (#21866)
In some cases, it is necessary to unescape the original value, such as when converting a string to JSONB.
If not unescape, then later jsonb parse will be failed
2023-07-20 10:33:17 +08:00
b35cfc5d5e [opt](join) Opt the performance of join probe (#21845) 2023-07-19 01:21:22 +08:00
Pxl
4171309b9b [Bug](scanner) fix core dump due to release ScannerContext too early #21946 2023-07-19 00:53:23 +08:00
c36d225a27 [feature](profile) add process hashtable time in join node (#21878)
add process hashtable time in join node
2023-07-18 18:09:42 +08:00
Pxl
3089e4b3b6 [Bug](excution) fix ScannerContext is done make query failed (#21923)
fix ScannerContext is done make query failed
2023-07-18 17:58:00 +08:00
Pxl
19492b06c1 [Bug](decimalv3) fix failed on test_dup_tab_decimalv3 due to wrong precision (#21890)
fix failed on test_dup_tab_decimalv3 due to wrong precision
2023-07-18 12:53:09 +08:00
Pxl
b3d3ffa2de [Bug](pipeline) adjust scanner scheduler.submit and _num_scheduling_ctx maintain (#21843)
adjust scanner scheduler.submit and _num_scheduling_ctx maintain
2023-07-18 11:55:21 +08:00
5fc0a84735 [improvement](catalog) reduce the size thrift params for external table query (#21771)
### 1
In previous implementation, for each FileSplit, there will be a `TFileScanRange`, and each `TFileScanRange`
contains a list of `TFileRangeDesc` and a `TFileScanRangeParams`.
So if there are thousands of FileSplit, there will be thousands of `TFileScanRange`, which cause the thrift
data send to BE too large, resulting in:

1. the rpc of sending fragment may fail due to timeout
2. FE will OOM

For a certain query request, the `TFileScanRangeParams` is the common part and is same of all `TFileScanRange`.
So I move this to the `TExecPlanFragmentParams`.
After that, for each FileSplit, there is only a list of `TFileRangeDesc`.

In my test, to query a hive table with 100000 partitions, the size of thrift data reduced from 151MB to 15MB,
and the above 2 issues are gone.

### 2
Support when setting `max_external_file_meta_cache_num` <=0, the file meta cache for parquet footer will
not be used.
Because I found that for some wide table, the footer is too large(1MB after compact, and much more after
deserialized to thrift), it will consuming too much memory of BE when there are many files.

This will be optimized later, here I just support to disable this cache.
2023-07-17 13:37:02 +08:00
a7eb186801 [Bug](CSVReader) fix null pointer coredump in CSVReader in p2 (#20811) 2023-07-15 22:50:10 +08:00
7f50c07219 [Opt](exec) opt the outer join performance in TPCDS Q95 (#21806) 2023-07-14 18:42:08 +08:00
b013f8006d [enhancement](multi-table) enable mullti table routine load on pipeline engine (#21729) 2023-07-14 12:16:32 +08:00
ca6e33ec0c [feature](table-value-functions)add catalogs table-value-function (#21790)
mysql> select * from catalogs() order by CatalogId;
2023-07-14 10:25:16 +08:00
254f76f61d [Agg](exec) support aggregation_node limit short circuit (#21767) 2023-07-14 00:29:19 +08:00
6fd8f5cd2f [Fix](parquet-reader) Fix parquet string column min max statistics issue which caused query result incorrectly. (#21675)
In parquet, min and max statistics may not be able to handle UTF8 correctly.
Current processing method is using min_value and max_value statistics introduced by PARQUET-1025 if they are used.
If not, current processing method is temporarily ignored. A better way is try to read min and max statistics if it contains 
only ASCII characters. I will improve it in the future PR.
2023-07-14 00:09:41 +08:00
9cad929e96 [Fix](rowset) When a rowset is cooled down, it is directly deleted. This can result in data query misses in the second phase of a two-phase query. (#21741)
* [Fix](rowset) When a rowset is cooled down, it is directly deleted. This can result in data query misses in the second phase of a two-phase query.

related pr #20732

There are two reasons for moving the logic of delayed deletion from the Tablet to the StorageEngine. The first reason is to consolidate the logic and unify the delayed operations. The second reason is that delayed garbage collection during queries can cause rowsets to remain in the "stale rowsets" state, preventing the timely deletion of rowset metadata, It may cause rowset metadata too large.

* not use unused rowsets
2023-07-13 11:46:12 +08:00
d86c67863d Remove unused code (#21735) 2023-07-12 14:48:13 +08:00
ff42cd9b49 [feature](hive)add read of the hive table textfile format array type (#21514) 2023-07-11 22:37:48 +08:00
d3317aa33b [Fix](executor)Fix scan entity core #21696
After the last time to call scan_task.scan_func(),the should be ended, this means PipelineFragmentContext could be released.
Then after PipelineFragmentContext is released, visiting its field such as query_ctx or _state may cause core dump.
But it can only explain core 2

void ScannerScheduler::_task_group_scanner_scan(ScannerScheduler* scheduler,
                                                taskgroup::ScanTaskTaskGroupQueue* scan_queue) {
    while (!_is_closed) {
        taskgroup::ScanTask scan_task;
        auto success = scan_queue->take(&scan_task);
        if (success) {
            int64_t time_spent = 0;
            {
                SCOPED_RAW_TIMER(&time_spent);
                scan_task.scan_func();
            }
            scan_queue->update_statistics(scan_task, time_spent);
        }
    }
}
2023-07-11 15:56:13 +08:00
Pxl
ca71048f7f [Chore](status) avoid empty error msg on status (#21454)
avoid empty error msg on status
2023-07-11 13:48:16 +08:00
f87a3ccba2 [fix](runtime_filter) runtime_profile was not initialized in multi_cast_data_stream_source (#21690) 2023-07-11 00:16:29 +08:00
8973610543 [feature](datetime) "timediff" supports calculating microseconds (#21371) 2023-07-10 19:21:32 +08:00
0be349e250 [feature](jdbc) Support jdbc catalog to read json types (#21341) 2023-07-10 16:21:00 +08:00
469c8b7ece [Fix](JSON LOAD)fix json load issue when string conform with RFC 4627 #21390
should set: enable_simdjson_reader=false in master as master enable_simdjson_reader=true by default.

Issue Number: close #21389

from rapidjson:

Query String
In addition to GetString(), the Value class also contains GetStringLength(). Here explains why:

According to RFC 4627, JSON strings can contain Unicode character U+0000, which must be escaped as "\u0000". The problem is that, C/C++ often uses null-terminated string, which treats \0 as the terminator symbol.

To conform with RFC 4627, RapidJSON supports string containing U+0000 character. If you need to handle this, you can use GetStringLength() to obtain the correct string length.

For example, after parsing the following JSON to Document d:

{ "s" : "a\u0000b" }
The correct length of the string "a\u0000b" is 3, as returned by GetStringLength(). But strlen() returns 1.

GetStringLength() can also improve performance, as user may often need to call strlen() for allocating buffer.

Besides, std::string also support a constructor:

string(const char* s, size_t count);
which accepts the length of string as parameter. This constructor supports storing null character within the string, and should also provide better performance.
2023-07-09 17:16:03 +08:00
2678afd2db [fix][improvement](fs) add HdfsIO profile and modification time (#21638)
Refactor the interface of create_file_reader

the file_size and mtime are merged into FileDescription, not in FileReaderOptions anymore.
Now the file handle cache can get correct file's modification time from FileDescription.
Add HdfsIO for hdfs file reader
pick from [Enhancement](multi-catalog) Add hdfs read statistics profile. #21442
2023-07-08 14:49:44 +08:00
9ee7fa45d1 [Refactor](multi-catalog) Refactor to process splitted conjuncts for dict filter. (#21459)
Conjuncts are currently split, so refactor source code to handle split conjuncts for dict filters.
2023-07-07 09:19:08 +08:00
4d17400244 [profile](join) add collisions into profile (#21510) 2023-07-06 14:30:10 +08:00
009b300abd [Fix](ScannerScheduler) fix dead lock when shutdown group_local_scan_thread_pool (#21553) 2023-07-06 13:09:37 +08:00
242a35fa80 [fix](s3) fix s3 fs benchmark tool (#21401)
1. fix concurrency bug of s3 fs benchmark tool, to avoid crash on multi thread.
2. Add `prefetch_read` operation to test prefetch reader.
3. add `AWS_EC2_METADATA_DISABLED` env in `start_be.sh` to avoid call ec2 metadata when creating s3 client.
4. add `AWS_MAX_ATTEMPTS` env in `start_be.sh` to avoid warning log of s3 sdk.
2023-07-05 16:20:58 +08:00
9adbca685a [opt](hudi) use spark bundle to read hudi data (#21260)
Use spark-bundle to read hudi data instead of using hive-bundle to read hudi data.

**Advantage** for using spark-bundle to read hudi data:
1. The performance of spark-bundle is more than twice that of hive-bundle
2. spark-bundle using `UnsafeRow` can reduce data copying and GC time of the jvm
3. spark-bundle support `Time Travel`, `Incremental Read`, and `Schema Change`, these functions can be quickly ported to Doris

**Disadvantage** for using spark-bundle to read hudi data:
1. More dependencies make hudi-dependency.jar very cumbersome(from 138M -> 300M)
2. spark-bundle only provides `RDD` interface and cannot be used directly
2023-07-04 17:04:49 +08:00
90dd8716ed [refactor](multicast) change the way multicast do filter, project and shuffle (#21412)
Co-authored-by: Jerry Hu <mrhhsg@gmail.com>

1. Filtering is done at the sending end rather than the receiving end
2. Projection is done at the sending end rather than the receiving end
3. Each sender can use different shuffle policies to send data
2023-07-04 16:51:07 +08:00