Commit Graph

1612 Commits

Author SHA1 Message Date
aeb3450151 [feature](graph)Support querying data from the Nebula graph database (#19209)
Support querying data from the Nebula graph database
This feature comes from the needs of commercial customers who have used Doris and Nebula, hoping to connect these two databases

changes mainly include:

* add New Graph Database JDBC Type
* Adapt the type and map the graph to the Doris type
2023-05-09 15:30:11 +08:00
e08de52ee7 [chore](compile) using PCH for compilation acceleration under clang (#19303) 2023-05-08 19:51:06 +08:00
e78149cb65 [Enhencement](Export) add property for outfile/export and add test (#18997)
This pr does three things:
1. add `delete_existing_files` property for outfile/export. If `delete_existing_files = true`, export/outfile will delete all files under file_path first.
2. add p2 test for export
3. modify docs
2023-05-08 14:02:20 +08:00
673cbe3317 [chore](build) Porting to GCC-13 (#19293)
Support using GCC-13 to build the codebase.
2023-05-08 10:42:06 +08:00
b50e2a8c08 [Fix](parquet-reader) Fix dict cols not be converted back to string type in some cases. (#19348)
Fix dict cols not be converted back to string type in some cases, which includes introduced by #19039.
For dict cols, we will convert dict cols to int32 type firstly, then convert back to string type after read block. 
The block will be reuse it, so it is necessary to convert it back.
2023-05-07 10:05:23 +08:00
9edbfa37cd [Enhancement](Broker Load) New progress manager for showing loading progress status (#19170)
This work is in the early stage, current progress is not accurate because the scan range will be too large
for gathering information, what's more, only file scan node and import job support new progress manager

## How it works

for example, when we use the following load query:
```
LOAD LABEL test_broker_load
(
	DATA INFILE("XXX")
	INTO TABLE `XXX`
        ......
)
```

Initial Progress: the query will call `BrokerLoadJob` to create job, then `coordinator` is called to calculate scan range and its location. 
Update Progress: BE will report runtime_state to FE and FE update progress status according to jobID and fragmentID

we can use `show load` to see the progress

PENDING:
```
         State: PENDING
      Progress: 0.00%
```

LOADING:
```
         State: LOADING
      Progress: 14.29% (1/7)
```

FINISH:
```
         State: FINISHED
      Progress: 100.00% (7/7)
```

At current time, full output of `show load\G` looks like:

```
*************************** 1. row ***************************
         JobId: 25052
         Label: test_broker
         State: LOADING
      Progress: 0.00% (0/7)
          Type: BROKER
       EtlInfo: NULL
      TaskInfo: cluster:N/A; timeout(s):250000; max_filter_ratio:0.0
      ErrorMsg: NULL
    CreateTime: 2023-05-03 20:53:13
  EtlStartTime: 2023-05-03 20:53:15
 EtlFinishTime: 2023-05-03 20:53:15
 LoadStartTime: 2023-05-03 20:53:15
LoadFinishTime: NULL
           URL: NULL
    JobDetails: {"Unfinished backends":{"5a9a3ecd203049bc-85e39a765c043228":[10080]},"ScannedRows":39611808,"TaskNumber":1,"LoadBytes":7398908902,"All backends":{"5a9a3ecd203049bc-85e39a765c043228":[10080]},"FileNumber":1,"FileSize":7895697364}
 TransactionId: 14015
  ErrorTablets: {}
          User: root
       Comment: 
```

## TODO:

1. The current partition granularity of scan range is too large, resulting in an uneven loading process for progress."
2. Only broker load supports the new Progress Manager, support progress for other query
2023-05-06 22:44:40 +08:00
4c6ca88088 Revert "[refactor](function) ignore DST for function from_unixtime (#19151)" (#19333)
This reverts commit 9dd6c8f87b73db238bfd38fb1d76f3796910f398.
2023-05-06 16:33:58 +08:00
Pxl
dff669899a [Feature](generic-aggregation) add some type define for generic aggregate functions support (#19252)
add some type define for generic aggregate functions support
2023-05-06 11:30:13 +08:00
153f42a873 [enhancement](exprcontext) modify get_output_block_after_execute_expr method more clear to avoid mis usage (#19310)
The original method signature is Block VExprContext::get_output_block_after_execute_exprs(
const std::vectorvectorized::VExprContext*& output_vexpr_ctxs, const Block& input_block,
Status& status)
It return error status as a out parameter and the block as return value. It has to check the block.rows == 0 and then check error status.
It is not conforming to the convention.


---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-05-06 09:03:22 +08:00
58cb404661 [fix](memory) Allocator throws Exception instead of std::bad_alloc (#19285)
W0505 01:31:25.840227 1727715 scanner_scheduler.cpp:340] Scan thread read VScanner failed: [MEM_LIMIT_EXCEEDED]PreCatch error code:11, [E11] Allocator sys memory check failed: Cannot alloc:16384, consuming tracker:<Orphan>, exec node:<>, process memory used 5.87 GB exceed limit 5.64 GB or sys mem available 252.17 GB less than low water mark 1.60 GB, failed alloc size 16.00 KB.
    @     0x555c19e0cca8  doris::Exception::Exception()
    @     0x555c1c3e0c3f  Allocator<>::sys_memory_check()
    @     0x555c1c3e1052  Allocator<>::memory_check()
    @     0x555c19e0a645  Allocator<>::alloc()
    @     0x555c1c34508b  COWHelper<>::create<>()
    @     0x555c1e23f574  doris::vectorized::ConvertThroughParsing<>::execute<>()
    @     0x555c1e23f209  doris::vectorized::FunctionConvertFromString<>::execute_impl()
    @     0x555c1e23f4aa  doris::vectorized::FunctionConvertFromString<>::execute_impl()
    @     0x555c1e15ac29  doris::vectorized::PreparedFunctionImpl::execute_without_low_cardinality_columns()
    @     0x555c1e15ac56  doris::vectorized::PreparedFunctionImpl::execute()
    @     0x555c1e245276  _ZNSt17_Function_handlerIFN5doris6StatusEPNS0_15FunctionContextERNS0_10vectorized5BlockERKSt6vectorImSaImEEmmEZNKS4_12FunctionCast14create_wrapperINS4_14DataTypeNumberIiEEEESt8functionISC_ERKSt10shared_ptrIKNS4_9IDataTypeEEPKT_bEUlS3_S6_SB_mmE_E9_M_invokeERKSt9_Any_dataOS3_S6_SB_OmSY_
    @     0x555c1e2a9341  _ZZNK5doris10vectorized12FunctionCast23prepare_remove_nullableEPNS_15FunctionContextERKSt10shared_ptrIKNS0_9IDataTypeEES9_bENKUlS3_RNS0_5BlockERKSt6vectorImSaImEEmmE_clES3_SB_SG_mm
    @     0x555c1e2a8d42  _ZNSt17_Function_handlerIFN5doris6StatusEPNS0_15FunctionContextERNS0_10vectorized5BlockERKSt6vectorImSaImEEmmEZNKS4_12FunctionCast23prepare_remove_nullableES3_RKSt10shared_ptrIKNS4_9IDataTypeEESJ_bEUlS3_S6_SB_mmE_E9_M_invokeERKSt9_Any_dataOS3_S6_SB_OmSQ_
    @     0x555c1e20e42b  doris::vectorized::PreparedFunctionCast::execute_impl()
    @     0x555c1e15ac29  doris::vectorized::PreparedFunctionImpl::execute_without_low_cardinality_columns()
    @     0x555c1e15ac56  doris::vectorized::PreparedFunctionImpl::execute()
    @     0x555c1d63e960  doris::vectorized::IFunctionBase::execute()
    @     0x555c1d628700  doris::vectorized::VCastExpr::execute()
    @     0x555c1d6163e5  doris::vectorized::VExprContext::execute()
    @     0x555c20a83fe1  doris::vectorized::VFileScanner::_convert_to_output_block()
    @     0x555c20a809af  doris::vectorized::VFileScanner::_get_block_impl()
    @     0x555c209b9bc4  doris::vectorized::VScanner::get_block()
    @     0x555c209b1a50  doris::vectorized::ScannerScheduler::_scanner_scan()
    @     0x555c209b2ac1  _ZNSt17_Function_handlerIFvvEZZN5doris10vectorized16ScannerScheduler18_schedule_scannersEPNS2_14ScannerContextEENK3$_0clEvEUlvE1_E9_M_invokeERKSt9_Any_data
    @     0x555c1a8378cf  doris::ThreadPool::dispatch_thread()
    @     0x555c1a830fac  doris::Thread::supervise_thread()
    @     0x7f461faa117a  start_thread
    @     0x7f462033bdf3  __GI___clone
    @              (nil)  (unknown)
2023-05-05 18:01:48 +08:00
f2a34dde52 [fix](memory) Fix memory leak due to incorrect block reuse of AggregateFunctionSortData #19214 2023-05-05 14:29:34 +08:00
b6c7f3aeb8 [opt](FileCache) Add file cache metrics and management (#19177)
Add file cache metrics and management.
1. Get file cache metrics
> If the performance of file cache is not efficient, there are currently no metrics to investigate the cause. In practice, hit ratio, disk usage, and segments removed status are very important information. 

API: `http://be_host:be_webserver_port/metrics`
File cache metrics for each base path start with `doris_be_file_cache_` prefix. `hits_ratio` is the hit ratio of the cache since BE startup; `removed_elements` is the num of removed segment files since BE startup; Every cache path has three queues: index, normal and disposable. The capacity ratio of the three queues is 1:17:2.
```
doris_be_file_cache_hits_ratio{path="/mnt/datadisk1/gaoxin/file_cache"} 0.500000
doris_be_file_cache_hits_ratio{path="/mnt/datadisk1/gaoxin/small_file_cache"} 0.500000
doris_be_file_cache_removed_elements{path="/mnt/datadisk1/gaoxin/file_cache"} 0
doris_be_file_cache_removed_elements{path="/mnt/datadisk1/gaoxin/small_file_cache"} 0

doris_be_file_cache_normal_queue_max_size{path="/mnt/datadisk1/gaoxin/file_cache"} 912680550400
doris_be_file_cache_normal_queue_max_size{path="/mnt/datadisk1/gaoxin/small_file_cache"} 8500000000
doris_be_file_cache_normal_queue_max_elements{path="/mnt/datadisk1/gaoxin/file_cache"} 217600
doris_be_file_cache_normal_queue_max_elements{path="/mnt/datadisk1/gaoxin/small_file_cache"} 102400

doris_be_file_cache_normal_queue_curr_size{path="/mnt/datadisk1/gaoxin/file_cache"} 14129846
doris_be_file_cache_normal_queue_curr_size{path="/mnt/datadisk1/gaoxin/small_file_cache"} 14874904
doris_be_file_cache_normal_queue_curr_elements{path="/mnt/datadisk1/gaoxin/file_cache"} 18
doris_be_file_cache_normal_queue_curr_elements{path="/mnt/datadisk1/gaoxin/small_file_cache"} 22

...
```
2. Release file cache
> Frequent segment files swapping can seriously affect the performance of file cache. Adding a deletion interface helps users clean up the file cache.

API: `http://be_host:be_webserver_port/api/file_cache?op=release&base_path=${file_cache_base_path}`
Return the number of released segment files. If `base_path` is not provide in url, all cache paths will be released.
It's thread-safe to call this api, so only the segment files not been read currently can be released.
```
{"released_elements":22}
```
3. Specify the base path to store cache data
> Currently, regression testing lacks test cases of file cache, which cannot guarantee the stability of file cache. This interface is generally used in regression testing scenarios. Different queries use different paths to verify different usage cases and performance.

User can set session variable `file_cache_base_path` to specify the base path to store cache data. `file_cache_base_path="random"` as default, means chosing a random path from cached paths to store cache data.  If `file_cache_base_path` is not one of the base paths in BE configuration, a random path is used.
2023-05-05 14:28:01 +08:00
9dd6c8f87b [refactor](function) ignore DST for function from_unixtime (#19151) 2023-05-05 11:51:49 +08:00
4e4fb33995 [refactor](conjuncts) simplify conjuncts in exec node (#19254)
Co-authored-by: yiguolei <yiguolei@gmail.com>
Currently, exec node save exprcontext**, but the object is in object pool, the code is very unclear. we could just use exprcontext*.
2023-05-04 18:04:32 +08:00
e9a4cbcdf9 [Refact](type system) refact column with arrow serde (#19091)
* refact arrow serde

* add date serde

* update arrow and fix nullable and date type
2023-05-04 15:28:46 +08:00
e17a171a3c [fix](vertical_compaction) Fix continuous_agg_count PODArray wrong boundary judgment #19187 2023-05-04 14:50:30 +08:00
eac61dc410 [vectorized](function) add some check about result type in array map (#19228) 2023-05-01 16:28:11 +08:00
8eab20d3df [bugfix](low cardinality) cached code is wrong will result wrong query result when many null pages (#19221)
Sometimes the dict is not initialized when run comparison predicate here, for example, the full page is null, then the reader will skip read, so that the dictionary is not inited. The cached code is wrong during this case, because the following page maybe not null, and the dict should have items in the future.
This will result the dict string column query return wrong result, if there are many null values in the column.
I also add some regression test for dict column's equal query, larger than query, less than query.

---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-29 21:28:41 +08:00
c74c2a4f8e [fix](Metadata tvf) Metadata TVF supports read the specified columns from Fe (#19110) 2023-04-29 00:06:08 +08:00
a324ee794c [fix](memory) Fix Aggregation null key memory leak due to incorrect aggfunc destroy #19201 2023-04-28 18:41:41 +08:00
1379d7f3e0 [fix](memory) mmap threshold can be modified in conf, Increase to 128M 2023-04-28 18:17:22 +08:00
6626f26506 [optimize](string) optimize char_length function by SIMD (#18925)
Optimize char_length function by SIMD
(1) optimize utf8_len compute
(2) 840% up
2023-04-28 17:22:35 +08:00
aef9355cd3 [feature-wip](partial update) PART1: support basic partial write (#17542) 2023-04-28 17:17:57 +08:00
Pxl
ec517a53a8 [Chore](build) upgrade clang-format version to 16 && move thrift to fe-common (#19155)
upgrade clang-format version to 16
move thrift to fe-common
fix core dump on pipeline engine when operator canceled and not prepared
2023-04-28 14:14:51 +08:00
65a82a0b57 [opt](FileReader) turn off prefetch data in parquet page reader when using MergeRangeFileReader (#19102)
Using both `MergeRangeFileReader` and `BufferedStreamReader` simultaneously would waste a lot of memory,
so turn off prefetch data in `BufferedStreamReader` when using MergeRangeFileReader.
2023-04-28 09:27:56 +08:00
28016c53f0 [profile](rf) refactor profile of runtime filters (#19134)
* [profile](rf) refactor profile of runtime filters


---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2023-04-28 08:46:42 +08:00
3ed5cf8350 [Optimize] add has_filter template param in get_next_run() to decrease _has_filter condition checking count in the loop. (#19043) 2023-04-27 21:23:36 +08:00
e4f7d77c5c [Optimize](parquet-reader) Opt by filtering null count statistics in rowgroup and page level. (#19106)
Issue Number: About #19038, we found in this case, l_orderkey has many nulls,
so we can filter it by null count statistics in the row group and page level,
then it can improve a lot of performance in this case.
2023-04-27 21:21:30 +08:00
9e2b118288 [RegressTest](Exec) Add DCHECK null_aware_left_anti_join in mark join (#19149) 2023-04-27 17:52:03 +08:00
f23c93b3c6 [fix](memory) Fix AggFunc memory leak due to incorrect destroy (#19126) 2023-04-27 14:58:32 +08:00
98a975b013 [fix](memory) Fix SchemaChange memory leak due to incorrect aggfunc destroy (#19130) 2023-04-27 14:44:00 +08:00
8412571030 [fix](memleak) avoid memleak due to race condition (#19071) 2023-04-27 14:22:09 +08:00
20395ce501 [feature](array_function): add support for array_cum_sum function (#18231) 2023-04-27 09:57:13 +08:00
a262f42a28 [refactor](exceptionsafe) make scanner and scancontext exception safe (#19057) 2023-04-27 09:23:01 +08:00
925efc1902 [bug](map-type)fix some bugs in map and map element function (#18935)
fix some bugs in map and map element function.
2023-04-26 22:10:15 +08:00
aabcab9dbe [Improvement](runtime filter) Improve merge phase (#18828) 2023-04-26 21:01:20 +08:00
e1651bfea5 [bugfix](aggregate_function) Fix wrong registration for percentile_approx #19070 2023-04-26 16:17:46 +08:00
1dfc5ea34c [bugfix](jsonb) fix jsonb parser crash on noavx2 host (#18977)
support avx2 and noavx2 for jsonb parser using __AVX2__ macro.
2023-04-26 15:10:12 +08:00
94b11af17c [fixbug](json-reader) fix memory leak of new_json_reader #19067 2023-04-26 12:54:47 +08:00
5bd4a3897e [optimize](multi-catalog) Skip whole row group in lazy_read if data has been filtered. (#19039)
We found qt_q11 in regression test test_external_catalog_hive is very slow.
The result is only one record, so other data should be filtered out in the parquet lazy read situation.
Then we found currently the parquet reader read many records because we can only skip parquet page. But in order to skip parquet page, currently we need to read page header, then it will caused prefetch data. Therefore, prefetch data in this case may be not good.

So there are two issues:

Skip whole row group in this case.
Prefetching data in this case may be not good, need to improve it.
This PR resolve issues 1.
2023-04-26 12:10:14 +08:00
375789d345 [enhancement](JNI) Provide default environment variables if it is unset (#19041) 2023-04-26 12:06:38 +08:00
5fd6d8ebd4 [fix](function) Support more behaviors of cast time in MySQL 2023-04-26 07:49:54 +08:00
17b59df8dd [fix](function) Array_map compared offset rows one by one (#18406)
Array_map 's multi columns compare not only nested data rows to be equal,but also the offsets data must equal each other.
2023-04-25 19:12:19 +08:00
fa0f3a2859 [fix](planner) vdatetime_value.cpp:1585 Array access may overflow. (#18872)
int64_t months = _year * 12 + _month - 1 + sign * (12 * interval.year + interval.month);
    _year = months / 12;
    if (_year > 9999) {
        return false;
    }
    _month = (months % 12) + 1;
    if (_day > s_days_in_month[_month]) {
        _day = s_days_in_month[_month];
        if (_month == 2 && doris::is_leap(_year)) {
            _day++;
        }
    }
The variable "months" may be negative. Taking modulus with it (_month) may also result in a negative value, which can cause an array access overflow.
2023-04-25 17:57:21 +08:00
8d21f20753 [enhancement](javaudf) not depend on parent will cause deconstructor core (#18948)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-04-25 15:26:54 +08:00
339d804ec4 [Refactor](exceptionsafe) add factory creator to some class (#19000) 2023-04-25 14:33:47 +08:00
39d66ca2c6 [fix](parquet) hasn't initialize select vector when number of nested values equals zero (#18953)
Fix bug when reading array type in parquet file:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = [INTERNAL_ERROR]Read parquet file xxx failed,
reason = [IO_ERROR]Decode too many values in current page
```
When reading normal columns, `ScalarColumnReader::_read_values` still calls `ColumnSelectVector::set_run_length_null_map` to initialize select vector, but `ScalarColumnReader::_read_nested_column` hasn't do this, making the number of values wrong.
The situation where this error occurs is particularly extreme: The column pages have remaining values to be read,
but all of them are null values at ancestor level, so there's no actual read operation, just skipping null values at ancestor level.
2023-04-25 14:21:33 +08:00
d555bae290 [Bug](serde) fix serialize column to jsonb when meet boolean and decimal_v3 (#19011)
* [Bug](serde) fix serialize column to jsonb when meet boolean and decimal_v3

* add comment to explain why use uint8
2023-04-25 10:48:13 +08:00
b2c26e17e1 [Compile](vec) Fix compile by BHREAD_SCANNER (#18979) 2023-04-24 17:07:06 +08:00
16a394da0e [chore](build) Use include-what-you-use to optimize includes (PART III) (#18958)
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
2023-04-24 14:51:51 +08:00