Commit Graph

14741 Commits

Author SHA1 Message Date
9c6c2f736e [Improvement](statistics)Improve stats sample strategy (#26435)
Improve the accuracy of sample stats collection. For non distribution columns, use 
`n*d / (n - f1 + f1*n/N)`

where `f1` is the number of distinct values that occurred exactly once in our sample of n rows (from a total of N),
and `d` is the total number of distinct values in the sample.

For distribution columns, use `ndv(n) * fraction of tablets sampled` for NDV.

For very large tablet to sample, use limit to control the total lines to scan (for non key column only, because key column is sorted and will be inaccurate using limit).
2023-11-13 15:52:21 +08:00
c6b97c4daa [Improvement](segment iterator) remove range in first read to save time (#26689)
Currently, rowids may be fragmented significantly after `_get_row_ranges_by_column_conditions`, potentially leading to high CPU costs when processing these scattered ranges of rowid.

This PR enhances the `SegmentIterator` by eliminating the initial range read in the `BitmapRangeIterator` constructor and introducing a `read_batch_rowids` method to both `BitmapRangeIterator` and `BackwardBitmapRangeIterator` classes. The aim is to boost performance by omitting redundant read operations, thereby reducing execution time.

Moreover, to avoid unnecessary reads when the range is relatively complete, we employ a simple `is_continuous` check to determine if the block of rows is continuous. If so, we call `next_batch` instead of `read_by_rowids`, streamlining the processing of consecutive rowids.


We selected three SQL statement scenarios to test the effects of the optimization, which are:

1. ```select COUNT() from wc_httplogs_inverted_index where request match "images" and (size >= 10 and status = 200);```
2. ```select COUNT() from wc_httplogs_inverted_index where request match "HTTP" and (size >= 10 and status = 200);```
3. ```select COUNT() from wc_httplogs_inverted_index where request match "GET" and (size >= 10 and status = 200);```

- The first SQL statement represents the scenario primarily optimized in this PR, where the first read matches a large number of rows but is highly fragmented. 
- The second SQL statement represents a scenario where the first read fully hits, mainly to verify if there is any performance degradation in the PR when hitting a complete rowid range. 
- The third SQL statement represents a near-total hit with only occasional misses, used to check if the PR degrades when the rowid range contains many continuous ranges.

The results are as follows:

1. For the first SQL statement:
    1. Before optimization: Execution time: 0.32 sec, FirstReadTime: 6s628ms
    2. After optimization: Execution time: 0.16 sec, FirstReadTime: 1s604ms
2. For the second SQL statement:
    1. Before optimization: Execution time: 0.16 sec, FirstReadTime: 682.816ms
    2. After optimization: Execution time: 0.15 sec, FirstReadTime: 635.156ms
3. For the third SQL statement:
    1. Before optimization: Execution time: 0.16 sec, FirstReadTime: 787.904ms
    2. After optimization: Execution time: 0.16 sec, FirstReadTime: 798.861ms
2023-11-13 15:51:48 +08:00
b0c92d408b [bug](function) add signature for precentile function (#26867) 2023-11-13 15:43:10 +08:00
761fa68ab2 [docs](readme)Update README.md (#26844) 2023-11-13 14:29:39 +08:00
2f32a721ee [refactor](jni) unified jni framework for jdbc catalog (#26317)
This commit overhauls the JDBC connector logic within our project, transitioning from the previous mechanism of fetching data through JNI calls for individual ResultSet items to a more efficient and unified approach using the VectorTable data structure.
2023-11-13 14:28:15 +08:00
5a7c0ec9dc [fix](broker load) pass loadToSingleTablet to olapTableSink (#26680) 2023-11-13 14:14:25 +08:00
7e62c3c2de [fix](Nereids) store user variable in connect context (#26655)
1.user variable should be case insensitive
2.user variable should be cleared after the connection reset
2023-11-13 12:25:08 +08:00
fa3c7d98c8 [fix](map) the implementation of ColumnMap::replicate was incorrect" (#26647) 2023-11-13 12:17:14 +08:00
17b1108635 [fix](nereids)support uncorrelated subquery in join condition (#26672)
sql select * from t1 a join t1 b on b.id in (select 1) and a.id = b.id; will report an error.
This pr support uncorrelated subquery in join condition to fix it
2023-11-13 11:49:11 +08:00
a78e0f8309 [enhancement](nereids)make error message more readable when bind logicalRepeat node (#26744) 2023-11-13 10:52:27 +08:00
db29850e1c [bug](user login)fix PASSWORD_LOCK_TIME setting UNBOUNDED does not take effect (#26585) 2023-11-13 10:41:49 +08:00
7e36ab838f [regression](partial update) Add cases when the deleted rows have non nullable columns without default value (#26776) 2023-11-13 10:36:59 +08:00
c0fda8c5c2 [improve](group commit) Add a swicth to wait internal group commit lo… (#26734)
* [improve](group commit) Add a swicth to make internal group commit load finish

* modify group commit tvf plan
2023-11-13 10:35:35 +08:00
7332b1b371 [fix](decimal) fix undefined behaviour of divide by zero when cast string to decimal (#26822)
* [fix](decimal) fix undefined behaviour of divide by zero when cast string to decimal

* fix format
2023-11-13 10:09:06 +08:00
d34dc1c133 [enhancement](regression test) stream load support direct load to be (#26829) 2023-11-13 10:07:10 +08:00
183c74f6ae [decimal](test case) porting postgres regression test cases (#26836) 2023-11-13 10:06:43 +08:00
d9e0a9fa2e [enhancement](230) print max version and spec version when -230 happens (#26643)
More information is provided.
2023-11-13 09:57:22 +08:00
fa8c3aec07 [opt](load) catch Throwable to make load error msg more clear (#26821)
When doing LoadPendingTask or LoadLoadingTask, there may be some Error thrown,
such as `NoClassDefFoundError`, but previously, we only catch java's `Exception`, so
other kind of error can not be shown clearly.
2023-11-13 09:39:29 +08:00
4230b8c36c [doc](hive) fix hive.version doc (#26806) 2023-11-12 19:38:12 +08:00
07f1114ffa [chore](fs) Don't print the stack for file system and it's derived class (#26814) 2023-11-12 19:22:01 +08:00
b2dd58a666 [fix](disk migrate) migrate ignore not exists tablet (#26779) 2023-11-12 18:04:33 +08:00
66054a5c78 [opt](scanner) increase the connection num of s3 client (#26795) 2023-11-12 00:29:11 -06:00
8cf360fff7 [refactor](closure) remove ref count closure using auto release closure (#26718)
1. closure should be managed by a unique ptr and released by brpc , should not hold by our code. If hold by our code, we need to wait brpc finished during cancel or close.
2. closure should be exception safe, if any exception happens, should not memory leak.
3. using a specific callback interface to be implemented by Doris's code, we could write any code and doris should manage callback's lifecycle.
4. using a weak ptr between callback and closure. If callback is deconstruted before closure'Run, should not core.
2023-11-12 11:57:46 +08:00
ef880166bb [regression-test](stream load)Invalid EXEC_MEM_LIMIT check (#26717) 2023-11-12 11:55:44 +08:00
8392e49983 [fix](hudi) fix wrong schema when query hudi table on obs (#26789) 2023-11-11 21:10:30 -06:00
2937b5166e [fix](refresh) fix priv issue of refresh database and table operation (#26793) 2023-11-11 21:09:53 -06:00
b23dd27c5e [chore](regression-test) Fix error add partition operation due to duplicate partition range (#26742) 2023-11-12 11:00:52 +08:00
ad754cb58f [fix](fe ut) Fix set traceid failed #26808
related to #26605
2023-11-12 10:55:10 +08:00
12b2b0f366 [fix](s3) Prevent data race when finishing s3 file writer's _put_object operation (#26811) 2023-11-12 07:29:14 +08:00
c26f5a2bd2 [improvement](BE) Remove unnecessary error handling codes (#26760) 2023-11-12 00:02:51 +08:00
3044b8397e [feature](fe) Add coverage tool for FE UT (#26203) 2023-11-11 19:54:04 +08:00
196fadc044 [enhancement](metrics) enhance visibility of flush thread pool (#26544) 2023-11-11 19:53:24 +08:00
8b33b0c4a4 [Fix](row store) cache invalidate key should not include sequence column (#26771) 2023-11-11 01:30:32 -06:00
ca47d75e83 [fix](regression) Add regression for group commit executed on observe… (#26692) 2023-11-10 18:53:45 +08:00
70fdd1f1af [fix](ci) fix bug, tpch pipeline upload log (#26627)
* [fix](ci) fix bug, tpch pipeline upload log
Co-authored-by: stephen <hello-stephen@qq.com>
2023-11-10 18:01:40 +08:00
fd43e64a72 [Enhancement](sql-cache) Use update time of hive to avoid cache miss through multi fe nodes. (#26424)
Now the update time of hms table is generated by every FE node (Use `System.currentTimestamp()` separately), so the update time of a hms table may be different between FE nodes, always the same query can not hit the sql-cache if we submit it more than one times through different FE nodes. This pr mainly do following changes to avoid this problem.

- Use the `transient_lastDdlTime` instead of `System.currentTimestamp` as the `schemaUpdateTime` of hms tables
- Use the `eventTime` in hms event instead of `System.currentTimestamp` as the update time when processing hms events
2023-11-10 17:36:00 +08:00
8ee237c55a [Enhance](regression)enhance case test_hdfs_json_load #26358
enhance case test_hdfs_json_load
2023-11-10 17:29:11 +08:00
0e0cd3b256 [fix](action) Update pr-approve-status.yml (#26577)
According to https://docs.github.com/en/rest/pulls/reviews?apiVersion=2022-11-28#list-reviews-for-a-pull-request,
the number of results per page default is 30 (max 100).
review of  APPROVED after 30 will not be listed,
change to 100 to fix it.
2023-11-10 17:01:37 +08:00
Pxl
2712bb9f60 [Bug](decimalv2) getCmpType return decimalv2 when lhs/rhs type both is decimalv2 (#26705)
getCmpType return decimalv2 when lhs/rhs type both is decimalv2
2023-11-10 16:21:28 +08:00
59efebce3b [opt](nereids) estimate join cost when col stats are not available (#26086)
no stats left zigzag
2023-11-10 16:13:53 +08:00
0749d632c4 [feature](diagnose) diagnose for cluster balance (#26085) 2023-11-10 15:31:58 +08:00
4ebb517af0 [fix](be-ut) Fix compilation errors caused by missing opentelemetry headers (#26739) 2023-11-10 14:58:46 +08:00
ce64f0c917 [enhancement](Nereids): add phase in shape string (#26682) 2023-11-10 14:56:28 +08:00
5c3fed216d [fix](transaction) Fix publish txn wait too long when not meet quorum (#26659) 2023-11-10 14:55:26 +08:00
9f6c6ffc92 [regression-test](stream load)Invalid file format check (#26713) 2023-11-10 14:53:01 +08:00
899630d0eb [chore](key_util) remove useless null_first parameter (#26635)
Doris always put null in the first when sorting key, the parameter null_first of encode_keys is useless.
2023-11-10 14:27:47 +08:00
cdba4936b4 [feature](nereids) Support group commit insert (#26075) 2023-11-10 14:20:14 +08:00
019fb956d3 [docs](cache) Refactor query-cache docs (#26418) 2023-11-10 13:57:20 +08:00
7878c08e15 [Revert](merge-on-write) Don't use delete bitmap to mark delete for rows with delete sign when sequence column doesn't exist (#26721) 2023-11-10 13:55:40 +08:00
27a21aa150 [fix](balance) Delete useless debug log (#26732) 2023-11-10 12:57:13 +08:00