Commit Graph

6137 Commits

Author SHA1 Message Date
de6ecd2035 [fix](tls) Manually track memory in Allocator instead of mem hook and ThreadContext life cycle to manual control (#26904)
Manually track query/load/compaction/etc. memory in Allocator instead of mem hook.
Can still use Mem Hook when cannot manually track memory code segments and find memory locations during debugging.
This will cause memory tracking loss for Query, loss less than 10% compared to the past, but this is expected to be more controllable.
Similarly, Mem Hook will no longer track unowned memory to the orphan mem tracker by default, so the total memory of all MemTrackers will be less than before.
Not need to get memory size from jemalloc in Mem Hook each memory alloc and free, which would lose performance in the past.
Not require caching bthread local in pthread local for memory hook, in the past this has caused core dumps inside bthread, seems to be a bug in bthread.
ThreadContext life cycle to manual control
In the past, ThreadContext was automatically created when it was used for the first time (this was usually in the Jemalloc Hook when the first malloc memory), and was automatically destroyed when the thread exited.
Now instead of manually controlling the create and destroy of ThreadContext, it is mainly created manually when the task thread start and destroyed before the task thread end.
Run 43 clickbench query tests.
Use MemHook in the past:
2023-11-14 10:30:42 +08:00
34edc578f1 [opt](MergeIO) use equivalent merge size to measure merge effectiveness (#26741)
`MergeRangeFileReader` is used to merge small IOs, and `max_amplified_read_ratio` controls the proportion of read amplification. However, in some extreme cases(eg. `orc strip size`/`parquet row group size` is less than 3MB), the control effect of `max_amplified_read_ratio` is not good, resulting in a large amount of small IOs.

After testing, the return time of a single IO for IO size smaller than 4kb in hdfs(512kb in oss) remains basically unchanged. Therefore, equivalent IO size is used to measure merge effectiveness:
```
EquivalentIOSize = MergeSize / Request IOs
```
When `EquivalentIOSize` is greater than 4kb in hdfs, or 512kb in oss, we believe that this kind of merge is effective.
2023-11-14 10:07:14 +08:00
6f82c798eb [fix](delta-writer) fix total received rows in delta writer incorrect (#26905) 2023-11-14 08:31:16 +08:00
ec40603b93 [fix](parquet) compressed_page_size has the same meaning in page v1 and v2 (#26783)
1. Parquet with page v2 is parsed error when using other codec except snappy. Because `compressed_page_size` has the same meaning in page v1 and v2, it always contains the bytes of definition level, repetition level and compressed data.
2. Add regression test for `fix_length_byte_array` stored decimal type, and dictionary encoded date/datetime type.
2023-11-14 08:30:42 +08:00
b19abac5e2 [fix](move-memtable) pass num local sink to backends (#26897) 2023-11-14 08:28:49 +08:00
de62c00f4e [fix](move-memtable) init auto partition context in VRowDistribution::open (#26911) 2023-11-14 08:16:14 +08:00
5ad49dceaa [fix](scanner_schedule) scanner hangs due to negative num_running_scanners (#26816)
* [fix] scanner hangs due to negative num_running_scanners

Before the patch, num_running_scanners is increased after submitting,
then it may be decreased before increasing then negative values can
be seen by get_block_from_queue and a expected submit does not happend.

Co-authored-by: Mingyu Chen <morningman.cmy@gmail.com>
2023-11-13 23:03:49 +08:00
f698bb7be2 [Feature](inverted index) index tool add match_all and match_phrase (#26896) 2023-11-13 22:53:11 +08:00
ebc15fc6cc [fix](transaction) Fix concurrent schema change and txn cause dead lock (#26428)
Concurrent schema change and txn may cause dead lock. An example:

Txn T commit but not publish;
Run schema change or rollup on T's related partition, add alter replica R;
sc/rollup add a sched txn watermark M;
Restart fe;
After fe restart, T's loadedTblIndexes will clear because it's not save to disk;
T will publish version to all tablet, including sc/rollup's new alter replica R;
Since R not contains txn data, so the T will fail. It will then always waitting for R's data;
sc/rollup wait for txn before M to finish, only after that it will let R copy history data;
Since T's not finished, so sc/rollup will always wait, so R will nerver copy history data;
Txn T and sc/rollup will wait each other forever, cause dead lock;
Fix: because sc/rollup will ensure double write after the sched watermark M, so for finish transaction, when checking a alter replica:

if txn id is bigger than M, check it just like a normal replica;
otherwise skip check this replica, the BE will modify history data later.
2023-11-13 21:39:28 +08:00
504ec324bb Revert "[refactor](scan) delete bloom_filter_predicate (#26499)" (#26851)
This reverts commit 2bb3ef198144954583aea106591959ee09932cba.
2023-11-13 16:27:23 +08:00
c6b97c4daa [Improvement](segment iterator) remove range in first read to save time (#26689)
Currently, rowids may be fragmented significantly after `_get_row_ranges_by_column_conditions`, potentially leading to high CPU costs when processing these scattered ranges of rowid.

This PR enhances the `SegmentIterator` by eliminating the initial range read in the `BitmapRangeIterator` constructor and introducing a `read_batch_rowids` method to both `BitmapRangeIterator` and `BackwardBitmapRangeIterator` classes. The aim is to boost performance by omitting redundant read operations, thereby reducing execution time.

Moreover, to avoid unnecessary reads when the range is relatively complete, we employ a simple `is_continuous` check to determine if the block of rows is continuous. If so, we call `next_batch` instead of `read_by_rowids`, streamlining the processing of consecutive rowids.


We selected three SQL statement scenarios to test the effects of the optimization, which are:

1. ```select COUNT() from wc_httplogs_inverted_index where request match "images" and (size >= 10 and status = 200);```
2. ```select COUNT() from wc_httplogs_inverted_index where request match "HTTP" and (size >= 10 and status = 200);```
3. ```select COUNT() from wc_httplogs_inverted_index where request match "GET" and (size >= 10 and status = 200);```

- The first SQL statement represents the scenario primarily optimized in this PR, where the first read matches a large number of rows but is highly fragmented. 
- The second SQL statement represents a scenario where the first read fully hits, mainly to verify if there is any performance degradation in the PR when hitting a complete rowid range. 
- The third SQL statement represents a near-total hit with only occasional misses, used to check if the PR degrades when the rowid range contains many continuous ranges.

The results are as follows:

1. For the first SQL statement:
    1. Before optimization: Execution time: 0.32 sec, FirstReadTime: 6s628ms
    2. After optimization: Execution time: 0.16 sec, FirstReadTime: 1s604ms
2. For the second SQL statement:
    1. Before optimization: Execution time: 0.16 sec, FirstReadTime: 682.816ms
    2. After optimization: Execution time: 0.15 sec, FirstReadTime: 635.156ms
3. For the third SQL statement:
    1. Before optimization: Execution time: 0.16 sec, FirstReadTime: 787.904ms
    2. After optimization: Execution time: 0.16 sec, FirstReadTime: 798.861ms
2023-11-13 15:51:48 +08:00
2f32a721ee [refactor](jni) unified jni framework for jdbc catalog (#26317)
This commit overhauls the JDBC connector logic within our project, transitioning from the previous mechanism of fetching data through JNI calls for individual ResultSet items to a more efficient and unified approach using the VectorTable data structure.
2023-11-13 14:28:15 +08:00
fa3c7d98c8 [fix](map) the implementation of ColumnMap::replicate was incorrect" (#26647) 2023-11-13 12:17:14 +08:00
c0fda8c5c2 [improve](group commit) Add a swicth to wait internal group commit lo… (#26734)
* [improve](group commit) Add a swicth to make internal group commit load finish

* modify group commit tvf plan
2023-11-13 10:35:35 +08:00
7332b1b371 [fix](decimal) fix undefined behaviour of divide by zero when cast string to decimal (#26822)
* [fix](decimal) fix undefined behaviour of divide by zero when cast string to decimal

* fix format
2023-11-13 10:09:06 +08:00
d9e0a9fa2e [enhancement](230) print max version and spec version when -230 happens (#26643)
More information is provided.
2023-11-13 09:57:22 +08:00
07f1114ffa [chore](fs) Don't print the stack for file system and it's derived class (#26814) 2023-11-12 19:22:01 +08:00
66054a5c78 [opt](scanner) increase the connection num of s3 client (#26795) 2023-11-12 00:29:11 -06:00
8cf360fff7 [refactor](closure) remove ref count closure using auto release closure (#26718)
1. closure should be managed by a unique ptr and released by brpc , should not hold by our code. If hold by our code, we need to wait brpc finished during cancel or close.
2. closure should be exception safe, if any exception happens, should not memory leak.
3. using a specific callback interface to be implemented by Doris's code, we could write any code and doris should manage callback's lifecycle.
4. using a weak ptr between callback and closure. If callback is deconstruted before closure'Run, should not core.
2023-11-12 11:57:46 +08:00
12b2b0f366 [fix](s3) Prevent data race when finishing s3 file writer's _put_object operation (#26811) 2023-11-12 07:29:14 +08:00
c26f5a2bd2 [improvement](BE) Remove unnecessary error handling codes (#26760) 2023-11-12 00:02:51 +08:00
196fadc044 [enhancement](metrics) enhance visibility of flush thread pool (#26544) 2023-11-11 19:53:24 +08:00
8b33b0c4a4 [Fix](row store) cache invalidate key should not include sequence column (#26771) 2023-11-11 01:30:32 -06:00
70fdd1f1af [fix](ci) fix bug, tpch pipeline upload log (#26627)
* [fix](ci) fix bug, tpch pipeline upload log
Co-authored-by: stephen <hello-stephen@qq.com>
2023-11-10 18:01:40 +08:00
4ebb517af0 [fix](be-ut) Fix compilation errors caused by missing opentelemetry headers (#26739) 2023-11-10 14:58:46 +08:00
899630d0eb [chore](key_util) remove useless null_first parameter (#26635)
Doris always put null in the first when sorting key, the parameter null_first of encode_keys is useless.
2023-11-10 14:27:47 +08:00
7878c08e15 [Revert](merge-on-write) Don't use delete bitmap to mark delete for rows with delete sign when sequence column doesn't exist (#26721) 2023-11-10 13:55:40 +08:00
7754791146 [improvement](disk balance) Prevent duplicate disk balance tasks afte… (#25990) 2023-11-10 10:14:42 +08:00
2bf48d7829 Revert "[Coverage](BE) Delete vinfo_func in BE (#26562)" (#26723)
This reverts commit 01094fd25ed539a8025066d8823c1e907109048a.
2023-11-10 10:14:11 +08:00
d767804815 [feature](merge-cloud) Decouple rowset id generator and local rowsets gc implementation (#25921) 2023-11-10 10:07:02 +08:00
d988193d39 [pipelineX](shuffle) block exchange sink by memory usage (#26595) 2023-11-09 21:28:22 +08:00
c07a70e22a [Fix](orc-reader) Add missing break introduced by #26548. (#26633)
Add missing break introduced by #26548. Sorry for this mistake.
2023-11-09 18:29:44 +08:00
a5565f68b2 [Refactor](opentelemetry) Remove opentelemetry (#26605) 2023-11-09 18:05:34 +08:00
eca747413d [Fix](partial update) Fix core when doing partial update on tables with row column after schema change (#26632) 2023-11-09 18:00:05 +08:00
baae7bf339 [fix](information_schema)fix bug that metadata_name_ids error tableid and append information_schema case. (#26238)
fix bug that  #24059 .
Added some information_schema scanner tests.
files
schema_privileges
table_privileges
partitions
rowsets
statistics
table_constraints

Based on infodb_support_ext_catalog=false, it currently includes tests for all tables under the information_schema database.
2023-11-09 14:07:12 +08:00
22bf2889e5 [feature](tvf)(jni-avro)jni-avro scanner add complex data types (#26236)
Support avro's enum, record, union data types
2023-11-09 13:58:49 +08:00
5f62a4462d [Enhancement](wal) Add wal space back pressure (#26483) 2023-11-09 12:29:05 +08:00
33e46ee13d [enhancement](config) enable single_replica_load by default in BE (#26619)
Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
2023-11-09 12:14:37 +08:00
d1438a8563 [Fix](orc-reader) Fix orc complex types when late materialization was turned on by disabling late materialization in this case. (#26548)
Fix orc complex types when late materialization was turned on in orc reader by disabling late materialization in this case.
2023-11-09 12:05:43 +08:00
01094fd25e [Coverage](BE) Delete vinfo_func in BE (#26562)
Delete vinfo_func in BE
2023-11-09 11:00:15 +08:00
95f74f1544 [FIX](complextype)fix shrink in topN for complex type #26609 2023-11-09 10:56:14 +08:00
74e452f19c [bug](bitmap) fix bitmap value copy operator not call reset (#26451)
when a empty bitmap assign to other bitmap
the other bitmap should reset self firstly, and then set empty type.
2023-11-09 10:05:09 +08:00
66e591f7f2 [enhancement](brpc) add a auto release closure to ensure the closue safety (#26567) 2023-11-09 08:50:42 +08:00
55b2988bfd [Opt](date_add/sub) Throw exception when result of date_add/sub out of range (#26475) 2023-11-09 08:46:51 +08:00
a6f9df7096 [LOG] Add fatal log in exchange sink buffer (#26594) 2023-11-08 21:52:21 +08:00
d0960bac56 [Fix](partial update) Fix partial update info loss when the delete bitmaps of the committed transactions are calculated by the compaction (#26556)
a fix for #25147
2023-11-08 19:56:31 +08:00
3bce6d3828 [Opt](orc-reader) Optimize orc string dict filter in not_single_conjunct case. (#26386)
Optimize orc/parquet string dict filter in not_single_conjunct case. We can optimize this processing to filter block firstly by dict code, then filter by not_single_conjunct. Because dict code is int, it will filter faster than string.

For example:
```
select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate  and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01';
```
 `l_receiptdate` and `l_shipmode` will using string dict filtering, and `l_commitdate < l_receiptdate` is the an not_single_conjunct which contains dict filter field. We can optimize this processing to filter block firstly by dict code, then filter by not_single_conjunct. Because dict code is int, it will filter faster than string.

### Test Result:
Before:
 mysql> select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate  and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01';
+----------------------+
| count(l_receiptdate) |
+----------------------+
|             49314694 |
+----------------------+
1 row in set (6.87 sec)

After:
mysql> select count(l_receiptdate) from lineitem_date_as_string where l_shipmode in ('MAIL', 'SHIP') and l_commitdate < l_receiptdate  and l_receiptdate >= '1994-01-01' and l_receiptdate < '1995-01-01';
+----------------------+
| count(l_receiptdate) |
+----------------------+
|             49314694 |
+----------------------+
1 row in set (4.85 sec)
2023-11-08 18:03:18 +08:00
58bf79f79e [fix](move-memtable) pass load stream num to backends (#26198) 2023-11-08 16:16:33 +08:00
6637f9c15f Add enable_cgroup_cpu_soft_limit (#26510) 2023-11-08 15:52:13 +08:00
f018b00646 [ci](perf) add new pipeline of tpch-sf100 (#26334)
* [ci](perf) add new pipeline of tpch-sf100
Co-authored-by: stephen <hello-stephen@qq.com>
2023-11-08 15:32:02 +08:00