1. Fix the memory leak. When the load task is canceled, the `IndexChannel` and `NodeChannel` mem trackers cannot be destructed in time.
2. Fix Load task being frequently canceled by oom and inaccurate `LoadChannel` mem tracker limit, and rewrite the variable name of `mem limit` in `LoadChannel`.
3. Fix core dump, when logout task mem tracker, phmap erase fails, resulting in repeated logout of the same tracker.
4. Fix the deadlock, when add_child_tracker mem limit exceeds, calling log_usage causes `_child_trackers_lock` deadlock.
5. Fix frequent log printing when thread mem tracker limit exceeds, which will affect readability and performance.
6. Optimize some details of mem tracker display.
When the length of `Tuple/Block data` is greater than 2G, serialize the protoBuf request and embed the
`Tuple/Block data` into the controller attachment and transmit it through http brpc.
This is to avoid errors when the length of the protoBuf request exceeds 2G:
`Bad request, error_text=[E1003]Fail to compress request`.
In #7164, `Tuple/Block data` was put into attachment and sent via default `baidu_std brpc`,
but when the attachment exceeds 2G, it will be truncated. There is no 2G limit for sending via `http brpc`.
Also, in #7921, consider putting `Tuple/Block data` into attachment transport by default, as this theoretically
reduces one serialization and improves performance. However, the test found that the performance did not improve,
but the memory peak increased due to the addition of a memory copy.
1. Fix LoadTask, ChunkAllocator, TabletMeta, Brpc, the accuracy of memory track.
2. Modified some MemTracker names, deleted some unnecessary trackers, and improved readability.
3. More powerful MemTracker debugging capabilities.
4. Avoid creating TabletColumn temporary objects and improve BE startup time by 8%.
5. Fix some other details.
1. fix track bthread
- Bthread, a high performance M:N thread library used by brpc. In Doris, a brpc server response runs on one bthread, possibly on multiple pthreads. Currently, MemTracker consumption relies on pthread local variables (TLS).
- This caused pthread TLS MemTracker confusion when switching pthread TLS MemTracker in brpc server response. So replacing pthread TLS with bthread TLS in the brpc server response saves the MemTracker.
Ref: 731730da85/docs/en/server.md (bthread-local)
2. fix track vectorized query
- Added track mmap. Currently, mmap allocates memory in many places of the vectorized execution engine.
- Refactored ThreadContext to avoid dependency conflicts and make it easier to debug.
- Fix some bugs.
This reverts commit de7dce4df84fcbfbbaf715cbac151e802321f80f.
Reverts apache/incubator-doris#8976
This cause BE ut failed: sh run-be-ut.sh --run --filter OlapTableSinkTest.*
```
==62008==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x7ffff36867c0 in thread T0
```
We can not shutdown _send_batch_thread_pool_token, because _packet_in_flight
has to be clear finally. Otherwise a never ended join on rpc would happen.
It is difficult to handle concurrent problem if a flag setter is not guaranteed to run.
The patch fixes two problems.
1. Memory order problem accessing _last_patch_processed_finished and in_flight, actually _last_patch_processed_finished is redundant, so the patch removes it.
2. synchronization in join on cid.
Fix for #8725.
1. add a config string_type_soft_limit to soft limit max length of string type
2. disable using String type in Key column, partition column and
distribution column
3. remove String type alias BLOB for futrue use
Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G
Implement a new way of memory statistics based on TCMalloc New/Delete Hook,
MemTracker and TLS, and it is expected that all memory new/delete/malloc/free
of the BE process can be counted.
1.
The methods in the IndexChannel are called back in the RpcClosure in the NodeChannel.
However, this callback may occur after the whole task is finished (e.g. due to network latency),
and by that time the IndexChannel may have been destructured, so we should not call
the IndexChannel methods anymore, otherwise the BE will crash.
Therefore, we use the `_is_closed` variable and `_closed_lock` to ensure that the RPC callback function
will not call the IndexChannel's method after the NodeChannel is closed.
2.
Do not add IndexChannel to the ObjectPool.
Because when deconstruct IndexChannel, it may call the deconstruction of NodeChannel.
And the deconstruction of NodeChannel maybe time consuming(wait rpc finished).
But the ObjectPool will hold a SpinLock to destroy the objects, so it may cause CPU busy.
Due to unlimited queue in OlapScanNode and NodeChannel, memory usage can be
very large for reading and writing large table, e.g 'insert into tableB select * from tableA'.
Modify the implementation of MemTracker:
1. Simplify a lot of useless logic;
2. Added MemTrackerTaskPool, as the ancestor of all query and import trackers, This is used to track the local memory usage of all tasks executing;
3. Add cosume/release cache, trigger a cosume/release when the memory accumulation exceeds the parameter mem_tracker_consume_min_size_bytes;
4. Add a new memory leak detection mode (Experimental feature), throw an exception when the remaining statistical value is greater than the specified range when the MemTracker is destructed, and print the accurate statistical value in HTTP, the parameter memory_leak_detection
5. Added Virtual MemTracker, cosume/release will not sync to parent. It will be used when introducing TCMalloc Hook to record memory later, to record the specified memory independently;
6. Modify the GC logic, register the buffer cached in DiskIoMgr as a GC function, and add other GC functions later;
7. Change the global root node from Root MemTracker to Process MemTracker, and remove Process MemTracker in exec_env;
8. Modify the macro that detects whether the memory has reached the upper limit, modify the parameters and default behavior of creating MemTracker, modify the error message format in mem_limit_exceeded, extend and apply transfer_to, remove Metric in MemTracker, etc.;
Modify where MemTracker is used:
1. MemPool adds a constructor to create a temporary tracker to avoid a lot of redundant code;
2. Added trackers for global objects such as ChunkAllocator and StorageEngine;
3. Added more fine-grained trackers such as ExprContext;
4. RuntimeState removes FragmentMemTracker, that is, PlanFragmentExecutor mem_tracker, which was previously used for independent statistical scan process memory, and replaces it with _scanner_mem_tracker in OlapScanNode;
5. MemTracker is no longer recorded in ReservationTracker, and ReservationTracker will be removed later;
In some scenarios, users cannot find a suitable hash key to avoid data skew, so we need to provide an additional data distribution for olap table to avoid data skew
example:
CREATE TABLE random_table
(
siteid INT DEFAULT '10',
citycode SMALLINT,
username VARCHAR(32) DEFAULT '',
pv BIGINT SUM DEFAULT '0'
)
AGGREGATE KEY(siteid, citycode, username)
DISTRIBUTED BY random BUCKETS 10
PROPERTIES("replication_num" = "1");
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
1. Fix the problem of BE crash caused by destruct sequence. (close#8058)
2. Add a new BE config `compaction_task_num_per_fast_disk`
This config specify the max concurrent compaction task num on fast disk(typically .SSD).
So that for high speed disk, we can execute more compaction task at same time,
to compact the data as soon as possible
3. Avoid frequent selection of unqualified tablet to perform compaction.
4. Modify some log level to reduce the log size of BE.
5. Modify some clone logic to handle error correctly.
1. set both `tuple_offsets` and `new_tuple_offsets` in PRowBatch for compatibility
2. set FE config `repair_slow_replica` default to false
Avoid impacting the load process after upgrading.
Eg, if there are only 2 replicas, one is with high version count. After upgrade,
that replica will be set to bad, so that the load process will be stopped
because only 1 replica is alive.
3. Fix a bug that NodeChannel may be blocked at `close_wait()`
Forget to set `add_batch_finish` flag after the last rpc finished.
4. Fix a NPE of RoutineLoadScheduler
Support implement UDF through GRPC protocol. This brings several benefits:
1. The udf implementation language is not limited to c++, users can use any familiar language to implement udf
2. UDF is decoupled from Doris, udf will not cause doris coredump, udf computing resources are separated from doris, and doris services are not affected
But RPC's UDF has a fixed overhead, so its performance is much slower than C++ UDF, especially when the amount of data is large.
Create function like
```
CREATE FUNCTION rpc_add(INT, INT) RETURNS INT PROPERTIES (
"SYMBOL"="add_int",
"OBJECT_FILE"="127.0.0.1:9999",
"TYPE"="RPC"
);
```
Function service need to implement `check_fn` and `fn_call` methods
Note:
THIS IS AN EXPERIMENTAL FEATURE, THE INTERFACE AND DATA STRUCTURE MAY BE CHANGED IN FUTURE !!!
This PR mainly changes:
1. Fix bug when enable `transfer_data_by_brpc_attachment`
In `data_stream_sender`, we will send a serialized PRowBatch data to multiple Channels.
And if `transfer_data_by_brpc_attachment` is enabled, we will mistakenly clear the data in PRowBatch
after sending PRowBatch to the first Channel.
As a result, the following Channel cannot receive the correct data, causing an error.
So I use a separate buffer instead of `tuple_data` in PRowBatch to store the serialized data
and reuse it in multiple channels.
2. Fix bug that the the offset in serialized row batch may overflow
Use int64 to replace int32 offset. And for compatibility, add a new field `new_tuple_offsets` in PRowBatch.
Currently, if we encounter a problem with a replica of a tablet during the load process,
such as a write error, rpc error, -235, etc., it will cause the entire load job to fail,
which results in a significant reduction in Doris' fault tolerance.
This PR mainly changes:
1. refined the judgment of failed replicas in the load process, so that the failure of a few replicas will not affect the normal completion of the load job.
2. fix a bug introduced from #7754 that may cause BE coredump
This PR mainly changes:
1. Help to Cancel the load job ASAP when encounter unqualified data.
Solution is described in #6318 .
Also replace some std::stringstream with fmt::memory_buffer to avoid performance issues.
2. fix a NPE bug when create user with empty host
3. fix compile warning after rebasing the master(vectorization)
# Proposed changes
Issue Number: close#6238
Co-authored-by: HappenLee <happenlee@hotmail.com>
Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com>
Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com>
Co-authored-by: wangbo <506340561@qq.com>
Co-authored-by: emmymiao87 <522274284@qq.com>
Co-authored-by: Pxl <952130278@qq.com>
Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com>
Co-authored-by: thinker <zchw100@qq.com>
Co-authored-by: Zeno Yang <1521564989@qq.com>
Co-authored-by: Wang Shuo <wangshuo128@gmail.com>
Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com>
Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>
Co-authored-by: xinghuayu007 <1450306854@qq.com>
Co-authored-by: weizuo93 <weizuo@apache.org>
Co-authored-by: yiguolei <guoleiyi@tencent.com>
Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com>
Co-authored-by: awakeljw <993007281@qq.com>
Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com>
Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com>
## Problem Summary:
### 1. Some code from clickhouse
**ClickHouse is an excellent implementation of the vectorized execution engine database,
so here we have referenced and learned a lot from its excellent implementation in terms of
data structure and function implementation.
We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers.**
The following comment has been added to the code from Clickhouse, eg:
// This file is copied from
// https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h
// and modified by Doris
### 2. Support exec node and query:
* vaggregation_node
* vanalytic_eval_node
* vassert_num_rows_node
* vblocking_join_node
* vcross_join_node
* vempty_set_node
* ves_http_scan_node
* vexcept_node
* vexchange_node
* vintersect_node
* vmysql_scan_node
* vodbc_scan_node
* volap_scan_node
* vrepeat_node
* vschema_scan_node
* vselect_node
* vset_operation_node
* vsort_node
* vunion_node
* vhash_join_node
You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set.
### 3. Data Model
Vec Exec Engine Support **Dup/Agg/Unq** table, Support Block Reader Vectorized.
Segment Vec is working in process.
### 4. How to use
1. Set the environment variable `set enable_vectorized_engine = true; `(required)
2. Set the environment variable `set batch_size = 4096; ` (recommended)
### 5. Some diff from origin exec engine
https://github.com/doris-vectorized/doris-vectorized/issues/294
## Checklist(Required)
1. Does it affect the original behavior: (No)
2. Has unit tests been added: (Yes)
3. Has document been added or modified: (No)
4. Does it need to update dependencies: (No)
5. Are there any changes that cannot be rolled back: (Yes)
If an load task has a relatively short timeout, then we need to ensure that
each RPC of this task does not get blocked for a long time.
And an RPC is usually blocked for two reasons.
1. handling "memory exceeds limit" in the RPC
If the system finds that the memory occupied by the load exceeds the threshold,
it will select the load channel that occupies the most memory and flush the memtable in it.
this operation is done in the RPC, which may be more time consuming.
2. close the load channel
When the load channel receives the last batch, it will end the task.
It will wait for all memtables flushes to finish synchronously. This process is also time consuming.
Therefore, this PR solves this problem by.
1. Use timeout to determine whether it is a high-priority load task
If the timeout of an load task is relatively short, then we mark it as a high-priority task.
2. not processing "memory exceeds limit" for high priority tasks
3. use a separate flush thread to flush memtable for high priority tasks.
Transfer RowBatch in Protobuf Request to Controller Attachment,
when the maximum length of the RowBatch in the Protobuf Request is exceeded.
This can avoid reaching the upper limit of the Protobuf Request length (2G),
and it is expected that performance can be improved.
Added bprc stub cache check and reset api, used to test whether the bprc stub cache is available, and reset the bprc stub cache
add a config used for auto check and reset bprc stub
Add a use_path_style property for S3
Upgrade hadoop-common and hadoop-aws to 2.8.0 to support path style property
Fix some S3 URI bugs
Add some logs for tracing load process.
This CL mainly changes:
1. Add star schema benchmark tools in `tools/ssb-tools`, for user to easy load and test with SSB data set.
2. Disable the segment cache for some read scenario such as compaction and alter operation.(Fix#6924 )
3. Fix a bug that `max_segment_num_per_rowset` won't work(Fix#6926)
4. Enable `enable_batch_delete_by_default` by default.
1. Fix a memory leak in `collect_iterator.cpp` (Fix#6700)
2. Add a new BE config `max_segment_num_per_rowset` to limit the num of segment in new rowset.(Fix#6701)
3. Make the error msg of stream load more friendly.
1.Fix a potential BE coredump of sending batch when loading data. (Fix [Bug] BE crash when loading data #6656)
2.Fix a potential BE coredump when doing schema change. (Fix [Bug] BE crash when doing alter task #6657)
3.Optimize the metric of base_compaction_request_failed.
4.Add Order column in show tablet result. (Fix [Feature] Add order column in SHOW TABLET stmt result #6658)
5.Fix bug that tablet repair slot not being released. (Fix [Bug] Tablet scheduler stop working #6659)
6.Fix bug that REPLICA_MISSING error can not be handled. (Fix [Bug] REPLICA_MISSING error can not be handled. #6660)
7.Modify column name of SHOW PROC "/cluster_balance/cluster_load_stat"
8.Optimize the result of SHOW PROC "/statistic" to show COLOCATE_MISMATCH tablets (Fix [Feature] the health status of colocate table's tablet is not shown in show proc statistic #6663)
9.Fix bug that show load where state='pending' can not be executed. (Fix [Bug] show load where state='pending' can not be executed. #6664)
The log4j-config.xml will be generated at startup of FE and also when modifying FE config.
But in some deploy environment such as k8s, the conf dir is not writable.
So change the dir of log4j-config.xml to Config.custom_conf_dir.
Also fix some small bugs:
1. Typo "less then" -> "less than"
2. Duplicated `exec_mem_limit` showed in SHOW ROUTINE LOAD
3. Allow MAXVALUE in single partition column table.
4. Add IP info for "intolerate index channel failure" msg.
Change-Id: Ib4e1182084219c41eae44d3a28110c0315fdbd7d
Co-authored-by: chenmingyu <chenmingyu@baidu.com>