In pr #8476, all memory usage of a process is recorded in the process mem tracker,
and all memory usage of a query is recorded in the query mem tracker,
and it is still necessary to manually call `transfer to` to track the cached memory size.
We hope to separate out more detailed memory usage based on Hook TCMalloc new/delete + TLS mem tracker.
In this pr, the more detailed mem tracker is switched to TLS, which automatically and accurately
counts more detailed memory usage than before.
Add a new column-type to speed up the approximation of quantiles.
1. The new column-type is named `quantile_state` with fixed aggregation function `quantile_union`, which stores the intermediate results of pre-aggregated approximation calculations for quantiles.
2. support pre-aggregation of new column-type and quantile_state related functions.
This feature is propsoed in [DSIP-1](https://cwiki.apache.org/confluence/display/DORIS/DSIP-001%3A+Java+UDF).
This PR support fixed-length input and output Java UDF. Phase I in DIP-1 is done after this PR.
To support Java UDF effeciently, I use no data copy in JNI call and all compute operations are off-heap in Java.
To achieve that, I use a UdfExecutor instead.
For users, a UDF class must have a public evaluate method.
Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G
Implement a new way of memory statistics based on TCMalloc New/Delete Hook,
MemTracker and TLS, and it is expected that all memory new/delete/malloc/free
of the BE process can be counted.
Modify the implementation of MemTracker:
1. Simplify a lot of useless logic;
2. Added MemTrackerTaskPool, as the ancestor of all query and import trackers, This is used to track the local memory usage of all tasks executing;
3. Add cosume/release cache, trigger a cosume/release when the memory accumulation exceeds the parameter mem_tracker_consume_min_size_bytes;
4. Add a new memory leak detection mode (Experimental feature), throw an exception when the remaining statistical value is greater than the specified range when the MemTracker is destructed, and print the accurate statistical value in HTTP, the parameter memory_leak_detection
5. Added Virtual MemTracker, cosume/release will not sync to parent. It will be used when introducing TCMalloc Hook to record memory later, to record the specified memory independently;
6. Modify the GC logic, register the buffer cached in DiskIoMgr as a GC function, and add other GC functions later;
7. Change the global root node from Root MemTracker to Process MemTracker, and remove Process MemTracker in exec_env;
8. Modify the macro that detects whether the memory has reached the upper limit, modify the parameters and default behavior of creating MemTracker, modify the error message format in mem_limit_exceeded, extend and apply transfer_to, remove Metric in MemTracker, etc.;
Modify where MemTracker is used:
1. MemPool adds a constructor to create a temporary tracker to avoid a lot of redundant code;
2. Added trackers for global objects such as ChunkAllocator and StorageEngine;
3. Added more fine-grained trackers such as ExprContext;
4. RuntimeState removes FragmentMemTracker, that is, PlanFragmentExecutor mem_tracker, which was previously used for independent statistical scan process memory, and replaces it with _scanner_mem_tracker in OlapScanNode;
5. MemTracker is no longer recorded in ReservationTracker, and ReservationTracker will be removed later;
The thread context saves some info about a working thread.
1. thread_id: Current thread id, Auto generated.
2. type: The type is a enum value indicating which type of task current thread is running.
For example: QUERY, LOAD, COMPACTION, ...
3. task id: A unique id to identify this task. maybe query id, load job id, etc.
Using gcc11 compiles thread_local variable on lower versions of GLIBC will report an error, see https://github.com/apache/incubator-doris/pull/7911
This is very difficult to solve, so kudu Class-scoped static thread local implementation was introduced.
Solve the above problem by Thread-scopedthread local + Class-scoped thread local.
See the comments for ThreadContextPtr for details.
Also fix BE ut:
1. fix scheme_change_test memory leak
2. fix mem_pool_test
Do not using DEFAULT_PADDING_SIZE = 0x10 in mem_pool when running ut.
3. remove plugin_test
* make ASAN poisoning work as much as possible
Before this patch a use after poison is reported like below
==19305==ERROR: AddressSanitizer: unknown-crash on address
0x625000137013 at pc 0x561c44bcf6b8 bp 0x7ffb75a00910 sp 0x7ffb75a000b8
After this patch the use after poison is reported like below
==17782==ERROR: AddressSanitizer: use-after-poison on address
0x625000137033 at pc 0x55633c8f56b8 bp 0x7ff3dc437930 sp 0x7ff3dc43
Before this patch, a false memory usage is reported like below
==33080==AddressSanitizer CHECK failed: ../../../../src/libsanitizer/
asan/asan_allocator.cpp:189 "((old)) == ((kAllocBegMagic))"
Support thread pool per disk for scanners to prevent pool performance from some high ioutil disks happening
key point:
1. each disk has a thread pool for scanners
2. whenever a thread pool of one disk runs out of local work, tasks can be retrieved from other threads(disks). This is done round-robin.
performance testing:
vec version: 25% faster than single thread pool in a high io util disk test case
normal version: 8% faster than single thread pool in a high io util disk test case
1. Fix the problem of BE crash caused by destruct sequence. (close#8058)
2. Add a new BE config `compaction_task_num_per_fast_disk`
This config specify the max concurrent compaction task num on fast disk(typically .SSD).
So that for high speed disk, we can execute more compaction task at same time,
to compact the data as soon as possible
3. Avoid frequent selection of unqualified tablet to perform compaction.
4. Modify some log level to reduce the log size of BE.
5. Modify some clone logic to handle error correctly.
The two phase batch commit means:
During Stream load, after data is written, the message will be returned to the client,
the data is invisible at this point and the transaction status is PRECOMMITTED.
The data will be visible only after COMMIT is triggered by client.
1. User can invoke the following interface to trigger commit operations for transaction:
curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" \
http://fe_host:http_port/api/{db}/_stream_load_2pc
or
curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" \
http://be_host:webserver_port/api/{db}/_stream_load_2pc
2.User can invoke the following interface to trigger abort operations for transaction:
curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" \
http://fe_host:http_port/api/{db}/_stream_load_2pc
or
curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" \
http://be_host:webserver_port/api/{db}/_stream_load_2pc
1. set both `tuple_offsets` and `new_tuple_offsets` in PRowBatch for compatibility
2. set FE config `repair_slow_replica` default to false
Avoid impacting the load process after upgrading.
Eg, if there are only 2 replicas, one is with high version count. After upgrade,
that replica will be set to bad, so that the load process will be stopped
because only 1 replica is alive.
3. Fix a bug that NodeChannel may be blocked at `close_wait()`
Forget to set `add_batch_finish` flag after the last rpc finished.
4. Fix a NPE of RoutineLoadScheduler
Support implement UDF through GRPC protocol. This brings several benefits:
1. The udf implementation language is not limited to c++, users can use any familiar language to implement udf
2. UDF is decoupled from Doris, udf will not cause doris coredump, udf computing resources are separated from doris, and doris services are not affected
But RPC's UDF has a fixed overhead, so its performance is much slower than C++ UDF, especially when the amount of data is large.
Create function like
```
CREATE FUNCTION rpc_add(INT, INT) RETURNS INT PROPERTIES (
"SYMBOL"="add_int",
"OBJECT_FILE"="127.0.0.1:9999",
"TYPE"="RPC"
);
```
Function service need to implement `check_fn` and `fn_call` methods
Note:
THIS IS AN EXPERIMENTAL FEATURE, THE INTERFACE AND DATA STRUCTURE MAY BE CHANGED IN FUTURE !!!
This PR mainly changes:
1. Fix bug when enable `transfer_data_by_brpc_attachment`
In `data_stream_sender`, we will send a serialized PRowBatch data to multiple Channels.
And if `transfer_data_by_brpc_attachment` is enabled, we will mistakenly clear the data in PRowBatch
after sending PRowBatch to the first Channel.
As a result, the following Channel cannot receive the correct data, causing an error.
So I use a separate buffer instead of `tuple_data` in PRowBatch to store the serialized data
and reuse it in multiple channels.
2. Fix bug that the the offset in serialized row batch may overflow
Use int64 to replace int32 offset. And for compatibility, add a new field `new_tuple_offsets` in PRowBatch.
The tuple ids of the empty set node must be exactly the same as the tuple ids of the origin root node.
In the issue, we found that once the tree where the root node is located has a window function,
the tuple ids of the empty set node cannot be calculated correctly.
This pr mostly fixes the problem.
In order to calculate the correct tuple ids,
the tuple ids obtained from the SelectStmt.getMaterializedTupleIds() function in the past
are changed to directly use the tuple ids of the origin root node.
Although we tried to fix#7929 by modifying the SelectStmt.getMaterializedTupleIds() function,
this method can't get the tuple of the last correct window function.
So we use other ways to construct tupleids of empty nodes.
Currently, if we encounter a problem with a replica of a tablet during the load process,
such as a write error, rpc error, -235, etc., it will cause the entire load job to fail,
which results in a significant reduction in Doris' fault tolerance.
This PR mainly changes:
1. refined the judgment of failed replicas in the load process, so that the failure of a few replicas will not affect the normal completion of the load job.
2. fix a bug introduced from #7754 that may cause BE coredump