We make all MemTrackers shared, in order to show MemTracker real-time consumptions on the web.
As follows:
1. nearly all MemTracker raw ptr -> shared_ptr
2. Use CreateTracker() to create new MemTracker(in order to add itself to its parent)
3. RowBatch & MemPool still use raw ptrs of MemTracker, it's easy to ensure RowBatch & MemPool destructor exec
before MemTracker's destructor. So we don't change these code.
4. MemTracker can use RuntimeProfile's counter to calc consumption. So RuntimeProfile's counter need to be shared
too. We add a shared counter pool to store the shared counter, don't change other counters of RuntimeProfile.
Note that, this PR doesn't change the MemTracker tree structure. So there still have some orphan trackers, e.g. RowBlockV2's MemTracker. If you find some shared MemTrackers are little memory consumption & too time-consuming, you could make them be the orphan, then it's fine to use the raw ptr.
Result may error when ORC load negative decimal value
When load negative decimal which has pre zero , the result is wrong.
eg -0.0014, the orc result is -14(precision ... 0)
Mainly change:
1. Fix the bug in `update_status(status)` of `PlanFragmentExecutor`.
2. When the FE Coordinator executes `execRemoteFragmentAsync()`, if it finds an RPC error, return a Future with an error code instead of exception.
3. Protect the `_status` in RuntimeState with lock
4. Move the `_runtime_profile` of RuntimeState before the `_obj_pool`, so that the profile will be
deconstructed after the object pool.
5. Remove the unused `ObjectPool` param in RuntimeProfile constructor. If I don't remove it,
RuntimeProfile will depends on the `_obj_pool` in RuntimeProfile.
#3479
Here I try to explain the cause of the problem and how to fix it.
**The Cause of The problem**
Take the case in issue(#3479 ) as an example:
The general results are as follows:
```
GET table/_doc/_search
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"took": 6,
"timed_out": false,
"_shards": {
……
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "table",
"_score": null,
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"fields": {
"k1": [
"kkk1"
]
},
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"sort": [
0
]
}
]
}
}
```
But in Doris on ES,Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows:
```
GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 0
}
}
GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1
}
}
GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1,
"hits": [
{
"fields": {
"k1": [
"kkk1"
]
}
}
]
}
}
```
*Scan-Worker On BE which processed result of shard2 will failed.*
**The reasons are as follows:**
1. "filter_path" causes the hits.hits object not exist.
2. In the current implementation, if there are some data rows(total > 0), the hits.hits. object must be an array
**How To Fix it**
Two Method:
1. modify "filter_path" to contain the hits.
Pros: Fixed Code is very simple
Cons: More network cost
2. Deal with the case where fields are missing in a batch.
Pros: No loss of performance
Cons: Code is more complex
Performance first, I use Method2.
**Design**
1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not.
2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode
3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode" and _total to construct _total lines which fields are assigned with 'NULL'
LSAN detected errors have been fixed by a prior pathch (#3326), but
there are still some ASAN detected errors.
This patch try to fix these errors to make Doris BE more robustness.
And then we can add CI run in LSAN/ASAN mode to detect memory errors
as early as possible.
ImplementaItion Notes
NodeChannel
_cur_batch -> _pending_batches: when _cur_batch is filled up, move it to _pending_batches.
add_row() just produce batches.
try_send_and_fetch_status() tries to consume one pending batch. If has in flight packet, skip send in this round.
So we can add one sender thread to be in charge of all node channels try_send.
IndexChannel
init(), open() stay the same.
Use for_each_node_channel() to expose the detailed changes of NodeChannel.(It's more easy to read & modify)
Sender thread
See func OlapTableSink::_send_batch_process()
Why use polling?
If we use wait/notify, it will notify when generate a new batch. We can't skip sending this batch, coz it won't notify the same batch again. So wait/notify can't avoid blocking simply.
So I choose polling.
It's wasting to continuously try_send(), but it's difficult to set the suitable polling interval. Thus, I add std::this_thread::yield() to give up the time slice, give priority to other process/threads (if there are other process/threads waiting in the queue).
We can observe the workload of BE, and also it's a way to check
whether there is any problem in BE, like some container increase
too large and lead to OOM.
This patch add the following metrics:
```
Name Description
rowset_count_generated_and_in_use The total count of rowset id generated and in use since BE last start
unused_rowsets_count The total count of unused rowset waiting to be GC
broker_count The total count of brokers in management
data_stream_receiver_count The total count of data stream receivers in management
fragment_endpoint_count The total count of fragment endpoints of data stream in management, should always equal to data_stream_receiver_count
active_scan_context_count The total count of active scan contexts
plan_fragment_count The total count of plan fragments in executing
load_channel_count The total count of load channels in management
result_buffer_block_count The total count of result buffer blocks for queries, each block has a limited queue size (default 1024)
result_block_queue_count The total count of queues for fragments, each queue has a limited size (default 20, by config::max_memory_sink_batch_count)
routine_load_task_count The total count of routine load tasks in executing
small_file_cache_count The total count of cached small files' digest info
stream_load_pipe_count The total count of stream load pipes, each pipe has a limited buffer size (default 1M)
tablet_writer_count The total count of tablet writers
brpc_endpoint_stub_count The total count of brpc endpoints
```
related issue: #3306
Note: this PR just remove the es_scan_node_test.cpp which is useless
For the moment, just add a simple explain syntax for EsTable without translating the native predicates to ES queryDSL which is better to finished with moving the predicate translating from Doris BE to Doris FE, the whole work is still WIP.
Relate Issue: https://github.com/apache/incubator-doris/issues/3248
SQL:
```
select * from test where (k2 = 6 and k3 = 1) or (k2 = 2 and k3 =3 and k4 = 'beijing');
```
Output filter:
```
((#k2:[6 TO 6] #k3:[1 TO 1]) (#(#k2:[2 TO 2] #k3:[3 TO 3]) #k4:beijing))~1
```
SQL:
```
select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and k3 =3 and (k4 = 'beijing' or k4 = 'zhaochun'));
```
Output filter:
```
(k2:[6 TO 6] k3:[7 TO 7] (#(#k2:[2 TO 2] #k3:[3 TO 3]) #((k4:beijing k4:zhaochun)~1)))~1
```
SQL:
```
select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and abs(k3) =3 and (k4 = 'beijing' or k4 = 'zhaochun'));
```
Output filter (`abs` can not be pushed down to es, so doris on es would not process this scenario ):
```
match_all
```
The timestamp value load from orc file is error, the value has an offset with hive and spark.
Becuase the time zone of orc's timestamp is stored inside orc's stripe information, so the timestamp obtained here is an offset timestamp, so parse timestamp with UTC is actual datetime literal.
Pure DocValue optimization for doris-on-es
Future todo:
Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable, should check pure_docvalue enabled for one whole scan outside, I will add this todo in future
Leverage gitattributes to enable auto convert end-of-line to LF when
checking in. Convert already exist CRLF to LF by removing all files and
checking out with new .gitattributes file. Except .gitattributes, all
files are only modified at the end of line.
Currently, we do not support parsing encoded/compressed columns in file path, eg: extract column k1 from file path /path/to/dir/k1=1/xxx.csv
This patch is able to parse columns from file path like in Spark(Partition Discovery).
This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then the broker reader of BE can read the value of specified partition column.
Use same UUID as query ID and load ID of a load execution plan.
Each load execution plan has a load ID, and as a plan, there is also a query ID.
We can use same UUID as query ID and load ID, for tracing the load process more easily.
Change the load ID when retrying a load execution plan.
When a load execution plan retry, the load ID should be changed, otherwise BE can not
distinguish the old and new load requests.
Cancel the running loading task when cancelling the broker load.
When user cancel a broker load, the running loading task should also be cancelled, or
it may occupies the worker thread for a long time.
Remove the unnecessary query report when doing load execution plan.
Only the last query report is needed.
Add a new BE config tablet_writer_rpc_timeout_sec.
It is used for RPC of tablet sink. The default is 600 seconds. which is long enough for flushing
about 6GB data. The long timeout config will reduce the possibility of encountering fail to send batch error when loading.
Use streaming_load_max_mb instead of mini_load_max_mb in BE config.
Add more logs for tracing a broker load process easily.
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.
1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
Nameing rowset with tablet_id and version will lead to
many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
with different format.
* Add UserFunctionCache to cache UDF's library
This patch replace LibCache with UserFunctionCache. LibCache use HDFS
URL to identify a UDF's Library, and when BE process restart all of
downloaded library should be loaded another time. We use function id
corresponding to a library, and when process restart, all downloaded
libraries can be loaded without another downloading.
* update
* Reduce UT binary size
Almost every module depend on ExecEnv, and ExecEnv contains all
singleton, which make UT binary contains all object files.
This patch seperate ExecEnv's initial and destory to anthor file to
avoid other file's dependence. And status.cc include debug_util.h which
depend tuple.h tuple_row.h, and I move get_stack_trace() to
stack_util.cpp to reduce status.cc's dependence.
I add USE_RTTI=1 to build rocksdb to avoid linking librocksdb.a
Issue: #292
* Update