This PR is the first step to make Doris stream load more robust with higher concurrent
performance(#3368),the main work is to support txn management in db level isolation
and use ArrayDeque to stored final status txns.
#3479
Here I try to explain the cause of the problem and how to fix it.
**The Cause of The problem**
Take the case in issue(#3479 ) as an example:
The general results are as follows:
```
GET table/_doc/_search
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"took": 6,
"timed_out": false,
"_shards": {
……
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "table",
"_score": null,
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"fields": {
"k1": [
"kkk1"
]
},
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"sort": [
0
]
}
]
}
}
```
But in Doris on ES,Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows:
```
GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 0
}
}
GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1
}
}
GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1,
"hits": [
{
"fields": {
"k1": [
"kkk1"
]
}
}
]
}
}
```
*Scan-Worker On BE which processed result of shard2 will failed.*
**The reasons are as follows:**
1. "filter_path" causes the hits.hits object not exist.
2. In the current implementation, if there are some data rows(total > 0), the hits.hits. object must be an array
**How To Fix it**
Two Method:
1. modify "filter_path" to contain the hits.
Pros: Fixed Code is very simple
Cons: More network cost
2. Deal with the case where fields are missing in a batch.
Pros: No loss of performance
Cons: Code is more complex
Performance first, I use Method2.
**Design**
1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not.
2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode
3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode" and _total to construct _total lines which fields are assigned with 'NULL'
LSAN detected errors have been fixed by a prior pathch (#3326), but
there are still some ASAN detected errors.
This patch try to fix these errors to make Doris BE more robustness.
And then we can add CI run in LSAN/ASAN mode to detect memory errors
as early as possible.
Each test case in ExternalScanContextMgrTest may cost 1 minitue
which is too long, we'd better disable backgrounp scan context
gc to speed up unit test.
The callback added to the CallbackFactory should not be removed until the
transaction is aborted or visible. Otherwise, some callback method may failed
to be called.
When BE sets `ignore_broken_disk` to true, it's expected that non-exist path in storage_root_path won't prevent BE from launching, but in 0.12 BE fails to launch in such scenario.
```
W0506 14:46:11.039953 17040 options.cpp:64] path can not be canonicalized. may be not exist. path=/data11/olap
W0506 14:46:11.040014 17040 options.cpp:141] failed to parse store path /data11/olap, res=-203
```
The reason is that #2861 adds a path existence check in `parse_root_path` which precedes the usage of `ignore_broken_disk` in the main method.
ImplementaItion Notes
NodeChannel
_cur_batch -> _pending_batches: when _cur_batch is filled up, move it to _pending_batches.
add_row() just produce batches.
try_send_and_fetch_status() tries to consume one pending batch. If has in flight packet, skip send in this round.
So we can add one sender thread to be in charge of all node channels try_send.
IndexChannel
init(), open() stay the same.
Use for_each_node_channel() to expose the detailed changes of NodeChannel.(It's more easy to read & modify)
Sender thread
See func OlapTableSink::_send_batch_process()
Why use polling?
If we use wait/notify, it will notify when generate a new batch. We can't skip sending this batch, coz it won't notify the same batch again. So wait/notify can't avoid blocking simply.
So I choose polling.
It's wasting to continuously try_send(), but it's difficult to set the suitable polling interval. Thus, I add std::this_thread::yield() to give up the time slice, give priority to other process/threads (if there are other process/threads waiting in the queue).
RandomAccessFileOptions, WritableFileOptions, RandomRWFileOptions
defined as a struct but previously declared as a class; this is valid,
but will result in compile warning or error under clang compiler
For SQL like:
```
select * from
join1 left semi join join2
on join1.id = join2.id and join2.id > 1;
```
the predicate `join2.id > 1` can not be pushed down to table join1.
1. reserve `SegmentWriter::_column_writers` before writing it
2. remove some condition branchs in SegmentWriter::init
3. fix hard-coded library names in build-thirdpary.sh
Add a new config `drop_backend_after_decommission` in FE. if this config
is false, the BE will not be dropped after finishing decommission operation.
This new config is try to solve the problem described in ISSUE: #3460 .
TODO:
This method will generate a lot of data migration, so it is only a temporary solution.
After that, we should try to solve the problem of data balancing within the BE.
This CL also add the documents of FE and BE configuration.
These documents are incomplete and can be added later.
When there is subquery in where clause, the query will be rewritten to join operation.
And some auxiliary binary predicates will be generated. These binary predicates
will not go through the ExprRewriteRule, so they are not normalized as
"column to the left and constant to the right" format.
We need to take this case into account so that the `canPushDownPredicate()` judgement
will not throw exception.
Current implement is very simply and conservative, because our query planner is error-prone.
After we implement the new query planner, we could do this work by `Predicate Equivalence Class` and `PredicatePushDown` rule like presto.
The resource pool in runtime state will be automatically unregistered
when deconstructing the RuntimeState. So no need to unregister it when
closing the plan fragment executor.
* [Fix] Fix bug that rowset meta is deleted after compaction
After compaction, the tablet rowset meta will be modified by
adding to new output rowsets and deleting the old input rowsets.
The output version may equals to the input version.
So we should delete the "input" version from _rs_version_map
before adding the "output" version to _rs_version_map. Otherwise,
the new "output" version will be lost in _rs_version_map.