Ref https://github.com/apache/incubator-doris/issues/3566
Introduce trace utility from Kudu to BE. This utility has been widely used in Kudu,
Impala also import this trace utility.
This trace util is used for tracing each phases in a thread, and can be dumped to
string to see each phases' time cost and diagnose which phase cost more time.
This util store a Trace object as a threadlocal variable, we can add trace entries
which record the current file name, line number, user specified symbols and
timestamp to this object, and it's able to add some counters to this Trace
object. And then, it can be dumped to human readable string.
There are some helpful macros defined in trace.h, here is a simple example for
usage:
```
scoped_refptr<Trace> t1(new Trace); // New 2 traces
scoped_refptr<Trace> t2(new Trace);
t1->AddChildTrace("child_trace", t2.get()); // t1 add t2 as a child named "child_trace"
TRACE_TO(t1, "step $0", 1); // Explicitly trace to t1
usleep(10);
// ... do some work
ADOPT_TRACE(t1.get()); // Explicitly adopt to trace to t1
TRACE("step $0", 2); // Implicitly trace to t1
{
// The time spent in this scope is added to counter t1.scope_time_cost
TRACE_COUNTER_SCOPE_LATENCY_US("scope_time_cost");
ADOPT_TRACE(t2.get()); // Adopt to trace to t2 for the duration of the current scope
TRACE("sub start"); // Implicitly trace to t2
usleep(10);
// ... do some work
TRACE("sub before loop");
for (int i = 0; i < 10; ++i) {
TRACE_COUNTER_INCREMENT("iterate_count", 1); // Increase counter t2.iterate_count
MicrosecondsInt64 start_time = GetMonoTimeMicros();
usleep(10);
// ... do some work
MicrosecondsInt64 end_time = GetMonoTimeMicros();
int64_t dur = end_time - start_time;
// t2's simple histogram metric with name prefixed with "lbm_writes"
const char* counter = BUCKETED_COUNTER_NAME("lbm_writes", dur);
TRACE_COUNTER_INCREMENT(counter, 1);
}
TRACE("sub after loop");
}
TRACE("goodbye $0", "cruel world"); // Automatically restore to trace to t1
std::cout << t1->DumpToString(Trace::INCLUDE_ALL) << std::endl;
```
output looks like:
```
0514 02:16:07.988054 (+ 0us) trace_test.cpp:76] step 1
0514 02:16:07.988112 (+ 58us) trace_test.cpp:80] step 2
0514 02:16:07.988863 (+ 751us) trace_test.cpp:103] goodbye cruel world
Related trace 'child_trace':
0514 02:16:07.988120 (+ 0us) trace_test.cpp:85] sub start
0514 02:16:07.988188 (+ 68us) trace_test.cpp:88] sub before loop
0514 02:16:07.988850 (+ 662us) trace_test.cpp:101] sub after loop
Metrics: {"scope_time_cost":744,"child_traces":[["child_trace",{"iterate_count":10,"lbm_writes_lt_1ms":10}]]}
```
Exclude the original source code, this patch
do the following work to adapt to Doris:
- Rename "kudu" namespace to "doris"
- Update some names to the existing function names in Doris, i.g. strings::internal::SubstituteArg::kNoArg -> strings::internal::SubstituteArg::NoArg
- Use doris::SpinLock instead of kudu::simple_spinlock which hasn't been imported
- Use manual malloc() and free() instead of kudu::Arena which hasn't been imported
- Use manual rapidjson::Writer instead of kudu::JsonWriter which hasn't been imported
- Remove all TRACE_EVENT related unit tests since TRACE_EVENT is not imported this time
- Update CMakeLists.txt
1. Delete Invalid Counter In Data_Stream_Sender. (#3598)
2. Add Counter For PartitionHashTable of PartitionAggregationNode:
* Hash Probe Method
* Row processed by Aggregation
* HashFilledBuckets: Counter How Many FilledBuckets in Aggragation
* HTResize: Counter How Many Resize of HashTable
* HashProbe: Counter Probe of HashTable
* HashFailedProbe: Counter Failed Probe of HashTable
* HashTravelLength: Total TravelLength for Probe
* HashCollisions: Counter of HashCollision
3. Del some unecessary code in PartitionHashTable by template
#3479
Here I try to explain the cause of the problem and how to fix it.
**The Cause of The problem**
Take the case in issue(#3479 ) as an example:
The general results are as follows:
```
GET table/_doc/_search
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"took": 6,
"timed_out": false,
"_shards": {
……
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "table",
"_score": null,
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"fields": {
"k1": [
"kkk1"
]
},
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"sort": [
0
]
}
]
}
}
```
But in Doris on ES,Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows:
```
GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 0
}
}
GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1
}
}
GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1,
"hits": [
{
"fields": {
"k1": [
"kkk1"
]
}
}
]
}
}
```
*Scan-Worker On BE which processed result of shard2 will failed.*
**The reasons are as follows:**
1. "filter_path" causes the hits.hits object not exist.
2. In the current implementation, if there are some data rows(total > 0), the hits.hits. object must be an array
**How To Fix it**
Two Method:
1. modify "filter_path" to contain the hits.
Pros: Fixed Code is very simple
Cons: More network cost
2. Deal with the case where fields are missing in a batch.
Pros: No loss of performance
Cons: Code is more complex
Performance first, I use Method2.
**Design**
1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not.
2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode
3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode" and _total to construct _total lines which fields are assigned with 'NULL'
LSAN detected errors have been fixed by a prior pathch (#3326), but
there are still some ASAN detected errors.
This patch try to fix these errors to make Doris BE more robustness.
And then we can add CI run in LSAN/ASAN mode to detect memory errors
as early as possible.
Each test case in ExternalScanContextMgrTest may cost 1 minitue
which is too long, we'd better disable backgrounp scan context
gc to speed up unit test.
When BE sets `ignore_broken_disk` to true, it's expected that non-exist path in storage_root_path won't prevent BE from launching, but in 0.12 BE fails to launch in such scenario.
```
W0506 14:46:11.039953 17040 options.cpp:64] path can not be canonicalized. may be not exist. path=/data11/olap
W0506 14:46:11.040014 17040 options.cpp:141] failed to parse store path /data11/olap, res=-203
```
The reason is that #2861 adds a path existence check in `parse_root_path` which precedes the usage of `ignore_broken_disk` in the main method.
ImplementaItion Notes
NodeChannel
_cur_batch -> _pending_batches: when _cur_batch is filled up, move it to _pending_batches.
add_row() just produce batches.
try_send_and_fetch_status() tries to consume one pending batch. If has in flight packet, skip send in this round.
So we can add one sender thread to be in charge of all node channels try_send.
IndexChannel
init(), open() stay the same.
Use for_each_node_channel() to expose the detailed changes of NodeChannel.(It's more easy to read & modify)
Sender thread
See func OlapTableSink::_send_batch_process()
Why use polling?
If we use wait/notify, it will notify when generate a new batch. We can't skip sending this batch, coz it won't notify the same batch again. So wait/notify can't avoid blocking simply.
So I choose polling.
It's wasting to continuously try_send(), but it's difficult to set the suitable polling interval. Thus, I add std::this_thread::yield() to give up the time slice, give priority to other process/threads (if there are other process/threads waiting in the queue).
RandomAccessFileOptions, WritableFileOptions, RandomRWFileOptions
defined as a struct but previously declared as a class; this is valid,
but will result in compile warning or error under clang compiler
1. reserve `SegmentWriter::_column_writers` before writing it
2. remove some condition branchs in SegmentWriter::init
3. fix hard-coded library names in build-thirdpary.sh
The resource pool in runtime state will be automatically unregistered
when deconstructing the RuntimeState. So no need to unregister it when
closing the plan fragment executor.
* [Fix] Fix bug that rowset meta is deleted after compaction
After compaction, the tablet rowset meta will be modified by
adding to new output rowsets and deleting the old input rowsets.
The output version may equals to the input version.
So we should delete the "input" version from _rs_version_map
before adding the "output" version to _rs_version_map. Otherwise,
the new "output" version will be lost in _rs_version_map.
Adding a new storage engine, we need to make an extensible tablet interface, so olap/StorageEngine can support and manage new tablet types.
To start, this commit creates a class BaseTablet and make Tablet and new MemTablet inherit
this base class, some common fields & methods are moved to BaseTablet class, which fields
and methods belong to base/old class is not finalized yet, it will change as the project evolves.
Fix#3384
This CL mainly made the following modifications:
1. Delete Invalid MemoryUsed Counter and Add PeakMemUsage in each exec node and datastreamsender
2. Add intent in child execnode profile,make it is easily to know the relationship between execnode
3. Del _is_result_order we not support any more in olap_scan_node.h and olap_scan_node.cpp
4. Add scan_disk method to olap_scanner to fix the counter _num_disks_accessed_counter
5. Now we do not use buffer pool to read and write disk, so annotation eadio counter and
6. Delete the MemUsed counter in exec node.
The child.open() function is not called before this commit.
If the assert num rows node has child which process data in open function, the assert num rows node will fetch no data from child. So the result will be empty(incorrect).
This error only appear in inner subquery which has a aggregation function.
For example:
`select * from table where k1=(select k1 from (select avg(k1) from table) a);`
The first level of subquery returns a non-scalar value, so the assert num rows node is needed.
The second level of subquery has a aggregation function, so the child of assert node is aggregate node.
However, if the open stage of the aggregate node is not called, the get next state of aggregate node will return empty set.
So the result is wrong.
Fixed#3435.
Fix#3390
This CL add more info in `JobDetails` column of `SHOW LOAD` result for Broker Load Job.
For example:
```
{
"Unfinished backends": {
"9c3441027ff948a0-8287923329a2b6a7": [10002]
},
"All backends": {
"9c3441027ff948a0-8287923329a2b6a7": [10002, 10004, 10006]
},
"ScannedRows": 2390016,
"TaskNumber": 1,
"FileNumber": 1,
"FileSize": 1073741824
}
```
2 newly added keys:
`Unfinished backends` indicates the BE which task on them are not finished.
`All backends` indicates the BE which this job has tasks on it.
One more thing, I pass the Backend Id along with the heartbeat msg from FE to BE, so that BE can
know the Id of themselves.
We can observe the workload of BE, and also it's a way to check
whether there is any problem in BE, like some container increase
too large and lead to OOM.
This patch add the following metrics:
```
Name Description
rowset_count_generated_and_in_use The total count of rowset id generated and in use since BE last start
unused_rowsets_count The total count of unused rowset waiting to be GC
broker_count The total count of brokers in management
data_stream_receiver_count The total count of data stream receivers in management
fragment_endpoint_count The total count of fragment endpoints of data stream in management, should always equal to data_stream_receiver_count
active_scan_context_count The total count of active scan contexts
plan_fragment_count The total count of plan fragments in executing
load_channel_count The total count of load channels in management
result_buffer_block_count The total count of result buffer blocks for queries, each block has a limited queue size (default 1024)
result_block_queue_count The total count of queues for fragments, each queue has a limited size (default 20, by config::max_memory_sink_batch_count)
routine_load_task_count The total count of routine load tasks in executing
small_file_cache_count The total count of cached small files' digest info
stream_load_pipe_count The total count of stream load pipes, each pipe has a limited buffer size (default 1M)
tablet_writer_count The total count of tablet writers
brpc_endpoint_stub_count The total count of brpc endpoints
```
There is no functional changes in this patch.
Key refactor points are:
- Remove meaningless return value of functions in class Tablet, and
also some related functions in other classes
- Allow RowsetGraph::capture_consistent_versions to pass a nullptr
to the output parameter
- Use CHECK instead of LOG(FATAL) to simplify code
This CL mainly made the following modifications:
1. Delete Invalid method in Running Profile Class.
2. Move Memlimit Counter from blockmgr to fragment and add PeakMemUsage Counter
3. Fix the bug of buffer pool memlimit counter
4. Call compute_time_in_profile() before pretty_print() to show the _local_time_percent without child running profile
5. Add TransferThread ThreadToken count in AveThreadToken Counter
Process castexpr, such as: k (float) > 2.0, k(int) > 3.2, Doris On Es should ignore this doris native cast transformation for every row's col value, we push down this `cast semantic` to Elasticsearch.
I believe in this `predicate` situation, would decrease the mount of data for transmission。
k1 is float:
````
k1 >= 5
````
push-down filter:
```
{"range":{"k1":{"gte":"5.000000"}}}
```
k2 is int :
```
k2 > 3.2
```
push-down filter:
```
{"range":{"k2":{"gte":"3.2"}}}
```
related issue: #3306
Note: this PR just remove the es_scan_node_test.cpp which is useless
For the moment, just add a simple explain syntax for EsTable without translating the native predicates to ES queryDSL which is better to finished with moving the predicate translating from Doris BE to Doris FE, the whole work is still WIP.