To avoid showing too many memtracker on BE web pages.
The MemTracker level now has 3 levels: OVERVIEW, TASK and VERBOSE.
OVERVIEW Mainly used for main memory consumption module such as Query/Load/Metadata.
TASK is mainly used to record the memory overhead of a single task such as a single query, load, and compaction task.
VERBOSE is used for other more detailed memtrackers.
1 Make some MemTracker have reasonable parent MemTracker not the root tracker
2 Make each MemTracker can be easily to trace.
3 Add show level of MemTracker to reduce the MemTracker show in the web page to have a way to control show how many tracker in web page.
1. The MaterializedViewSelector should be reset for each scan node
2. On the BE side, columns with delete conditions must be added to the return column.
1. Add BlockColumnPredicate support OR and AND column predicate in RowBlockV2
2. Support evaluate vectorization delete predicate in storage engine not in Reader in SegmentV2
At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
Scanner threads may be running and using the member vars of OlapScanNode,
when the OlapScanNode has already destroyed.
We can use `_running_thread` to be the last accessed member variable.
And `transfer_thread` need to wait for `_running_thread==0`.
After `transfer_thread` joined, `OlapScanNode::close()` can continue.
In the previous implementation, whether a subtask is in commit or abort state,
we will try to update the job progress, such as the consumed offset of kafka.
Under normal circumstances, the aborted transaction does not consume any data,
and all progress is 0, so even we update the progress, the progress will remain
unchanged.
However, in the case of high cluster load, the subtask may fail half of the execution on the BE side.
At this time, although the task is aborted, part of the progress is updated.
Cause the next subtask to skip these data for consumption, resulting in data loss.
Introduced by PR #5051.
As @liutang123 said, when PlanFragmentExecutor is destructed, it will call
`close -> ExecNode::close -> OlapScanNode::close`. OlapScanNode will wait for `_transfer_thread`.
`_transfer_thread` will wait for all OlapScanner processing to complete.
OlapScanner is processed by the scanner thread. When the last scanner processing is completed,
`_transfer_thread` will break out of the loop, and PlanFragmentExecutor will continue to destruct.
And if it is completed, its RuntimeProfile::Counter will also be destructed.
At this time, the ScopedTimer in the Scan thread may still use this Counter when it is destructed.
So we must make sure that the timer is deconstructed before deconstructing the runtime profile.
A large number of small segment files will lead to low efficiency for scan operations.
Multiple small files can be merged into a large file by compaction operation.
So we could take the tablet scan frequency into consideration when selecting an tablet for compaction
and preferentially do compaction for those tablets which are scanned frequently during a
latest period of time at the present.
Using the compaction strategy of Kudu for reference, scan frequency can be calculated
for tablet during a latest period of time and be taken into consideration when calculating compaction score.
mainly includes:
- `OLAP_SCAN_NODE` profile layering: `OLAP_SCAN_NODE`,`OlapScanner`, and `SegmentIterator`.
- Delete meaningless statistical values. mainly in scan_node.cpp.
- Increase `RowsConditionsFiltered` statistical, split from `RowsDelFiltered`, the meaning is the number of rows filtered by various column indexes, only in segment V2.
- Modify the document based on the above, and enhance readability.
Sometimes we want to detect the hotspot of a cluster, for example, hot scanned tablet, hot wrote tablet,
but we have no insight about tablets in the cluster.
This patch introduce tablet level metrics to help to achieve this object, now support 4 metrics on tablets: `query_scan_bytes `, `query_scan_rows `, `flush_bytes `, `flush_count `.
However, one BE may holds hundreds of thousands of tablets, so I add a parameter for the metrics HTTP request,
and not return tablet level metrics by default.
Sometimes we want to detect the hotspot of a cluster, for example, hot scanned tablet, hot wrote tablet,
but we have no insight about tablets in the cluster.
This patch introduce tablet level metrics to help to achieve this object, now support 4 metrics on tablets: `query_scan_bytes `, `query_scan_rows `, `flush_bytes `, `flush_count `.
However, one BE may holds hundreds of thousands of tablets, so I add a parameter for the metrics HTTP request,
and not return tablet level metrics by default.
We make all MemTrackers shared, in order to show MemTracker real-time consumptions on the web.
As follows:
1. nearly all MemTracker raw ptr -> shared_ptr
2. Use CreateTracker() to create new MemTracker(in order to add itself to its parent)
3. RowBatch & MemPool still use raw ptrs of MemTracker, it's easy to ensure RowBatch & MemPool destructor exec
before MemTracker's destructor. So we don't change these code.
4. MemTracker can use RuntimeProfile's counter to calc consumption. So RuntimeProfile's counter need to be shared
too. We add a shared counter pool to store the shared counter, don't change other counters of RuntimeProfile.
Note that, this PR doesn't change the MemTracker tree structure. So there still have some orphan trackers, e.g. RowBlockV2's MemTracker. If you find some shared MemTrackers are little memory consumption & too time-consuming, you could make them be the orphan, then it's fine to use the raw ptr.
1. when read column data page:
for compaction, schema_change, check_sum: we don't use page cache
for query and config::disable_storage_page_cache is false, we use page cache
2. when read column index page
if config::disable_storage_page_cache is false, we use page cache
[STORAGE][SEGMENTV2]
use block split bloom filter
build bloom filter against data page
add distinct value to bloom filter
add ordinal index to bloom filter index
Add a new type: Object. Currently, it's mainly for complex aggregate metrics(HLL , Bitmap).
The Object type has the following constraints:
1 Object type could not as key column type
2 Object type doesn't support all indices (BloomFilter, short key, zone map, invert index)
3 Object type doesn't support filter and group by
In the implementation:
The Object type reuse the StringValue and StringVal, because in storage engine, the Object type is binary, it has a pointer and length.
Currently, we have Field and ColumnSchema to access column data in a
row. These two classes are mostly the same. So we should unify these to
one class. Now, Field has offset information, which is an row attribute,
so we remove offset in Field.
RowCursor now has some logic which belong to Schema, so in this patch I
add Schema attribute to RowCursor to make RowCursor simple. After this
change, only Schema will handle Field/ColumnSchema.
I extract some logic from RowCursor to be/src/olap/row.h, then we can
use same logic to handle different types of row. Each type of row has
same function that to get Cell of this row. A cell represent a column
content with a null indicator.
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.
1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
Nameing rowset with tablet_id and version will lead to
many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
with different format.
Before RowBlock will reserve memory for all columns in schema, even if
it is not queried. Which will cause bad performance when quering wide
table.
In this patch, RowBlock will reserve memory for needed columns. In a
case, this reduce ConvertBatchTime from 10s to 60ms when quering a wide
table who has 178 columns.
#1236