#5146
Add histogram metrics into util/metrics.h. The data structure of histogram is implemented in util/histogram.h,
which could also be used in other situations that in need of histogram. Unit tests added as well.
Fix#5138
1. fix bug when create colocate table with empty partition.
2. put code groupName2Id.put(fullGroupName, groupId) to the end to avoid state inconsistent when exception thrown.
3. do not check backendsPerBucketSeq empty in replayAddTableToGroup(),
cause backendsPerBucketSeq can be empty for colocate table with empty partition.
Scanner threads may be running and using the member vars of OlapScanNode,
when the OlapScanNode has already destroyed.
We can use `_running_thread` to be the last accessed member variable.
And `transfer_thread` need to wait for `_running_thread==0`.
After `transfer_thread` joined, `OlapScanNode::close()` can continue.
Bucket shuffle join is an algorithm of joining two tables. Left table is distributed by a column.
Right table sends the data to the left table for joining operation.
It reduces the network cost. But when two table is without any data. Bucket shuffle join will fail.
Related Issue: #5144
When two colocate tables make join operation, to make join operation locally,
the tablet belongs to the same bucket sequence will be distributed to the same host.
When choosing which host for a bucket sequence, it takes random strategy.
Random strategy can not make query task load balance logically for one query.
Therefore, this patch takes round-robin strategy, make buckets distributed evenly.
For example, if there are 6 bucket sequences and 3 hosts,
it is better to distributed 2 buckets sequence for every host.
RebalancerType could be configured via Config.rebalancer_type(BeLoad, Partition).
PartitionRebalancer is based on TwoDimensionalGreedyAlgo.
Two dims of Doris should be cluster & partition. And we only consider about the replica count,
do not consider replica size.
#4845 for further details.
Doris supports two kinds of cache mode: sql_cache and partition_cache.
sql_cache takes sql string as key and cache the whole data.
partition_cache splits the data into many partition data and caches them differently.
Therefore a query may hit part of the partition_cache data.
If a query hits the left part of the data, we call the hit range is left.
If a query hits the right part of the data, we call the hit range is right.
And if a query hits the whole part of the data, we call the hit range is full.
A query does not hit any partition cache, but the algorithm still returns hit range right.
It should return hit range none.
Related issue: #5136
There are some long loops and sleeps in unit tests, it will cost a
very long time to run all unit tests, especially run in TSAN mode.
This patch speed up unit tests by shortening long loops and sleeps,
on my environment all unit tests finished in 1 minite. It's useful
to do basic functional unit tests.
You can switch to run in this mode by adding a new environment variable
'DORIS_ALLOW_SLOW_TESTS'. For example, you can set:
export DORIS_ALLOW_SLOW_TESTS=1
and also you can disable it by setting:
export DORIS_ALLOW_SLOW_TESTS=0
add a flag of fuzzy_parse, if the json file all object keys are the same and has same order, we only need to parse the first row, and then use index instead key to parse value
#4996
When BE is restarting and the older tablet have been added to the garbage collection queue but not deleted yet.
In this case, since the data_dirs are parallel loaded, a later loaded tablet may be older than previously loaded one, which should not be acknowledged as a failure.
It should be noted that the _add_tablet_unlocked() method will also be called when creating a new tablet. In that case, the changes in this pull request will not be accessed so there is no affect on the tablet creating process.
In the previous implementation, whether a subtask is in commit or abort state,
we will try to update the job progress, such as the consumed offset of kafka.
Under normal circumstances, the aborted transaction does not consume any data,
and all progress is 0, so even we update the progress, the progress will remain
unchanged.
However, in the case of high cluster load, the subtask may fail half of the execution on the BE side.
At this time, although the task is aborted, part of the progress is updated.
Cause the next subtask to skip these data for consumption, resulting in data loss.
Add trace for create tablet tasks, it's a useful tool for admin to find
out the bottleneck when create tablets timeouted.
For example, admin could enlarge 'tablet_map_shard_size' when found
'got tablets shard lock' procedure cost too much time.
When partition cache is not cached continuely, range query may fail.
For example, partition key 20201011 and 20201013 is cached,
but rang query is between 20201011 and 20201013, the query will not hit the cache.
issue:#5059
If a column does not have any null value, and execute a delete operation
with "where k1 is null" on it, BE may crash.
This bug is introducaed from #5030
Regardless of whether the tablet is submitted for compaction or not,
we need to call 'reset_compaction' to clean up the base_compaction or cumulative_compaction objects
in the tablet, because these two objects store the tablet's own shared_ptr.
If it is not cleaned up, the reference count of the tablet will always be greater than 1,
thus cannot be collected by the garbage collector. (TabletManager::start_trash_sweep)
This bug is introduced from #4891
When user wants to create materialized view with a mv column which is transformed
from original column in agg family table, Doris will throw a new error message
"The mv column of agg or uniq table cannot be transformed from original column"
instead of "column not exists".
Add viewable profile for broker load. Similar to the query profile,
the user can submit the import job by setting the session variable is_report_success to true,
and then view the running profile of the job on the FE web page for easy analysis and debugging.
- There is a fe configuration called dynamic_partition_enable
which controls the opening and closing of the dynamic partition function.
When this configuration is false, it means that all tables do not support dynamic partitioning.
- But when the user tried to create the dynamic partition table, Doris did not detect this parameter.
This will cause the user can normally create a dynamic partition table,
but in fact Doris cannot create a partition for this table.
- This pr detect this config when building the table.
The dynamic partition table can be created only when the dynamic_partition_enable configuration is true.
If the configuration is false, the command to create a dynamic partition table will directly report an error.
For #4674
This is a udaf for approximate topn using Space-Saving algorithm. At present, we can only calculate
the frequent items and their frequencies in a certain column, based on which we can implement similar
topN functions supported by Kylin in the future.
I have also added a test to calculate the accuracy of this algorithm. The following is a rough running result.
The total amount of data is 1 million lines and follows the Zipfian distribution, where Element Cardinality
represents the data cardinality, 20X, 50X.. The value representing space_expand_rate is 20,50, which is
used to set the counter number in the space-saving algorithm
```
zf exponent = 0.5
Element cardinality 20X 50X 100X
1000 100% 100% 100%
10000 100% 100% 100%
100000 100% 100% 100%
500000 94% 98% 99%
zf exponent = 0.6,1
Element cardinality 20X 50X 100X
1000 100% 100% 100%
10000 100% 100% 100%
100000 100% 100% 100%
500000 100% 100% 100%
```
Introduced by PR #5051.
As @liutang123 said, when PlanFragmentExecutor is destructed, it will call
`close -> ExecNode::close -> OlapScanNode::close`. OlapScanNode will wait for `_transfer_thread`.
`_transfer_thread` will wait for all OlapScanner processing to complete.
OlapScanner is processed by the scanner thread. When the last scanner processing is completed,
`_transfer_thread` will break out of the loop, and PlanFragmentExecutor will continue to destruct.
And if it is completed, its RuntimeProfile::Counter will also be destructed.
At this time, the ScopedTimer in the Scan thread may still use this Counter when it is destructed.
So we must make sure that the timer is deconstructed before deconstructing the runtime profile.
Mistakenly use the string '_engine_data_path' as the path, actually the storage engine is not open,
so option/path is needless. Cleanup it to avoid any doubt about the file path management.
And Refactor ColumnRangeValue and OlapScanNode
This patch mainly do the following:
- Fix issue #5071
- Change type_min in ColumnRangeValue as static
- Add Class of type_limit make code clear
- Refactor the function of normalize_in_and_eq_predicate
Add ninja build system support, if you installed ninja you can building be by ninja using bash build.sh --be --ninja.
ninja build is more faster than make