1. BlockManager has been added into StorageEngine.
So StorageEngine should be initialized when starting BetaRowset unit test.
2. Cache should not use the same buf to store value, otherwise the address
will be freed twice and crash.
improve performent of hash join when build table has to many duplicated rows, this will cause hash table collisions and slow down the probe performence.
In this pr when join type is semi join or anti join, we will build a hash table without duplicated rows.
benchmark:
dataset: tpcds dataset `store_sales` and `catalog_sales`
```
mysql> select count(*) from catalog_sales;
+----------+
| count(*) |
+----------+
| 14401261 |
+----------+
1 row in set (0.44 sec)
mysql> select count(distinct cs_bill_cdemo_sk) from catalog_sales;
+------------------------------------+
| count(DISTINCT `cs_bill_cdemo_sk`) |
+------------------------------------+
| 1085080 |
+------------------------------------+
1 row in set (2.46 sec)
mysql> select count(*) from store_sales;
+----------+
| count(*) |
+----------+
| 28800991 |
+----------+
1 row in set (0.84 sec)
mysql> select count(distinct ss_addr_sk) from store_sales;
+------------------------------+
| count(DISTINCT `ss_addr_sk`) |
+------------------------------+
| 249978 |
+------------------------------+
1 row in set (2.57 sec)
```
test querys:
query1: `select count(*) from (select store_sales.ss_addr_sk from store_sales left semi join catalog_sales on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;`
query2: `select count(*) from (select catalog_sales.cs_bill_cdemo_sk from catalog_sales left semi join store_sales on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;`
benchmark result:
||query1|query2|
|:--:|:--:|:--:|
|before|14.76 sec|3 min 16.52 sec|
|after|12.64 sec|10.34 sec|
fix a bug of const union query like `select null union select null`, this because the type of SlotDescriptor when clause is `select null` is null ,this will cause BE core dump, and FE find wrong cast function.
Related issue: #2663, #2828.
This CL support loading data into specified temporary partitions.
```
INSERT INTO tbl TEMPORARY PARTITIONS(tp1, tp2, ..) ....;
curl .... -H "temporary_partition: tp1, tp, .. " ....
LOAD LABEL db1.label1 (
DATA INFILE("xxxx")
INTO TABLE `tbl2`
TEMPORARY PARTITION(tp1, tp2, ...)
...
```
NOTICE: this CL change the FE meta version to 77.
There 3 major changes in this CL
## Syntax reorganization
Reorganized the syntax related to the `specify-partitions`. Removed some redundant syntax
definitions, and unified the syntax related to the `specify-partitions` under one syntax entry.
## Meta refactor
In order to be able to support specifying temporary partitions,
I made some changes to the way the partition information in the table is stored.
Partition information is now organized as follows:
The following two maps are reserved in OlapTable for storing formal partitions:
```
idToPartition
nameToPartition
```
Use the `TempPartitions` class for storing temporary partitions.
All the partition attributes of the formal partition and the temporary partition,
such as the range, the number of replicas, and the storage medium, are all stored
in the `partitionInfo` of the OlapTable.
In `partitionInfo`, we use two maps to store the range of formal partition
and temporary partition:
```
idToRange
idToTempRange
```
Use separate map is because the partition ranges of the formal partition and
the temporary partition may overlap. Separate map can more easily check the partition range.
All partition attributes except the partition range are stored using the same map,
and the partition id is used as the map key.
## Method to get partition
A table may contain both formal and temporary partitions.
There are several methods to get the partition of a table.
Typically divided into two categories:
1. Get partition by id
2. Get partition by name
According to different requirements, the caller may want to obtain
a formal partition or a temporary partition. These methods are
described below in order to obtain the partition by using the correct method.
1. Get by name
This type of request usually comes from a user with partition names. Such as
`select * from tbl partition(p1);`.
This type of request has clear information to indicate whether to obtain a
formal or temporary partition.
Therefore, we need to get the partition through this method:
`getPartition(String partitionName, boolean isTemp)`
To avoid modifying too much code, we leave the `getPartition(String
partitionName)`, which is same as:
`getPartition(partitionName, false)`
2. Get by id
This type of request usually means that the previous step has obtained
certain partition ids in some way,
so we only need to get the corresponding partition through this method:
`getPartition(long partitionId)`.
This method will try to get both formal partitions and temporary partitions.
3. Get all partition instances
Depending on the requirements, the caller may want to obtain all formal
partitions,
all temporary partitions, or all partitions. Therefore we provide 3 methods,
the caller chooses according to needs.
`getPartitions()`
`getTempPartitions()`
`getAllPartitions()`
During the use of the `block`, some methods in the block manager will be referenced.
So `file_block_mgr` should be a resident and globally unique object.
I put it in `StorageEngine`.
TODO: the `BlockManager`, `Env` need to be reorganized.
This CL try to fix a potential bug describe in ISSUE: #3097. But I'm not sure this is the root cause.
Also remove lots of verbose log, and fix a memory leak.
In a large scale cluster, we may rolling upgrade BEs, this patch add a
column named 'Version' for command 'show backends;', as well as website
'/system?path=//backends', to provide a method to check whether there
is any BE missing upgraded.
```
be/src/olap/rowset/segment_v2/ordinal_page_index.cpp:103:22: warning: ‘ordinal’ may be used
uninitialized in this function [-Wmaybe-uninitialized]
_ordinals[i] = ordinal;
```
This bug occurred when BE make snapshot, the version required by fe had been merged into the cumulative version, so the snapshot task could not complete the task even if it retried. In order to solve this problem, the BackupJob could be set to CANCELLED, and the user could continue to retry the job.
Fix#3057
If delete predicate exists in meta in Doris-0.10, all of this predicates should
be remained. There is an confused place in Doris-0.10. The delete predicate
only exists in OLAPHeaderMessage and PPendingDelta, not in PDelta.
This trick results this bug.
The timestamp value load from orc file is error, the value has an offset with hive and spark.
Becuase the time zone of orc's timestamp is stored inside orc's stripe information, so the timestamp obtained here is an offset timestamp, so parse timestamp with UTC is actual datetime literal.
eg:
select str_to_date('2014-12-21 12%3A34%3A56', '%Y-%m-%d %H%%3A%i%%3A%s');
select unix_timestamp('2007-11-30 10:30%3A19', '%Y-%m-%d %H:%i%%3A%s');
This also enable us to extract column fields from HDFS file path with contains '%'.
Normalize the setting of mem limit to avoid some unexpected exception.
For example, use may not setting query mem limit in query plan, which
may cause BE crash.
Now disks_total_capacity metric is a user specified capacity, but
disks_avail_capacity is the disk's actual available capacity, so
disks_total_capacity may be less than disks_avail_capacity, and
UsedPct on FE may be a negative number as a result.
We'd better to use disk actual capacity for disks_total_capacity metric.
The abstraction of the Block layer, inspired by Kudu, lies between the "business
layer" and the "underlying file storage layer" (`Env`), making them no longer
strongly coupled.
In this way, for the business layer (such as `SegmentWriter`),
there is no need to directly do the file operation, which will bring better
encapsulation. An ideal situation in the future is: when we need to support a
new file storage system, we only need to add a corresponding type of
BlockManager without modifying the business code (such as `SegmentWriter`).
With the Block layer, there are some benefits:
1. First and foremost, the mapping relationship between data and `Env` is more
flexible. For example, in the storage engine, the data of the tablet can be
placed in multiple file systems (`Env`) at the same time. That is, one-to-many
relationships can be supported. For example: one on the local and one on the
remote storage.
2. The mapping relationship between blocks and files can be adjusted, for example,
it may not be a one-to-one relationship. For example, the data of multiple
blocks can be stored in a physical file, which can reduce the number of files
that need to be opened during querying. It is like `LogBlockManager` in Kudu.
3. We can move the opened-file-cache under the Block layer, which can automatically
close and open the files used by the upper layer, so that the upper business
level does not need to be aware of the restrictions of the file handle at all
(This problem is often encountered online now).
4. Better automatic cleanup logic when there are exceptions. For example, a block
that is not closed explicitly can automatically clean up its corresponding file,
thereby avoiding generating most garbage files.
5. More convenient for batch file creation and deletion. Some business operations
create multiple files, such as compaction. At present, the processing flow that
these files go through is executed one by one: 1) creation; 2) writing data;
3) fsync to disk. But in fact, this is not necessary, we only need to fsync this
batch of files at the end. The advantage is that it can give the operating system
more opportunities to perform IO merge, thereby improving performance. However,
this operation is relatively tedious, there is no need to be coupled in the
business code, it is an ideal place to put it in the Block layer.
This is the first patch, just add related classes, laying the groundwork for later
switching of read and write logic.