The problem with the current implementation is that all data to be
inserted will be counted in memory, but for the aggregation model or
some other special cases, not all data will be inserted into `MemTable`,
and these data should not be counted in memory.
This change makes the `SkipList` use the exclusive `MemPool`,
and only the data will be inserted into the `SkipList` can use this
`MemPool`. In other words, those discarded rows will not be
counted by the `MemPool` of` SkipList`.
In order to avoid duplicate checking whether a row already exists in
`SkipList`, this change also modifies the `SkipList` interface(A `Hint`
will be fetched when `Find()`, and then use it in `InsertUseHint()`),
and made `SkipList` no longer aware of the aggregation logic.
At present, because of the data row(`Tuple`) generated by the upper layer
is different from the data row(`Row`) internally represented by the
engine, when inserting `MemTable`, the data row must be copied.
If the row needs to be inserted into SkipList, we need copy it again
to `MemPool` of `SkipList`.
And, at present, the aggregation function only supports `MemPool` when
copying, so even if the data will not be inserted into` SkipList`,
`MemPool` is still used (in the future, it can be replaced with an
ordinary` Buffer`). However, we reuse the allocated memory in MemPool,
that is, we do not reallocate new memory every time.
Note: Due to the characteristics of `MemPool` (once inserted, it cannot
be partially cleared), the following scenarios may still cause multiple
flushes. For example, the aggregation model of a string column is `MAX`,
and the data inserted at the same time is in ascending order, then for
each data row, it must apply for memory from `MemPool` in `SkipList`,
that is, although the old rows in SkipList` will be discarded,
the memory occupied will still be counted.
I did a test on my development machine using `STREAM LOAD`: a table with
only one tablet and all columns are keys, the original data was
1.1G (9318799 rows), and there were 377745 rows after removing duplicates.
It can be found that both the number of files and the query efficiency are
greatly improved, the price paid is only a slight increase in load time.
before:
```
$ ll storage/data/0/10019/1075020655/
total 4540
-rw------- 1 dev dev 393152 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_0_0.dat
-rw------- 1 dev dev 1135 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_0_0.idx
-rw------- 1 dev dev 421660 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_10_0.dat
-rw------- 1 dev dev 1185 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_10_0.idx
-rw------- 1 dev dev 184214 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_1_0.dat
-rw------- 1 dev dev 610 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_1_0.idx
-rw------- 1 dev dev 329181 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_11_0.dat
-rw------- 1 dev dev 935 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_11_0.idx
-rw------- 1 dev dev 343813 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_12_0.dat
-rw------- 1 dev dev 985 Dec 2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_12_0.idx
-rw------- 1 dev dev 315364 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_2_0.dat
-rw------- 1 dev dev 885 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_2_0.idx
-rw------- 1 dev dev 423806 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_3_0.dat
-rw------- 1 dev dev 1185 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_3_0.idx
-rw------- 1 dev dev 294811 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_4_0.dat
-rw------- 1 dev dev 835 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_4_0.idx
-rw------- 1 dev dev 403241 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_5_0.dat
-rw------- 1 dev dev 1135 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_5_0.idx
-rw------- 1 dev dev 350753 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_6_0.dat
-rw------- 1 dev dev 860 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_6_0.idx
-rw------- 1 dev dev 266966 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_7_0.dat
-rw------- 1 dev dev 735 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_7_0.idx
-rw------- 1 dev dev 451191 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_8_0.dat
-rw------- 1 dev dev 1235 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_8_0.idx
-rw------- 1 dev dev 398439 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_9_0.dat
-rw------- 1 dev dev 1110 Dec 2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_9_0.idx
{
"TxnId": 16,
"Label": "cd9f8392-dfa0-4626-8034-22f7cb97044c",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 9318799,
"NumberLoadedRows": 9318799,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 1079581477,
"LoadTimeMs": 46907
}
mysql> select count(*) from xxx_before;
+----------+
| count(*) |
+----------+
| 377745 |
+----------+
1 row in set (0.91 sec)
```
aftr:
```
$ ll storage/data/0/10013/1075020655/
total 3612
-rw------- 1 dev dev 3328992 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_0_0.dat
-rw------- 1 dev dev 8460 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_0_0.idx
-rw------- 1 dev dev 350576 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_1_0.dat
-rw------- 1 dev dev 985 Dec 2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_1_0.idx
{
"TxnId": 12,
"Label": "88f606d5-8095-4f15-b61d-49b7080c16b8",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 9318799,
"NumberLoadedRows": 9318799,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 1079581477,
"LoadTimeMs": 48771
}
mysql> select count(*) from xxx_after;
+----------+
| count(*) |
+----------+
| 377745 |
+----------+
1 row in set (0.38 sec)
```
Apache Doris (incubating)
Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. Its original name was Palo, developed in Baidu. After donating it to Apache Software Foundation, it was renamed Doris.
1. License
2. Technology
Doris mainly integrates the technology of Google Mesa and Apache Impala, and it is based on a column-oriented storage engine and can communicate by MySQL client.
3. User cases
Doris not only provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis.
Doris not only provides batch data loading, but also provides near real-time mini-batch data loading.
Doris also provides high availability, reliability, fault tolerance, and scalability.
The simplicity (of developing, deploying and using) and meeting many data serving requirements in single system are the main features of Doris (refer to Overview).
4. Compile and install
Currently only supports Docker environment and Linux OS, such as Ubuntu and CentOS.
4.1 Compile in Docker environment (Recommended)
We offer a docker image as a Doris compilation environment. You can compile Doris from source in it and run the output binaries in other Linux environment.
Firstly, you must be install and start docker service.
And then you could build Doris as following steps:
Step1: Pull the docker image with Doris building environment
$ docker pull apachedoris/doris-dev:build-env
You can check it by listing images, for example:
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
apachedoris/doris-dev build-env f8bc5d4024e0 21 hours ago 3.28GB
NOTE: You may have to use different images to compile from source.
image version commit id release version apachedoris/doris-dev:build-env before ff0dd0d 0.8.x, 0.9.x apachedoris/doris-dev:build-env-1.1 ff0dd0d or later 0.10.x or later
Step2: Run the Docker image
You can run the image directly:
$ docker run -it apachedoris/doris-dev:build-env
Or if you want to compile the source located in your local host, you can map the local directory to the image by running:
$ docker run -it -v /your/local/path/incubator-doris-DORIS-x.x.x-release/:/root/incubator-doris-DORIS-x.x.x-release/ apachedoris/doris-dev:build-env
Step3: Download Doris source
Now you should attached in docker environment.
You can download Doris source by release package or by git clone in image.
(If you already downloaded the source in your local host and map it to the image in Step2, you can skip this step.)
$ wget https://dist.apache.org/repos/dist/dev/incubator/doris/xxx.tar.gz
or
$ git clone https://github.com/apache/incubator-doris.git
Step4: Build Doris
Enter Doris source path and build Doris.
$ sh build.sh
After successfully building, it will install binary files in the directory output/.
4.2 For Linux OS
Prerequisites
You must be install following softwares:
GCC 5.3.1+, Oracle JDK 1.8+, Python 2.7+, Apache Maven 3.5+, CMake 3.4.3+
After you installed above all, you also must be set them to environment variable PATH and set JAVA_HOME.
If your GCC version is lower than 5.3.1, you can run:
sudo yum install devtoolset-4-toolchain -y
and then, set the path of GCC (e.g /opt/rh/devtoolset-4/root/usr/bin) to the environment variable PATH.
Compile and install
Run the following script, it will compile thirdparty libraries and build whole Doris.
sh build.sh
After successfully building, it will install binary files in the directory output/.
5. License Notice
Some of the third-party dependencies' license are not compatible with Apache 2.0 License. So you may have to disable
some features of Doris to be complied with Apache 2.0 License. Details can be found in thirdparty/LICENSE.txt
6. Reporting Issues
If you find any bugs, please file a GitHub issue.
7. Links
- Doris official site - http://doris.incubator.apache.org
- User Manual (GitHub Wiki) - https://github.com/apache/incubator-doris/wiki
- Developer Mailing list - Subscribe to dev@doris.incubator.apache.org to discuss with us.
- Gitter channel - https://gitter.im/apache-doris/Lobby - Online chat room with Doris developers.
- Overview - https://github.com/apache/incubator-doris/wiki/Doris-Overview
- Compile and install - https://github.com/apache/incubator-doris/wiki/Doris-Install
- Getting start - https://github.com/apache/incubator-doris/wiki/Getting-start
- Deploy and Upgrade - https://github.com/apache/incubator-doris/wiki/Doris-Deploy-%26-Upgrade
- User Manual - https://github.com/apache/incubator-doris/wiki/Doris-Create%2C-Load-and-Delete
- FAQs - https://github.com/apache/incubator-doris/wiki/Doris-FAQ