Commit Graph

40 Commits

Author SHA1 Message Date
e5e0dc421d [refactor] Change ALL OLAPStatus to Status (#8855)
Currently, there are 2 status code in BE, one is common/Status.h,
and the other is olap/olap_define.h called OLAPStatus.
OLAPStatus is just an enum type, it is very simple and could not save many informations,
I will unify these code to common/Status.
2022-04-14 11:43:49 +08:00
519305cb22 [feature-wip] (memory tracker) (step4) Switch TLS mem tracker to separate more detailed memory usage (#8669)
Based on #8605, Separate out the memory usage of each operator from the Query/Load/StorageEngine mem tracker.
2022-04-08 09:02:26 +08:00
e285d09157 [Enhancement](load) speed up stream load for duplicate table, use template for faster get_type_info. (#8500) 2022-03-25 15:18:43 +08:00
eeae516e37 [Feature](Memory) Hook TCMalloc new/delete automatically counts to MemTracker (#8476)
Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G

Implement a new way of memory statistics based on TCMalloc New/Delete Hook,
MemTracker and TLS, and it is expected that all memory new/delete/malloc/free
of the BE process can be counted.
2022-03-20 23:06:54 +08:00
e17aef9467 [refactor] refactor the implement of MemTracker, and related usage (#8322)
Modify the implementation of MemTracker:
1. Simplify a lot of useless logic;
2. Added MemTrackerTaskPool, as the ancestor of all query and import trackers, This is used to track the local memory usage of all tasks executing;
3. Add cosume/release cache, trigger a cosume/release when the memory accumulation exceeds the parameter mem_tracker_consume_min_size_bytes;
4. Add a new memory leak detection mode (Experimental feature), throw an exception when the remaining statistical value is greater than the specified range when the MemTracker is destructed, and print the accurate statistical value in HTTP, the parameter memory_leak_detection
5. Added Virtual MemTracker, cosume/release will not sync to parent. It will be used when introducing TCMalloc Hook to record memory later, to record the specified memory independently;
6. Modify the GC logic, register the buffer cached in DiskIoMgr as a GC function, and add other GC functions later;
7. Change the global root node from Root MemTracker to Process MemTracker, and remove Process MemTracker in exec_env;
8. Modify the macro that detects whether the memory has reached the upper limit, modify the parameters and default behavior of creating MemTracker, modify the error message format in mem_limit_exceeded, extend and apply transfer_to, remove Metric in MemTracker, etc.;

Modify where MemTracker is used:
1. MemPool adds a constructor to create a temporary tracker to avoid a lot of redundant code;
2. Added trackers for global objects such as ChunkAllocator and StorageEngine;
3. Added more fine-grained trackers such as ExprContext;
4. RuntimeState removes FragmentMemTracker, that is, PlanFragmentExecutor mem_tracker, which was previously used for independent statistical scan process memory, and replaces it with _scanner_mem_tracker in OlapScanNode;
5. MemTracker is no longer recorded in ReservationTracker, and ReservationTracker will be removed later;
2022-03-11 22:04:23 +08:00
50864aca7d [refactor] fix warings when compile with clang (#8069) 2022-02-19 11:29:02 +08:00
dd36ccc3bf [feature](storage-format) Z-Order Implement (#7149)
Support sort data by Z-Order:

```
CREATE TABLE table2 (
siteid int(11) NULL DEFAULT "10" COMMENT "",
citycode int(11) NULL COMMENT "",
username varchar(32) NULL DEFAULT "" COMMENT "",
pv bigint(20) NULL DEFAULT "0" COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(siteid, citycode)
COMMENT "OLAP"
DISTRIBUTED BY HASH(siteid) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"data_sort.sort_type" = "ZORDER",
"data_sort.col_num" = "2",
"in_memory" = "false",
"storage_format" = "V2"
);
```
2021-12-02 11:39:51 +08:00
db1c281be5 [Enhance][Load] Reduce the number of segments when loading a large volume data in one batch (#6947)
## Case

In the load process, each tablet will have a memtable to save the incoming data,
and if the data in a memtable is larger than 100MB, it will be flushed to disk as a `segment` file. And then
a new memtable will be created to save the following data/

Assume that this is a table with N buckets(tablets). So the max size of all memtables will be `N * 100MB`.
If N is large, it will cost too much memory.

So for memory limit purpose, when the size of all memtables reach a threshold(2GB as default), Doris will
try to flush all current memtables to disk(even if their size are not reach 100MB).

So you will see that the memtable will be flushed when it's size reach `2GB/N`, which maybe much smaller
than 100MB, resulting in too many small segment files.

## Solution

When decide to flush memtable to reduce memory consumption, NOT to flush all memtable, but to flush part
of them.
For example, there are 50 tablets(with 50 memtables). The memory limit is 1GB, so when each memtable reach
20MB, the total size reach 1GB, and flush will occur.

If I only flush 25 of 50 memtables, then next time when the total size reach 1GB, there will be 25 memtables with
size 10MB, and other 25 memtables with size 30MB. So I can flush those memtables with size 30MB, which is larger
than 20MB.

The main idea is to introduce some jitter during flush to ensure the small unevenness of each memtable, so as to ensure that flush will only be triggered when the memtable is large enough.

In my test, loading a table with 48 buckets, mem limit 2G, in previous version, the average memtable size is 44MB,
after modification, the average size is 82MB
2021-11-01 10:51:50 +08:00
63662194ab [BUG] Fix Stream Load cost too much memory (#5875) 2021-05-25 10:34:10 +08:00
1a81b9e160 [MemTracker] Some enchance of MemTracker (#5783)
1 Make some MemTracker have reasonable parent MemTracker not the root tracker
2 Make each MemTracker can be easily to trace.
3 Add show level of MemTracker to reduce the MemTracker show in the web page to have a way to control show how many tracker in web page.
2021-05-19 09:27:50 +08:00
0131c33966 [Enhance] Improve the readability of memtrackers' name (#5455)
Improve the readability of memtrackers' name, then you will be happy to read website be_ip:port/mem_tracker
2021-03-11 22:33:31 +08:00
ab06e92021 [Load Parallel][2/3] Support parallel flushing memtable during load (#5163)
In the previous implementation, in an load job,
multiple memtables of the same tablet are written to disk sequentially.
In fact, multiple memtables can be written out of order in parallel,
only need to ensure that each memtable uses a different segment writer.
2021-01-24 10:10:30 +08:00
6fedf5881b [CodeFormat] Clang-format cpp sources (#4965)
Clang-format all c++ source files.
2020-11-28 18:36:49 +08:00
068707484d Support sequence column for UNIQUE_KEYS Table (#4256)
* add sequence  col

Co-authored-by: yangwenbo6 <yangwenbo3@jd.com>
2020-09-04 10:10:17 +08:00
498b06fbe2 [Metrics] Support tablet level metrics (#4428)
Sometimes we want to detect the hotspot of a cluster, for example, hot scanned tablet, hot wrote tablet,
but we have no insight about tablets in the cluster.
This patch introduce tablet level metrics to help to achieve this object, now support 4 metrics on tablets: `query_scan_bytes `, `query_scan_rows `, `flush_bytes `, `flush_count `. 
However, one BE may holds hundreds of thousands of tablets, so I add a parameter for the metrics HTTP request,
and not return tablet level metrics by default.
2020-09-02 10:39:41 +08:00
10f822eb43 [MemTracker] make all MemTrackers shared (#4135)
We make all MemTrackers shared, in order to show MemTracker real-time consumptions on the web.
As follows:
1. nearly all MemTracker raw ptr -> shared_ptr
2. Use CreateTracker() to create new MemTracker(in order to add itself to its parent)
3. RowBatch & MemPool still use raw ptrs of MemTracker, it's easy to ensure RowBatch & MemPool destructor exec 
     before MemTracker's destructor. So we don't change these code.
4. MemTracker can use RuntimeProfile's counter to calc consumption. So RuntimeProfile's counter need to be shared 
    too. We add a shared counter pool to store the shared counter, don't change other counters of RuntimeProfile.
Note that, this PR doesn't change the MemTracker tree structure. So there still have some orphan trackers, e.g. RowBlockV2's MemTracker. If you find some shared MemTrackers are little memory consumption & too time-consuming, you could make them be the orphan, then it's fine to use the raw ptr.
2020-07-31 21:57:21 +08:00
b58b1b3953 [metrics] Make DorisMetrics to be a real singleton (#3417) 2020-05-04 09:20:53 +08:00
7c4149cf27 Improve comparison and printing of Version (#2796)
* Improve comparison and printing of Version

There are two members in `Version`:` first` and `second`.
There are many places where we need to print one `Version` object  and
compare two `Version` objects, but in the current code, these two members
are accessed directly, which makes the code very tedious.

This patch mainly do:
1. Adds overloaded methods for `operator<<()` for `Version`, so
   we can directly print a Version object;
2. Adds the `cantains()` method to determine whether it is an containment
   relationship;
3. Uses `operator==()` to determine if two `Version` objects are equal.

Because there are too many places need to be modified, there are still some
naked codes left, which will be modified later.

This patch also removes some necessary header file references.

No functional changes in this patch.
2020-01-19 18:04:28 +08:00
913792ce2b Add copy_object() method for HLL columns when loading (#2422)
Currently, special treatment is used for HLL types (and OBJECT types).
When loading data, because there is no need to serialize HLL content
(the upper layer has already done), we directly save the pointer
of `HyperLogLog` object in `Slice->data` (at the corresponding `Cell`
in each `Row`) and make `Slice->size` to be 0. This logic is different
from when reading the HLL column.  When reading, we need to deserialize
the HLL object from the `Slice` object. This causes us to have different
implementations of `copy_row()` when loading and reading.

In the optimization(commit: 177fec8917304e399aa7f3facc4cc4804e72ce8b),
the logic of `copy_row()` was added before a row can be added into the
`MemTable`, but the current `copy_row()` treats the `HLL column Cell`
as a normal Slice object(i.e. will memcpy its data according its size).

So this change adds a `copy_object()` method to `TypeInfo`, which is
used to copy the HLL column during loading data.

Note: The way of copying rows should be unified in the future. At that
time, we can delete the `copy_object()` method.
2019-12-11 22:07:51 +08:00
177fec8917 Improve SkipList memory usage tracking (#2359)
The problem with the current implementation is that all data to be
inserted will be counted in memory, but for the aggregation model or
some other special cases, not all data will be inserted into `MemTable`,
and these data should not be counted in memory.

This change makes the `SkipList` use the exclusive `MemPool`,
and only the data will be inserted into the `SkipList` can use this
`MemPool`. In other words, those discarded rows will not be
counted by the `MemPool` of` SkipList`.

In order to avoid duplicate checking whether a row already exists in
`SkipList`, this change also modifies the `SkipList` interface(A `Hint`
will be fetched when `Find()`, and then use it in `InsertUseHint()`),
and made `SkipList` no longer aware of the aggregation logic.

At present, because of the data row(`Tuple`) generated by the upper layer
is different from the data row(`Row`) internally represented by the
engine, when inserting `MemTable`, the data row must be copied.
If the row needs to be inserted into SkipList, we need copy it again
to `MemPool` of `SkipList`.

And, at present, the aggregation function only supports `MemPool` when
copying, so even if the data will not be inserted into` SkipList`,
`MemPool` is still used (in the future, it can be replaced with an
ordinary` Buffer`). However, we reuse the allocated memory in MemPool,
that is, we do not reallocate new memory every time.

Note: Due to the characteristics of `MemPool` (once inserted, it cannot
be partially cleared), the following scenarios may still cause multiple
flushes. For example, the aggregation model of a string column is `MAX`,
and the data inserted at the same time is in ascending order, then for
each data row, it must apply for memory from `MemPool` in `SkipList`,
that is, although the old rows in SkipList` will be discarded,
the memory occupied will still be counted.

I did a test on my development machine using `STREAM LOAD`: a table with
only one tablet and all columns are keys, the original data was
1.1G (9318799 rows), and there were 377745 rows after removing duplicates.

It can be found that both the number of files and the query efficiency are
greatly improved, the price paid is only a slight increase in load time.

before:
```
  $ ll storage/data/0/10019/1075020655/
  total 4540
  -rw------- 1 dev dev 393152 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_0_0.dat
  -rw------- 1 dev dev   1135 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_0_0.idx
  -rw------- 1 dev dev 421660 Dec  2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_10_0.dat
  -rw------- 1 dev dev   1185 Dec  2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_10_0.idx
  -rw------- 1 dev dev 184214 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_1_0.dat
  -rw------- 1 dev dev    610 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_1_0.idx
  -rw------- 1 dev dev 329181 Dec  2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_11_0.dat
  -rw------- 1 dev dev    935 Dec  2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_11_0.idx
  -rw------- 1 dev dev 343813 Dec  2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_12_0.dat
  -rw------- 1 dev dev    985 Dec  2 18:43 0200000000000004f5404b740288294b21e52b0786adf3be_12_0.idx
  -rw------- 1 dev dev 315364 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_2_0.dat
  -rw------- 1 dev dev    885 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_2_0.idx
  -rw------- 1 dev dev 423806 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_3_0.dat
  -rw------- 1 dev dev   1185 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_3_0.idx
  -rw------- 1 dev dev 294811 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_4_0.dat
  -rw------- 1 dev dev    835 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_4_0.idx
  -rw------- 1 dev dev 403241 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_5_0.dat
  -rw------- 1 dev dev   1135 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_5_0.idx
  -rw------- 1 dev dev 350753 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_6_0.dat
  -rw------- 1 dev dev    860 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_6_0.idx
  -rw------- 1 dev dev 266966 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_7_0.dat
  -rw------- 1 dev dev    735 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_7_0.idx
  -rw------- 1 dev dev 451191 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_8_0.dat
  -rw------- 1 dev dev   1235 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_8_0.idx
  -rw------- 1 dev dev 398439 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_9_0.dat
  -rw------- 1 dev dev   1110 Dec  2 18:42 0200000000000004f5404b740288294b21e52b0786adf3be_9_0.idx

  {
    "TxnId": 16,
    "Label": "cd9f8392-dfa0-4626-8034-22f7cb97044c",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 9318799,
    "NumberLoadedRows": 9318799,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 1079581477,
    "LoadTimeMs": 46907
  }

  mysql> select count(*) from xxx_before;
  +----------+
  | count(*) |
  +----------+
  |   377745 |
  +----------+
1 row in set (0.91 sec)

```

aftr:
```
  $ ll storage/data/0/10013/1075020655/
  total 3612
  -rw------- 1 dev dev 3328992 Dec  2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_0_0.dat
  -rw------- 1 dev dev    8460 Dec  2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_0_0.idx
  -rw------- 1 dev dev  350576 Dec  2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_1_0.dat
  -rw------- 1 dev dev     985 Dec  2 18:26 0200000000000003d44e5cc72626f95a0b196b52a05c0f8a_1_0.idx

  {
    "TxnId": 12,
    "Label": "88f606d5-8095-4f15-b61d-49b7080c16b8",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 9318799,
    "NumberLoadedRows": 9318799,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 1079581477,
    "LoadTimeMs": 48771
  }

  mysql> select count(*) from xxx_after;
  +----------+
  | count(*) |
  +----------+
  |   377745 |
  +----------+
  1 row in set (0.38 sec)

```
2019-12-06 17:31:18 +08:00
c5f7f7e0f4 Check the return status of _flush_memtable_async() (#2332)
This commit also contains some adjustments of the forward declaration
2019-11-29 21:05:17 +08:00
62acf5d098 Limit the memory usage of Loading process (#1954) 2019-10-15 09:26:20 +08:00
4e8d728e75 Remove unused code and unnecessary check (#1918) 2019-09-30 18:35:30 +08:00
cafb9f1e62 Replace Arena with MemPool first step (#1899) 2019-09-28 01:12:22 +08:00
b246d93128 Avoid SerDe for aggregation query with object pool (#1854) 2019-09-26 13:51:13 +08:00
c643cbd30c Optimize the load performance for large file (#1798)
The current load process is:

Tablet Sink -> Tablet Channel Mgr -> Tablets Channel -> Delta Writer -> MemTable -> Flush to disk

In the path of Tablets Channel -> DeltaWriter -> MemTable -> Flush to disk, the following operations are performed:

Insert tuple into different memtables according to tablet ID
When the memtable size reaches the threshold, it is written to disk.
The above operations are equivalent to single thread execution for a single load task.
In fact, the insertion of memtable and the flush of memtable can be executed synchronously.
Perform these operation in single thread prevents the insertion of memtable from being delayed due to slow disk writing.

In the new implementation, I added a MemTableFlushExecutor class with a set of flush queues and corresponding worker threads.
By default, each data directory uses two worker threads for flush, which can be modified by the parameter flush_thread_num_per_store of BE.
DeltaWriter will push the full memtable to MemTableFlushExecutor for flush operation and generate a new memtable for receiving new data.

This design can improve the performance of load large files.
In single host testing, the time to load a 1GB text file is reduced from 48 seconds to 29 seconds.
2019-09-25 13:49:32 +08:00
9aa2045987 Refactor alter job (#1695) 2019-09-12 16:31:29 +08:00
cd5cfea5cc Encapsulate HLL logic (#1756) 2019-09-09 15:52:10 +08:00
a63989cc61 Use RowsetFactory to create and init RowsetWriter (#1740) 2019-09-04 17:02:43 +08:00
1e4dd77d2a Add bitmap agg type and udaf (#1610) 2019-08-26 14:24:42 +08:00
c5edf9dae0 Unify Field and ColumnSchema in Storage (#1561)
Currently, we have Field and ColumnSchema to access column data in a
row. These two classes are mostly the same. So we should unify these to
one class. Now, Field has offset information, which is an row attribute,
so we remove offset in Field.

RowCursor now has some logic which belong to Schema, so in this patch I
add Schema attribute to RowCursor to make RowCursor simple. After this
change, only Schema will handle Field/ColumnSchema.

I extract some logic from RowCursor to be/src/olap/row.h, then we can
use same logic to handle different types of row. Each type of row has
same function that to get Cell of this row. A cell represent a column
content with a null indicator.
2019-07-30 14:01:57 +08:00
dbc912d2df Unify ColumnSchemaV2 and ColumnSchema to one (#1545)
Currently, we have two versions of ColumnSchema, in this patch, we unify
these two classes to one class.
2019-07-25 10:48:16 +08:00
a88b55e649 Add more logs and metrics to trace the broker load process (#1530)
The Operator wants to known when the job being scheduled as PENDING
and LOADING. And how long it takes to finish these sub states.

Also add 2 metrics on BE to monitor the memtable's flush time.
`memtable_flush_total` and `memtable_flush_duration_us`
2019-07-23 21:42:44 +08:00
0d48a3961c Refactor Storage Engine (#1478)
NOTE: This patch would modify all Backend's data.
And this will cause a very long time to restart be.
So if you want to interferer your product environment,
you should upgrade backend one by one.

1. Refactoring be is to clarify the structure the codes.
2. Use unique id to indicate a rowset.
   Nameing rowset with tablet_id and version will lead to
   many conflicts among compaction, clone, restore.
3. Extract an rowset interface to encapsulate rowsets
   with different format.
2019-07-15 21:18:22 +08:00
c34b306b4f Decimal optimize branch #695 (#727) 2019-03-22 17:22:16 +08:00
6b4049e21c Unify Slice code path (#380) 2018-12-03 18:11:47 +08:00
1ba8a4ee4e Transform row-oriented table to columnar-oriented table (#311) 2018-11-16 16:03:56 +08:00
37b4cafe87 Change variable and namespace name in BE (#268)
Change 'palo' to 'doris'
2018-11-02 10:22:32 +08:00
2868793b6b Change license to Apache License 2.0 (#262) 2018-11-01 09:06:01 +08:00
5d3fc80067 Added:
* Add streaming load feature. You can execute 'help stream load;' to see more information.

Changed:
* Loading phase of a certain table can be parallelized, to reduce the load job execution time when multi load jobs to a single table.
* Using RocksDB to save the header info of tablets in Backends, to reduce the IO operations and increate speeding of restarting.

Fixed:
* A lot of bugs fixed.
2018-10-31 14:46:22 +08:00