Commit Graph

48 Commits

Author SHA1 Message Date
20ef8a6e21 [feature-wip](remote storage)(step1) use a struct instead of string for parameter path, add basic remote method (#7098)
For the first, we need to make a parameter to discribe the data is local or remote.
At then, we need to support some basic function to support the operation for remote storage.
2021-12-22 22:58:23 +08:00
dd36ccc3bf [feature](storage-format) Z-Order Implement (#7149)
Support sort data by Z-Order:

```
CREATE TABLE table2 (
siteid int(11) NULL DEFAULT "10" COMMENT "",
citycode int(11) NULL COMMENT "",
username varchar(32) NULL DEFAULT "" COMMENT "",
pv bigint(20) NULL DEFAULT "0" COMMENT ""
) ENGINE=OLAP
DUPLICATE KEY(siteid, citycode)
COMMENT "OLAP"
DISTRIBUTED BY HASH(siteid) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"data_sort.sort_type" = "ZORDER",
"data_sort.col_num" = "2",
"in_memory" = "false",
"storage_format" = "V2"
);
```
2021-12-02 11:39:51 +08:00
6c6380969b [refactor] replace boost smart ptr with stl (#6856)
1. replace all boost::shared_ptr to std::shared_ptr
2. replace all boost::scopted_ptr to std::unique_ptr
3. replace all boost::scoped_array to std::unique<T[]>
4. replace all boost:thread to std::thread
2021-11-17 10:18:35 +08:00
7297b275f1 [Optimize] Optimize cpu consumption when importing parquet files (#6782)
Remove part of dynamic_cast, reduce the overhead caused by type conversion,
and probably reduce the cpu consumption of parquet file import by about 10%
2021-10-03 12:14:35 +08:00
d8cde8c044 (#6454) Remove useless code for Segment V2 (#6455) 2021-09-02 09:59:21 +08:00
8738ce380b Add long text type STRING, with a maximum length of 2GB. Usage is similar to varchar, and there is no guarantee for the performance of storing extremely long data (#6391) 2021-08-18 09:05:40 +08:00
ed3ff470ce [ARRAY] Support array type load and select not include access by index (#5980)
This is part of the array type support and has not been fully completed. 
The following functions are implemented
1. fe array type support and implementation of array function, support array syntax analysis and planning
2. Support import array type data through insert into
3. Support select array type data
4. Only the array type is supported on the value lie of the duplicate table

this pr merge some code from #4655 #4650 #4644 #4643 #4623 #2979
2021-07-13 14:02:39 +08:00
4d64612b96 [ARRAY]Save array's size instead of offset. (#5983)
* Save array's size instead of offset.

* Optimize variable name

* Fix comment
2021-06-10 12:32:58 +08:00
b423274f17 [Enhance] Make MemTracker more accurate (#5515) (#5516)
* [Enhance] Make MemTracker more accurate (#5515)
 This PR main about:
 1. Improve the readability of MemTrackers' name
 2. Add the MemTracker of:
    * Load
    * Compaction
    * SchemaChange
    * StoragePageCache
    * TabletManager
 3. Change SchemaChange to a Singleon

* revise some code for Code Review

* change the name of mem_tracker

* keep reader_context have the same lifetime of rowset_reader in schema change.

* change vlog notice to log(warning) in schema change
2021-04-08 09:14:55 +08:00
139709d060 [Storage] Optimize Zone map create policy (#5260)
If there are too large fields in the table, there may be only one row in each page,
and this row also has a zone map index
This causes the stored data to expand three times the original data,
It also takes up more memory when reading those segments
Therefore, we need to Disable the creation of zonemap indexes for segments with too few rows
2021-01-24 10:11:21 +08:00
a5298d617d [Performance Improve] Push Down _conjunctf of 'not in' and '!=' to Storage Engine. (#5207) 2021-01-23 21:07:01 +08:00
6c098e45fc [Optimize][Cache]Implementation of Separated Page Cache (#5008)
#4995
**Implementation of Separated Page Cache**
- Add config "index_page_cache_ratio" to set the ratio of capacity of index page cache
- Change the member of StoragePageCache to maintain two type of cache
- Change the interface of StoragePageCache for selecting type of cache
- Change the usage of page cache in read_and_decompress_page in page_io.cpp
  - add page type as argument
  - check if current page type is available in StoragePageCache (cover the situation of ratio == 0 or 1)
- Add type as argument in superior call of read_and_decompress_page
- Change Unit Test
2021-01-04 12:19:24 +08:00
11c0aafa5c [UT] Speed up BE unit test (#5131)
There are some long loops and sleeps in unit tests, it will cost a
very long time to run all unit tests, especially run in TSAN mode.
This patch speed up unit tests by shortening long loops and sleeps,
on my environment all unit tests finished in 1 minite. It's useful
to do basic functional unit tests.
You can switch to run in this mode by adding a new environment variable
'DORIS_ALLOW_SLOW_TESTS'. For example, you can set:
export DORIS_ALLOW_SLOW_TESTS=1
and also you can disable it by setting:
export DORIS_ALLOW_SLOW_TESTS=0
2020-12-27 22:19:56 +08:00
6fedf5881b [CodeFormat] Clang-format cpp sources (#4965)
Clang-format all c++ source files.
2020-11-28 18:36:49 +08:00
10e1e29711 Remove header file common/names.h (#4945) 2020-11-26 17:00:48 +08:00
b48c768dc7 [ComplexType] Restructure storage type to support complex types expending (#4905)
This CL includes:
* Change the column metadata to a tree structure.
* Refactor the segment_v2.ColumnReader and sgment_v2.ColumnWriter to support complex type.
* Implements the reading and writing of array type.
2020-11-16 21:59:41 +08:00
75e0ba32a1 Fixes some be typo (#4714) 2020-10-13 09:37:15 +08:00
9785e103ea [Bug] Fix bug that delete stmt with filter condition delete all data from table on segment v2 (#3943)
When we get different columns's row ranges by column_delete_conditions, we should use union operation instead of intersection operation to get final get final row ranges.
The root cause is that we lost the relationship of the two delete conditions in same delete stmt.
Base data:
```
    k1, k2
    1,  2
    1,  3
case 1:
   delete from tbl where k1=1 and k2=2;
case 2:
   delete from tbl where k1=1;
   delete from tbl where k2=2;
```
We treat the above 2 cases as same, which is incorrect.
So we need to process every rowset of delete conditions separately.
2020-07-02 11:07:23 +08:00
8aa8b8c96d [Code Refactor] Using block manager to unify the data file access. (#3189)
Earlier we introduced `BlockManager` to separate data access logic from
underlying file read and write logic.

This CL further unifies all `SegmentV2` data access to the `BlockManager`, 
removes the previous `FileManager` class, and move the file cache to the `FileBlockManager`.

There are no logical changes to this CL.

After this CL, all user table data is read through the `WritableBlock` and `ReadableBlock` 
returned by the `BlockManager`, and no file operations are performed directly.
2020-03-25 20:39:07 +08:00
6beadfda71 [Bug] Fix delete predicate bug for segment v2 (#3164)
This bug is because the min and max wrapper field is not initialized
when there is no predicate of that column.
2020-03-20 20:35:55 +08:00
d2d95bfa84 [segment_v2] Switch to Unified and Extensible Page Format (#2953)
Fixes #2892 

IMPORTANT NOTICE: this CL makes incompatible changes to V2 storage format, developers need to create new tables for test.

This CL refactors the metadata and page format for segment_v2 in order to
* make it easy to extend existing page type
* make it easy to add new page type while not sacrificing code reuse
* make it possible to use SIMD to speed up page decoding

Here we summary the main code changes
* Page and index metadata is redesigned, please see `segment_v2.proto`
* The new class `PageIO` is the single place for reading and writing all pages. This removes lots of duplicated code. `PageCompressor` and `PageDecompressor` are now useless and removed. 
* The type of value ordinal is changed from `rowid_t` to 64-bits `ordinal_t`, this affects ordinal index as well.
* Column's ordinal index is now implemented by IndexPage, the same with IndexedColumn.
* Zone map index is now implemented by IndexedColumn
2020-02-27 15:09:57 +08:00
625411bd28 Doris support in memory olap table (#2847) 2020-02-18 10:45:54 +08:00
6c33f80544 Add disable_storage_page_cache config (#2890)
1. when read column data page:
    for compaction, schema_change, check_sum: we don't use page cache
    for query and config::disable_storage_page_cache is false, we use page cache
2. when read column index page
    if config::disable_storage_page_cache is false, we use page cache
2020-02-16 19:13:30 +08:00
a27e89065b Add file cache for v2 (#2782)
Add file descriptor cache for segment v2 to solve too many open file problems
2020-02-04 00:16:01 +08:00
5229ea24da Fix bloom filter statistics bug (#2609) 2019-12-30 23:23:39 +08:00
5fd7133e69 Fix bitmap, hll, segment v2 DefaultValue bug (#2570)
1. Change the bitmap and HLL default value to empty bitmap and empty bitmap HLL
2. Fix DefaultValueColumnIterator bug
3. Fix uint24.h ostream bug
2019-12-27 14:01:45 +08:00
b4d935ab37 Fix compaction with delete rowset bug (#2523)
[STORAGE][SEGMENTV2]
when base compaction rowsets with delete rowset of more than two
condition, stats rows_del_filtered is wrong and compaction will
fail because of line check.
2019-12-21 12:13:46 +08:00
d31f774852 Add block split bloom filter (#2471)
[STORAGE][SEGMENTV2]

    use block split bloom filter
    build bloom filter against data page
    add distinct value to bloom filter
    add ordinal index to bloom filter index
2019-12-18 12:57:44 +08:00
f828670245 Add Bitmap index reader (#2319)
[STORAGE] [INDEX]

For #2061 and #2062

Add bitmap index reader
SegmentIterator support bitmap index
Add some metrics
2019-12-03 23:01:40 +08:00
068eed8eb0 Add delete state of row block v2 for performance (#2055) 2019-11-11 20:07:37 +08:00
c25e826dce Fix default value column bug (#2134) 2019-11-07 19:06:24 +08:00
cfc98e3571 Fix string type column zone map bug (#2144)
string type column's zone map of segment is wrong and segments are filtered incorrectly.
2019-11-07 15:57:38 +08:00
d25f0ba69a Make ColumnReader load lazily (#2026)
[Storage][SegmentV2]
Currently `segment_v2::Segment::open` will eagerly initialize all column readers, regardless of whether the column is queried or not. Initializing `segment_v2::ColumnReader` incurs additional I/O cost to read ordinal index and zonemap index and should be delayed to the time it's needed.
2019-10-23 10:25:28 +08:00
9c2d149c36 add profile for segment v2 (#2015) 2019-10-22 09:43:16 +08:00
8aa2cbe12d Load Rowset only once in a thread-safe manner (#2022)
[Storage]
This PR implements thread-safe `Rowset::load()` for both AlphaRowset and BetaRowset. The main changes are 

1. Introduce `DorisCallOnce<ReturnType>` to be the replacement for `DorisInitOnce` . It works for both Status and OLAPStatus.
2. `segment_v2::ColumnReader::init()` is now implemented by DorisCallOnce.
3. `segment_v2::Segment` is now created by a factory open() method. This guarantees all Segment instances are in opened state.
4. `segment_v2::Segment::_load_index()` is now implemented by DorisCallOnce.
5. Implement thread-safe load() for AlphaRowset and BetaRowset
2019-10-21 16:05:12 +08:00
8aa8e08f27 v2 segment support string encode(#1766) (#1816)
major change

change data format of binary dict page, appending (dict page data) and (dict page offset) to binary dict page;
add new decoding method for new binary dict page format
add ut for segment test
set the elements of initial array to 0 ,when calling arena.AllocateNewBlock
hard code way to choose dict coding for string
0919 commit major change

change dict file format:when saving binary dict page, separate dict page from dict page,one dict page may have multi data pages;when reading a binary dict page,one ColumnReader keeps one dict page
loading dict when calling column_reader._read_page
3.rollback BinaryDictPage
no longer using memset(0) to inital column_zonemap.max_value
0926 17 commit major change

init column_zone_map min value column_zone_map slice's data array;
set char/varchar column_zone_map'max value size to 0
add ut for char column zone map query hit/miss
0929 10 commit major change

allocate mem for column_zone_map 's max and min value
direct copy content to column_zone_map's max and min value
2019-09-30 16:25:31 +08:00
2cecf5901f Fix segment v2 bug (#1904) 2019-09-30 13:50:39 +08:00
8d0fee7e64 Add default value column iterator #1834 (#1835) 2019-09-24 14:39:10 +08:00
fe27969978 add delete predicate filter(#1636) (#1745)
Delete predicate can be used to prune data by zone map.
2019-09-24 14:38:19 +08:00
0f44ce99ce Fix segment v2 comment (#1769) 2019-09-09 18:26:48 +08:00
65dcabf1df Use crc32c checksum for segment v2 (#1753) 2019-09-06 15:23:57 +08:00
6d040a33af Add zone map page(#1390) (#1633) 2019-08-24 00:57:30 +08:00
acf868c9d0 Support page compression and checksum in BetaRowset (#1646) 2019-08-19 09:40:47 +08:00
2bd01b23c7 Add page cache for column page in BetaRowset (#1607) 2019-08-12 10:42:00 +08:00
67b370a1ed Add ColumnBlock (#1450)
Use ColumnBlock to read data from Page.
2019-07-09 21:52:27 +08:00
4747bed306 Add rle page (#1379) 2019-06-27 22:25:19 +08:00
b17d1c5348 Fix a bug of v2 ColumnReader when reading not-null column (#1398) 2019-06-26 22:58:30 +08:00
e30844a321 Add column reader writer for segment V2 (#1346) 2019-06-25 16:59:26 +08:00