Commit Graph

130 Commits

Author SHA1 Message Date
1228995dec [improvement](segment) reduce memory footprint of column_reader and segment (#24140) 2023-09-11 21:54:00 +08:00
153c7982f3 [Optimize](invert index) Optimize multiple terms conjunction query (#23871) 2023-09-09 01:52:58 +08:00
f7a3d2778a [FIX](array)update array olapconvertor and support array nested other complex type (#23489)
* update array olapconvertor and support array nested other complex type

* update for inverted index
2023-08-29 16:18:11 +08:00
9c65b7ab96 [improvement](column_reader) move load once to index reader to reduce (#23537)
memory footprint of column reader
2023-08-29 09:34:27 +08:00
392437008c [Improvement](ColumnReader) optimize memory using of ColumnReader meta (#23528) 2023-08-28 17:57:59 +08:00
687c676160 [FIX](map)fix column map for offset next_array_item_rowid order (#23250)
* fix column map for offset next_array_item_rowid order

* add regress test
2023-08-24 10:57:40 +08:00
Pxl
8ed4045df9 [Chore](primitive-type) remove VecPrimitiveTypeTraits (#22842) 2023-08-23 08:37:40 +08:00
707a527775 [FIX](map)insert into doris table with array/map type by local tvf (#22955) 2023-08-15 13:11:23 +08:00
abc21f5d77 [bugfix](ngram bf index) process differently for normal bloom filter index and ngram bf index (#21310)
* process differently for normal bloom filter index and ngram bf index

* fix review comments for readbility

* add test case

* add testcase for delete condition
2023-07-13 17:31:45 +08:00
9d2f879bd2 [Enhancement](inverted index) make InvertedIndexReader shared_from_this (#21381)
This PR proposes several changes to improve code safety and readability by replacing raw pointers with smart pointers in several places.

use enable_factory_creator in InvertedIndexIterator and InvertedIndexReader, remove explicit new constructor.
make InvertedIndexReader shared_from_this, it may desctruct when InvertedIndexIterator use it.
2023-07-06 11:52:59 +08:00
7f0e37069f [improvement](olap) filter the whole segment by dictionary (#21239) 2023-06-29 10:34:29 +08:00
28abeef72b [performace](colddata) opt cold data read performance (#21141)
In #10370, we try to opt string evaluate performance by rewrite the predicate using dict value. But it has to check if the string column is full dict encoding. So that we add a logic to read the last page of the string column to check it.

But it has some bad performance for cold data because it has to load the column's ordinal index and zone map index. In some scenario for example, select * from table where pk_col=1. If the query condition is primary key, the result maybe just a few rows but the result may have 100 columns, it will cost a lot of time to load these indices. We could find a lot of time is spending on block_init_time.

In my test, a table with 50 string columns and query with primary key.

The first read time will reduce from 220ms to 40ms.
2023-06-26 10:39:20 +08:00
4bc221aa25 [improvement](column reader) lazy load indices (#20456)
Currently when reading column data, all types of indice are read even if they are not actually used, this PR implements lazy load of indices.
2023-06-06 16:36:06 +08:00
608d2a3eca [Bug](exec) push down no group by agg min cause error result (#20289)
sql """
CREATE TABLE t1_int (
num int(11) NULL,
dgs_jkrq bigint(20) NULL
) ENGINE=OLAP
DUPLICATE KEY(num)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(num) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"storage_format" = "V2",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false"
);
"""
sql """insert into t1_int values(1,1),(1,2),(1,3),(1,4),(1,null);"""
qt_sql """
select min(dgs_jkrq) from t1_int;
"""

get the error result:4

after change we get the right result:1
2023-06-01 17:29:46 +08:00
90b4e127e3 [Feature](inverted index) add parser_mode properties for inverted index parser (#20116)
We add parser mode for inverted index, usage like this:
```
CREATE TABLE `inverted` (
  `FIELD0` text NULL,
  `FIELD1` text NULL,
  `FIELD2` text NULL,
  `FIELD3` text NULL,
  INDEX idx_name1 (`FIELD0`) USING INVERTED PROPERTIES("parser" = "chinese", "parser_mode" = "fine_grained") COMMENT '',
  INDEX idx_name2 (`FIELD1`) USING INVERTED PROPERTIES("parser" = "chinese", "parser_mode" = "coarse_grained") COMMENT ''
) ENGINE=OLAP
);
```
2023-05-29 23:21:52 +08:00
ab8125d56f [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema (#20037)
* [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema

1. When the system is under high-concurrency load with wide table point queries, the frequent memory allocation and deallocation of Schema become evident system bottlenecks. Additionally, the initialization of TabletSchema and Schema also becomes a CPU hotspot.Therefore, the introduction of a SchemaCache is implemented to cache these resources for reuse.

2. Make some variables wrapped with std::unique<unique_ptr>

Performance:
| 状态              | QPS | 平均响应时间 (avg) | P99 响应时间 |
|------------------|-----|------------------|-------------|
| 开启 SchemaCache | 501 | 20ms             | 34ms        |
| 关闭 SchemaCache | 321 | 31ms             | 61ms        |

* handle schema change with schema version

* remove useless header

* rebase
2023-05-29 17:34:53 +08:00
Pxl
dfad7b6b38 [Feature](generic-aggregation) some prowork of generic aggregation (#19343)
some prowork of generic aggregation
2023-05-09 21:42:21 +08:00
e412dd12e8 [chore](build) Use include-what-you-use to optimize includes (PART II) (#18761)
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
2023-04-19 23:11:48 +08:00
Pxl
c9b4eaea76 [Chore](storage) change FieldType to enum class #18500 2023-04-10 08:53:44 +08:00
d36e9bd523 [chore](scan) Disable low cardinality optimization for compaction (#18424) 2023-04-07 14:19:11 +08:00
06dee69174 [Refactor](map) remove using column array in map to reduce offset column (#17330)
1. remove column array in map 
2. add offsets column in map 
Aim to reduce duplicate offset  from key-array and value-array in disk
2023-03-09 11:22:26 +08:00
8ccc805cd0 [Fix](Lightweight schema Change) query error caused by array default type is unsupported (#17331)
We have supportted array type default [], but when using lightweight schema Change to add column array type, query failed as follows:

Fix "array default type is unsupported" error.
Fix the default value filling assignment digit problem.
2023-03-07 16:30:41 +08:00
de725d5d44 [bugfix](column_reader) index_page should not be pre-decoded (#16605)
In our current logic, index page will be pre-decoded but it will return OK
as index page use BinaryPlainPageBuilder and first 4 bytes of the page is a offset
so it's high probablility not equal to EncodingTypePB::DICT_ENCODING which
is 5.
Code in bitshuffle_page_pre_decode.h
```
   	if constexpr (USED_IN_DICT_ENCODING) {
            auto type = decode_fixed32_le((const uint8_t*)&data.data[0]);
            if (static_cast<EncodingTypePB>(type) != EncodingTypePB::DICT_ENCODING) {
                return Status::OK();
            }
            size_of_dict_header = BINARY_DICT_PAGE_HEADER_SIZE;
            data.remove_prefix(4);
        }
```
But if type just equal to EncodingTypePB::DICT_ENCODING and then it will use
BitShuffle to decode BinaryPlainPage, which will leads to an fatal error.
2023-02-14 00:06:14 +08:00
aba843bb2b [Improvement](inverted index) inverted index query match bitmap cache (#16578)
Add cache for inverted index query match bitmap to accelerate common query keyword, especially for keyword matching many rows. 

Tests result:
- large result: matching 99% out of 247 million rows shows 8x speed up.
- small result: matching 0.1% out of 247 million rows shows 2x speed up.
2023-02-11 13:38:58 +08:00
1b3902baa2 [Feature](Complex-type) Add struct and map type to Doris (#16444)
This commit support:
1、Insert + select for struct/map type
2、Json stream load for struct type
3、m[key] function for map type

How to use:
Set the fe config to create table for struct and map type
1、admin set frontend config("enable_struct_type" = "true");
2、admin set frontend config("enable_map_type" = "true");

#16547

Co-authored-by: xy720 <xuyang25@baidu.com>
Co-authored-by: amory <wangqiannan@selectdb.com>
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
Co-authored-by: hucheng01 <hucheng01@baidu.com>
2023-02-10 11:00:33 +08:00
5eaa995704 [refactor](some mempool) not memset 0 in default value iterator (#16194)
---------

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-29 22:50:39 +08:00
199d7d3be8 [Refactor]Merged string_value into string_ref (#15925) 2023-01-22 16:39:23 +08:00
42b5d17fa1 [refactor](remove non vec) remove column block and column view (#16022)
* [refactor](remove non vec) remove column block and column view and column vectorized batch

Co-authored-by: yiguolei <yiguolei@gmail.com>
2023-01-18 12:40:53 +08:00
31cc99964c [Feature-WIP](inverted index)(bkd) bdk index'reader implementation which in inverted index using for numeric types (#15994)
Step3 of DSIP-023: Add inverted index for full text search
implementation of bkd index's reader which in inverted index using for numeric types
dependency pr: #14211 #15807 #15823
2023-01-18 09:24:19 +08:00
95397ff05d [refactor](array) remove depandancy of ColumnBlock, ColumnBlockView (#16002)
change to vectorized::MutableColumnPtr
2023-01-17 19:16:16 +08:00
b1caa68706 [Feature-WIP](inverted index) inverted index reader's implementation, and add mysql_fulltext regression case to test fulltext query (#15823)
Issue Number: Step2 of DSIP-023: Add inverted index for full text search
implementation of inverted index reader

dependency pr: #14211 #15807 #15821
2023-01-17 09:13:56 +08:00
Pxl
b727033906 [Chore](build) enable -Wextra and remove some -Wno (#15760)
enable -Wextra and remove some -Wno
2023-01-15 10:40:35 +08:00
edecc2e706 [feature-wip](inverted index) API for inverted index reader and syntax for fulltext match (#14211)
* [feature-wip](inverted index)inverted index api: reader

* [feature-wip](inverted index) Fulltext query syntax with MATCH/MATCH_ALL/MATCH_ALL

* [feature-wip](inverted index) Adapt to index meta

* [enhance] add more metrics

* [enhance] add fulltext match query check for column type and index parser

* [feature-wip](inverted index) Support apply inverted index in compound predicate which except leaf node of and node
2022-12-30 21:48:14 +08:00
81fece5360 [improvement](cache) close compaction&schema_change&checksum index meta cache (#14586) 2022-11-26 12:15:32 +08:00
d55faa7f6a [feature](remote)Only query can use local cache when reading remote files. (#13865)
When calling select on remote files, download cache files to local disk.
When calling alter table on remote files, read files directly from remote storage. So if tablet is too large, it will not take up too many local disk when creating local cache file.
2022-11-14 10:30:15 +08:00
42b2725f03 [Bug](delete) Fix wrong delete operation (#13840) 2022-11-01 13:38:43 +08:00
20b583c91e [Bug](array-type) Fix memory buffer overflow (#13074) 2022-10-10 11:42:13 +08:00
59699a4321 [feature](JSON datatype)Support JSON datatype (#10322)
Add `JSON` datatype, following features are implemented by this PR:
1. `CREATE` tables with `JSON` type columns
2. `INSERT` values containing `JSON` type value stored in `String`, which is represented as binary format(AKA `JSONB`) at BE 
3. `SELECT` JSON columns

Detail design refers [DSIP-016: Support JSON type](https://cwiki.apache.org/confluence/display/DORIS/DSIP-016%3A+Support+JSON+type)

* add JSONB data storage format type

* fix JsonLiteral resolve bug

* add DataTypeJson case in data_type_factory

* add JSON syntax check in FE

* add operators for jsonb_document, currently not support comparison between any JSON type value

* add ColumnJson and DataTypeJson

* add JsonField to store JsonValue

* add JsonValue to convert String JSON to BINARY JSON and JsonLiteral case for vliteral

* add push_json for MysqlResultWriter

* JSON column need no zone_map_index

* Revert "JSON column need no zone_map_index"

This reverts commit f71d1ce1ded9dbae44a5d58abcec338816b70d79.

* add JSON writer and reader, ignore zone-map for JSON column

* add json_to_string for DataTypeJson

* add olap_data_convertor for JSON type

* add some enum

* add OLAP_FIELD_TYPE_JSON type, FieldTypeTraits for it and corresponding cases or functions

* fix column_json offsets overflow bug, format code

* remove useless TODOs, add CmpType cases for JSON type

* add license header

* format license

* format be codes

* resolve rebase master conflicts

* fix bugs for CREATE and meta related code

* refactor JsonValue constructors, add fe JSON cases and fix some bugs, reformat codes

* modification be codes along code review advice

* fix rebase conflicts with master

* add unit test for json_value and column_json

* fix rebase error

* rename json to jsonb

* fix some data convert bugs, set Mysql type to JSON
2022-09-25 14:06:49 +08:00
f7e3ca29b5 [Opt](Vectorized) Support push down no grouping agg (#12803)
Support push down no grouping agg
2022-09-23 18:29:54 +08:00
b136d80e1a [enhancement](compress) reuse compression ctx and buffer (#12573)
Reuse compression ctx and buffer.
Use a global instance for every compression algorithm, and use a
thread saft buffer pool to reuse compression buffer, pool size is equal
to max parallel thread num in compression, and this will not be too large.

Test shows this feature increase 5% of data import and compaction.

Co-authored-by: yixiutt <yixiu@selectdb.com>
2022-09-15 10:59:46 +08:00
f50054f547 [Enhancement](array-type) record offsets info to speed up the seek performance (#12293)
Store the offset rather than the length in file for the data with array type. The new file format can improve the seek performance. Please refer to #12246 to get the performance report.

Co-authored-by: xy720 <22125576+xy720@users.noreply.github.com>
2022-09-14 22:41:54 +08:00
5f7d6e8f2b [Refactor](predicate) Unify Conditions and ColumnPredicate (#11985) 2022-08-29 12:11:22 +08:00
6ffdd0cbdd [Fix] Coredump in column dictionary insert data from DefaultValueColumnIterator (#11217)
Co-authored-by: lihaopeng <lihaopeng@baidu.com>
2022-07-28 14:08:24 +08:00
eec142ae90 [Enhancement] Use shared file reader when read a segment (#10896)
* readers under a segment use a shared FileReader

* no need to cache fd in LocalFileReader
2022-07-17 07:54:58 +08:00
7be2ef79ed array column support read by rowids (#10886)
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-07-15 21:19:02 +08:00
3b46242483 [feature-wip] Optimize Decimal type (#10794)
* [feature-wip](decimalv3) support decimalv3

* [feature-wip] Optimize Decimal type

Co-authored-by: liaoxin <liaoxinbit@126.com>
2022-07-14 10:50:50 +08:00
ed4b2140d7 [BUG](datev2) fix bloom filter for datev2 and remove redundant code (#10695) 2022-07-09 06:26:28 +08:00
331fa50501 [feature](cold-data) move cold data to object storage without losing any feature(BE) (#10280)
This PR supports rowset level data upload on the BE side, so that there can be both cold data and hot data in a tablet,
and there is no necessary to prohibit loading new data to cooled tablets.

Each rowset is bound to a `FileSystem`, so that the storage layer can read and write rowsets without
perceiving the underlying filesystem.

The abstracted `RemoteFileSystem` can try local caching strategies with different granularity,
instead of caching segment files as before.

To avoid conflicts with the code in be/src/io, we temporarily put the file system related code in the be/src/io/fs directory.
In the future, `FileReader`s and `FileWriter`s should be unified.
2022-07-08 12:18:39 +08:00
Pxl
e68ab0084b [bugfix]fix default value get wrong result because no implement read_by_rowids (#10582) 2022-07-04 19:30:49 +08:00
c9f86bc7e2 [refactor] Refactoring Status static methods to format message using fmt(#9533) 2022-07-02 18:58:23 +08:00