* process differently for normal bloom filter index and ngram bf index
* fix review comments for readbility
* add test case
* add testcase for delete condition
This PR proposes several changes to improve code safety and readability by replacing raw pointers with smart pointers in several places.
use enable_factory_creator in InvertedIndexIterator and InvertedIndexReader, remove explicit new constructor.
make InvertedIndexReader shared_from_this, it may desctruct when InvertedIndexIterator use it.
In #10370, we try to opt string evaluate performance by rewrite the predicate using dict value. But it has to check if the string column is full dict encoding. So that we add a logic to read the last page of the string column to check it.
But it has some bad performance for cold data because it has to load the column's ordinal index and zone map index. In some scenario for example, select * from table where pk_col=1. If the query condition is primary key, the result maybe just a few rows but the result may have 100 columns, it will cost a lot of time to load these indices. We could find a lot of time is spending on block_init_time.
In my test, a table with 50 string columns and query with primary key.
The first read time will reduce from 220ms to 40ms.
* [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema
1. When the system is under high-concurrency load with wide table point queries, the frequent memory allocation and deallocation of Schema become evident system bottlenecks. Additionally, the initialization of TabletSchema and Schema also becomes a CPU hotspot.Therefore, the introduction of a SchemaCache is implemented to cache these resources for reuse.
2. Make some variables wrapped with std::unique<unique_ptr>
Performance:
| 状态 | QPS | 平均响应时间 (avg) | P99 响应时间 |
|------------------|-----|------------------|-------------|
| 开启 SchemaCache | 501 | 20ms | 34ms |
| 关闭 SchemaCache | 321 | 31ms | 61ms |
* handle schema change with schema version
* remove useless header
* rebase
Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
We have supportted array type default [], but when using lightweight schema Change to add column array type, query failed as follows:
Fix "array default type is unsupported" error.
Fix the default value filling assignment digit problem.
In our current logic, index page will be pre-decoded but it will return OK
as index page use BinaryPlainPageBuilder and first 4 bytes of the page is a offset
so it's high probablility not equal to EncodingTypePB::DICT_ENCODING which
is 5.
Code in bitshuffle_page_pre_decode.h
```
if constexpr (USED_IN_DICT_ENCODING) {
auto type = decode_fixed32_le((const uint8_t*)&data.data[0]);
if (static_cast<EncodingTypePB>(type) != EncodingTypePB::DICT_ENCODING) {
return Status::OK();
}
size_of_dict_header = BINARY_DICT_PAGE_HEADER_SIZE;
data.remove_prefix(4);
}
```
But if type just equal to EncodingTypePB::DICT_ENCODING and then it will use
BitShuffle to decode BinaryPlainPage, which will leads to an fatal error.
Add cache for inverted index query match bitmap to accelerate common query keyword, especially for keyword matching many rows.
Tests result:
- large result: matching 99% out of 247 million rows shows 8x speed up.
- small result: matching 0.1% out of 247 million rows shows 2x speed up.
This commit support:
1、Insert + select for struct/map type
2、Json stream load for struct type
3、m[key] function for map type
How to use:
Set the fe config to create table for struct and map type
1、admin set frontend config("enable_struct_type" = "true");
2、admin set frontend config("enable_map_type" = "true");
#16547
Co-authored-by: xy720 <xuyang25@baidu.com>
Co-authored-by: amory <wangqiannan@selectdb.com>
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
Co-authored-by: hucheng01 <hucheng01@baidu.com>
Step3 of DSIP-023: Add inverted index for full text search
implementation of bkd index's reader which in inverted index using for numeric types
dependency pr: #14211#15807#15823
* [feature-wip](inverted index)inverted index api: reader
* [feature-wip](inverted index) Fulltext query syntax with MATCH/MATCH_ALL/MATCH_ALL
* [feature-wip](inverted index) Adapt to index meta
* [enhance] add more metrics
* [enhance] add fulltext match query check for column type and index parser
* [feature-wip](inverted index) Support apply inverted index in compound predicate which except leaf node of and node
When calling select on remote files, download cache files to local disk.
When calling alter table on remote files, read files directly from remote storage. So if tablet is too large, it will not take up too many local disk when creating local cache file.
Add `JSON` datatype, following features are implemented by this PR:
1. `CREATE` tables with `JSON` type columns
2. `INSERT` values containing `JSON` type value stored in `String`, which is represented as binary format(AKA `JSONB`) at BE
3. `SELECT` JSON columns
Detail design refers [DSIP-016: Support JSON type](https://cwiki.apache.org/confluence/display/DORIS/DSIP-016%3A+Support+JSON+type)
* add JSONB data storage format type
* fix JsonLiteral resolve bug
* add DataTypeJson case in data_type_factory
* add JSON syntax check in FE
* add operators for jsonb_document, currently not support comparison between any JSON type value
* add ColumnJson and DataTypeJson
* add JsonField to store JsonValue
* add JsonValue to convert String JSON to BINARY JSON and JsonLiteral case for vliteral
* add push_json for MysqlResultWriter
* JSON column need no zone_map_index
* Revert "JSON column need no zone_map_index"
This reverts commit f71d1ce1ded9dbae44a5d58abcec338816b70d79.
* add JSON writer and reader, ignore zone-map for JSON column
* add json_to_string for DataTypeJson
* add olap_data_convertor for JSON type
* add some enum
* add OLAP_FIELD_TYPE_JSON type, FieldTypeTraits for it and corresponding cases or functions
* fix column_json offsets overflow bug, format code
* remove useless TODOs, add CmpType cases for JSON type
* add license header
* format license
* format be codes
* resolve rebase master conflicts
* fix bugs for CREATE and meta related code
* refactor JsonValue constructors, add fe JSON cases and fix some bugs, reformat codes
* modification be codes along code review advice
* fix rebase conflicts with master
* add unit test for json_value and column_json
* fix rebase error
* rename json to jsonb
* fix some data convert bugs, set Mysql type to JSON
Reuse compression ctx and buffer.
Use a global instance for every compression algorithm, and use a
thread saft buffer pool to reuse compression buffer, pool size is equal
to max parallel thread num in compression, and this will not be too large.
Test shows this feature increase 5% of data import and compaction.
Co-authored-by: yixiutt <yixiu@selectdb.com>
Store the offset rather than the length in file for the data with array type. The new file format can improve the seek performance. Please refer to #12246 to get the performance report.
Co-authored-by: xy720 <22125576+xy720@users.noreply.github.com>
This PR supports rowset level data upload on the BE side, so that there can be both cold data and hot data in a tablet,
and there is no necessary to prohibit loading new data to cooled tablets.
Each rowset is bound to a `FileSystem`, so that the storage layer can read and write rowsets without
perceiving the underlying filesystem.
The abstracted `RemoteFileSystem` can try local caching strategies with different granularity,
instead of caching segment files as before.
To avoid conflicts with the code in be/src/io, we temporarily put the file system related code in the be/src/io/fs directory.
In the future, `FileReader`s and `FileWriter`s should be unified.