The element in InvertedIndexSearcherCache is inverted index searcher, which is a file descriptor of inverted index file, so InvertedIndexSearcherCache is actually cache file descriptor of inverted index file.
If open file descriptor limit of the Linux system is set too small and config inverted_index_searcher_cache_limit is too big, during high pressure load maybe cause "Too many open files".
So, when insert inverted index searcher into InvertedIndexSearcherCache, need also check whether reach file_descriptor_number limit for inverted index file.
Add cache for inverted index query match bitmap to accelerate common query keyword, especially for keyword matching many rows.
Tests result:
- large result: matching 99% out of 247 million rows shows 8x speed up.
- small result: matching 0.1% out of 247 million rows shows 2x speed up.
Issue Number: close#16351
Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
1. support row format using codec of jsonb
2. short path optimize for point query
3. support prepared statement for point query
4. support mysql binary format
Since Filesystem inherited std::enable_shared_from_this , it is dangerous to create native point of FileSystem.
To avoid this behavior, making the constructor of XxxFileSystem a private method and using the static method create(...) to get a new FileSystem object.
The main purpose of this pr is to import `fileCache` for lakehouse reading remote files.
Use the local disk as the cache for reading remote file, so the next time this file is read,
the data can be obtained directly from the local disk.
In addition, this pr includes a few other minor changes
Import File Cache:
1. The imported `fileCache` is called `block_file_cache`, which uses lru replacement policy.
2. Implement a new FileRereader `CachedRemoteFilereader`, so that the logic of `file cache` is hidden under `CachedRemoteFilereader`.
Other changes:
1. Add a new interface `fs()` for `FileReader`.
2. `IOContext` adds some statistical information to count the situation of `FileCache`
Co-authored-by: Lightman <31928846+Lchangliang@users.noreply.github.com>
Tablet::version_for_delete_predicate should travel all rowset metas in tablet meta which complex is O(N), however we can directly judge whether this rowset is a delete rowset by RowsetMeta::has_delete_predicate which complex is O(1).
As we won't call Tablet::version_for_delete_predicate when pick input rowsets for compaction, we can reduce the critical area of Tablet::_meta_lock.
This PR implement the new bloom filter index: NGram bloom filter index, which was proposed in #10733.
The new index can improve the like query performance greatly, from our some test case , can get order of magnitude improve.
For how to use it you can check the docs in this PR, and the index based on the ```enable_function_pushdown```,
you need set it to ```true```, to make the index work for like query.
The segment group is useless in current codebase, remove all the related code inside Doris. As for the related protobuf code, use reserved flag to prevent any future user from using that field.
Currently, newly created segment could be chosen to be compaction
candidate, which is prone to bugs and segment file open failures. We
should skip last (maybe active) segment while doing segcompaction.
1.remove quick_compaction's rowset pick policy, call cu compaction when trigger
quick compaction
2. skip tablet's compaction task when compaction score is too small
Co-authored-by: yixiutt <yixiu@selectdb.com>
mem tracker can be logically divided into 4 layers: 1)process 2)type 3)query/load/compation task etc. 4)exec node etc.
type includes
enum Type {
GLOBAL = 0, // Life cycle is the same as the process, e.g. Cache and default Orphan
QUERY = 1, // Count the memory consumption of all Query tasks.
LOAD = 2, // Count the memory consumption of all Load tasks.
COMPACTION = 3, // Count the memory consumption of all Base and Cumulative tasks.
SCHEMA_CHANGE = 4, // Count the memory consumption of all SchemaChange tasks.
CLONE = 5, // Count the memory consumption of all EngineCloneTask. Note: Memory that does not contain make/release snapshots.
BATCHLOAD = 6, // Count the memory consumption of all EngineBatchLoadTask.
CONSISTENCY = 7 // Count the memory consumption of all EngineChecksumTask.
}
Object pointers are no longer saved between each layer, and the values of process and each type are periodically aggregated.
other fix:
In [fix](memtracker) Fix transmit_tracker null pointer because phamp is not thread safe #13528, I tried to separate the memory that was manually abandoned in the query from the orphan mem tracker. But in the actual test, the accuracy of this part of the memory cannot be guaranteed, so put it back to the orphan mem tracker again.