Use weak_ptr to cache the file handle of file segment. The max cached number of file handles can be configured by `file_cache_max_file_reader_cache_size`, default `1000000`.
Users can inspect the number of cached file handles by request BE metrics: `http://be_host:be_webserver_port/metrics`:
```
# TYPE doris_be_file_cache_segment_reader_cache_size gauge
doris_be_file_cache_segment_reader_cache_size{path="/mnt/datadisk1/gaoxin/file_cache"} 2500
```
arrow is not support key column has null element , but doris default map key column is nullable , so need to deal with if doris map row if key column has null element , we put null to arrow
Issue Number: close #xxx
when cal array hash, elem size is not need to seed hash
hash = HashUtil::zlib_crc_hash(reinterpret_cast<const char*>(&elem_size),
sizeof(elem_size), hash);
but we need to be care [[], [1]] vs [[1], []], when array nested array , and nested array is empty, we should make hash seed to
make difference
2. use range for one hash value to avoid virtual function call in loop.
which double the performance. I make it in ut
column: array[int64]
50 rows , and single array has 10w elements
Here we will calculate all the rowsets delete bitmaps which are committed but not published to reduce the calculation pressure of publish phase.
Step1: collect this tablet's all committed rowsets' delete bitmaps.
Step2: calculate all rowsets' delete bitmaps which are published during compaction.
Step3: write back updated delete bitmap and tablet info.
Fix problem:
For the same column, there are concurrent drop index request and build index request, if build index obtain lock before drop index, build a new index file, but when drop index request execute, link file not contains all index files for the column, that lead to new index file is missed.
Based on the above questions, use index id instead of column unique id to determine whether a hard link is required when do build index
Refactor the interface of create_file_reader
the file_size and mtime are merged into FileDescription, not in FileReaderOptions anymore.
Now the file handle cache can get correct file's modification time from FileDescription.
Add HdfsIO for hdfs file reader
pick from [Enhancement](multi-catalog) Add hdfs read statistics profile. #21442
we do not Implement any hash functions in array/map/struct column , so we use sql like this will make be core
select * from (
select
bdp.nc_num,
collect_list(distinct(bd.catalog_name)) as catalog_name,
material_qty
from
dataease.bu_delivery_product bdp
left join dataease.bu_trans_transfer btt on bdp.delivery_product_id = btt.delivery_product_id
left join dataease.bu_delivery bd on bdp.delivery_id = bd.delivery_id
where
bd.val_status in ('10', '20', '30', '90')
and bd.delivery_type in (0, 1, 2)
group by
nc_num,
material_qty
union
ALL
select
bdp.nc_num,
collect_list(distinct(bd.catalog_name)) as catalog_name,
material_qty
from
dataease.bu_trans_transfer btt
left join dataease.bu_delivery_product bdp on bdp.delivery_product_id = btt.delivery_product_id
left join dataease.bu_delivery bd on bdp.delivery_id = bd.delivery_id
where
bd.val_status in ('10', '20', '30', '90')
and bd.delivery_type in (0, 1, 2)
group by
nc_num,
material_qty
) aa;
core :
1. Add hdfs file handle cache for hdfs file reader
Copied from Impala, `https://github.com/apache/impala/blob/master/be/src/util/lru-multi-cache.h`. (Thanks for the Impala team)
This is a lru cache that can store multi entries with same key.
The key is build with {file name + modification time}
The value is the hdfsFile pointer that point to a certain hdfs file.
This cache is to avoid reopen same hdfs file mutli time, which can save
query time.
Add a BE config `max_hdfs_file_handle_cache_num` to limit the max number
of file handle cache, default is 20000.
2. Add file meta cache
The file meta cache is a lru cache. the key is {file name + modification time},
the value is the parsed file meta info of the certain file, which can save
the time of re-parsing file meta everytime.
Currently, it is only used for caching parquet file footer.
The test show that is cache is hit, the `FileOpenTime` and `ParseFooterTime` is reduce to almost 0
in query profile, which can save time when there are lots of files to read.
For routine load (kafka load), user can produce all data for different
table into single topic and doris will dispatch them into corresponding
table.
Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
1. Use heap sort to find duplicated keys between segments and update the delete-bitmap. The old implementation traversed all keys in all segments, used each key to search for duplicates in earlier segments, and then marked them for deletion.
2. Trick: Each time the heap top is popped as a key1, the new heap top is key2, allowing for jumping directly from key1 to key2 instead of advancing iteratively.
3. Effect: This technique works well when there are many segments within the same rowset and the imported data is relatively ordered.
Test on SSB 100g:
select lo_suppkey, count(distinct lo_linenumber) from lineorder group by lo_suppkey;
exec time: 4.388s
create materialized view:
create materialized view customer_uv as select lo_suppkey, bitmap_union(to_bitmap(lo_linenumber)) from lineorder group by lo_suppkey;
select lo_suppkey, count(distinct lo_linenumber) from lineorder group by lo_suppkey;
exec time: 12.908s
test with the patch, exec time: 5.790s
Currently, compaction is executed separately for each backend, and the reconstruction of the index during compaction leads to high CPU usage. To address this, we are introducing single replica compaction, where a specific primary replica is selected to perform compaction, and the remaining replicas fetch the compaction results from the primary replica.
The Backend (BE) requests replica information for all peers corresponding to a tablet from the Frontend (FE). This information includes the host where the replica is located and the replica_id. By calculating hash(replica_id), the replica with the smallest hash value is responsible for executing compaction, while the remaining replicas are responsible for fetching the compaction results from this replica.
The compaction task producer thread, before submitting a compaction task, checks whether the local replica should fetch from its peer. If it should, the task is then submitted to the single replica compaction thread pool.
When performing single replica compaction, the process begins by requesting rowset versions from the target replica. These rowset_versions are then compared with the local rowset versions. The first version that can be fetched is selected.
* [Improve](performance) introduce SchemaCache to cache TabletSchame & Schema
1. When the system is under high-concurrency load with wide table point queries, the frequent memory allocation and deallocation of Schema become evident system bottlenecks. Additionally, the initialization of TabletSchema and Schema also becomes a CPU hotspot.Therefore, the introduction of a SchemaCache is implemented to cache these resources for reuse.
2. Make some variables wrapped with std::unique<unique_ptr>
Performance:
| 状态 | QPS | 平均响应时间 (avg) | P99 响应时间 |
|------------------|-----|------------------|-------------|
| 开启 SchemaCache | 501 | 20ms | 34ms |
| 关闭 SchemaCache | 321 | 31ms | 61ms |
* handle schema change with schema version
* remove useless header
* rebase
Refactoring the filtering conditions in the current ExecNode from an expression tree to an array can simplify the process of adding runtime filters. It eliminates the need for complex merge operations and removes the requirement for the frontend to combine expressions into a single entity.
By representing the filtering conditions as an array, each condition can be treated individually, making it easier to add runtime filters without the need for complex merging logic. The array can store the individual conditions, and the runtime filter logic can iterate through the array to apply the filters as needed.
This refactoring simplifies the codebase, improves readability, and reduces the complexity associated with handling filtering conditions and adding runtime filters. It separates the conditions into discrete entities, enabling more straightforward manipulation and management within the execution node.