I want to use Doris Multi-catalog to accelerate HMS query. My organization has custom distributed file system, and we think wrapping the fs access difference into broker (listLocatedFiles, openReader..) would be a elegant approach.
This pr introduce HMS catalog conf `bind.broker.name`. If we set this conf, file split, query scan operation will send to broker.
usage:
create a hms catalog with broker usage
```
CREATE CATALOG hive_catalog_broker PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://xxx',
'broker.name' = 'hdfs_broker'
);
```
When we try to query from this catalog, file split and query scan request will send to broker `hdfs_broker`.
More details about this pr:
1. Introduce HMS catalog proporty `bind.broker.name` to specify broker name to do remote path work. When `broker.name` is set, `enable.self.splitter` must be `true` to ensure file splitting process is executed in Fe
2. Introduce 2 more interfaces to broker service:
- `TBrokerIsSplittableResponse isSplittable(1: TBrokerIsSplittableRequest request)`, helps to invoke input format `isSplitable` interface.
- `TBrokerListResponse listLocatedFiles(1: TBrokerListPathRequest request)`, helps to do `listFiles` or `listLocatedStatus` for remote file system
3. 3 parts of whole processing will be executed in broker:
- Check whether the path with specified input format name `isSplittable`
- `listLocatedFiles` of table / partition locations.
- `OpenReader` for specified file splits.
Co-authored-by: chenlinzhong <490103404@qq.com>
In previous, if user property `'resource_tags.location'` is not set, the can use Backends with any resource tag.
It may confuse that when the DBA set part of Backends to resource group A, then the current existing user
should not be able to use this group A util it's `'resource_tags.location'` is set.
So in this PR, I change the behavior, that if user property `'resource_tags.location'` is not set, it can only use the
Backends with `default` tag.
- db support replication_allocation,when create table,if not set `replication_num` or `replication_allocation `,will use it in db
- fix partition property will disappear when table partition is not null
While doing sample analyze, the result of row count, null number and datasize need to multiply a coefficient based on
the sample percent/rows. This pr is mainly to calculate the coefficient according to the sampled file size over total size.
- fix when modifying comments in property, it will modify the comments in the catalog
- add `alter catalog modify comment` to modify comment for catalog
- abstract some logic of `alter catalog` to parent class
Now FE does not record the update time of hms tbl's partitons, so the sql cache may be hit even the hive table's partitions have changed. This pr add a field to record the partition update time, and use it when enable sql-cache.
The cache will be missed if any partition has changed at hive side.
Use System.currentTimeMillis() but not the event time of hms event because we would better keep the same measurement with the schemaUpdateTime of external table. Add this value to ExternalObjectLog and let slave FEs replay it because it is better to keep the same value with all FEs, so the sql-cache can be hit by the querys through different FEs.
now create table use auto create partition:
AUTO PARTITION BY RANGE date_trunc(event_day, 'day')
so the value of event_day will be insert into partition of date_trunc(event_day, 'day'),
eg: select * from partition_range where date_trunc(event_day,"day")= "2023-08-07 11:00:00";
we can prune some partitions by invoke function of date_trunc("2023-08-07 11:00:00","day" );
metadata tvf list:
- backends
- catalogs
- frontends
- frontends_disks
- group_commit
- iceberg_meta
- workload_groups
fix group_commit bugs
- throw NPE when properties do not contain 'table_id'
- throw NPE when table_id's table do not exist
- throw class Cast failed when table_id's table's type is not OLAP
When the rowCount exceeds a certain threshold, refrain from generating a broadcast join.
Only enforce the best expression in CostAndEnforce Job, rather than enforcing every expression.
Remove lower bound group pruning
this pr is aim to update for filter_by_select logic and change delete limit
only support scala type in delete statement where condition
only support column nullable and predict column support filter_by_select logic, because we can not push down non-scala type to storage layer to pack in predict column but do filter logic
Since Doris support query specific tablet only, so we don't depend on tableSample to do sample, instead use grammar: TABLET(id) to do so. In OlapAnalyzeTask, we calculate which tablets would be hit and set theirs id in it, so we could get how many rows actually queried and furthur we could get the scale up ratio here