[refactor](doc) add first class category "lakehouse" for multi catalog and external table (#15934)

* [refactor](doc) add new first class catalog lakehouse for multi catalog and external table
I change the doc of multi-catalog and external table.

Now there will be a first-class category named "Lakehouse" in doc sidebar.
This commit is contained in:
Mingyu Chen
2023-01-16 09:27:02 +08:00
committed by GitHub
parent 151fdc224e
commit 3cb7b2ea50
42 changed files with 2220 additions and 4229 deletions

View File

@ -9,7 +9,7 @@
},
"sidebar.docs.category.Doris Introduction": {
"message": "Doris 介绍",
"description": "The label for category Doris Architecture in sidebar docs"
"description": "The label for category Doris Introduction in sidebar docs"
},
"sidebar.docs.category.Install And Deploy": {
"message": "安装部署",
@ -73,7 +73,7 @@
},
"sidebar.docs.category.Data Cache": {
"message": "数据缓存",
"description": "The label for category Date Cache in sidebar docs"
"description": "The label for category Data Cache in sidebar docs"
},
"sidebar.docs.category.Best Practice": {
"message": "最佳实践",
@ -83,10 +83,6 @@
"message": "生态扩展",
"description": "The label for category Ecosystem in sidebar docs"
},
"sidebar.docs.category.Expansion table": {
"message": "扩展表",
"description": "The label for category Expansion table in sidebar docs"
},
"sidebar.docs.category.Doris Manager": {
"message": "Doris Manager",
"description": "The label for category Doris Manager in sidebar docs"
@ -290,5 +286,17 @@
"sidebar.docs.category.Benchmark": {
"message": "性能测试",
"description": "The label for category Benchmark in sidebar docs"
},
"sidebar.docs.category.Lakehouse": {
"message": "数据湖分析",
"description": "The label for category Lakehouse in sidebar docs"
},
"sidebar.docs.category.Lakehouse.Multi Catalog": {
"message": "多源数据目录",
"description": "The label for category Lakehouse.Multi Catalog in sidebar docs"
},
"sidebar.docs.category.Lakehouse.External Table": {
"message": "外部表",
"description": "The label for category Lakehouse.External Table in sidebar docs"
}
}

View File

@ -132,7 +132,7 @@ In the existing Doris import process, the data structure of global dictionary is
## Hive Bitmap UDF
Spark supports loading hive-generated bitmap data directly into Doris, see [hive-bitmap-udf documentation](../../../ecosystem/external-table/hive-bitmap-udf)
Spark supports loading hive-generated bitmap data directly into Doris, see [hive-bitmap-udf documentation](../../../ecosystem/hive-bitmap-udf)
## Basic operation
@ -603,7 +603,7 @@ The data type applicable to the aggregate column of the doris table is bitmap ty
There is no need to build a global dictionary, just specify the corresponding field in the load command, the format is: ```doris field name=binary_bitmap (hive table field name)```
Similarly, the binary (bitmap) type of data import is currently only supported when the upstream data source is a hive table,You can refer to the use of hive bitmap [hive-bitmap-udf](../../../ecosystem/external-table/hive-bitmap-udf)
Similarly, the binary (bitmap) type of data import is currently only supported when the upstream data source is a hive table,You can refer to the use of hive bitmap [hive-bitmap-udf](../../../ecosystem/hive-bitmap-udf)
### Show Load

View File

@ -1,591 +0,0 @@
---
{
"title": "Doris On ES",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Doris On ES
Doris-On-ES not only take advantage of Doris's distributed query planning capability but also ES (Elastic search)'s full-text search capability, provide a more complete OLAP scenario solution:
1. Multi-index Distributed Join Query in ES
2. Joint Query of Tables in Doris and ES, More Complex Full-Text Retrieval and Filtering
This document mainly introduces the realization principle and usage of this function.
## Glossary
### Noun in Doris
* FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access.
* BE: Backend, Doris's back-end node. Responsible for query execution and data storage.
### Noun in ES
* DataNode: The data storage and computing node of ES.
* MasterNode: The Master node of ES, which manages metadata, nodes, data distribution, etc.
* scroll: The built-in data set cursor feature of ES for streaming scanning and filtering of data.
* _source: contains the original JSON document body that was passed at index time
* doc_values: store the same values as the _source but in a column-oriented fashion
* keyword: string datatype in ES, but the content not analyzed by analyzer
* text: string datatype in ES, the content analyzed by analyzer
## How To Use
### Create ES Index
```
PUT test
{
"settings": {
"index": {
"number_of_shards": "1",
"number_of_replicas": "0"
}
},
"mappings": {
"doc": { // There is no need to specify the type when creating indexes after ES7.x version, there is one and only type of `_doc`
"properties": {
"k1": {
"type": "long"
},
"k2": {
"type": "date"
},
"k3": {
"type": "keyword"
},
"k4": {
"type": "text",
"analyzer": "standard"
},
"k5": {
"type": "float"
}
}
}
}
}
```
### Add JSON documents to ES index
```
POST /_bulk
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Elasticsearch", "k4": "Trying out Elasticsearch", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Trying out Doris", "k4": "Trying out Doris", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris On ES", "k4": "Doris On ES", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "Doris", "k4": "Doris", "k5": 10.0}
{"index":{"_index":"test","_type":"doc"}}
{ "k1" : 100, "k2": "2020-01-01", "k3": "ES", "k4": "ES", "k5": 10.0}
```
### Create external ES table
Refer to the specific table syntax:[CREATE TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md)
```
CREATE EXTERNAL TABLE `test` // If no schema is specified, es mapping is automatically pulled to create a table
ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"type" = "doc",
"user" = "root",
"password" = "root"
);
CREATE EXTERNAL TABLE `test` (
`k1` bigint(20) COMMENT "",
`k2` datetime COMMENT "",
`k3` varchar(20) COMMENT "",
`k4` varchar(100) COMMENT "",
`k5` float COMMENT ""
) ENGINE=ELASTICSEARCH // ENGINE must be Elasticsearch
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"type" = "doc",
"user" = "root",
"password" = "root"
);
```
The following parameters are accepted by ES table:
Parameter | Description
---|---
**hosts** | ES Cluster Connection Address, maybe one or more node, load-balance is also accepted
**index** | the related ES index name, alias is supported, and if you use doc_value, you need to use the real name
**type** | the type for this index, ES 7.x and later versions do not pass this parameter
**user** | username for ES
**password** | password for the user
* For clusters before 7.x, please pay attention to choosing the correct type when building the table
* The authentication method only supports Http Basic authentication, need to ensure that this user has access to: /\_cluster/state/, \_nodes/http and other paths and index read permissions;The cluster has not turned on security authentication, and the user name and password do not need to be set
* The column names in the Doris table need to exactly match the field names in the ES, and the field types should be as consistent as possible
* **ENGINE** must be: **Elasticsearch**
##### Filter to push down
An important ability of `Doris On ES` is the push-down of filter conditions: The filtering conditions are pushed to ES, so that only the data that really meets the conditions will be returned, which can significantly improve query performance and reduce CPU, memory, and IO utilization of Doris and ES
The following operators (Operators) will be optimized to the following ES Query:
| SQL syntax | ES 5.x+ syntax |
|-------|:---:|
| = | term query|
| in | terms query |
| > , < , >= , ⇐ | range query |
| and | bool.filter |
| or | bool.should |
| not | bool.must_not |
| not in | bool.must_not + terms query |
| is\_not\_null | exists query |
| is\_null | bool.must_not + exists query |
| esquery | QueryDSL in ES native json form |
##### Data type mapping
Doris\ES | byte | short | integer | long | float | double| keyword | text | date
------------- | ------------- | ------ | ---- | ----- | ---- | ------ | ----| --- | --- |
tinyint | &radic; | | | | | | | |
smallint | &radic; | &radic; | | | | | | |
int | &radic; | &radic; | &radic; | | | | | |
bigint | &radic; | &radic; | &radic; | &radic; | | | | |
float | | | | | &radic; | | | |
double | | | | | | &radic; | | |
char | | | | | | | &radic; | &radic; |
varchar | | | | | | | &radic; | &radic; |
date | | | | | | | | | &radic;|
datetime | | | | | | | | | &radic;|
### Enable column scan to optimize query speed(enable\_docvalue\_scan=true)
```
CREATE EXTERNAL TABLE `test` (
`k1` bigint(20) COMMENT "",
`k2` datetime COMMENT "",
`k3` varchar(20) COMMENT "",
`k4` varchar(100) COMMENT "",
`k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"enable_docvalue_scan" = "true"
);
```
Parameter Description:
Parameter | Description
---|---
**enable\_docvalue\_scan** | whether to enable ES/Lucene column storage to get the value of the query field, the default is false
Doris obtains data from ES following the following two principles:
* **Best effort**: Automatically detect whether the column to be read has column storage enabled (doc_value: true).If all the fields obtained have column storage, Doris will obtain the values ​​of all fields from the column storage(doc_values)
* **Automatic downgrade**: If the field to be obtained has one or more field that is not have doc_value, the values ​​of all fields will be parsed from the line store `_source`
##### Advantage:
By default, Doris On ES will get all the required columns from the row storage, which is `_source`, and the storage of `_source` is the origin json format document, Inferior to column storage in batch read performance, Especially obvious when only a few columns are needed, When only a few columns are obtained, the performance of docvalue is about ten times that of _source
##### Tip
1. Fields of type `text` are not column-stored in ES, so if the value of the field to be obtained has a field of type `text`, it will be automatically downgraded to get from `_source`
2. In the case of too many fields obtained (`>= 25`), the performance of getting field values ​​from `docvalue` will be basically the same as getting field values ​​from `_source`
### Detect keyword type field(enable\_keyword\_sniff=true)
```
CREATE EXTERNAL TABLE `test` (
`k1` bigint(20) COMMENT "",
`k2` datetime COMMENT "",
`k3` varchar(20) COMMENT "",
`k4` varchar(100) COMMENT "",
`k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"enable_keyword_sniff" = "true"
);
```
Parameter Description:
Parameter | Description
---|---
**enable\_keyword\_sniff** | Whether to detect the string type (**text**) `fields` in ES to obtain additional not analyzed (**keyword**) field name(multi-fields mechanism)
You can directly import data without creating an index. At this time, ES will automatically create a new index in ES, For a field of type string, a field of type `text` and field of type `keyword` will be created meantime, This is the multi-fields feature of ES, mapping is as follows:
```
"k4": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
```
When performing conditional filtering on k4, for example =, Doris On ES will convert the query to ES's TermQuery
SQL filter:
```
k4 = "Doris On ES"
```
The query DSL converted into ES is:
```
"term" : {
"k4": "Doris On ES"
}
```
Because the first field type of k4 is `text`, when data is imported, it will perform word segmentation processing according to the word segmentator set by k4 (if it is not set, it is the standard word segmenter) to get three Term of doris, on, and es, as follows ES analyze API analysis:
```
POST /_analyze
{
"analyzer": "standard",
"text": "Doris On ES"
}
```
The result of analyzed is:
```
{
"tokens": [
{
"token": "doris",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "on",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "es",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
}
]
}
```
The query uses:
```
"term" : {
"k4": "Doris On ES"
}
```
This term does not match any term in the dictionary, and will not return any results, enable `enable_keyword_sniff: true` will automatically convert `k4 = "Doris On ES"` into `k4.keyword = "Doris On ES"`to exactly match SQL semantics, The converted ES query DSL is:
```
"term" : {
"k4.keyword": "Doris On ES"
}
```
The type of `k4.keyword` is `keyword`, and writing data into ES is a complete term, so it can be matched
### Enable node discovery mechanism, default is true(nodes\_discovery=true)
```
CREATE EXTERNAL TABLE `test` (
`k1` bigint(20) COMMENT "",
`k2` datetime COMMENT "",
`k3` varchar(20) COMMENT "",
`k4` varchar(100) COMMENT "",
`k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"nodes_discovery" = "true"
);
```
Parameter Description:
Parameter | Description
---|---
**nodes\_discovery** | Whether or not to enable ES node discovery. the default is true
Doris would find all available related data nodes (shards allocated on)from ES when this is true. Just set false if address of ES data nodes are not accessed by Doris BE, eg. the ES cluster is deployed in the intranet which isolated from your public Internet, and users access through a proxy
### Whether ES cluster enables https access mode, if enabled should set value with`true`, default is false(http\_ssl\_enable=true)
```
CREATE EXTERNAL TABLE `test` (
`k1` bigint(20) COMMENT "",
`k2` datetime COMMENT "",
`k3` varchar(20) COMMENT "",
`k4` varchar(100) COMMENT "",
`k5` float COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://192.168.0.1:8200,http://192.168.0.2:8200",
"index" = "test",
"user" = "root",
"password" = "root",
"http_ssl_enabled" = "true"
);
```
Parameter Description:
Parameter | Description
---|---
**http\_ssl\_enabled** | Whether ES cluster enables https access mode
The current FE/BE implementation is to trust all, this is a temporary solution, and the real user configuration certificate will be used later
### Query usage
After create the ES external table in Doris, there is no difference except that the data model (rollup, pre-aggregation, materialized view, etc.) with other table in Doris
#### Basic usage
```
select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'
```
#### Extended esquery(field, QueryDSL)
Through the `esquery(field, QueryDSL)` function, some queries that cannot be expressed in sql, such as match_phrase, geoshape, etc., are pushed down to the ES for filtering. The first column name parameter of `esquery` is used to associate the `index`, the second This parameter is the basic JSON expression of ES's `Query DSL`, which is contained in curly braces `{}`, and there can be only one root key of json, such as match_phrase, geo_shape, bool, etc.
Match query:
```
select * from es_table where esquery(k4, '{
"match": {
"k4": "doris on es"
}
}');
```
Geo related queries:
```
select * from es_table where esquery(k4, '{
"geo_shape": {
"location": {
"shape": {
"type": "envelope",
"coordinates": [
[
13,
53
],
[
14,
52
]
]
},
"relation": "within"
}
}
}');
```
Bool query:
```
select * from es_table where esquery(k4, ' {
"bool": {
"must": [
{
"terms": {
"k1": [
11,
12
]
}
},
{
"terms": {
"k2": [
100
]
}
}
]
}
}');
```
## Principle
```
+----------------------------------------------+
| |
| Doris +------------------+ |
| | FE +--------------+-------+
| | | Request Shard Location
| +--+-------------+-+ | |
| ^ ^ | |
| | | | |
| +-------------------+ +------------------+ | |
| | | | | | | | |
| | +----------+----+ | | +--+-----------+ | | |
| | | BE | | | | BE | | | |
| | +---------------+ | | +--------------+ | | |
+----------------------------------------------+ |
| | | | | | |
| | | | | | |
| HTTP SCROLL | | HTTP SCROLL | |
+-----------+---------------------+------------+ |
| | v | | v | | |
| | +------+--------+ | | +------+-------+ | | |
| | | | | | | | | | |
| | | DataNode | | | | DataNode +<-----------+
| | | | | | | | | | |
| | | +<--------------------------------+
| | +---------------+ | | |--------------| | | |
| +-------------------+ +------------------+ | |
| Same Physical Node | |
| | |
| +-----------------------+ | |
| | | | |
| | MasterNode +<-----------------+
| ES | | |
| +-----------------------+ |
+----------------------------------------------+
```
1. FE requests the hosts specified by the table to obtain node‘s HTTP port, shards location of the index. If the request fails, it will traverse the host list sequentially until it succeeds or fails completely.
2. When querying, the query plan will be generated and sent to the corresponding BE node according to some node information obtained by FE and metadata information of index.
3. The BE node requests locally deployed ES nodes in accordance with the `proximity principle`. The BE receives data concurrently from each fragment of ES index in the `HTTP Scroll` mode.
4. After calculating the result, return it to client
## Best Practices
### Suggestions for using Date type fields
The use of Datetype fields in ES is very flexible, but in Doris On ES, if the type of the Date type field is not set properly, it will cause the filter condition cannot be pushed down.
When creating an index, do maximum format compatibility with the setting of the Date type format:
```
"dt": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
```
When creating this field in Doris, it is recommended to set it to `date` or `datetime`, and it can also be set to `varchar` type. The following SQL statements can be used to directly push the filter condition down to ES
```
select * from doe where k2 > '2020-06-21';
select * from doe where k2 < '2020-06-21 12:00:00';
select * from doe where k2 < 1593497011;
select * from doe where k2 < now();
select * from doe where k2 < date_format(now(), '%Y-%m-%d');
```
`Notice`:
* If you don’t set the format for the time type field In ES, the default format for Date-type field is
```
strict_date_optional_time||epoch_millis
```
* If the date field indexed into ES is unix timestamp, it needs to be converted to `ms`, and the internal timestamp of ES is processed according to `ms` unit, otherwise Doris On ES will display wrong column data
### Fetch ES metadata field `_id`
When indexing documents without specifying `_id`, ES will assign a globally unique `_id` field to each document. Users can also specify a `_id` with special represent some business meaning for the document when indexing; if needed, Doris On ES can get the value of this field by adding the `_id` field of type `varchar` when creating the ES external table
```
CREATE EXTERNAL TABLE `doe` (
`_id` varchar COMMENT "",
`city` varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://127.0.0.1:8200",
"user" = "root",
"password" = "root",
"index" = "doe"
}
```
`Notice`:
1. The filtering condition of the `_id` field only supports two types: `=` and `in`
2. The `_id` field can only be of type `varchar`
## Q&A
1. ES Version Requirements
The main version of ES is larger than 5. The scanning mode of ES data before 2. X and after 5. x is different. At present, the scanning mode of ES data after 5. x is supported.
2. Does ES Cluster Support X-Pack Authentication
Support all ES clusters using HTTP Basic authentication
3. Some queries are much slower than requesting ES
Yes, for example, query related to _count, etc., the ES internal will directly read the number of documents that meet the requirements of the relevant metadata, without the need to filter the real data.
4. Whether the aggregation operation can be pushed down
At present, Doris On ES does not support push-down operations such as sum, avg, min/max, etc., all documents satisfying the conditions are obtained from the ES in batch flow, and then calculated in Doris

View File

@ -1,208 +0,0 @@
---
{
"title": "Doris On Hive",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hive External Table of Doris
<version deprecated="1.2.0" comment="Please use the multi-directory function to access Hive">
Hive External Table of Doris provides Doris with direct access to Hive external tables, which eliminates the need for cumbersome data import and solves the problem of analyzing Hive tables with the help of Doris' OLAP capabilities:
1. support for Hive data sources to access Doris
2. Support joint queries between Doris and Hive data sources to perform more complex analysis operations
3. Support access to kerberos-enabled Hive data sources
4. Support access to hive tables whose data stored on tencent chdfs
This document introduces how to use this feature and the considerations.
</version>
## Glossary
### Noun in Doris
* FE: Frontend, the front-end node of Doris, responsible for metadata management and request access.
* BE: Backend, the backend node of Doris, responsible for query execution and data storage
## How To Use
### Create Hive External Table
```sql
-- Syntax
CREATE [EXTERNAL] TABLE table_name (
col_name col_type [NULL | NOT NULL] [COMMENT "comment"]
) ENGINE=HIVE
[COMMENT "comment"] )
PROPERTIES (
'property_name'='property_value',
...
);
-- Example 1: Create the hive_table table under hive_db in a Hive cluster
CREATE TABLE `t_hive` (
`k1` int NOT NULL COMMENT "",
`k2` char(10) NOT NULL COMMENT "",
`k3` datetime NOT NULL COMMENT "",
`k5` varchar(20) NOT NULL COMMENT "",
`k6` double NOT NULL COMMENT ""
) ENGINE=HIVE
COMMENT "HIVE"
PROPERTIES (
'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
'database' = 'hive_db',
'table' = 'hive_table'
);
-- Example 2: Create the hive_table table under hive_db in a Hive cluster with HDFS HA configuration.
CREATE TABLE `t_hive` (
`k1` int NOT NULL COMMENT "",
`k2` char(10) NOT NULL COMMENT "",
`k3` datetime NOT NULL COMMENT "",
`k5` varchar(20) NOT NULL COMMENT "",
`k6` double NOT NULL COMMENT ""
) ENGINE=HIVE
COMMENT "HIVE"
PROPERTIES (
'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
'database' = 'hive_db',
'table' = 'hive_table',
'dfs.nameservices'='hacluster',
'dfs.ha.namenodes.hacluster'='n1,n2',
'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
-- Example 3: Create the hive external table under hive_db in Hive cluster with HDFS HA and enable kerberos authentication.
CREATE TABLE `t_hive` (
`k1` int NOT NULL COMMENT "",
`k2` char(10) NOT NULL COMMENT "",
`k3` datetime NOT NULL COMMENT "",
`k5` varchar(20) NOT NULL COMMENT "",
`k6` double NOT NULL COMMENT ""
) ENGINE=HIVE
COMMENT "HIVE"
PROPERTIES (
'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
'database' = 'hive_db',
'table' = 'hive_table',
'dfs.nameservices'='hacluster',
'dfs.ha.namenodes.hacluster'='n1,n2',
'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
'hadoop.security.authentication'='kerberos',
'dfs.namenode.kerberos.principal'='hadoop/_HOST@REALM.COM'
'hadoop.kerberos.principal'='doris_test@REALM.COM',
'hadoop.kerberos.keytab'='/path/to/doris_test.keytab'
);
-- Example 4: Create the hive_table table under hive_db in a Hive cluster with data stored on S3
CREATE TABLE `t_hive` (
`k1` int NOT NULL COMMENT "",
`k2` char(10) NOT NULL COMMENT "",
`k3` datetime NOT NULL COMMENT "",
`k5` varchar(20) NOT NULL COMMENT "",
`k6` double NOT NULL COMMENT ""
) ENGINE=HIVE
COMMENT "HIVE"
PROPERTIES (
'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
'database' = 'hive_db',
'table' = 'hive_table',
'AWS_ACCESS_KEY'='your_access_key',
'AWS_SECRET_KEY'='your_secret_key',
'AWS_ENDPOINT'='s3.us-east-1.amazonaws.com',
'AWS_REGION'='us-east-1'
);
```
#### Parameter Description
- External Table Columns
- Column names should correspond to the Hive table
- The order of the columns should be the same as the Hive table
- Must contain all the columns in the Hive table
- Hive table partition columns do not need to be specified, they can be defined as normal columns.
- ENGINE should be specified as HIVE
- PROPERTIES attribute.
- `hive.metastore.uris`: Hive Metastore service address
- `database`: the name of the database to which Hive is mounted
- `table`: the name of the table to which Hive is mounted
- `hadoop.username`: the username to visit HDFS (need to specify it when the authentication type is simple)
- `dfs.nameservices`: the logical name for this new nameservice. See hdfs-site.xml
- `dfs.ha.namenodes.[nameservice ID]`:unique identifiers for each NameNode in the nameservice. See hdfs-site.xml
- `dfs.namenode.rpc-address.[nameservice ID].[name node ID]`:the fully-qualified RPC address for each NameNode to listen on. See hdfs-site.xml
- `dfs.client.failover.proxy.provider.[nameservice ID]`:the Java class that HDFS clients use to contact the Active NameNode, usually it is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
- For a kerberos enabled Hive datasource, additional properties need to be set:
- `dfs.namenode.kerberos.principal`: HDFS namenode service principal
- `hadoop.security.authentication`: HDFS authentication type please set kerberos, default simple
- `hadoop.kerberos.principal`: The Kerberos pincipal that Doris will use when connectiong to HDFS.
- `hadoop.kerberos.keytab`: HDFS client keytab location.
- `AWS_ACCESS_KEY`: AWS access key id.
- `AWS_SECRET_KEY`: AWS secret access key.
- `AWS_ENDPOINT`: S3 endpoint. e.g. s3.us-east-1.amazonaws.com
- `AWS_REGION`: AWS region. e.g. us-east-1
**Note:**
- To enable Doris to access the hadoop cluster with kerberos authentication enabled, you need to deploy the Kerberos client kinit on the Doris all FE and BE nodes, configure krb5.conf, and fill in the KDC service information.
- The value of the PROPERTIES property `hadoop.kerberos.keytab` needs to specify the absolute path of the keytab local file and allow the Doris process to access it.
- The configuration of the HDFS cluster can be written into the hdfs-site.xml file. The configuration file is in the conf directory of fe and be. When users create a Hive table, they do not need to fill in the relevant information of the HDFS cluster configuration.
## Data Type Matching
The supported Hive column types correspond to Doris in the following table.
| Hive | Doris | Description |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| CHAR | CHAR | Only UTF8 encoding is supported |
| VARCHAR | VARCHAR | Only UTF8 encoding is supported |
| TINYINT | TINYINT | |
| SMALLINT | SMALLINT | |
| INT | INT | |
| BIGINT | BIGINT | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DECIMAL | DECIMAL | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | Timestamp to Datetime will lose precision |
**Note:**
- Hive table Schema changes **are not automatically synchronized** and require rebuilding the Hive external table in Doris.
- The current Hive storage format only supports Text, Parquet and ORC types
- The Hive version currently supported by default is `2.3.7、3.1.2`, which has not been tested in other versions. More versions will be supported in the future.
### Query Usage
After you finish building the Hive external table in Doris, it is no different from a normal Doris OLAP table except that you cannot use the data model in Doris (rollup, preaggregation, materialized view, etc.)
```sql
select * from t_hive where k1 > 1000 and k3 = 'term' or k4 like '%doris';
```

View File

@ -1,141 +0,0 @@
---
{
"title": "Doris Hudi external table",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hudi External Table of Doris
<version deprecated="1.2.0" comment="Please use the Multi-Catalog function to access Hudi">
Hudi External Table of Doris provides Doris with the ability to access hdui external tables directly, eliminating the need for cumbersome data import and leveraging Doris' own OLAP capabilities to solve hudi table data analysis problems.
1. support hudi data sources for Doris
2. Support joint query between Doris and hdui data source tables to perform more complex analysis operations
This document introduces how to use this feature and the considerations.
</version>
## Glossary
### Noun in Doris
* FE: Frontend, the front-end node of Doris, responsible for metadata management and request access
* BE: Backend, the backend node of Doris, responsible for query execution and data storage
## How to use
### Create Hudi External Table
Hudi tables can be created in Doris with or without schema. You do not need to declare the column definitions of the table when creating an external table, Doris can resolve the column definitions of the table in hive metastore when querying the table.
1. Create a separate external table to mount the Hudi table.
The syntax can be viewed in `HELP CREATE TABLE`.
```sql
-- Syntax
CREATE [EXTERNAL] TABLE table_name
[(column_definition1[, column_definition2, ...])]
ENGINE = HUDI
[COMMENT "comment"]
PROPERTIES (
"hudi.database" = "hudi_db_in_hive_metastore",
"hudi.table" = "hudi_table_in_hive_metastore",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);
-- Example: Mount hudi_table_in_hive_metastore under hudi_db_in_hive_metastore in Hive MetaStore
CREATE TABLE `t_hudi`
ENGINE = HUDI
PROPERTIES (
"hudi.database" = "hudi_db_in_hive_metastore",
"hudi.table" = "hudi_table_in_hive_metastore",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);
-- Example:Mount hudi table with schema.
CREATE TABLE `t_hudi` (
`id` int NOT NULL COMMENT "id number",
`name` varchar(10) NOT NULL COMMENT "user name"
) ENGINE = HUDI
PROPERTIES (
"hudi.database" = "hudi_db_in_hive_metastore",
"hudi.table" = "hudi_table_in_hive_metastore",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);
```
#### Parameter Description
- column_definition
- When create hudi table without schema(recommended), doris will resolve columns from hive metastore when query.
- When create hudi table with schema, the columns must exist in corresponding table in hive metastore.
- ENGINE needs to be specified as HUDI
- PROPERTIES property.
- `hudi.hive.metastore.uris`: Hive Metastore service address
- `hudi.database`: the name of the database to which Hudi is mounted
- `hudi.table`: the name of the table to which Hudi is mounted, not required when mounting Hudi database.
### Show table structure
Show table structure can be viewed by `HELP SHOW CREATE TABLE`.
## Data Type Matching
The supported Hudi column types correspond to Doris in the following table.
| Hudi | Doris | Description |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| INTEGER | INT | |
| LONG | BIGINT | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | Timestamp to Datetime with loss of precision |
| STRING | STRING | |
| UUID | VARCHAR | Use VARCHAR instead |
| DECIMAL | DECIMAL | |
| TIME | - | not supported |
| FIXED | - | not supported |
| BINARY | - | not supported |
| STRUCT | - | not supported |
| LIST | - | not supported |
| MAP | - | not supported |
**Note:**
- The current default supported version of hudi is 0.10.0 and has not been tested in other versions. More versions will be supported in the future.
### Query Usage
Once you have finished building the hdui external table in Doris, it is no different from a normal Doris OLAP table except that you cannot use the data models in Doris (rollup, preaggregation, materialized views, etc.)
```sql
select * from t_hudi where k1 > 1000 and k3 = 'term' or k4 like '%doris';
```

View File

@ -1,246 +0,0 @@
---
{
"title": "Doris On Iceberg",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Iceberg External Table of Doris
<version deprecated="1.2.0" comment="Please use the multi-directory function to access Iceberg">
Iceberg External Table of Doris provides Doris with the ability to access Iceberg external tables directly, eliminating the need for cumbersome data import and leveraging Doris' own OLAP capabilities to solve Iceberg table data analysis problems.
1. support Iceberg data sources to access Doris
2. Support joint query between Doris and Iceberg data source tables to perform more complex analysis operations
This document introduces how to use this feature and the considerations.
</version>
## Glossary
### Noun in Doris
* FE: Frontend, the front-end node of Doris, responsible for metadata management and request access
* BE: Backend, the backend node of Doris, responsible for query execution and data storage
## How to use
### Create Iceberg External Table
Iceberg tables can be created in Doris in two ways. You do not need to declare the column definitions of the table when creating an external table, Doris can automatically convert them based on the column definitions of the table in Iceberg.
1. Create a separate external table to mount the Iceberg table.
The syntax can be viewed in [CREATE TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md).
```sql
-- Syntax
CREATE [EXTERNAL] TABLE table_name
ENGINE = ICEBERG
[COMMENT "comment"]
PROPERTIES (
"iceberg.database" = "iceberg_db_name",
"iceberg.table" = "icberg_table_name",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- Example 1: Mount iceberg_table under iceberg_db in Iceberg
CREATE TABLE `t_iceberg`
ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
-- Example 2: Mount iceberg_table under iceberg_db in Iceberg, with HDFS HA enabled.
CREATE TABLE `t_iceberg`
ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table"
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG",
"dfs.nameservices"="HDFS8000463",
"dfs.ha.namenodes.HDFS8000463"="nn2,nn1",
"dfs.namenode.rpc-address.HDFS8000463.nn2"="172.21.16.5:4007",
"dfs.namenode.rpc-address.HDFS8000463.nn1"="172.21.16.26:4007",
"dfs.client.failover.proxy.provider.HDFS8000463"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
```
2. Create an Iceberg database to mount the corresponding Iceberg database on the remote side, and mount all the tables under the database.
You can check the syntax with [CREATE DATABASE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-DATABASE.md).
```sql
-- Syntax
CREATE DATABASE db_name
[COMMENT "comment"]
PROPERTIES (
"iceberg.database" = "iceberg_db_name",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- Example: mount the iceberg_db in Iceberg and mount all tables under that db
CREATE DATABASE `iceberg_test_db`
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
```
The progress of the table build in `iceberg_test_db` can be viewed by `HELP SHOW TABLE CREATION`.
You can also create an Iceberg table by explicitly specifying the column definitions according to your needs.
1. Create an Iceberg table
```sql
-- Syntax
CREATE [EXTERNAL] TABLE table_name (
col_name col_type [NULL | NOT NULL] [COMMENT "comment"]
) ENGINE = ICEBERG
[COMMENT "comment"] )
PROPERTIES (
"iceberg.database" = "iceberg_db_name",
"iceberg.table" = "icberg_table_name",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- Example 1: Mount iceberg_table under iceberg_db in Iceberg
CREATE TABLE `t_iceberg` (
`id` int NOT NULL COMMENT "id number",
`name` varchar(10) NOT NULL COMMENT "user name"
) ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- Example 2: Mount iceberg_table under iceberg_db in Iceberg, with HDFS HA enabled.
CREATE TABLE `t_iceberg` (
`id` int NOT NULL COMMENT "id number",
`name` varchar(10) NOT NULL COMMENT "user name"
) ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG",
"dfs.nameservices"="HDFS8000463",
"dfs.ha.namenodes.HDFS8000463"="nn2,nn1",
"dfs.namenode.rpc-address.HDFS8000463.nn2"="172.21.16.5:4007",
"dfs.namenode.rpc-address.HDFS8000463.nn1"="172.21.16.26:4007",
"dfs.client.failover.proxy.provider.HDFS8000463"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
```
#### Parameter Description
- External Table Columns
- Column names should correspond to the Iceberg table
- The order of the columns needs to be consistent with the Iceberg table
- ENGINE needs to be specified as ICEBERG
- PROPERTIES property.
- `iceberg.hive.metastore.uris`: Hive Metastore service address
- `iceberg.database`: the name of the database to which Iceberg is mounted
- `iceberg.table`: the name of the table to which Iceberg is mounted, not required when mounting Iceberg database.
- `iceberg.catalog.type`: the catalog method used in Iceberg, the default is `HIVE_CATALOG`, currently only this method is supported, more Iceberg catalog access methods will be supported in the future.
### Show table structure
Show table structure can be viewed by [SHOW CREATE TABLE](../../sql-manual/sql-reference/Show-Statements/SHOW-CREATE-TABLE.md).
### Synchronized mounts
When the Iceberg table Schema changes, you can manually synchronize it with the `REFRESH` command, which will remove and rebuild the Iceberg external table in Doris, as seen in the `HELP REFRESH` help.
```sql
-- Synchronize the Iceberg table
REFRESH TABLE t_iceberg;
-- Synchronize the Iceberg database
REFRESH DATABASE iceberg_test_db;
```
## Data Type Matching
The supported Iceberg column types correspond to Doris in the following table.
| Iceberg | Doris | Description |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| INTEGER | INT | |
| LONG | BIGINT | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | Timestamp to Datetime with loss of precision |
| STRING | STRING | |
| UUID | VARCHAR | Use VARCHAR instead |
| DECIMAL | DECIMAL | |
| TIME | - | not supported |
| FIXED | - | not supported |
| BINARY | - | not supported |
| STRUCT | - | not supported |
| LIST | - | not supported |
| MAP | - | not supported |
**Note:**
- Iceberg table Schema changes **are not automatically synchronized** and require synchronization of Iceberg external tables or databases in Doris via the `REFRESH` command.
- The current default supported version of Iceberg is 0.12.0,0.13.2 and has not been tested in other versions. More versions will be supported in the future.
### Query Usage
Once you have finished building the Iceberg external table in Doris, it is no different from a normal Doris OLAP table except that you cannot use the data models in Doris (rollup, preaggregation, materialized views, etc.)
```sql
select * from t_iceberg where k1 > 1000 and k3 = 'term' or k4 like '%doris';
```
## Related system configurations
### FE Configuration
The following configurations are at the Iceberg external table system level and can be configured by modifying `fe.conf` or by `ADMIN SET CONFIG`.
- `iceberg_table_creation_strict_mode`
Iceberg tables are created with strict mode enabled by default.
strict mode means that the column types of the Iceberg table are strictly filtered, and if there are data types that Doris does not currently support, the creation of the table will fail.
- `iceberg_table_creation_interval_second`
The background task execution interval for automatic creation of Iceberg tables, default is 10s.
- `max_iceberg_table_creation_record_size`
The maximum value reserved for Iceberg table creation records, default is 2000. Only for creating Iceberg database records.

View File

@ -1,329 +0,0 @@
---
{
"title": "Doris On JDBC",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# JDBC External Table Of Doris
<version since="1.2.0">
JDBC External Table Of Doris provides Doris to access external tables through the standard interface (JDBC) of database access. External tables save the tedious data import work, allowing Doris to have the ability to access various databases, and with the help of Doris's capabilities to solve data analysis problems with external tables:
1. Support various data sources to access Doris
2. Supports Doris's joint query with tables in various data sources for more complex analysis operations
This document mainly introduces how to use this function.
</version>
## Instructions
### Create JDBC external table in Doris
Specific table building syntax reference:[CREATE TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md)
#### 1. Create JDBC external table through JDBC_Resource
```sql
CREATE EXTERNAL RESOURCE jdbc_resource
properties (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url"="jdbc:mysql://192.168.0.1:3306/test",
"driver_url"="http://IP:port/mysql-connector-java-5.1.47.jar",
"driver_class"="com.mysql.jdbc.Driver"
);
CREATE EXTERNAL TABLE `baseall_mysql` (
`k1` tinyint(4) NULL,
`k2` smallint(6) NULL,
`k3` int(11) NULL,
`k4` bigint(20) NULL,
`k5` decimal(9, 3) NULL
) ENGINE=JDBC
PROPERTIES (
"resource" = "jdbc_resource",
"table" = "baseall",
"table_type"="mysql"
);
```
Parameter Description:
| Parameter | Description |
| ---------------- | ----------------------------- |
| **type** | "jdbc", Required flag of Resource Type 。|
| **user** | Username used to access the external database。|
| **password** | Password information corresponding to the user。|
| **jdbc_url** | The URL protocol of JDBC, including database type, IP address, port number and database name, has different formats for different database protocols. for example mysql: "jdbc:mysql://127.0.0.1:3306/test"。|
| **driver_class** | The class name of the driver package for accessing the external database,for example mysql:com.mysql.jdbc.Driver. |
| **driver_url** | The package driver URL used to download and access external databases。http://IP:port/mysql-connector-java-5.1.47.jar . During the local test of one BE, the jar package can be placed in the local path, "driver_url"=“ file:///home/disk1/pathTo/mysql-connector-java-5.1.47.jar ", In case of multiple BE test, Must ensure that they have the same path information|
| **resource** | The resource name that depends on when creating the external table in Doris corresponds to the name when creating the resource in the previous step.|
| **table** | The table name mapped to the external database when creating the external table in Doris.|
| **table_type** | When creating an appearance in Doris, the table comes from that database. for example mysql,postgresql,sqlserver,oracle.|
>**Notice:**
>
>If you use the local path method, the jar package that the database driver depends on, the FE and BE nodes must be placed here
<version since="1.2.1">
> After 1.2.1, you can put the driver in the `jdbc_drivers` directory of FE/BE, and directly specify the file name, such as: `"driver_url" = "mysql-connector-java-5.1.47.jar "`. The system will automatically look for files in the `jdbc_drivers` directory.
</version>
### Query usage
```
select * from mysql_table where k1 > 1000 and k3 ='term';
```
Because it is possible to use keywords in the database as column name, in order to solve this problem, escape characters will be automatically added to field names and table names in SQL statements according to the standards of each database. For example, MYSQL (``), PostgreSQL (""), SQLServer ([]), and ORACLE (""), But this may cause case sensitivity of field names. You can check the query statements issued to each database after escape through explain SQL.
### Data write
After the JDBC external table is create in Doris, the data can be written directly by the `insert into` statement, the query results of Doris can be written to the JDBC external table, or the data can be imported from one JDBC table to another.
```
insert into mysql_table values(1, "doris");
insert into mysql_table select * from table;
```
#### Transaction
The data of Doris is written to the external table by a group of batch. If the import is interrupted, the data written before may need to be rolled back. Therefore, the JDBC external table supports transactions when data is written. Transaction support needs to be supported set by session variable: `enable_odbc_transcation`. ODBC transactions are also controlled by this variable.
```
set enable_odbc_transcation = true;
```
Transactions ensure the atomicity of JDBC external table writing, but it will reduce the performance of data writing, so we can consider turning on the way as appropriate.
#### 1.Mysql
| Mysql Version | Mysql JDBC Driver Version |
| ------------- | ------------------------------- |
| 8.0.30 | mysql-connector-java-5.1.47.jar |
#### 2.PostgreSQL
| PostgreSQL Version | PostgreSQL JDBC Driver Version |
| ------------------ | ------------------------------ |
| 14.5 | postgresql-42.5.0.jar |
```sql
CREATE EXTERNAL RESOURCE jdbc_pg
properties (
"type"="jdbc",
"user"="postgres",
"password"="123456",
"jdbc_url"="jdbc:postgresql://127.0.0.1:5442/postgres?currentSchema=doris_test",
"driver_url"="http://127.0.0.1:8881/postgresql-42.5.0.jar",
"driver_class"="org.postgresql.Driver"
);
CREATE EXTERNAL TABLE `ext_pg` (
`k1` int
) ENGINE=JDBC
PROPERTIES (
"resource" = "jdbc_pg",
"table" = "pg_tbl",
"table_type"="postgresql"
);
```
#### 3.SQLServer
| SQLserver Version | SQLserver JDBC Driver Version |
| ------------- | -------------------------- |
| 2022 | mssql-jdbc-11.2.0.jre8.jar |
#### 4.oracle
| Oracle Version | Oracle JDBC Driver Version |
| ---------- | ------------------- |
| 11 | ojdbc6.jar |
At present, only this version has been tested, and other versions will be added after testing
#### 4.ClickHouse
| ClickHouse Version | ClickHouse JDBC Driver Version |
|-------------|---------------------------------------|
| 22 | clickhouse-jdbc-0.3.2-patch11-all.jar |
## Type matching
There are different data types among different databases. Here is a list of the matching between the types in each database and the data types in Doris.
### MySQL
| MySQL | Doris |
| :------: | :------: |
| BOOLEAN | BOOLEAN |
| BIT(1) | BOOLEAN |
| TINYINT | TINYINT |
| SMALLINT | SMALLINT |
| INT | INT |
| BIGINT | BIGINT |
|BIGINT UNSIGNED|LARGEINT|
| VARCHAR | VARCHAR |
| DATE | DATE |
| FLOAT | FLOAT |
| DATETIME | DATETIME |
| DOUBLE | DOUBLE |
| DECIMAL | DECIMAL |
### PostgreSQL
| PostgreSQL | Doris |
| :--------------: | :------: |
| BOOLEAN | BOOLEAN |
| SMALLINT | SMALLINT |
| INT | INT |
| BIGINT | BIGINT |
| VARCHAR | VARCHAR |
| DATE | DATE |
| TIMESTAMP | DATETIME |
| REAL | FLOAT |
| FLOAT | DOUBLE |
| DECIMAL | DECIMAL |
### Oracle
| Oracle | Doris |
| :------: | :------: |
| VARCHAR | VARCHAR |
| DATE | DATETIME |
| SMALLINT | SMALLINT |
| INT | INT |
| REAL | DOUBLE |
| FLOAT | DOUBLE |
| NUMBER | DECIMAL |
### SQL server
| SQLServer | Doris |
| :-------: | :------: |
| BIT | BOOLEAN |
| TINYINT | TINYINT |
| SMALLINT | SMALLINT |
| INT | INT |
| BIGINT | BIGINT |
| VARCHAR | VARCHAR |
| DATE | DATE |
| DATETIME | DATETIME |
| REAL | FLOAT |
| FLOAT | DOUBLE |
| DECIMAL | DECIMAL |
### ClickHouse
| ClickHouse | Doris |
|:----------:|:--------:|
| BOOLEAN | BOOLEAN |
| CHAR | CHAR |
| VARCHAR | VARCHAR |
| STRING | STRING |
| DATE | DATE |
| Float32 | FLOAT |
| Float64 | DOUBLE |
| Int8 | TINYINT |
| Int16 | SMALLINT |
| Int32 | INT |
| Int64 | BIGINT |
| Int128 | LARGEINT |
| DATETIME | DATETIME |
| DECIMAL | DECIMAL |
**Note:**
- For some specific types in ClickHouse, For example, UUID,IPv4,IPv6, and Enum8 can be matched with Doris's Varchar/String type. However, in the display of IPv4 and IPv6, an extra `/` is displayed before the data, which needs to be processed by the `split_part` function
- For the Geo type Point of ClickHouse, the match cannot be made
## Q&A
1. Besides mysql, Oracle, PostgreSQL, SQL Server and ClickHouse support more databases
At present, Doris only adapts to MySQL, Oracle, PostgreSQL, SQL Server and ClickHouse. And planning to adapt other databases. In principle, any database that supports JDBC access can be accessed through the JDBC facade. If you need to access other appearances, you are welcome to modify the code and contribute to Doris.
2. Read the Emoji expression on the surface of MySQL, and there is garbled code
When Doris makes a JDBC appearance connection, because the default utf8 code in MySQL is utf8mb3, it cannot represent Emoji expressions that require 4-byte coding. Here, you need to set the code of the corresponding column to utf8mb4, set the server code to utf8mb4, and do not configure characterencoding in the JDBC URL when creating the MySQL appearance (this attribute does not support utf8mb4. If non utf8mb4 is configured, the expression cannot be written. Therefore, it should be left blank and not configured.)
3. When reading the mysql table about DateTime="0000:00:00 00:00:00", an error is reported: "CAUSED BY: DataReadException: Zero date value prohibited"
This is because the default handling of this illegal DateTime in JDBC is to throw an exception. You can control this behavior through the parameter zeroDateTimeBehavior
Optional parameters: EXCEPTION, CONVERT_TO_NULL and ROUND are respectively abnormal error reports, converted to NULL values and converted to "0001-01-01 00:00:00";
You can add: "jdbc_url"="jdbc: mysql://IP:PORT/doris_test?zeroDateTimeBehavior=convertToNull "
4. When reading the mysql table or other external table loading the class failed
Such as the following exceptions:
failed to load driver class com.mysql.jdbc.driver in either of hikariconfig class loader
This is because when resource is created, driver_class is incorrectly entered. Therefore, it needs to be correctly entered. For example, if the preceding example is case sensitive, enter as `"driver_class" = "com.mysql.jdbc.Driver"`
5. When reading mysql communications link failure
If the following error occurs:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = PoolInitializationException: Failed to initialize pool: Communications link failure
The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago.
CAUSED BY: CommunicationsException: Communications link failure
The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago.
CAUSED BY: SSLHandshakeExcepti
```
You can view the be.out logs of the be
If the following information is included:
```
WARN: Establishing SSL connection without server's identity verification is not recommended.
According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set.
For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'.
You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
```
You can add the JDBC connection string at the end of the jdbc_url where the resource is created `?useSSL=false` ,like `"jdbc_url" = "jdbc:mysql://127.0.0.1:3306/test?useSSL=false"`
```
Configuration items can be modified globally
Modify the my.ini file in the MySQL directory (the Linux system is the my.cnf file in the etc directory)
[client]
default-character-set=utf8mb4
[mysql]
Set MySQL default character set
default-character-set=utf8mb4
[mysqld]
Set up MySQL character set server
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4
Modify the type of corresponding table and column
ALTER TABLE table_name MODIFY colum_name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE table_name CHARSET=utf8mb4;
SET NAMES utf8mb4
```

View File

@ -1,907 +0,0 @@
---
{
"title": "Multi-Catalog",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Multi-Catalog
<version since="1.2.0">
Multi-Catalog is a feature introduced in Doris 1.2.0, which aims to make it easier to interface with external data sources to enhance Doris' data lake analysis and federated data query capabilities.
In previous versions of Doris, there were only two levels of user data: Database and Table. When we need to connect to an external data source, we can only connect at the Database or Table level. For example, create a mapping of a table in an external data source through `create external table`, or map a Database in an external data source through `create external database`. If there are too many Databases or Tables in the external data source, users need to manually map them one by one, and the experience is not good.
The new Multi-Catalog function adds a new layer of Catalog to the original metadata level, forming a three-layer metadata level of Catalog -> Database -> Table. Among them, Catalog can directly correspond to the external data source. Currently supported external data sources include:
1. Hive MetaStore: Connect to a Hive MetaStore, so that you can directly access Hive, Iceberg, Hudi and other data in it.
2. Elasticsearch: Connect to an ES cluster and directly access the tables and shards in it.
3. JDBC: Connect to the standard database access interface (JDBC) to access data of various databases. (currently only support `jdbc:mysql`)
This function will be used as a supplement and enhancement to the previous external table connection method (External Table) to help users perform fast multi-catalog federated queries.
</version>
## Basic Concepts
1. Internal Catalog
Doris's original Database and Table will belong to Internal Catalog. Internal Catalog is the built-in default Catalog, which cannot be modified or deleted by the user.
2. External Catalog
An External Catalog can be created with the [CREATE CATALOG](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-CATALOG.md) command. After creation, you can view the created catalog through the [SHOW CATALOGS](../../sql-manual/sql-reference/Show-Statements/SHOW-CATALOGS.md) command.
3. Switch Catalog
After users log in to Doris, they enter the Internal Catalog by default, so the default usage is the same as the previous version. You can directly use `SHOW DATABASES`, `USE DB` and other commands to view and switch databases.
Users can switch the catalog through the [SWITCH](../../sql-manual/sql-reference/Utility-Statements/SWITCH.md) command. like:
````
SWiTCH internal;
SWITCH hive_catalog;
````
After switching, you can directly view and switch the Database in the corresponding Catalog through commands such as `SHOW DATABASES`, `USE DB`. Doris will automatically sync the Database and Table in the Catalog. Users can view and access data in the External Catalog as they would with the Internal Catalog.
Currently, Doris only supports read-only access to data in the External Catalog.
4. Drop Catalog
Both Database and Table in External Catalog are read-only. However, the catalog can be deleted (Internal Catalog cannot be deleted). An External Catalog can be dropped via the [DROP CATALOG](../../sql-manual/sql-reference/Data-Definition-Statements/Drop/DROP-CATALOG) command.
This operation will only delete the mapping information of the catalog in Doris, and will not modify or change the contents of any external data source.
## Samples
### Connect Hive MetaStore(Hive/Iceberg/Hudi)
> 1. hive supports version 2.3.7 and above.
> 2. Iceberg currently only supports V1 version, V2 version will be supported soon.
> 3. Hudi currently only supports Snapshot Query for Copy On Write tables and Read Optimized Query for Merge On Read tables. In the future, Incremental Query and Snapshot Query for Merge On Read tables will be supported soon.
> 4. Support access to hive tables whose data stored on tencent chdfs, usage is same as common hive table.
The following example is used to create a Catalog named hive to connect the specified Hive MetaStore, and provide the HDFS HA connection properties to access the corresponding files in HDFS.
**Create catalog through resource**
In later versions of `1.2.0`, it is recommended to create a catalog through resource.
```sql
CREATE RESOURCE hms_resource PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
// The properties in 'PROPERTIES' will overwrite the properties in "Resource"
CREATE CATALOG hive WITH RESOURCE hms_resource PROPERTIES(
'key' = 'value'
);
```
**Create catalog through properties**
Version `1.2.0` creates a catalog through properties.
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
```
If you want to connect to a Hive MetaStore with kerberos authentication, you can do like this:
```sql
-- 1.2.0+ Version
CREATE RESOURCE hms_resource PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hive.metastore.sasl.enabled' = 'true',
'dfs.nameservices'='your-nameservice',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
'hadoop.kerberos.principal' = 'your-principal@YOUR.COM',
'yarn.resourcemanager.address' = 'your-rm-address:your-rm-port',
'yarn.resourcemanager.principal' = 'your-rm-principal/_HOST@YOUR.COM'
);
CREATE CATALOG hive WITH RESOURCE hms_resource;
-- 1.2.0 Version
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.kerberos.xxx' = 'xxx',
...
);
```
If you want to connect to Hadoop with KMS authentication, you should add the follow configuration into properties:
```
'dfs.encryption.key.provider.uri' = 'kms://http@kms_host:kms_port/kms'
```
Once created, you can view the catalog with the `SHOW CATALOGS` command:
```
mysql> SHOW CATALOGS;
+-----------+-------------+----------+
| CatalogId | CatalogName | Type |
+-----------+-------------+----------+
| 10024 | hive | hms |
| 0 | internal | internal |
+-----------+-------------+----------+
```
Switch to the hive catalog with the `SWITCH` command and view the databases in it:
```
mysql> SWITCH hive;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW DATABASES;
+-----------+
| Database |
+-----------+
| default |
| random |
| ssb100 |
| tpch1 |
| tpch100 |
| tpch1_orc |
+-----------+
```
Switch to the tpch100 database and view the tables in it:
```
mysql> USE tpch100;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SHOW TABLES;
+-------------------+
| Tables_in_tpch100 |
+-------------------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
+-------------------+
```
View schema of table lineitem:
```
mysql> DESC lineitem;
+-----------------+---------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+------+---------+-------+
| l_shipdate | DATE | Yes | true | NULL | |
| l_orderkey | BIGINT | Yes | true | NULL | |
| l_linenumber | INT | Yes | true | NULL | |
| l_partkey | INT | Yes | true | NULL | |
| l_suppkey | INT | Yes | true | NULL | |
| l_quantity | DECIMAL(15,2) | Yes | true | NULL | |
| l_extendedprice | DECIMAL(15,2) | Yes | true | NULL | |
| l_discount | DECIMAL(15,2) | Yes | true | NULL | |
| l_tax | DECIMAL(15,2) | Yes | true | NULL | |
| l_returnflag | TEXT | Yes | true | NULL | |
| l_linestatus | TEXT | Yes | true | NULL | |
| l_commitdate | DATE | Yes | true | NULL | |
| l_receiptdate | DATE | Yes | true | NULL | |
| l_shipinstruct | TEXT | Yes | true | NULL | |
| l_shipmode | TEXT | Yes | true | NULL | |
| l_comment | TEXT | Yes | true | NULL | |
+-----------------+---------------+------+------+---------+-------+
```
Query:
```
mysql> SELECT l_shipdate, l_orderkey, l_partkey FROM lineitem limit 10;
+------------+------------+-----------+
| l_shipdate | l_orderkey | l_partkey |
+------------+------------+-----------+
| 1998-01-21 | 66374304 | 270146 |
| 1997-11-17 | 66374304 | 340557 |
| 1997-06-17 | 66374400 | 6839498 |
| 1997-08-21 | 66374400 | 11436870 |
| 1997-08-07 | 66374400 | 19473325 |
| 1997-06-16 | 66374400 | 8157699 |
| 1998-09-21 | 66374496 | 19892278 |
| 1998-08-07 | 66374496 | 9509408 |
| 1998-10-27 | 66374496 | 4608731 |
| 1998-07-14 | 66374592 | 13555929 |
+------------+------------+-----------+
```
You can also perform associated queries with tables in other data catalogs:
```
mysql> SELECT l.l_shipdate FROM hive.tpch100.lineitem l WHERE l.l_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;
+------------+
| l_shipdate |
+------------+
| 1993-02-16 |
| 1995-06-26 |
| 1995-08-19 |
| 1992-07-23 |
| 1998-05-23 |
| 1997-07-12 |
| 1994-03-06 |
| 1996-02-07 |
| 1997-06-01 |
| 1996-08-23 |
+------------+
```
Here we identify a table in a fully qualified way of `catalog.database.table`, such as: `internal.db1.part`.
`catalog` and `database` can be omitted, and the catalog and database switched after the current SWITCH and USE are used by default.
The table data in the hive catalog can be inserted into the internal table in the internal catalog through the INSERT INTO command, so as to achieve the effect of **importing external data source's data**:
```
mysql> SWITCH internal;
Query OK, 0 rows affected (0.00 sec)
mysql> USE db1;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> INSERT INTO part SELECT * FROM hive.tpch100.part limit 1000;
Query OK, 1000 rows affected (0.28 sec)
{'label':'insert_212f67420c6444d5_9bfc184bf2e7edb8', 'status':'VISIBLE', 'txnId':'4'}
```
### Connect Elasticsearch
> 1. 5.x and later versions are supported.
> 2. In 5.x and 6.x, multiple types in an index are taken as the first by default.
The following example creates a Catalog connection named es to the specified ES and turns off node discovery.
```sql
-- 1.2.0+ Version
CREATE RESOURCE es_resource PROPERTIES (
"type"="es",
"hosts"="http://127.0.0.1:9200",
"nodes_discovery"="false"
);
CREATE CATALOG es WITH RESOURCE es_resource;
-- 1.2.0 Version
CREATE CATALOG es PROPERTIES (
"type"="es",
"elasticsearch.hosts"="http://127.0.0.1:9200",
"elasticsearch.nodes_discovery"="false"
);
```
Once created, you can view the catalog with the `SHOW CATALOGS` command:
```
mysql> SHOW CATALOGS;
+-----------+-------------+----------+
| CatalogId | CatalogName | Type |
+-----------+-------------+----------+
| 0 | internal | internal |
| 11003 | es | es |
+-----------+-------------+----------+
2 rows in set (0.02 sec)
```
Switch to the hive catalog with the `SWITCH` command and view the databases in it(Only one default_db associates all index)
```
mysql> SWITCH es;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW DATABASES;
+------------+
| Database |
+------------+
| default_db |
+------------+
mysql> show tables;
+----------------------+
| Tables_in_default_db |
+----------------------+
| test |
| test2 |
+----------------------+
```
Query
```
mysql> select * from test;
+------------+-------------+--------+-------+
| test4 | test2 | test3 | test1 |
+------------+-------------+--------+-------+
| 2022-08-08 | hello world | 2.415 | test2 |
| 2022-08-08 | hello world | 3.1415 | test1 |
+------------+-------------+--------+-------+
```
#### Parameters:
Parameter | Description
---|---
**elasticsearch.hosts** | ES Connection Address, maybe one or more node, load-balance is also accepted
**elasticsearch.username** | username for ES
**elasticsearch.password** | password for the user
**elasticsearch.doc_value_scan** | whether to enable ES/Lucene column storage to get the value of the query field, the default is true
**elasticsearch.keyword_sniff** | Whether to probe the string segmentation type text.fields in ES, query by keyword (the default is true, false matches the content after the segmentation)
**elasticsearch.nodes_discovery** | Whether or not to enable ES node discovery, the default is true. In network isolation, set this parameter to false. Only the specified node is connected
**elasticsearch.ssl** | Whether ES cluster enables https access mode, the current FE/BE implementation is to trust all
### Connect JDBC
The following example creates a Catalog connection named jdbc. This jdbc Catalog will connect to the specified database according to the 'jdbc.jdbc_url' parameter(`jdbc::mysql` in the example, so connect to the mysql database). Currently, supports MYSQL, POSTGRESQL, CLICKHOUSE database types.
**mysql catalog example**
```sql
-- 1.2.0+ Version
CREATE RESOURCE mysql_resource PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:mysql://127.0.0.1:3306/demo",
"driver_url" = "file:///path/to/mysql-connector-java-5.1.47.jar",
"driver_class" = "com.mysql.jdbc.Driver"
)
CREATE CATALOG jdbc WITH RESOURCE mysql_resource;
-- 1.2.0 Version
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.jdbc_url" = "jdbc:mysql://127.0.0.1:3306/demo",
...
)
```
**postgresql catalog example**
```sql
-- 1.2.0+ Version
CREATE RESOURCE pg_resource PROPERTIES (
"type"="jdbc",
"user"="postgres",
"password"="123456",
"jdbc_url" = "jdbc:postgresql://127.0.0.1:5449/demo",
"driver_url" = "file:///path/to/postgresql-42.5.1.jar",
"driver_class" = "org.postgresql.Driver"
);
CREATE CATALOG jdbc WITH RESOURCE pg_resource;
-- 1.2.0 Version
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.jdbc_url" = "jdbc:postgresql://127.0.0.1:5449/demo",
...
)
```
**CLICKHOUSE catalog example**
```sql
-- The first way
CREATE RESOURCE clickhouse_resource PROPERTIES (
"type"="jdbc",
"user"="default",
"password"="123456",
"jdbc_url" = "jdbc:clickhouse://127.0.0.1:8123/demo",
"driver_url" = "file:///path/to/clickhouse-jdbc-0.3.2-patch11-all.jar",
"driver_class" = "com.clickhouse.jdbc.ClickHouseDriver"
)
CREATE CATALOG jdbc WITH RESOURCE clickhouse_resource;
-- The second way, note: keys have 'jdbc' prefix in front.
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.jdbc_url" = "jdbc:clickhouse://127.0.0.1:8123/demo",
...
)
```
**oracle catalog example**
```sql
-- The first way
CREATE RESOURCE oracle_resource PROPERTIES (
"type"="jdbc",
"user"="doris",
"password"="123456",
"jdbc_url" = "jdbc:oracle:thin:@127.0.0.1:1521:helowin",
"driver_url" = "file:/path/to/ojdbc6.jar",
"driver_class" = "oracle.jdbc.driver.OracleDriver"
);
CREATE CATALOG jdbc WITH RESOURCE oracle_resource;
-- The second way, note: keys have 'jdbc' prefix in front.
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.user"="doris",
"jdbc.password"="123456",
"jdbc.jdbc_url" = "jdbc:oracle:thin:@127.0.0.1:1521:helowin",
"jdbc.driver_url" = "file:/path/to/ojdbc6.jar",
"jdbc.driver_class" = "oracle.jdbc.driver.OracleDriver"
);
```
Where `jdbc.driver_url` can be a remote jar package
```sql
CREATE RESOURCE mysql_resource PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:mysql://127.0.0.1:13396/demo",
"driver_url" = "https://path/jdbc_driver/mysql-connector-java-8.0.25.jar",
"driver_class" = "com.mysql.cj.jdbc.Driver"
)
CREATE CATALOG jdbc WITH RESOURCE mysql_resource;
```
If the `jdbc.driver_url` is a remote jar package in the form of http, the Doris processing method is:
1. Only query the meta-data, without inquiring the table data (such as the operation `show catalogs/database/tables`): This URL will be used to load the driver class by FE.
2. When querying the tables in JDBC Catalog (like `select from`): BE will download the jar package to the local directory `be/lib/udf/`, and the jar package will be loaded directly from the local path When queried.
Once created, you can view the catalog with the `SHOW CATALOGS` command:
```sql
MySQL [(none)]> show catalogs;
+-----------+-------------+----------+
| CatalogId | CatalogName | Type |
+-----------+-------------+----------+
| 0 | internal | internal |
| 10480 | jdbc | jdbc |
+-----------+-------------+----------+
2 rows in set (0.02 sec)
```
> Note:
> 1. In the `postgresql catalog`, a database for doris corresponds to a schema in the postgresql specified catalog (specified in the `jdbc.jdbc_url` parameter), tables under this database corresponds to tables under this postgresql's schema.
> 2. In the `oracle catalog`, a database for doris corresponds to a user in the oracle, tables under this database corresponds to tables under this oracle's user.
Switch to the jdbc catalog with the `SWITCH` command and view the databases in it:
```sql
MySQL [(none)]> switch jdbc;
Query OK, 0 rows affected (0.02 sec)
MySQL [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| __db1 |
| _db1 |
| db1 |
| demo |
| information_schema |
| mysql |
| mysql_db_test |
| performance_schema |
| sys |
+--------------------+
9 rows in set (0.67 sec)
```
Show the tables under the `db1` database and query one table:
```sql
MySQL [demo]> use db1;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MySQL [db1]> show tables;
+---------------+
| Tables_in_db1 |
+---------------+
| tbl1 |
+---------------+
1 row in set (0.00 sec)
MySQL [db1]> desc tbl1;
+-------+------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+------+---------+-------+
| k1 | INT | Yes | true | NULL | |
+-------+------+------+------+---------+-------+
1 row in set (0.00 sec)
MySQL [db1]> select * from tbl1;
+------+
| k1 |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
+------+
4 rows in set (0.19 sec)
```
#### 参数说明:
参数 | 说明
---|---
**jdbc.user** | The username used to connect to the database
**jdbc.password** | The password used to connect to the database
**jdbc.jdbc_url** | The identifier used to connect to the specified database
**jdbc.driver_url** | The url of JDBC driver package
**jdbc.driver_class** | The class of JDBC driver
### Connect Aliyun Data Lake Formation
> [What is Data Lake Formation](https://www.alibabacloud.com/product/datalake-formation)
1. Create hive-site.xml
Create hive-site.xml and put it in `fe/conf`.
```
<?xml version="1.0"?>
<configuration>
<!--Set to use dlf client-->
<property>
<name>hive.metastore.type</name>
<value>dlf</value>
</property>
<property>
<name>dlf.catalog.endpoint</name>
<value>dlf-vpc.cn-beijing.aliyuncs.com</value>
</property>
<property>
<name>dlf.catalog.region</name>
<value>cn-beijing</value>
</property>
<property>
<name>dlf.catalog.proxyMode</name>
<value>DLF_ONLY</value>
</property>
<property>
<name>dlf.catalog.uid</name>
<value>20000000000000000</value>
</property>
<property>
<name>dlf.catalog.accessKeyId</name>
<value>XXXXXXXXXXXXXXX</value>
</property>
<property>
<name>dlf.catalog.accessKeySecret</name>
<value>XXXXXXXXXXXXXXXXX</value>
</property>
</configuration>
```
* `dlf.catalog.endpoint`: DLF Endpoint. See: [Regions and endpoints of DLF](https://www.alibabacloud.com/help/en/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.region`: DLF Regio. See: [Regions and endpoints of DLF](https://www.alibabacloud.com/help/en/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.uid`: Ali Cloud Account ID. That is, the "cloud account ID" of the personal information in the upper right corner of the Alibaba Cloud console. * `dlf.catalog.accessKeyId`: AccessKey. See: [Ali Could Console](https://ram.console.aliyun.com/manage/ak).
* `dlf.catalog.accessKeySecret`: SecretKey. See: [Ali Could Console](https://ram.console.aliyun.com/manage/ak).
Other configuration items are fixed values and do not need to be changed.
2. Restart FE and create a catalog with the `CREATE CATALOG` statement.
HMS resource will read and analyze fe/conf/hive-site.xml
```sql
-- 1.2.0+ Version
CREATE RESOURCE dlf_resource PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
)
CREATE CATALOG dlf WITH RESOURCE dlf_resource;
-- 1.2.0 Version
CREATE CATALOG dlf PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
)
```
where `type` is fixed to `hms`. The value of `hive.metastore.uris` can be filled in at will, but it will not be used in practice. But it needs to be filled in the standard hive metastore thrift uri format.
After that, the metadata under DLF can be accessed like a normal Hive MetaStore.
## Column Type Mapping
After the user creates the catalog, Doris will automatically synchronize the database and tables of the data catalog. For different data catalog and data table formats, Doris will perform the following mapping relationships.
<version since="dev">
For types that cannot currently be mapped to Doris column types, such as map, struct, etc. Doris will map the column type to UNSUPPORTED type. For queries of type UNSUPPORTED, an example is as follows:
Suppose the table schema after synchronization is:
```
k1 INT,
k2 INT,
k3 UNSUPPORTED,
k4 INT
```
```
select * from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
select * except(k3) from table; // Query OK.
select k1, k3 from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
select k1, k4 from table; // Query OK.
```
</version>
### Hive MetaStore
For Hive/Iceberge/Hudi
| HMS Type | Doris Type | Comment |
|---|---|---|
| boolean| boolean | |
| tinyint|tinyint | |
| smallint| smallint| |
| int| int | |
| bigint| bigint | |
| date| date| |
| timestamp| datetime| |
| float| float| |
| double| double| |
| `array<type>` | `array<type>`| Supprot nested array, such as `array<array<int>>` |
| char| char | |
| varchar| varchar| |
| decimal| decimal | |
| other | string | The rest of the unsupported types are uniformly processed as string |
### Elasticsearch
| HMS Type | Doris Type | Comment |
|---|---|---|
| boolean | boolean | |
| byte| tinyint| |
| short| smallint| |
| integer| int| |
| long| bigint| |
| unsigned_long| largeint | |
| float| float| |
| half_float| float| |
| double | double | |
| scaled_float| double | |
| date | date | |
| keyword | string | |
| text |string | |
| ip |string | |
| nested |string | |
| object |string | |
| array | | Comming soon |
|other| string ||
### JDBC
#### MYSQL
MYSQL Type | Doris Type | Comment |
|---|---|---|
| BOOLEAN | BOOLEAN | |
| TINYINT | TINYINT | |
| SMALLINT | SMALLINT | |
| MEDIUMINT | INT | |
| INT | INT | |
| BIGINT | BIGINT | |
| UNSIGNED TINYINT | SMALLINT | DORIS does not have the UNSIGNED data type, so expand the type|
| UNSIGNED MEDIUMINT | INT | DORIS does not have the UNSIGNED data type, so expand the type|
| UNSIGNED INT | BIGINT | DORIS does not have the UNSIGNED data type, so expand the type|
| UNSIGNED BIGINT | STRING | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DECIMAL | DECIMAL | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | |
| DATETIME | DATETIME | |
| YEAR | SMALLINT | |
| TIME | STRING | |
| CHAR | CHAR | |
| VARCHAR | STRING | |
| TINYTEXT、TEXT、MEDIUMTEXT、LONGTEXT、TINYBLOB、BLOB、MEDIUMBLOB、LONGBLOB、TINYSTRING、STRING、MEDIUMSTRING、LONGSTRING、BINARY、VARBINARY、JSON、SET、BIT | STRING | |
#### POSTGRESQL
POSTGRESQL Type | Doris Type | Comment |
|---|---|---|
| boolean | BOOLEAN | |
| smallint/int2 | SMALLINT | |
| integer/int4 | INT | |
| bigint/int8 | BIGINT | |
| decimal/numeric | DECIMAL | |
| real/float4 | FLOAT | |
| double precision | DOUBLE | |
| smallserial | SMALLINT | |
| serial | INT | |
| bigserial | BIGINT | |
| char | CHAR | |
| varchar/text | STRING | |
| timestamp | DATETIME | |
| date | DATE | |
| time | STRING | |
| interval | STRING | |
| point/line/lseg/box/path/polygon/circle | STRING | |
| cidr/inet/macaddr | STRING | |
| bit/bit(n)/bit varying(n) | STRING | `bit` type corresponds to the `STRING` type of DORIS. The data read is `true/false`, not `1/0` |
| uuid/josnb | STRING | |
#### CLICKHOUSE
| ClickHouse Type | Doris Type | Comment |
|------------------------|------------|--------------------------------------------------------------------------------------------------------------------------------------|
| Bool | BOOLEAN | |
| String | STRING | |
| Date/Date32 | DATE | |
| DateTime/DateTime64 | DATETIME | Data that exceeds Doris's maximum DateTime accuracy is truncated |
| Float32 | FLOAT | |
| Float64 | DOUBLE | |
| Int8 | TINYINT | |
| Int16/UInt8 | SMALLINT | DORIS does not have the UNSIGNED data type, so expand the type |
| Int32/UInt16 | INT | DORIS does not have the UNSIGNED data type, so expand the type |
| Int64/Uint32 | BIGINT | DORIS does not have the UNSIGNED data type, so expand the type |
| Int128/UInt64 | LARGEINT | DORIS does not have the UNSIGNED data type, so expand the type |
| Int256/UInt128/UInt256 | STRING | Doris does not have a data type of this magnitude and is processed with STRING |
| DECIMAL | DECIMAL | Data that exceeds Doris's maximum Decimal precision is mapped to a STRING |
| Enum/IPv4/IPv6/UUID | STRING | In the display of IPv4 and IPv6, an extra `/` is displayed before the data, which needs to be processed by the `split_part` function |
#### ORACLE
ORACLE Type | Doris Type | Comment |
|---|---|---|
| number(p) / number(p,0) | | Doris will choose the corresponding doris type based on the p: p<3 -> TINYINT; p<5 -> SMALLINT; p<10 -> INT; p<19 -> BIGINT; p>19 -> LARGEINT |
| number(p,s) | DECIMAL | |
| decimal | DECIMAL | |
| float/real | DOUBLE | |
| DATE | DATETIME | |
| CHAR/NCHAR | CHAR | |
| VARCHAR2/NVARCHAR2 | VARCHAR | |
| LONG/ RAW/ LONG RAW/ INTERVAL | TEXT | |
## Privilege Management
Using Doris to access the databases and tables in the External Catalog is not controlled by the permissions of the external data source itself, but relies on Doris's own permission access management.
The privilege management of Doris provides an extension to the Cataloig level. For details, please refer to the [privilege management](../../admin-manual/privilege-ldap/user-privilege.md) document.
## Metadata Refresh
### Manual Refresh
Metadata changes of external data sources, such as creating, dropping tables, adding or dropping columns, etc., will not be synchronized to Doris.
Users need to manually refresh metadata via the [REFRESH CATALOG](../../sql-manual/sql-reference/Utility-Statements/REFRESH.md) command.
### Auto Refresh
#### Hive MetaStore(HMS) catalog
<version since="dev">
The FE node can sense the change of the Hive table metadata by reading the HMS notification event regularly. Currently, the following events are supported:
</version>
1. CREATE DATABASE event:Create database under the corresponding catalog.
2. DROP DATABASE event:Drop database under the corresponding catalog.
3. ALTER DATABASE event:The main impact of this event is to change the attribute information, comments and default storage location of the database,These changes do not affect the query operation of doris on the external catalog, so this event will be ignored at present.
4. CREATE TABLE event:Create table under the corresponding database.
5. DROP TABLE event:Drop table under the corresponding database and invalidate its cache.
6. ALTER TABLE event:If renaming, delete the table with the old name first, and then create the table with the new name, otherwise the cache of the table will be invalidated.
7. ADD PARTITION event:Add partition in the partitions list of the corresponding table cache.
8. DROP PARTITION event:Delete the partition in the partition list of the corresponding table cache and invalidate the cache of the partition.
9. ALTER PARTITION event:If renaming, delete the partition with the old name first, and then create the partition with the new name, otherwise the cache of the partition will be invalidated.
10. When the file is changed due to importing data, the partition table will follow the ALTER PARTITION event logic, and the non-partition table will follow the ALTER TABLE event logic (note: if the file system is directly operated by bypassing the HMS, the HMS will not generate the corresponding event, and Doris will not be aware of it).
The feature is controlled by the following parameters of fe:
1. enable_hms_events_incremental_sync:Whether to enable automatic incremental metadata synchronization. The default value false.
2. hms_events_polling_interval_ms: The interval between reading events. The default value is 10000, in milliseconds.
3. hms_events_batch_size_per_rpc:The maximum number of events read each time. The default value is 500.
If you want to use this feature, you need to change the hive-site.xml of HMS and restart HMS
```
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.dml.events</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.transactional.event.listeners</name>
<value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
</property>
```
##### Use suggestions
Whether you want to change the previously created catalog to automatic refresh or the newly created catalog, you only need to set enable_hms_events_incremental_sync to true, restart the fe node, and do not need to manually refresh the metadata before or after restarting.
#### Other catalog
Not supported temporarily.
## Time Travel
### Iceberg
Each write to the iceberg table produces a new snapshot, whereas the read operation only reads the latest version of the snapshot.
The iceberg table can use `FOR TIME AS OF`和`FOR VERSION AS OF` clause to read the historical version data according to the snapshot ID or the time when the snapshot was created.
`SELECT * FROM db.tbl FOR TIME AS OF "2022-10-07 17:20:37";`
`SELECT * FROM db.tbl FOR VERSION AS OF 868895038966572;`
To get the snapshot ID and timestamp of iceberg can use [iceberg_meta](../../sql-manual/sql-functions/table-functions/iceber_meta.md) table-valued-function.
## FAQ
### Iceberg
The following configurations solves the problem `failed to get schema for table xxx in db xxx` and `java.lang.UnsupportedOperationException: Storage schema reading not supported` when reading data from `Hive Metastore`.
- Place iceberg runtime jar in the hive lib directory.
- hive-site.xml add the property `metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader`.
Restart `Hive Metastore` after the configuration is complete.
### Kerberos
When connecting to `Hive Metastore` authenticated by `Kerberos` with doris 1.2.0, an exception message containing `GSS initiate failed` appears.
- Please update to version 1.2.2, or recompile FE version 1.2.1 using the latest docker development image.
### HDFS
`java.lang.VerifyError: xxx` error occurred while reading HDFS version 3.x.
- Update hadoop-related dependencies in `fe/pom.xml` to version 2.10.2, and recompile FE.

View File

@ -1,415 +0,0 @@
---
{
"title": "Doris On ODBC",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# ODBC External Table Of Doris
<version deprecated="1.2.0" comment="Please use the JDBC External table">
ODBC external table of Doris provides Doris access to external tables through the standard interface for database access (ODBC). The external table eliminates the tedious data import work and enables Doris to have the ability to access all kinds of databases. It solves the data analysis problem of external tables with Doris' OLAP capability.
1. Support various data sources to access Doris
2. Support Doris query with tables in various data sources to perform more complex analysis operations
3. Use insert into to write the query results executed by Doris to the external data source
This document mainly introduces the implementation principle and usage of this ODBC external table.
</version>
## Glossary
### Noun in Doris
* FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access.
* BE: Backend, Doris's back-end node. Responsible for query execution and data storage.
## How To Use
### Create ODBC External Table
Refer to the specific table syntax:[CREATE TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md)
#### 1. Creating ODBC external table without resource
```
CREATE EXTERNAL TABLE `baseall_oracle` (
`k1` decimal(9, 3) NOT NULL COMMENT "",
`k2` char(10) NOT NULL COMMENT "",
`k3` datetime NOT NULL COMMENT "",
`k5` varchar(20) NOT NULL COMMENT "",
`k6` double NOT NULL COMMENT ""
) ENGINE=ODBC
COMMENT "ODBC"
PROPERTIES (
"host" = "192.168.0.1",
"port" = "8086",
"user" = "test",
"password" = "test",
"database" = "test",
"table" = "baseall",
"driver" = "Oracle 19 ODBC driver",
"type" = "oracle"
);
```
#### 2. Creating ODBC external table by resource (recommended)
```
CREATE EXTERNAL RESOURCE `oracle_odbc`
PROPERTIES (
"type" = "odbc_catalog",
"host" = "192.168.0.1",
"port" = "8086",
"user" = "test",
"password" = "test",
"database" = "test",
"odbc_type" = "oracle",
"driver" = "Oracle 19 ODBC driver"
);
CREATE EXTERNAL TABLE `baseall_oracle` (
`k1` decimal(9, 3) NOT NULL COMMENT "",
`k2` char(10) NOT NULL COMMENT "",
`k3` datetime NOT NULL COMMENT "",
`k5` varchar(20) NOT NULL COMMENT "",
`k6` double NOT NULL COMMENT ""
) ENGINE=ODBC
COMMENT "ODBC"
PROPERTIES (
"odbc_catalog_resource" = "oracle_odbc",
"database" = "test",
"table" = "baseall"
);
```
The following parameters are accepted by ODBC external table:
Parameter | Description
---|---
**hosts** | IP address of external database
**port** | Port of external database
**driver** | The driver name of ODBC Driver, which needs to be/conf/odbcinst.ini. The driver names should be consistent.
**type** | The type of external database, currently supports Oracle, MySQL and PostgerSQL
**user** | The user name of database
**password** | password for the user
**charset** | charset of connection(it is useless for sqlserver)
Remark:
In addition to adding the above parameters to `PROPERTIES`, you can also add parameters specific to each database's ODBC driver implementation, such as `sslverify` for mysql、`ClientCharset` for sqlserver, etc.
>Notes:
>
>If you are on SQL Server 2017 and later, because SQL Server 2017 and later have security authentication enabled by default, you need to add `"TrustServerCertificate"="Yes"` when you define ODBC Resources again
##### Installation and configuration of ODBC driver
Each database will provide ODBC access driver. Users can install the corresponding ODBC driver lib library according to the official recommendation of each database.
After installation of ODBC driver, find the path of the driver lib Library of the corresponding database. The modify be/conf/odbcinst.ini Configuration like:
```
[MySQL Driver]
Description = ODBC for MySQL
Driver = /usr/lib64/libmyodbc8w.so
FileUsage = 1
```
* `[]`: The corresponding driver name in is the driver name. When creating an external table, the driver name of the external table should be consistent with that in the configuration file.
* `Driver=`: This should be setted in according to the actual be installation path of the driver. It is essentially the path of a dynamic library. Here, we need to ensure that the pre dependencies of the dynamic library are met.
**Remember, all BE nodes are required to have the same driver installed, the same installation path and the same be/conf/odbcinst.ini config.**
### Query usage
After the ODBC external table is create in Doris, it is no different from ordinary Doris tables except that the data model (rollup, pre aggregation, materialized view, etc.) in Doris cannot be used.
```
select * from oracle_table where k1 > 1000 and k3 ='term' or k4 like '%doris'
```
### Data write
After the ODBC external table is create in Doris, the data can be written directly by the `insert into` statement, the query results of Doris can be written to the ODBC external table, or the data can be imported from one ODBC table to another.
```
insert into oracle_table values(1, "doris");
insert into oracle_table select * from postgre_table;
```
#### Transaction
The data of Doris is written to the external table by a group of batch. If the import is interrupted, the data written before may need to be rolled back. Therefore, the ODBC external table supports transactions when data is written. Transaction support needs to be supported set by session variable: `enable_odbc_transcation`.
```
set enable_odbc_transcation = true;
```
Transactions ensure the atomicity of ODBC external table writing, but it will reduce the performance of data writing ., so we can consider turning on the way as appropriate.
## Database ODBC version correspondence
### Centos Operating System
The unixODBC versions used are: 2.3.1, Doris 0.15, centos 7.9, all of which are installed using yum.
#### 1.mysql
| Mysql version | Mysql ODBC version |
| ------------- | ------------------ |
| 8.0.27 | 8.0.27, 8.026 |
| 5.7.36 | 5.3.11, 5.3.13 |
| 5.6.51 | 5.3.11, 5.3.13 |
| 5.5.62 | 5.3.11, 5.3.13 |
#### 2. PostgreSQL
PostgreSQL's yum source rpm package address:
````
https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
````
This contains all versions of PostgreSQL from 9.x to 14.x, including the corresponding ODBC version, which can be installed as needed.
| PostgreSQL Version | PostgreSQL ODBC Version |
| ------------------ | ---------------------------- |
| 12.9 | postgresql12-odbc-13.02.0000 |
| 13.5 | postgresql13-odbc-13.02.0000 |
| 14.1 | postgresql14-odbc-13.02.0000 |
| 9.6.24 | postgresql96-odbc-13.02.0000 |
| 10.6 | postgresql10-odbc-13.02.0000 |
| 11.6 | postgresql11-odbc-13.02.0000 |
#### 3. Oracle
####
| Oracle version | Oracle ODBC version |
| ------------------------------------------------------------ | ------------------------------------------ |
| Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production | oracle-instantclient19.13-odbc-19.13.0.0.0 |
| Oracle Database 12c Standard Edition Release 12.2.0.1.0 - 64bit Production | oracle-instantclient19.13-odbc-19.13.0.0.0 |
| Oracle Database 18c Enterprise Edition Release 18.0.0.0.0 - Production | oracle-instantclient19.13-odbc-19.13.0.0.0 |
| Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production | oracle-instantclient19.13-odbc-19.13.0.0.0 |
| Oracle Database 21c Enterprise Edition Release 21.0.0.0.0 - Production | oracle-instantclient19.13-odbc-19.13.0.0.0 |
Oracle ODBC driver version download address:
```
https://download.oracle.com/otn_software/linux/instantclient/1913000/oracle-instantclient19.13-sqlplus-19.13.0.0.0-2.x86_64.rpm
https://download.oracle.com/otn_software/linux/instantclient/1913000/oracle-instantclient19.13-devel-19.13.0.0.0-2.x86_64.rpm
https://download.oracle.com/otn_software/linux/instantclient/1913000/oracle-instantclient19.13-odbc-19.13.0.0.0-2.x86_64.rpm
https://download.oracle.com/otn_software/linux/instantclient/1913000/oracle-instantclient19.13-basic-19.13.0.0.0-2.x86_64.rpm
```
#### 4.SQLServer
| SQLServer version | SQLServer ODBC version |
| --------- | -------------- |
| SQL Server 2016 Enterprise | freetds-1.2.21 |
Currently only tested this version, other versions will be added after testing
## Ubuntu operating system
The unixODBC versions used are: 2.3.4, Doris 0.15, Ubuntu 20.04
#### 1. Mysql
| Mysql version | Mysql ODBC version |
| ------------- | ------------------ |
| 8.0.27 | 8.0.11, 5.3.13 |
Currently only tested this version, other versions will be added after testing
#### 2. PostgreSQL
| PostgreSQL Version | PostgreSQL ODBC Version |
| ------------------ | ----------------------- |
| 12.9 | psqlodbc-12.02.0000 |
For other versions, as long as you download the ODBC driver version that matches the major version of the database, there is no problem. This will continue to supplement the test results of other versions under the Ubuntu system.
#### 3. Oracle
The same as the Oracle database and ODBC correspondence of the Centos operating system, and the following method is used to install the rpm package under ubuntu.
In order to install rpm packages under ubuntu, we also need to install an alien, which is a tool that can convert rpm packages into deb installation packages
````
sudo apt-get install alien
````
Then execute the installation of the above four packages
````
sudo alien -i oracle-instantclient19.13-basic-19.13.0.0.0-2.x86_64.rpm
sudo alien -i oracle-instantclient19.13-devel-19.13.0.0.0-2.x86_64.rpm
sudo alien -i oracle-instantclient19.13-odbc-19.13.0.0.0-2.x86_64.rpm
sudo alien -i oracle-instantclient19.13-sqlplus-19.13.0.0.0-2.x86_64.rpm
````
#### 4.SQLServer
| SQLServer version | SQLServer ODBC version |
| --------- | -------------- |
| SQL Server 2016 Enterprise | freetds-1.2.21 |
Currently only tested this version, other versions will be added after testing
## Data type mapping
There are different data types among different databases. Here, the types in each database and the data type matching in Doris are listed.
### MySQL
| MySQL | Doris | Alternation rules |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| CHAR | CHAR | Only UTF8 encoding is supported |
| VARCHAR | VARCHAR | Only UTF8 encoding is supported |
| DATE | DATE | |
| FLOAT | FLOAT | |
| TINYINT | TINYINT | |
| SMALLINT | SMALLINT | |
| INT | INT | |
| BIGINT | BIGINT | |
| DOUBLE | DOUBLE | |
| DATE | DATE | |
| DATETIME | DATETIME | |
| DECIMAL | DECIMAL | |
### PostgreSQL
| PostgreSQL | Doris | Alternation rules |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| CHAR | CHAR | Only UTF8 encoding is supported |
| VARCHAR | VARCHAR | Only UTF8 encoding is supported
| DATE | DATE | |
| REAL | FLOAT | |
| SMALLINT | SMALLINT | |
| INT | INT | |
| BIGINT | BIGINT | |
| DOUBLE | DOUBLE | |
| TIMESTAMP | DATETIME | |
| DECIMAL | DECIMAL | |
### Oracle
| Oracle | Doris | Alternation rules |
| :------: | :----: | :-------------------------------: |
| not support | BOOLEAN | Oracle can replace Boolean with number (1) |
| CHAR | CHAR | |
| VARCHAR | VARCHAR | |
| DATE | DATE | |
| FLOAT | FLOAT | |
| not support | TINYINT | Oracle can be replaced by NUMBER |
| SMALLINT | SMALLINT | |
| INT | INT | |
| not support | BIGINT | Oracle can be replaced by NUMBER |
| not support | DOUBLE | Oracle can be replaced by NUMBER |
| DATE | DATE | |
| DATETIME | DATETIME | |
| NUMBER | DECIMAL | |
### SQLServer
| SQLServer | Doris | Alternation rules |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| CHAR | CHAR | Only UTF8 encoding is supported |
| VARCHAR | VARCHAR | Only UTF8 encoding is supported |
| DATE/ | DATE | |
| REAL | FLOAT | |
| TINYINT | TINYINT | |
| SMALLINT | SMALLINT | |
| INT | INT | |
| BIGINT | BIGINT | |
| FLOAT | DOUBLE | |
| DATETIME/DATETIME2 | DATETIME | |
| DECIMAL/NUMERIC | DECIMAL | |
## Best Practices
Sync for small amounts of data
For example, a table in Mysql has 1 million data. If you want to synchronize to doris, you can use ODBC to map the data. When using[insert into](../../data-operate/import/import-way/insert-into-manual.md)way to synchronize data to Doris, if you want to synchronize large batches of data,Can be used in batches[insert into](../../data-operate/import/import-way/insert-into-manual.md)Sync (deprecated)
## Q&A
1. Relationship with the original external table of MySQL?
After accessing the ODBC external table, the original way to access the MySQL external table will be gradually abandoned. If you have not used the MySQL external table before, it is recommended that the newly accessed MySQL tables use ODBC external table directly.
2. Besides MySQL, Oracle, SQLServer, PostgreSQL, can doris support more databases?
Currently, Doris only adapts to MySQL, Oracle, SQLServer, PostgreSQL. The adaptation of other databases is under planning. In principle, any database that supports ODBC access can be accessed through the ODBC external table. If you need to access other databases, you are welcome to modify the code and contribute to Doris.
3. When is it appropriate to use ODBC external tables?
Generally, when the amount of external data is small and less than 100W, it can be accessed through ODBC external table. Since external table the cannot play the role of Doris in the storage engine and will bring additional network overhead, it is recommended to determine whether to access through external tables or import data into Doris according to the actual access delay requirements for queries.
4. Garbled code in Oracle access?
Add the following parameters to the BE start up script: `export NLS_LANG=AMERICAN_AMERICA.AL32UTF8`R, Restart all be
5. ANSI Driver or Unicode Driver?
Currently, ODBC supports both ANSI and Unicode driver forms, while Doris only supports Unicode driver. If you force the use of ANSI driver, the query results may be wrong.
6. Report Errors: `driver connect Err: 01000 [unixODBC][Driver Manager]Can't open lib 'Xxx' : file not found (0)`
The driver for the corresponding data is not installed on each BE, or it is not installed in the be/conf/odbcinst.ini configure the correct path, or create the table with the driver namebe/conf/odbcinst.ini different
7. Report Errors: `Fail to convert odbc value 'PALO ' TO INT on column:'A'`
Type conversion error, type of column `A` mapping of actual column type is different, needs to be modified
8. BE crash occurs when using old MySQL table and ODBC external driver at the same time
This is the compatibility problem between MySQL database ODBC driver and existing Doris depending on MySQL lib. The recommended solutions are as follows:
* Method 1: replace the old MySQL External Table by ODBC External Table, recompile BE close options **WITH_MySQL**
* Method 2: Do not use the latest 8. X MySQL ODBC driver replace with the 5. X MySQL ODBC driver
9. Push down the filtering condition
The current ODBC appearance supports push down under filtering conditions. MySQL external table can support push down under all conditions. The functions of other databases are different from Doris, which will cause the push down query to fail. At present, except for the MySQL, other databases do not support push down of function calls. Whether Doris pushes down the required filter conditions can be confirmed by the 'explain' query statement.
10. Report Errors: `driver connect Err: xxx`
Connection to the database fails. The` Err: part` represents the error of different database connection failures. This is usually a configuration problem. You should check whether the IP address, port or account password are mismatched.
11. Messy code appears when reading and writing emoji emoji in mysql odbc table
The default encoding used by Doris when connecting to odbc tables is utf8, since the default utf8 encoding in mysql is utf8mb3, it can't represent the emoji expressions which need 4-byte encoding. Here need to set `charset`=`utf8mb4` when you create odbc mysql tables, then can read and write emoji normally 😀.
12. Set charset for sqlserver odbc table
It is useless to set `charset` parameter for sqlserver odbc table,you can set `ClientCharset` parameter (for freetds) to get write or read data, for example: "ClientCharset" = "UTF-8".

View File

@ -0,0 +1,34 @@
---
{
"title": "Elasticsearch External Table",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Elasticsearch External Table
<version deprecated="1.2.2">
Please use [ES Catalog](../multi-catalog/es) to visit ES
</version>

View File

@ -0,0 +1,31 @@
---
{
"title": "File Analysis",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# File Analysis
TODO: translate

View File

@ -0,0 +1,34 @@
---
{
"title": "Hive External Table",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hive External Table
<version deprecated="1.2.0">
Please use [Hive Catalog](../multi-catalog/hive) to visit Hive。
</version>

View File

@ -0,0 +1,34 @@
---
{
"title": "JDBC External Table",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# JDBC External Table
<version deprecated="1.2.2">
Please use [JDBC Catalog](../multi-catalog/jdbc) to visit JDBC data source.
</version>

View File

@ -0,0 +1,34 @@
---
{
"title": "ODBC External Table",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# ODBC External Table
<version deprecated="1.2.0">
Please use [JDBC Catalog](../multi-catalog/jdbc) to visit external table.
</version>

View File

@ -0,0 +1,31 @@
---
{
"title": "Aliyun DLF",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Aliyun DLF
TODO: translate

View File

@ -0,0 +1,29 @@
---
{
"title": "Elasticsearch",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Elasticsearch
TODO: translate

View File

@ -0,0 +1,30 @@
---
{
"title": "FAQ",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# FAQ
TODO: translate

View File

@ -0,0 +1,29 @@
---
{
"title": "Hive",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hive
TODO: translate

View File

@ -0,0 +1,30 @@
---
{
"title": "Hudi",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hudi
TODO: translate

View File

@ -0,0 +1,30 @@
---
{
"title": "Iceberg",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Iceberg
TODO: translate

View File

@ -0,0 +1,30 @@
---
{
"title": "JDBC",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# JDBC
TODO: translate

View File

@ -0,0 +1,30 @@
---
{
"title": "Multi Catalog",
"language": "en"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Multi Catalog
TODO: translate

View File

@ -179,21 +179,39 @@
},
{
"type": "category",
"label": "Ecosystem",
"label": "Lakehouse",
"items": [
{
"type": "category",
"label": "Expansion table",
"label": "Multi Catalog",
"items": [
"ecosystem/external-table/multi-catalog",
"ecosystem/external-table/doris-on-es",
"ecosystem/external-table/hudi-external-table",
"ecosystem/external-table/iceberg-of-doris",
"ecosystem/external-table/odbc-of-doris",
"ecosystem/external-table/jdbc-of-doris",
"ecosystem/external-table/hive-of-doris"
"lakehouse/multi-catalog/multi-catalog",
"lakehouse/multi-catalog/hive",
"lakehouse/multi-catalog/iceberg",
"lakehouse/multi-catalog/hudi",
"lakehouse/multi-catalog/es",
"lakehouse/multi-catalog/jdbc",
"lakehouse/multi-catalog/dlf",
"lakehouse/multi-catalog/faq"
]
},
{
"type": "category",
"label": "External Table",
"items": [
"lakehouse/external-table/file",
"lakehouse/external-table/es",
"lakehouse/external-table/jdbc",
"lakehouse/external-table/odbc",
"lakehouse/external-table/hive"
]
}
]
},
{
"type": "category",
"label": "Ecosystem",
"items": [
"ecosystem/spark-doris-connector",
"ecosystem/flink-doris-connector",
"ecosystem/datax",
@ -202,6 +220,7 @@
"ecosystem/plugin-development-manual",
"ecosystem/audit-plugin",
"ecosystem/cloudcanal",
"ecosystem/hive-bitmap-udf",
{
"type": "category",
"label": "Doris Manager",

View File

@ -107,7 +107,7 @@ Spark load 任务的执行主要分为以下5个阶段。
## Hive Bitmap UDF
Spark 支持将 hive 生成的 bitmap 数据直接导入到 Doris。详见 [hive-bitmap-udf 文档](../../../ecosystem/external-table/hive-bitmap-udf)
Spark 支持将 hive 生成的 bitmap 数据直接导入到 Doris。详见 [hive-bitmap-udf 文档](../../../ecosystem/hive-bitmap-udf)
## 基本操作
@ -559,7 +559,7 @@ WITH RESOURCE 'spark0'
**hive binary(bitmap)类型列的导入**
适用于 doris 表聚合列的数据类型为 bitmap 类型,且数据源 hive 表中对应列的数据类型为 binary(通过 FE 中 spark-dpp 中的 `org.apache.doris.load.loadv2.dpp.BitmapValue` 类序列化)类型。 无需构建全局字典,在 load 命令中指定相应字段即可,格式为:`doris 字段名称= binary_bitmap( hive 表字段名称)` 同样,目前只有在上游数据源为hive表时才支持 binary( bitmap )类型的数据导入hive bitmap使用可参考 [hive-bitmap-udf](../../../ecosystem/external-table/hive-bitmap-udf) 。
适用于 doris 表聚合列的数据类型为 bitmap 类型,且数据源 hive 表中对应列的数据类型为 binary(通过 FE 中 spark-dpp 中的 `org.apache.doris.load.loadv2.dpp.BitmapValue` 类序列化)类型。 无需构建全局字典,在 load 命令中指定相应字段即可,格式为:`doris 字段名称= binary_bitmap( hive 表字段名称)` 同样,目前只有在上游数据源为hive表时才支持 binary( bitmap )类型的数据导入hive bitmap使用可参考 [hive-bitmap-udf](../../../ecosystem/hive-bitmap-udf) 。
### 查看导入

View File

@ -1,138 +0,0 @@
---
{
"title": "Doris Hudi external table",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hudi External Table of Doris
<version deprecated="1.2.0" comment="请使用 Multi-Catalog 功能访问 Hudi">
Hudi External Table of Doris 提供了 Doris 直接访问 Hudi 外部表的能力,外部表省去了繁琐的数据导入工作,并借助 Doris 本身的 OLAP 的能力来解决 Hudi 表的数据分析问题:
1. 支持 Hudi 数据源接入Doris
2. 支持 Doris 与 Hive数据源Hudi中的表联合查询,进行更加复杂的分析操作
本文档主要介绍该功能的使用方式和注意事项等。
</version>
## 名词解释
### Doris 相关
* FE:Frontend,Doris 的前端节点,负责元数据管理和请求接入
* BE:Backend,Doris 的后端节点,负责查询执行和数据存储
## 使用方法
### Doris 中创建 Hudi 的外表
可以通过以下两种方式在 Doris 中创建 Hudi 外表。建外表时无需声明表的列定义,Doris 可以在查询时从HiveMetaStore中获取列信息。
1. 创建一个单独的外表,用于挂载 Hudi 表。
具体相关语法,可以通过 [CREATE TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md) 查看。
```sql
-- 语法
CREATE [EXTERNAL] TABLE table_name
[(column_definition1[, column_definition2, ...])]
ENGINE = HUDI
[COMMENT "comment"]
PROPERTIES (
"hudi.database" = "hudi_db_in_hive_metastore",
"hudi.table" = "hudi_table_in_hive_metastore",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);
-- 例子:挂载 HiveMetaStore 中 hudi_db_in_hive_metastore 下的 hudi_table_in_hive_metastore,挂载时不指定schema。
CREATE TABLE `t_hudi`
ENGINE = HUDI
PROPERTIES (
"hudi.database" = "hudi_db_in_hive_metastore",
"hudi.table" = "hudi_table_in_hive_metastore",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);
-- 例子:挂载时指定schema
CREATE TABLE `t_hudi` (
`id` int NOT NULL COMMENT "id number",
`name` varchar(10) NOT NULL COMMENT "user name"
) ENGINE = HUDI
PROPERTIES (
"hudi.database" = "hudi_db_in_hive_metastore",
"hudi.table" = "hudi_table_in_hive_metastore",
"hudi.hive.metastore.uris" = "thrift://127.0.0.1:9083"
);
```
#### 参数说明:
- 外表列
- 可以不指定列名,这时查询时会从HiveMetaStore中获取列信息,推荐这种建表方式
- 指定列名时指定的列名要在 Hudi 表中存在
- ENGINE 需要指定为 HUDI
- PROPERTIES 属性:
- `hudi.hive.metastore.uris`:Hive Metastore 服务地址
- `hudi.database`:挂载 Hudi 对应的数据库名
- `hudi.table`:挂载 Hudi 对应的表名
### 展示表结构
展示表结构可以通过 [SHOW CREATE TABLE](../../sql-manual/sql-reference/Show-Statements/SHOW-CREATE-TABLE.md) 查看。
## 类型匹配
支持的 Hudi 列类型与 Doris 对应关系如下表:
| Hudi | Doris | 描述 |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| INTEGER | INT | |
| LONG | BIGINT | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | Timestamp 转成 Datetime 会损失精度 |
| STRING | STRING | |
| UUID | VARCHAR | 使用 VARCHAR 来代替 |
| DECIMAL | DECIMAL | |
| TIME | - | 不支持 |
| FIXED | - | 不支持 |
| BINARY | - | 不支持 |
| STRUCT | - | 不支持 |
| LIST | - | 不支持 |
| MAP | - | 不支持 |
**注意:**
- 当前默认支持的 Hudi 版本为 0.10.0,未在其他版本进行测试。后续后支持更多版本。
### 查询用法
完成在 Doris 中建立 Hudi 外表后,除了无法使用 Doris 中的数据模型(rollup、预聚合、物化视图等)外,与普通的 Doris OLAP 表并无区别
```sql
select * from t_hudi where k1 > 1000 and k3 ='term' or k4 like '%doris';
```

View File

@ -1,247 +0,0 @@
---
{
"title": "Doris On Iceberg",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Iceberg External Table of Doris
<version deprecated="1.2.0" comment="请使用 Multi-Catalog 功能访问 Iceberg">
Iceberg External Table of Doris 提供了 Doris 直接访问 Iceberg 外部表的能力,外部表省去了繁琐的数据导入工作,并借助 Doris 本身的 OLAP 的能力来解决 Iceberg 表的数据分析问题:
1. 支持 Iceberg 数据源接入Doris
2. 支持 Doris 与 Iceberg 数据源中的表联合查询,进行更加复杂的分析操作
本文档主要介绍该功能的使用方式和注意事项等。
</version>
## 名词解释
### Doris 相关
* FE:Frontend,Doris 的前端节点,负责元数据管理和请求接入
* BE:Backend,Doris 的后端节点,负责查询执行和数据存储
## 使用方法
### Doris 中创建 Iceberg 的外表
可以通过以下两种方式在 Doris 中创建 Iceberg 外表。建外表时无需声明表的列定义,Doris 可以根据 Iceberg 中表的列定义自动转换。
1. 创建一个单独的外表,用于挂载 Iceberg 表。
具体相关语法,可以通过 [CREATE TABLE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-TABLE.md) 查看。
```sql
-- 语法
CREATE [EXTERNAL] TABLE table_name
ENGINE = ICEBERG
[COMMENT "comment"]
PROPERTIES (
"iceberg.database" = "iceberg_db_name",
"iceberg.table" = "icberg_table_name",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- 例子1:挂载 Iceberg 中 iceberg_db 下的 iceberg_table
CREATE TABLE `t_iceberg`
ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- 例子2:挂载 Iceberg 中 iceberg_db 下的 iceberg_table,HDFS开启HA
CREATE TABLE `t_iceberg`
ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG",
"dfs.nameservices"="HDFS8000463",
"dfs.ha.namenodes.HDFS8000463"="nn2,nn1",
"dfs.namenode.rpc-address.HDFS8000463.nn2"="172.21.16.5:4007",
"dfs.namenode.rpc-address.HDFS8000463.nn1"="172.21.16.26:4007",
"dfs.client.failover.proxy.provider.HDFS8000463"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
```
2. 创建一个 Iceberg 数据库,用于挂载远端对应 Iceberg 数据库,同时挂载该 database 下的所有 table。
具体相关语法,可以通过 [CREATE DATABASE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-DATABASE.md) 查看。
```sql
-- 语法
CREATE DATABASE db_name
[COMMENT "comment"]
PROPERTIES (
"iceberg.database" = "iceberg_db_name",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- 例子:挂载 Iceberg 中的 iceberg_db,同时挂载该 db 下的所有 table
CREATE DATABASE `iceberg_test_db`
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
```
`iceberg_test_db` 中的建表进度可以通过 `HELP SHOW TABLE CREATION` 查看。
也可以根据自己的需求明确指定列定义来创建 Iceberg 外表。
1. 创一个 Iceberg 外表
```sql
-- 语法
CREATE [EXTERNAL] TABLE table_name (
col_name col_type [NULL | NOT NULL] [COMMENT "comment"]
) ENGINE = ICEBERG
[COMMENT "comment"]
PROPERTIES (
"iceberg.database" = "iceberg_db_name",
"iceberg.table" = "icberg_table_name",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- 例子1:挂载 Iceberg 中 iceberg_db 下的 iceberg_table
CREATE TABLE `t_iceberg` (
`id` int NOT NULL COMMENT "id number",
`name` varchar(10) NOT NULL COMMENT "user name"
) ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG"
);
-- 例子2:挂载 Iceberg 中 iceberg_db 下的 iceberg_table,HDFS开启HA
CREATE TABLE `t_iceberg` (
`id` int NOT NULL COMMENT "id number",
`name` varchar(10) NOT NULL COMMENT "user name"
) ENGINE = ICEBERG
PROPERTIES (
"iceberg.database" = "iceberg_db",
"iceberg.table" = "iceberg_table",
"iceberg.hive.metastore.uris" = "thrift://192.168.0.1:9083",
"iceberg.catalog.type" = "HIVE_CATALOG",
"dfs.nameservices"="HDFS8000463",
"dfs.ha.namenodes.HDFS8000463"="nn2,nn1",
"dfs.namenode.rpc-address.HDFS8000463.nn2"="172.21.16.5:4007",
"dfs.namenode.rpc-address.HDFS8000463.nn1"="172.21.16.26:4007",
"dfs.client.failover.proxy.provider.HDFS8000463"="org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
```
#### 参数说明:
- 外表列
- 列名要于 Iceberg 表一一对应
- 列的顺序需要与 Iceberg 表一致
- ENGINE 需要指定为 ICEBERG
- PROPERTIES 属性:
- `iceberg.hive.metastore.uris`:Hive Metastore 服务地址
- `iceberg.database`:挂载 Iceberg 对应的数据库名
- `iceberg.table`:挂载 Iceberg 对应的表名,挂载 Iceberg database 时无需指定。
- `iceberg.catalog.type`:Iceberg 中使用的 catalog 方式,默认为 `HIVE_CATALOG`,当前仅支持该方式,后续会支持更多的 Iceberg catalog 接入方式。
### 展示表结构
展示表结构可以通过 [SHOW CREATE TABLE](../../sql-manual/sql-reference/Show-Statements/SHOW-CREATE-TABLE.md) 查看。
### 同步挂载
当 Iceberg 表 Schema 发生变更时,可以通过 `REFRESH` 命令手动同步,该命令会将 Doris 中的 Iceberg 外表删除重建,具体帮助可以通过 `HELP REFRESH` 查看。
```sql
-- 同步 Iceberg 表
REFRESH TABLE t_iceberg;
-- 同步 Iceberg 数据库
REFRESH DATABASE iceberg_test_db;
```
## 类型匹配
支持的 Iceberg 列类型与 Doris 对应关系如下表:
| Iceberg | Doris | 描述 |
| :------: | :----: | :-------------------------------: |
| BOOLEAN | BOOLEAN | |
| INTEGER | INT | |
| LONG | BIGINT | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | Timestamp 转成 Datetime 会损失精度 |
| STRING | STRING | |
| UUID | VARCHAR | 使用 VARCHAR 来代替 |
| DECIMAL | DECIMAL | |
| TIME | - | 不支持 |
| FIXED | - | 不支持 |
| BINARY | - | 不支持 |
| STRUCT | - | 不支持 |
| LIST | - | 不支持 |
| MAP | - | 不支持 |
**注意:**
- Iceberg 表 Schema 变更**不会自动同步**,需要在 Doris 中通过 `REFRESH` 命令同步 Iceberg 外表或数据库。
- 当前默认支持的 Iceberg 版本为 0.12.0、0.13.2,未在其他版本进行测试。后续后支持更多版本。
### 查询用法
完成在 Doris 中建立 Iceberg 外表后,除了无法使用 Doris 中的数据模型(rollup、预聚合、物化视图等)外,与普通的 Doris OLAP 表并无区别
```sql
select * from t_iceberg where k1 > 1000 and k3 ='term' or k4 like '%doris';
```
## 相关系统配置
### FE配置
下面几个配置属于 Iceberg 外表系统级别的配置,可以通过修改 `fe.conf` 来配置,也可以通过 `ADMIN SET CONFIG` 来配置。
- `iceberg_table_creation_strict_mode`
创建 Iceberg 表默认开启 strict mode。
strict mode 是指对 Iceberg 表的列类型进行严格过滤,如果有 Doris 目前不支持的数据类型,则创建外表失败。
- `iceberg_table_creation_interval_second`
自动创建 Iceberg 表的后台任务执行间隔,默认为 10s。
- `max_iceberg_table_creation_record_size`
Iceberg 表创建记录保留的最大值,默认为 2000. 仅针对创建 Iceberg 数据库记录。

View File

@ -1,907 +0,0 @@
---
{
"title": "多源数据目录",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# 多源数据目录
<version since="1.2.0">
多源数据目录(Multi-Catalog)是 Doris 1.2.0 版本中推出的功能,旨在能够更方便对接外部数据目录,以增强Doris的数据湖分析和联邦数据查询能力。
在之前的 Doris 版本中,用户数据只有两个层级:Database 和 Table。当我们需要连接一个外部数据目录时,我们只能在Database 或 Table 层级进行对接。比如通过 `create external table` 的方式创建一个外部数据目录中的表的映射,或通过 `create external database` 的方式映射一个外部数据目录中的 Database。 如果外部数据目录中的 Database 或 Table 非常多,则需要用户手动进行一一映射,使用体验不佳。
而新的 Multi-Catalog 功能在原有的元数据层级上,新增一层Catalog,构成 Catalog -> Database -> Table 的三层元数据层级。其中,Catalog 可以直接对应到外部数据目录。目前支持的外部数据目录包括:
1. Hive MetaStore:对接一个 Hive MetaStore,从而可以直接访问其中的 Hive、Iceberg、Hudi 等数据。
2. Elasticsearch:对接一个 ES 集群,并直接访问其中的表和分片。
3. JDBC: 对接数据库访问的标准接口(JDBC)来访问各式数据库的数据。(目前只支持访问MYSQL)
该功能将作为之前外表连接方式(External Table)的补充和增强,帮助用户进行快速的多数据目录联邦查询。
</version>
## 基础概念
1. Internal Catalog
Doris 原有的 Database 和 Table 都将归属于 Internal Catalog。Internal Catalog 是内置的默认 Catalog,用户不可修改或删除。
2. External Catalog
可以通过 [CREATE CATALOG](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-CATALOG.md) 命令创建一个 External Catalog。创建后,可以通过 [SHOW CATALOGS](../../sql-manual/sql-reference/Show-Statements/SHOW-CATALOGS.md) 命令查看已创建的 Catalog。
3. 切换 Catalog
用户登录 Doris 后,默认进入 Internal Catalog,因此默认的使用和之前版本并无差别,可以直接使用 `SHOW DATABASES``USE DB` 等命令查看和切换数据库。
用户可以通过 [SWITCH](../../sql-manual/sql-reference/Utility-Statements/SWITCH.md) 命令切换 Catalog。如:
```
SWiTCH internal;
SWITCH hive_catalog;
```
切换后,可以直接通过 `SHOW DATABASES`,`USE DB` 等命令查看和切换对应 Catalog 中的 Database。Doris 会自动通过 Catalog 中的 Database 和 Table。用户可以像使用 Internal Catalog 一样,对 External Catalog 中的数据进行查看和访问。
当前,Doris 只支持对 External Catalog 中的数据进行只读访问。
4. 删除 Catalog
External Catalog 中的 Database 和 Table 都是只读的。但是可以删除 Catalog(Internal Catalog无法删除)。可以通过 [DROP CATALOG](../../sql-manual/sql-reference/Data-Definition-Statements/Drop/DROP-CATALOG) 命令删除一个 External Catalog。
该操作仅会删除 Doris 中该 Catalog 的映射信息,并不会修改或变更任何外部数据目录的内容。
## 连接示例
### 连接 Hive MetaStore(Hive/Iceberg/Hudi)
> 1. hive 支持 2.3.7 以上版本。
> 2. Iceberg 目前仅支持 V1 版本,V2 版本即将支持。
> 3. Hudi 目前仅支持 Copy On Write 表的 Snapshot Query,以及 Merge On Read 表的 Read Optimized Query。后续将支持 Incremental Query 和 Merge On Read 表的 Snapshot Query。
> 4. 支持数据存储在腾讯 CHDFS 上的 hive 表,用法和普通 hive 一样。
以下示例,用于创建一个名为 hive 的 Catalog 连接指定的 Hive MetaStore,并提供了 HDFS HA 连接属性,用于访问对应的 HDFS 中的文件。
**通过 resource 创建 catalog**
`1.2.0` 以后的版本推荐通过 resource 创建 catalog,多个使用场景可以复用相同的 resource。
```sql
CREATE RESOURCE hms_resource PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
// 在 PROERPTIES 中指定的配置,将会覆盖 Resource 中的配置。
CREATE CATALOG hive WITH RESOURCE hms_resource PROPERTIES(
'key' = 'value'
);
```
**通过 properties 创建 catalog**
`1.2.0` 版本通过 properties 创建 catalog
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
```
如果需要连接开启了 Kerberos 认证的 Hive MetaStore,示例如下:
```sql
-- 1.2.0+ 版本
CREATE RESOURCE hms_resource PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hive.metastore.sasl.enabled' = 'true',
'dfs.nameservices'='your-nameservice',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
'hadoop.kerberos.principal' = 'your-principal@YOUR.COM',
'yarn.resourcemanager.address' = 'your-rm-address:your-rm-port',
'yarn.resourcemanager.principal' = 'your-rm-principal/_HOST@YOUR.COM'
);
CREATE CATALOG hive WITH RESOURCE hms_resource;
-- 1.2.0 版本
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.kerberos.xxx' = 'xxx',
...
);
```
如果需要 hadoop KMS 认证,可以在properties中添加:
```
'dfs.encryption.key.provider.uri' = 'kms://http@kms_host:kms_port/kms'
```
创建后,可以通过 `SHOW CATALOGS` 命令查看 catalog:
```
mysql> SHOW CATALOGS;
+-----------+-------------+----------+
| CatalogId | CatalogName | Type |
+-----------+-------------+----------+
| 10024 | hive | hms |
| 0 | internal | internal |
+-----------+-------------+----------+
```
通过 `SWITCH` 命令切换到 hive catalog,并查看其中的数据库:
```
mysql> SWITCH hive;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW DATABASES;
+-----------+
| Database |
+-----------+
| default |
| random |
| ssb100 |
| tpch1 |
| tpch100 |
| tpch1_orc |
+-----------+
```
切换到 tpch100 数据库,并查看其中的表:
```
mysql> USE tpch100;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> SHOW TABLES;
+-------------------+
| Tables_in_tpch100 |
+-------------------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
+-------------------+
```
查看 lineitem 表的schema:
```
mysql> DESC lineitem;
+-----------------+---------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+------+---------+-------+
| l_shipdate | DATE | Yes | true | NULL | |
| l_orderkey | BIGINT | Yes | true | NULL | |
| l_linenumber | INT | Yes | true | NULL | |
| l_partkey | INT | Yes | true | NULL | |
| l_suppkey | INT | Yes | true | NULL | |
| l_quantity | DECIMAL(15,2) | Yes | true | NULL | |
| l_extendedprice | DECIMAL(15,2) | Yes | true | NULL | |
| l_discount | DECIMAL(15,2) | Yes | true | NULL | |
| l_tax | DECIMAL(15,2) | Yes | true | NULL | |
| l_returnflag | TEXT | Yes | true | NULL | |
| l_linestatus | TEXT | Yes | true | NULL | |
| l_commitdate | DATE | Yes | true | NULL | |
| l_receiptdate | DATE | Yes | true | NULL | |
| l_shipinstruct | TEXT | Yes | true | NULL | |
| l_shipmode | TEXT | Yes | true | NULL | |
| l_comment | TEXT | Yes | true | NULL | |
+-----------------+---------------+------+------+---------+-------+
```
查询示例:
```
mysql> SELECT l_shipdate, l_orderkey, l_partkey FROM lineitem limit 10;
+------------+------------+-----------+
| l_shipdate | l_orderkey | l_partkey |
+------------+------------+-----------+
| 1998-01-21 | 66374304 | 270146 |
| 1997-11-17 | 66374304 | 340557 |
| 1997-06-17 | 66374400 | 6839498 |
| 1997-08-21 | 66374400 | 11436870 |
| 1997-08-07 | 66374400 | 19473325 |
| 1997-06-16 | 66374400 | 8157699 |
| 1998-09-21 | 66374496 | 19892278 |
| 1998-08-07 | 66374496 | 9509408 |
| 1998-10-27 | 66374496 | 4608731 |
| 1998-07-14 | 66374592 | 13555929 |
+------------+------------+-----------+
```
也可以和其他数据目录中的表进行关联查询:
```
mysql> SELECT l.l_shipdate FROM hive.tpch100.lineitem l WHERE l.l_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;
+------------+
| l_shipdate |
+------------+
| 1993-02-16 |
| 1995-06-26 |
| 1995-08-19 |
| 1992-07-23 |
| 1998-05-23 |
| 1997-07-12 |
| 1994-03-06 |
| 1996-02-07 |
| 1997-06-01 |
| 1996-08-23 |
+------------+
```
这里我们通过 `catalog.database.table` 这种全限定的方式标识一张表,如:`internal.db1.part`。
其中 `catalog` 和 `database` 可以省略,缺省使用当前 SWITCH 和 USE 后切换的 catalog 和 database。
可以通过 INSERT INTO 命令,将 hive catalog 中的表数据,插入到 interal catalog 中的内部表,从而达到**导入外部数据目录数据**的效果:
```
mysql> SWITCH internal;
Query OK, 0 rows affected (0.00 sec)
mysql> USE db1;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> INSERT INTO part SELECT * FROM hive.tpch100.part limit 1000;
Query OK, 1000 rows affected (0.28 sec)
{'label':'insert_212f67420c6444d5_9bfc184bf2e7edb8', 'status':'VISIBLE', 'txnId':'4'}
```
### 连接 Elasticsearch
> 1. 支持 5.x 及以上版本。
> 2. 5.x 和 6.x 中一个 index 中的多个 type 默认取第一个
以下示例,用于创建一个名为 es 的 Catalog 连接指定的 ES,并关闭节点发现功能。
```sql
-- 1.2.0+ 版本
CREATE RESOURCE es_resource PROPERTIES (
"type"="es",
"hosts"="http://127.0.0.1:9200",
"nodes_discovery"="false"
);
CREATE CATALOG es WITH RESOURCE es_resource;
-- 1.2.0 版本
CREATE CATALOG es PROPERTIES (
"type"="es",
"elasticsearch.hosts"="http://127.0.0.1:9200",
"elasticsearch.nodes_discovery"="false"
);
```
创建后,可以通过 `SHOW CATALOGS` 命令查看 catalog:
```
mysql> SHOW CATALOGS;
+-----------+-------------+----------+
| CatalogId | CatalogName | Type |
+-----------+-------------+----------+
| 0 | internal | internal |
| 11003 | es | es |
+-----------+-------------+----------+
2 rows in set (0.02 sec)
```
通过 `SWITCH` 命令切换到 es catalog,并查看其中的数据库(只有一个 default_db 关联所有 index)
```
mysql> SWITCH es;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW DATABASES;
+------------+
| Database |
+------------+
| default_db |
+------------+
mysql> show tables;
+----------------------+
| Tables_in_default_db |
+----------------------+
| test |
| test2 |
+----------------------+
```
查询示例
```
mysql> select * from test;
+------------+-------------+--------+-------+
| test4 | test2 | test3 | test1 |
+------------+-------------+--------+-------+
| 2022-08-08 | hello world | 2.415 | test2 |
| 2022-08-08 | hello world | 3.1415 | test1 |
+------------+-------------+--------+-------+
```
#### 参数说明:
参数 | 说明
---|---
**elasticsearch.hosts** | ES 地址,可以是一个或多个,也可以是 ES 的负载均衡地址
**elasticsearch.username** | ES 用户名
**elasticsearch.password** | 对应用户的密码信息
**elasticsearch.doc_value_scan** | 是否开启通过 ES/Lucene 列式存储获取查询字段的值,默认为 true
**elasticsearch.keyword_sniff** | 是否对 ES 中字符串分词类型 text.fields 进行探测,通过 keyword 进行查询(默认为 true,设置为 false 会按照分词后的内容匹配)
**elasticsearch.nodes_discovery** | 是否开启 ES 节点发现,默认为 true,在网络隔离环境下设置为 false,只连接指定节点
**elasticsearch.ssl** | ES 是否开启 https 访问模式,目前在 fe/be 实现方式为信任所有
### 连接阿里云 Data Lake Formation
> [什么是 Data Lake Formation](https://www.aliyun.com/product/bigdata/dlf)
1. 创建 hive-site.xml
创建 hive-site.xml 文件,并将其放置在 `fe/conf` 目录下。
```
<?xml version="1.0"?>
<configuration>
<!--Set to use dlf client-->
<property>
<name>hive.metastore.type</name>
<value>dlf</value>
</property>
<property>
<name>dlf.catalog.endpoint</name>
<value>dlf-vpc.cn-beijing.aliyuncs.com</value>
</property>
<property>
<name>dlf.catalog.region</name>
<value>cn-beijing</value>
</property>
<property>
<name>dlf.catalog.proxyMode</name>
<value>DLF_ONLY</value>
</property>
<property>
<name>dlf.catalog.uid</name>
<value>20000000000000000</value>
</property>
<property>
<name>dlf.catalog.accessKeyId</name>
<value>XXXXXXXXXXXXXXX</value>
</property>
<property>
<name>dlf.catalog.accessKeySecret</name>
<value>XXXXXXXXXXXXXXXXX</value>
</property>
</configuration>
```
* `dlf.catalog.endpoint`:DLF Endpoint,参阅:[DLF Region和Endpoint对照表](https://www.alibabacloud.com/help/zh/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.region`:DLF Region,参阅:[DLF Region和Endpoint对照表](https://www.alibabacloud.com/help/zh/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.uid`:阿里云账号。即阿里云控制台右上角个人信息的“云账号ID”。
* `dlf.catalog.accessKeyId`:AccessKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。
* `dlf.catalog.accessKeySecret`:SecretKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。
其他配置项为固定值,无需改动。
2. 重启 FE,并通过 `CREATE CATALOG` 语句创建 catalog。
HMS resource 会读取和解析 fe/conf/hive-site.xml
```sql
-- 1.2.0+ 版本
CREATE RESOURCE dlf_resource PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
)
CREATE CATALOG dlf WITH RESOURCE dlf_resource;
-- 1.2.0 版本
CREATE CATALOG dlf PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
)
```
其中 `type` 固定为 `hms`。 `hive.metastore.uris` 的值随意填写即可,实际不会使用。但需要按照标准 hive metastore thrift uri 格式填写。
之后,可以像正常的 Hive MetaStore 一样,访问 DLF 下的元数据。
### 连接JDBC
以下示例,用于创建一个名为 jdbc 的 Catalog, 通过jdbc 连接指定的Mysql。
jdbc Catalog会根据`jdbc.jdbc_url` 来连接指定的数据库(示例中是`jdbc::mysql`, 所以连接MYSQL数据库),当前支持MYSQL、POSTGRESQL、CLICKHOUSE数据库类型。
**MYSQL catalog示例**
```sql
-- 1.2.0+ 版本
CREATE RESOURCE mysql_resource PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:mysql://127.0.0.1:3306/demo",
"driver_url" = "file:///path/to/mysql-connector-java-5.1.47.jar",
"driver_class" = "com.mysql.jdbc.Driver"
)
CREATE CATALOG jdbc WITH RESOURCE mysql_resource;
-- 1.2.0 版本
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.jdbc_url" = "jdbc:mysql://127.0.0.1:3306/demo",
...
)
```
**POSTGRESQL catalog示例**
```sql
-- 1.2.0+ 版本
CREATE RESOURCE pg_resource PROPERTIES (
"type"="jdbc",
"user"="postgres",
"password"="123456",
"jdbc_url" = "jdbc:postgresql://127.0.0.1:5449/demo",
"driver_url" = "file:///path/to/postgresql-42.5.1.jar",
"driver_class" = "org.postgresql.Driver"
);
CREATE CATALOG jdbc WITH RESOURCE pg_resource;
-- 1.2.0 版本
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.jdbc_url" = "jdbc:postgresql://127.0.0.1:5449/demo",
...
)
```
**CLICKHOUSE catalog示例**
```sql
-- 方式一
CREATE RESOURCE clickhouse_resource PROPERTIES (
"type"="jdbc",
"user"="default",
"password"="123456",
"jdbc_url" = "jdbc:clickhouse://127.0.0.1:8123/demo",
"driver_url" = "file:///path/to/clickhouse-jdbc-0.3.2-patch11-all.jar",
"driver_class" = "com.clickhouse.jdbc.ClickHouseDriver"
)
CREATE CATALOG jdbc WITH RESOURCE clickhouse_resource;
-- 方式二,注意有jdbc前缀
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.jdbc_url" = "jdbc:clickhouse://127.0.0.1:8123/demo",
...
)
```
**ORACLE catalog示例**
```sql
-- 方式一
CREATE RESOURCE oracle_resource PROPERTIES (
"type"="jdbc",
"user"="doris",
"password"="123456",
"jdbc_url" = "jdbc:oracle:thin:@127.0.0.1:1521:helowin",
"driver_url" = "file:/path/to/ojdbc6.jar",
"driver_class" = "oracle.jdbc.driver.OracleDriver"
);
CREATE CATALOG jdbc WITH RESOURCE oracle_resource;
-- 方式二,注意有jdbc前缀
CREATE CATALOG jdbc PROPERTIES (
"type"="jdbc",
"jdbc.user"="doris",
"jdbc.password"="123456",
"jdbc.jdbc_url" = "jdbc:oracle:thin:@127.0.0.1:1521:helowin",
"jdbc.driver_url" = "file:/path/to/ojdbc6.jar",
"jdbc.driver_class" = "oracle.jdbc.driver.OracleDriver"
);
```
其中`jdbc.driver_url`可以是远程jar包:
```sql
CREATE RESOURCE mysql_resource PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:mysql://127.0.0.1:13396/demo",
"driver_url" = "https://path/jdbc_driver/mysql-connector-java-8.0.25.jar",
"driver_class" = "com.mysql.cj.jdbc.Driver"
)
CREATE CATALOG jdbc WITH RESOURCE mysql_resource;
```
如果`jdbc.driver_url` 是http形式的远程jar包,Doris对其的处理方式为:
1. 只查询元数据,不查询表数据情况下(如 `show catalogs/database/tables` 等操作):FE会直接用这个url来加载驱动类,并进行MYSQL数据类型到Doris数据类型的转换。
2. 在对jdbc catalog中的表进行查询时(`select from`):BE会将该url指定jar包下载到`be/lib/udf/`目录下,查询时将直接用下载后的路径来加载jar包。
创建catalog后,可以通过 SHOW CATALOGS 命令查看 catalog:
```sql
MySQL [(none)]> show catalogs;
+-----------+-------------+----------+
| CatalogId | CatalogName | Type |
+-----------+-------------+----------+
| 0 | internal | internal |
| 10480 | jdbc | jdbc |
+-----------+-------------+----------+
2 rows in set (0.02 sec)
```
通过 SWITCH 命令切换到 jdbc catalog,并查看其中的数据库:
```sql
MySQL [(none)]> switch jdbc;
Query OK, 0 rows affected (0.02 sec)
MySQL [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| __db1 |
| _db1 |
| db1 |
| demo |
| information_schema |
| mysql |
| mysql_db_test |
| performance_schema |
| sys |
+--------------------+
9 rows in set (0.67 sec)
```
> 注意:
> 1. 在postgresql catalog中,doris的一个database对应于postgresql中指定catalog(`jdbc.jdbc_url`参数中指定的catalog)下的一个schema,database下的tables则对应于postgresql该schema下的tables。
> 2. 在oracle catalog中,doris的一个database对应于oracle中的一个user,database下的tables则对应于oracle该user下的有权限访问的tables。
查看`db1`数据库下的表,并查询:
```sql
MySQL [demo]> use db1;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
MySQL [db1]> show tables;
+---------------+
| Tables_in_db1 |
+---------------+
| tbl1 |
+---------------+
1 row in set (0.00 sec)
MySQL [db1]> desc tbl1;
+-------+------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+------+------+------+---------+-------+
| k1 | INT | Yes | true | NULL | |
+-------+------+------+------+---------+-------+
1 row in set (0.00 sec)
MySQL [db1]> select * from tbl1;
+------+
| k1 |
+------+
| 1 |
| 2 |
| 3 |
| 4 |
+------+
4 rows in set (0.19 sec)
```
#### 参数说明:
参数 | 说明
---|---
**jdbc.user** | 连接数据库使用的用户名
**jdbc.password** | 连接数据库使用的密码
**jdbc.jdbc_url** | 连接到指定数据库的标识符
**jdbc.driver_url** | jdbc驱动包的url
**jdbc.driver_class** | jdbc驱动类
## 列类型映射
用户创建 Catalog 后,Doris 会自动同步数据目录的数据库和表,针对不同的数据目录和数据表格式,Doris 会进行以下列映射关系。
<version since="dev">
对于当前无法映射到 Doris 列类型的外表类型,如 map,struct 等。Doris 会将列类型映射为 UNSUPPORTED 类型。对于 UNSUPPORTED 类型的查询,示例如下:
假设同步后的表 schema 为:
```
k1 INT,
k2 INT,
k3 UNSUPPORTED,
k4 INT
```
```
select * from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
select * except(k3) from table; // Query OK.
select k1, k3 from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
select k1, k4 from table; // Query OK.
```
</version>
### Hive MetaStore
适用于 Hive/Iceberge/Hudi
| HMS Type | Doris Type | Comment |
|---|---|---|
| boolean| boolean | |
| tinyint|tinyint | |
| smallint| smallint| |
| int| int | |
| bigint| bigint | |
| date| date| |
| timestamp| datetime| |
| float| float| |
| double| double| |
| `array<type>` | `array<type>`| 支持array嵌套,如 `array<array<int>>` |
| char| char | |
| varchar| varchar| |
| decimal| decimal | |
| other | string | 其余不支持类型统一按 string 处理 |
### Elasticsearch
| HMS Type | Doris Type | Comment |
|---|---|---|
| boolean | boolean | |
| byte| tinyint| |
| short| smallint| |
| integer| int| |
| long| bigint| |
| unsigned_long| largeint | |
| float| float| |
| half_float| float| |
| double | double | |
| scaled_float| double | |
| date | date | |
| keyword | string | |
| text |string | |
| ip |string | |
| nested |string | |
| object |string | |
| array | | 开发中 |
|other| string ||
### JDBC
#### MYSQL
MYSQL Type | Doris Type | Comment |
|---|---|---|
| BOOLEAN | BOOLEAN | |
| TINYINT | TINYINT | |
| SMALLINT | SMALLINT | |
| MEDIUMINT | INT | |
| INT | INT | |
| BIGINT | BIGINT | |
| UNSIGNED TINYINT | SMALLINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级|
| UNSIGNED MEDIUMINT | INT | Doris没有UNSIGNED数据类型,所以扩大一个数量级|
| UNSIGNED INT | BIGINT |Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| UNSIGNED BIGINT | STRING | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DECIMAL | DECIMAL | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | |
| DATETIME | DATETIME | |
| YEAR | SMALLINT | |
| TIME | STRING | |
| CHAR | CHAR | |
| VARCHAR | STRING | |
| TINYTEXT、TEXT、MEDIUMTEXT、LONGTEXT、TINYBLOB、BLOB、MEDIUMBLOB、LONGBLOB、TINYSTRING、STRING、MEDIUMSTRING、LONGSTRING、BINARY、VARBINARY、JSON、SET、BIT | STRING | |
#### POSTGRESQL
POSTGRESQL Type | Doris Type | Comment |
|---|---|---|
| boolean | BOOLEAN | |
| smallint/int2 | SMALLINT | |
| integer/int4 | INT | |
| bigint/int8 | BIGINT | |
| decimal/numeric | DECIMAL | |
| real/float4 | FLOAT | |
| double precision | DOUBLE | |
| smallserial | SMALLINT | |
| serial | INT | |
| bigserial | BIGINT | |
| char | CHAR | |
| varchar/text | STRING | |
| timestamp | DATETIME | |
| date | DATE | |
| time | STRING | |
| interval | STRING | |
| point/line/lseg/box/path/polygon/circle | STRING | |
| cidr/inet/macaddr | STRING | |
| bit/bit(n)/bit varying(n) | STRING | `bit`类型映射为doris的`STRING`类型,读出的数据是`true/false`, 而不是`1/0` |
| uuid/josnb | STRING | |
#### CLICKHOUSE
| ClickHouse Type | Doris Type | Comment |
|------------------------|------------|-----------------------------------------------------|
| Bool | BOOLEAN | |
| String | STRING | |
| Date/Date32 | DATE | |
| DateTime/DateTime64 | DATETIME | 对于超过了Doris最大的DateTime精度的数据,将截断处理 |
| Float32 | FLOAT | |
| Float64 | DOUBLE | |
| Int8 | TINYINT | |
| Int16/UInt8 | SMALLINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int32/UInt16 | INT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int64/Uint32 | BIGINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int128/UInt64 | LARGEINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int256/UInt128/UInt256 | STRING | Doris没有这个数量级的数据类型,采用STRING处理 |
| DECIMAL | DECIMAL | 对于超过了Doris最大的Decimal精度的数据,将映射为STRING |
| Enum/IPv4/IPv6/UUID | STRING | 在显示上IPv4,IPv6会额外在数据最前面显示一个`/`,需要自己用`split_part`函数处理 |
#### ORACLE
ORACLE Type | Doris Type | Comment |
|---|---|---|
| number(p) / number(p,0) | | Doris会根据p的大小来选择对应的类型:p<3 -> TINYINT; p<5 -> SMALLINT; p<10 -> INT; p<19 -> BIGINT; p>19 -> LARGEINT |
| number(p,s) | DECIMAL | |
| decimal | DECIMAL | |
| float/real | DOUBLE | |
| DATE | DATETIME | |
| CHAR/NCHAR | CHAR | |
| VARCHAR2/NVARCHAR2 | VARCHAR | |
| LONG/ RAW/ LONG RAW/ INTERVAL | TEXT | |
## 权限管理
使用 Doris 对 External Catalog 中库表进行访问,并不受外部数据目录自身的权限控制,而是依赖 Doris 自身的权限访问管理功能。
Doris 的权限管理功能提供了对 Cataloig 层级的扩展,具体可参阅 [权限管理](../../admin-manual/privilege-ldap/user-privilege.md) 文档。
## 元数据更新
### 手动刷新
外部数据源的元数据变动,如创建、删除表,加减列等操作,不会同步给 Doris。
用户通过 [REFRESH CATALOG](../../sql-manual/sql-reference/Utility-Statements/REFRESH.md) 命令手动刷新元数据。
### 自动刷新
#### Hive MetaStore(HMS)数据目录
<version since="dev">
通过让FE节点定时读取HMS的notification event来感知Hive表元数据的变更情况,目前支持处理如下event:
</version>
1. CREATE DATABASE event:在对应数据目录下创建数据库。
2. DROP DATABASE event:在对应数据目录下删除数据库。
3. ALTER DATABASE event:此事件的影响主要有更改数据库的属性信息,注释及默认存储位置等,这些改变不影响doris对外部数据目录的查询操作,因此目前会忽略此event。
4. CREATE TABLE event:在对应数据库下创建表。
5. DROP TABLE event:在对应数据库下删除表,并失效表的缓存。
6. ALTER TABLE event:如果是重命名,先删除旧名字的表,再用新名字创建表,否则失效该表的缓存。
7. ADD PARTITION event:在对应表缓存的分区列表里添加分区。
8. DROP PARTITION event:在对应表缓存的分区列表里删除分区,并失效该分区的缓存。
9. ALTER PARTITION event:如果是重命名,先删除旧名字的分区,再用新名字创建分区,否则失效该分区的缓存。
10. 当导入数据导致文件变更,分区表会走ALTER PARTITION event逻辑,不分区表会走ALTER TABLE event逻辑(注意:如果绕过HMS直接操作文件系统的话,HMS不会生成对应事件,doris因此也无法感知)。
该特性被fe的如下参数控制:
1. enable_hms_events_incremental_sync:是否开启元数据自动增量同步功能,默认关闭。
2. hms_events_polling_interval_ms: 读取 event 的间隔时间,默认值为 10000,单位:毫秒。
3. hms_events_batch_size_per_rpc:每次读取 event 的最大数量,默认值为 500。
如果想使用该特性,需要更改HMS的hive-site.xml并重启HMS
```
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.dml.events</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.transactional.event.listeners</name>
<value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
</property>
```
##### 使用建议
无论是之前已经创建好的catalog现在想改为自动刷新,还是新创建的catalog,都只需要把enable_hms_events_incremental_sync设置为true,重启fe节点,
无需重启之前或之后再手动刷新元数据。
#### 其它数据目录
暂未支持。
## Time Travel
### Iceberg
每一次对iceberg表的写操作都会产生一个新的快照,而读操作只会读取最新版本的快照。
iceberg表可以使用`FOR TIME AS OF`和`FOR VERSION AS OF`语句,根据快照ID或者快照产生的时间读取历史版本的数据。
`SELECT * FROM db.tbl FOR TIME AS OF "2022-10-07 17:20:37";`
`SELECT * FROM db.tbl FOR VERSION AS OF 868895038966572;`
另外,查看iceberg表的snapshot id和snapshot创建的时间戳可以使用[iceberg_meta](../../sql-manual/sql-functions/table-functions/iceber_meta.md)表函数
## 常见问题
### Iceberg
下面的配置用来解决Doris使用Hive客户端访问Hive Metastore时出现的`failed to get schema for table xxx in db xxx` 和 `java.lang.UnsupportedOperationException: Storage schema reading not supported`。
- 在hive的lib目录放上iceberg运行时有关的jar包。
- hive-site.xml配置`metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader`。
配置完成后需要重启Hive Metastore。
### Kerberos
1.2.0 版本连接 `Kerberos` 认证的 `Hive Metastore` 出现 `GSS initiate failed` 异常信息。
- 请更新到 1.2.2 版本,或使用最新的 docker 开发镜像重新编译 1.2.1 版本的 FE。
### HDFS
读取 HDFS 3.x 时出现 `java.lang.VerifyError: xxx` 错误。
- 更新 `fe/pom.xml` 中的 hadoop 相关依赖到 2.10.2 版本,并重新编译 FE。

View File

@ -1,6 +1,6 @@
---
{
"title": "Doris On ES",
"title": "Elasticsearch 外表",
"language": "zh-CN"
}
---
@ -24,7 +24,13 @@ specific language governing permissions and limitations
under the License.
-->
# Doris On ES
# Elasticsearch 外表
<version deprecated="1.2.2">
推荐使用 [ES Catalog](../multi-catalog/es) 功能访问 ES。
</version>
Doris-On-ES将Doris的分布式查询规划能力和ES(Elasticsearch)的全文检索能力相结合,提供更完善的OLAP分析场景解决方案:

View File

@ -0,0 +1,141 @@
---
{
"title": "文件分析",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# 文件分析
<version since="1.2.0">
通过 Table Value Function 功能,Doris 可以直接将对象存储或 HDFS 上的文件作为 Table 进行查询分析。并且支持自动的列类型推断。
</version>
## 使用方式
更多使用方式可参阅 Table Value Function 文档:
* [S3](../../sql-manual/sql-functions/table-functions/s3):支持 S3 兼容的对象存储上的文件分析。
* [HDFS](../../sql-manual/sql-functions/table-functions/hdfs.md):支持 HDFS 上的文件分析。
这里我们通过 S3 Table Value Function 举例说明如何进行文件分析。
### 自动推断文件列类型
```
MySQL [(none)]> DESC FUNCTION s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"ACCESS_KEY"= "minioadmin",
"SECRET_KEY" = "minioadmin",
"Format" = "parquet",
"use_path_style"="true");
+---------------+--------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-------+---------+-------+
| p_partkey | INT | Yes | false | NULL | NONE |
| p_name | TEXT | Yes | false | NULL | NONE |
| p_mfgr | TEXT | Yes | false | NULL | NONE |
| p_brand | TEXT | Yes | false | NULL | NONE |
| p_type | TEXT | Yes | false | NULL | NONE |
| p_size | INT | Yes | false | NULL | NONE |
| p_container | TEXT | Yes | false | NULL | NONE |
| p_retailprice | DECIMAL(9,0) | Yes | false | NULL | NONE |
| p_comment | TEXT | Yes | false | NULL | NONE |
+---------------+--------------+------+-------+---------+-------+
```
这里我们定义了一个 S3 Table Value Function:
```
s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"ACCESS_KEY"= "minioadmin",
"SECRET_KEY" = "minioadmin",
"Format" = "parquet",
"use_path_style"="true")
```
其中指定了文件的路径、连接信息、认证信息等。
之后,通过 `DESC FUNCTION` 语法可以查看这个文件的 Schema。
可以看到,对于 Parquet 文件,Doris 会根据文件内的元信息自动推断列类型。
目前支持对 Parquet、ORC、CSV、Json 格式进行分析和列类型推断。
### 查询分析
你可以使用任意的 SQL 语句对这个文件进行分析
```
SELECT * FROM s3(
"URI" = "http://127.0.0.1:9312/test2/test.snappy.parquet",
"ACCESS_KEY"= "minioadmin",
"SECRET_KEY" = "minioadmin",
"Format" = "parquet",
"use_path_style"="true")
LIMIT 5;
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| p_partkey | p_name | p_mfgr | p_brand | p_type | p_size | p_container | p_retailprice | p_comment |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
| 1 | goldenrod lavender spring chocolate lace | Manufacturer#1 | Brand#13 | PROMO BURNISHED COPPER | 7 | JUMBO PKG | 901 | ly. slyly ironi |
| 2 | blush thistle blue yellow saddle | Manufacturer#1 | Brand#13 | LARGE BRUSHED BRASS | 1 | LG CASE | 902 | lar accounts amo |
| 3 | spring green yellow purple cornsilk | Manufacturer#4 | Brand#42 | STANDARD POLISHED BRASS | 21 | WRAP CASE | 903 | egular deposits hag |
| 4 | cornflower chocolate smoke green pink | Manufacturer#3 | Brand#34 | SMALL PLATED BRASS | 14 | MED DRUM | 904 | p furiously r |
| 5 | forest brown coral puff cream | Manufacturer#3 | Brand#32 | STANDARD POLISHED TIN | 15 | SM PKG | 905 | wake carefully |
+-----------+------------------------------------------+----------------+----------+-------------------------+--------+-------------+---------------+---------------------+
```
Table Value Function 可以出现在 SQL 中,Table 能出现的任意位置。如 CTE 的 WITH 子句中,FROM 子句中。
这样,你可以把文件当做一张普通的表进行任意分析。
### 数据导入
配合 `INSERT INTO SELECT` 语法,我们可以方便将文件导入到 Doris 表中进行更快速的分析:
```
// 1. 创建doris内部表
CREATE TABLE IF NOT EXISTS test_table
(
id int,
name varchar(50),
age int
)
DISTRIBUTED BY HASH(id) BUCKETS 4
PROPERTIES("replication_num" = "1");
// 2. 使用 S3 Table Value Function 插入数据
INSERT INTO test_table (id,name,age)
SELECT cast(id as INT) as id, name, cast (age as INT) as age
FROM s3(
"uri" = "${uri}",
"ACCESS_KEY"= "${ak}",
"SECRET_KEY" = "${sk}",
"format" = "${format}",
"strip_outer_array" = "true",
"read_json_by_line" = "true",
"use_path_style" = "true");
```

View File

@ -1,6 +1,6 @@
---
{
"title": "Doris On Hive",
"title": "Hive 外表",
"language": "zh-CN"
}
---
@ -24,9 +24,13 @@ specific language governing permissions and limitations
under the License.
-->
# Hive External Table of Doris
# Hive 外表
<version deprecated="1.2.0" comment="请使用 Multi-Catalog 功能访问 Hive">
<version deprecated="1.2.0">
推荐使用 [Hive Catalog](../multi-catalog/hive) 访问 Hive。
</version>
Hive External Table of Doris 提供了 Doris 直接访问 Hive 外部表的能力,外部表省去了繁琐的数据导入工作,并借助 Doris 本身的 OLAP 的能力来解决 Hive 表的数据分析问题:
@ -37,8 +41,6 @@ Hive External Table of Doris 提供了 Doris 直接访问 Hive 外部表的能
本文档主要介绍该功能的使用方式和注意事项等。
</version>
## 名词解释
### Doris 相关

View File

@ -1,6 +1,6 @@
---
{
"title": "Doris On JDBC",
"title": "JDBC 外表",
"language": "zh-CN"
}
---
@ -24,7 +24,13 @@ specific language governing permissions and limitations
under the License.
-->
# JDBC External Table Of Doris
# JDBC 外表
<version deprecated="1.2.2">
推荐使用 [JDBC Catalog](../multi-catalog/jdbc) 访问 JDBC 外表。
</version>
<version since="1.2.0">
@ -260,70 +266,4 @@ PROPERTIES (
## Q&A
1. 除了MySQL,Oracle,PostgreSQL,SQLServer,ClickHouse是否能够支持更多的数据库
目前Doris只适配了MySQL,Oracle,PostgreSQL,SQLServer,ClickHouse.关于其他的数据库的适配工作正在规划之中,原则上来说任何支持JDBC访问的数据库都能通过JDBC外表来访问。如果您有访问其他外表的需求,欢迎修改代码并贡献给Doris。
2. 读写mysql外表的emoji表情出现乱码
Doris进行jdbc外表连接时,由于mysql之中默认的utf8编码为utf8mb3,无法表示需要4字节编码的emoji表情。这里需要在建立mysql外表时设置对应列的编码为utf8mb4,设置服务器编码为utf8mb4,JDBC Url中的characterEncoding不配置.(该属性不支持utf8mb4,配置了非utf8mb4将导致无法写入表情,因此要留空,不配置)
3. 读mysql外表时,DateTime="0000:00:00 00:00:00"异常报错: "CAUSED BY: DataReadException: Zero date value prohibited"
这是因为JDBC中对于该非法的DateTime默认处理为抛出异常,可以通过参数zeroDateTimeBehavior控制该行为.
可选参数为:EXCEPTION,CONVERT_TO_NULL,ROUND, 分别为异常报错,转为NULL值,转为"0001-01-01 00:00:00";
可在url中添加:"jdbc_url"="jdbc:mysql://IP:PORT/doris_test?zeroDateTimeBehavior=convertToNull"
4. 读取mysql外表或其他外表时,出现加载类失败
如以下异常:
failed to load driver class com.mysql.jdbc.driver in either of hikariconfig class loader
这是因为在创建resource时,填写的driver_class不正确,需要正确填写,如上方例子为大小写问题,应填写为 `"driver_class" = "com.mysql.jdbc.Driver"`
5. 读取mysql问题出现通信链路异常
如果出现如下报错:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = PoolInitializationException: Failed to initialize pool: Communications link failure
The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago.
CAUSED BY: CommunicationsException: Communications link failure
The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago.
CAUSED BY: SSLHandshakeExcepti
```
可查看be的be.out日志
如果包含以下信息:
```
WARN: Establishing SSL connection without server's identity verification is not recommended.
According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set.
For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'.
You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
```
可在创建resource的jdbc_url把JDBC连接串最后增加 `?useSSL=false` ,如 `"jdbc_url" = "jdbc:mysql://127.0.0.1:3306/test?useSSL=false"`
```
可全局修改配置项
修改mysql目录下的my.ini文件(linux系统为etc目录下的my.cnf文件)
[client]
default-character-set=utf8mb4
[mysql]
设置mysql默认字符集
default-character-set=utf8mb4
[mysqld]
设置mysql字符集服务器
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4
修改对应表与列的类型
ALTER TABLE table_name MODIFY colum_name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE table_name CHARSET=utf8mb4;
SET NAMES utf8mb4
```
请参考 [JDBC Catalog](../multi-catalog/jdbc) 中的 常见问题一节。

View File

@ -1,6 +1,6 @@
---
{
"title": "Doris On ODBC",
"title": "ODBC 外表",
"language": "zh-CN"
}
---
@ -24,9 +24,13 @@ specific language governing permissions and limitations
under the License.
-->
# ODBC External Table Of Doris
# ODBC 外表
<version deprecated="1.2.0" comment="请使用 JDBC 外表功能">
<version deprecated="1.2.0">
请使用 [JDBC Catalog](../multi-catalog/jdbc) 功能访问外表。
</version>
ODBC External Table Of Doris 提供了Doris通过数据库访问的标准接口(ODBC)来访问外部表,外部表省去了繁琐的数据导入工作,让Doris可以具有了访问各式数据库的能力,并借助Doris本身的OLAP的能力来解决外部表的数据分析问题:
@ -36,8 +40,6 @@ ODBC External Table Of Doris 提供了Doris通过数据库访问的标准接口(
本文档主要介绍该功能的实现原理、使用方式等。
</version>
## 名词解释
### Doris相关

View File

@ -0,0 +1,102 @@
---
{
"title": "阿里云 DLF",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# 阿里云 DLF
阿里云 Data Lake Formation(DLF) 是阿里云上的统一元数据管理服务。兼容 Hive Metastore 协议。
> [什么是 Data Lake Formation](https://www.aliyun.com/product/bigdata/dlf)
因此我们也可以和访问 Hive Metastore 一样,连接并访问 DLF。
## 连接 DLF
1. 创建 hive-site.xml
创建 hive-site.xml 文件,并将其放置在 `fe/conf` 目录下。
```
<?xml version="1.0"?>
<configuration>
<!--Set to use dlf client-->
<property>
<name>hive.metastore.type</name>
<value>dlf</value>
</property>
<property>
<name>dlf.catalog.endpoint</name>
<value>dlf-vpc.cn-beijing.aliyuncs.com</value>
</property>
<property>
<name>dlf.catalog.region</name>
<value>cn-beijing</value>
</property>
<property>
<name>dlf.catalog.proxyMode</name>
<value>DLF_ONLY</value>
</property>
<property>
<name>dlf.catalog.uid</name>
<value>20000000000000000</value>
</property>
<property>
<name>dlf.catalog.accessKeyId</name>
<value>XXXXXXXXXXXXXXX</value>
</property>
<property>
<name>dlf.catalog.accessKeySecret</name>
<value>XXXXXXXXXXXXXXXXX</value>
</property>
</configuration>
```
* `dlf.catalog.endpoint`:DLF Endpoint,参阅:[DLF Region和Endpoint对照表](https://www.alibabacloud.com/help/zh/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.region`:DLF Region,参阅:[DLF Region和Endpoint对照表](https://www.alibabacloud.com/help/zh/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.uid`:阿里云账号。即阿里云控制台右上角个人信息的“云账号ID”。
* `dlf.catalog.accessKeyId`:AccessKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。
* `dlf.catalog.accessKeySecret`:SecretKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。
其他配置项为固定值,无需改动。
2. 重启 FE,并通过 `CREATE CATALOG` 语句创建 catalog。
Doris 会读取和解析 `fe/conf/hive-site.xml`。
```sql
CREATE CATALOG hive_with_dlf PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
)
```
其中 `type` 固定为 `hms`。 `hive.metastore.uris` 的值随意填写即可,实际不会使用。但需要按照标准 hive metastore thrift uri 格式填写。
之后,可以像正常的 Hive MetaStore 一样,访问 DLF 下的元数据。
同 Hive Catalog 已经,支持访问 DLF 中的 Hive/Iceberg/Hudi 的元数据信息。

View File

@ -0,0 +1,436 @@
---
{
"title": "Elasticsearch",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Elasticsearch
Elasticsearch Catalog 除了支持自动映射 ES 元数据外,也可以利用 Doris 的分布式查询规划能力和 ES(Elasticsearch) 的全文检索能力相结合,提供更完善的 OLAP 分析场景解决方案:
1. ES 中的多 index 分布式 Join 查询。
2. Doris 和 ES 中的表联合查询,更复杂的全文检索过滤。
## 使用限制
1. 支持 Elasticsearch 5.x 及以上版本。
## 创建 Catalog
```sql
CREATE CATALOG es PROPERTIES (
"type"="es",
"hosts"="http://127.0.0.1:9200"
);
```
因为 Elasticsearch 没有 Database 的概念,所以连接 ES 后,会自动生成一个唯一的 Database:`default_db`
并且在通过 SWITCH 命令切换到 ES Catalog 后,会自动切换到 `default_db`。无需再执行 `USE default_db` 命令。
### 参数说明
参数 | 是否必须 | 默认值 | 说明
--- | --- | --- | ---
`hosts` | 是 | | ES 地址,可以是一个或多个,也可以是 ES 的负载均衡地址 |
`username` | 否 | 空 | ES 用户名 |
`password` | 否 | 空 | 对应用户的密码信息 |
`doc_value_scan` | 否 | true | 是否开启通过 ES/Lucene 列式存储获取查询字段的值 |
`keyword_sniff` | 否 | true | 是否对 ES 中字符串分词类型 text.fields 进行探测,通过 keyword 进行查询。设置为 false 会按照分词后的内容匹配 |
`nodes_discovery` | 否 | true | 是否开启 ES 节点发现,默认为 true,在网络隔离环境下设置为 false,只连接指定节点 |
`ssl` | 否 | false | ES 是否开启 https 访问模式,目前在 fe/be 实现方式为信任所有 |
`mapping_es_id` | 否 | false | 是否映射 ES 索引中的 `_id` 字段 |
> 1. 认证方式目前仅支持 Http Basic 认证,并且需要确保该用户有访问: `/_cluster/state/、_nodes/http` 等路径和 index 的读权限; 集群未开启安全认证,用户名和密码不需要设置。
>
> 2. 5.x 和 6.x 中一个 index 中的多个 type 默认取第一个。
## 列类型映射
| ES Type | Doris Type | Comment |
|---|---|---|
|null| null||
| boolean | boolean | |
| byte| tinyint| |
| short| smallint| |
| integer| int| |
| long| bigint| |
| unsigned_long| largeint | |
| float| float| |
| half_float| float| |
| double | double | |
| scaled_float| double | |
| date | date | |
| keyword | string | |
| text |string | |
| ip |string | |
| nested |string | |
| object |string | |
|other| unsupported ||
## 最佳实践
### 过滤条件下推
ES Catalog 支持过滤条件的下推: 过滤条件下推给ES,这样只有真正满足条件的数据才会被返回,能够显著的提高查询性能和降低Doris和Elasticsearch的CPU、memory、IO使用量
下面的操作符(Operators)会被优化成如下ES Query:
| SQL syntax | ES 5.x+ syntax |
|-------|:---:|
| = | term query|
| in | terms query |
| > , < , >= , ⇐ | range query |
| and | bool.filter |
| or | bool.should |
| not | bool.must_not |
| not in | bool.must_not + terms query |
| is\_not\_null | exists query |
| is\_null | bool.must_not + exists query |
| esquery | ES原生json形式的QueryDSL |
### 启用列式扫描优化查询速度(enable\_docvalue\_scan=true)
设置 `"enable_docvalue_scan" = "true"`
开启后Doris从ES中获取数据会遵循以下两个原则:
* **尽力而为**: 自动探测要读取的字段是否开启列式存储(doc_value: true),如果获取的字段全部有列存,Doris会从列式存储中获取所有字段的值
* **自动降级**: 如果要获取的字段只要有一个字段没有列存,所有字段的值都会从行存`_source`中解析获取
**优势**
默认情况下,Doris On ES会从行存也就是`_source`中获取所需的所有列,`_source`的存储采用的行式+json的形式存储,在批量读取性能上要劣于列式存储,尤其在只需要少数列的情况下尤为明显,只获取少数列的情况下,docvalue的性能大约是_source性能的十几倍
**注意**
1. `text`类型的字段在ES中是没有列式存储,因此如果要获取的字段值有`text`类型字段会自动降级为从`_source`中获取
2. 在获取的字段数量过多的情况下(`>= 25`),从`docvalue`中获取字段值的性能会和从`_source`中获取字段值基本一样
### 探测keyword类型字段
设置 `"enable_keyword_sniff" = "true"`
在ES中可以不建立index直接进行数据导入,这时候ES会自动创建一个新的索引,针对字符串类型的字段ES会创建一个既有`text`类型的字段又有`keyword`类型的字段,这就是ES的multi fields特性,mapping如下:
```
"k4": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
```
对k4进行条件过滤时比如=,Doris On ES会将查询转换为ES的TermQuery
SQL过滤条件:
```
k4 = "Doris On ES"
```
转换成ES的query DSL为:
```
"term" : {
"k4": "Doris On ES"
}
```
因为k4的第一字段类型为`text`,在数据导入的时候就会根据k4设置的分词器(如果没有设置,就是standard分词器)进行分词处理得到doris、on、es三个Term,如下ES analyze API分析:
```
POST /_analyze
{
"analyzer": "standard",
"text": "Doris On ES"
}
```
分词的结果是:
```
{
"tokens": [
{
"token": "doris",
"start_offset": 0,
"end_offset": 5,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "on",
"start_offset": 6,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "es",
"start_offset": 9,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
}
]
}
```
查询时使用的是:
```
"term" : {
"k4": "Doris On ES"
}
```
`Doris On ES`这个term匹配不到词典中的任何term,不会返回任何结果,而启用`enable_keyword_sniff: true`会自动将`k4 = "Doris On ES"`转换成`k4.keyword = "Doris On ES"`来完全匹配SQL语义,转换后的ES query DSL为:
```
"term" : {
"k4.keyword": "Doris On ES"
}
```
`k4.keyword` 的类型是`keyword`,数据写入ES中是一个完整的term,所以可以匹配
### 开启节点自动发现, 默认为true(nodes\_discovery=true)
设置 `"nodes_discovery" = "true"`
当配置为true时,Doris将从ES找到所有可用的相关数据节点(在上面分配的分片)。如果ES数据节点的地址没有被Doris BE访问,则设置为false。ES集群部署在与公共Internet隔离的内网,用户通过代理访问
### ES集群是否开启https访问模式
设置 `"ssl" = "true"`
目前会fe/be实现方式为信任所有,这是临时解决方案,后续会使用真实的用户配置证书
### 查询用法
完成在Doris中建立ES外表后,除了无法使用Doris中的数据模型(rollup、预聚合、物化视图等)外并无区别
#### 基本查询
```
select * from es_table where k1 > 1000 and k3 ='term' or k4 like 'fu*z_'
```
#### 扩展的 esquery(field, QueryDSL)
通过`esquery(field, QueryDSL)`函数将一些无法用sql表述的query如match_phrase、geoshape等下推给ES进行过滤处理,`esquery`的第一个列名参数用于关联`index`,第二个参数是ES的基本`Query DSL`的json表述,使用花括号`{}`包含,json的`root key`有且只能有一个,如 `match_phrase``geo_shape``bool`
`match_phrase` 查询:
```
select * from es_table where esquery(k4, '{
"match_phrase": {
"k4": "doris on es"
}
}');
```
`geo` 相关查询:
```
select * from es_table where esquery(k4, '{
"geo_shape": {
"location": {
"shape": {
"type": "envelope",
"coordinates": [
[
13,
53
],
[
14,
52
]
]
},
"relation": "within"
}
}
}');
```
`bool` 查询:
```
select * from es_table where esquery(k4, ' {
"bool": {
"must": [
{
"terms": {
"k1": [
11,
12
]
}
},
{
"terms": {
"k2": [
100
]
}
}
]
}
}');
```
### 时间类型字段使用建议
> 仅 ES 外表适用,ES Catalog 中自动映射日期类型为 Date 或 Datetime
在ES中,时间类型的字段使用十分灵活,但是在 ES 外表中如果对时间类型字段的类型设置不当,则会造成过滤条件无法下推
创建索引时对时间类型格式的设置做最大程度的格式兼容:
```
"dt": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
```
在Doris中建立该字段时建议设置为`date``datetime`,也可以设置为`varchar`类型, 使用如下SQL语句都可以直接将过滤条件下推至ES:
```
select * from doe where k2 > '2020-06-21';
select * from doe where k2 < '2020-06-21 12:00:00';
select * from doe where k2 < 1593497011;
select * from doe where k2 < now();
select * from doe where k2 < date_format(now(), '%Y-%m-%d');
```
注意:
* 在ES中如果不对时间类型的字段设置`format`, 默认的时间类型字段格式为
```
strict_date_optional_time||epoch_millis
```
* 导入到ES的日期字段如果是时间戳需要转换成`ms`, ES内部处理时间戳都是按照`ms`进行处理的, 否则 ES 外表会出现显示错误
### 获取ES元数据字段 `_id`
导入文档在不指定 `_id` 的情况下,ES会给每个文档分配一个全局唯一的 `_id` 即主键, 用户也可以在导入时为文档指定一个含有特殊业务意义的 `_id`;
如果需要在 ES 外表中获取该字段值,建表时可以增加类型为`varchar``_id`字段:
```
CREATE EXTERNAL TABLE `doe` (
`_id` varchar COMMENT "",
`city` varchar COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://127.0.0.1:8200",
"user" = "root",
"password" = "root",
"index" = "doe"
}
```
如果需要在 ES Catalog 中获取该字段值,请设置 `"mapping_es_id" = "true"`
注意:
1. `_id` 字段的过滤条件仅支持`=``in`两种
2. `_id` 字段必须为 `varchar` 类型
## 常见问题
1. 是否支持X-Pack认证的ES集群
支持所有使用HTTP Basic认证方式的ES集群
2. 一些查询比请求ES慢很多
是,比如_count相关的query等,ES内部会直接读取满足条件的文档个数相关的元数据,不需要对真实的数据进行过滤
3. 聚合操作是否可以下推
目前Doris On ES不支持聚合操作如sum, avg, min/max 等下推,计算方式是批量流式的从ES获取所有满足条件的文档,然后在Doris中进行计算
## 附录
### Doris 查询 ES 原理
```
+----------------------------------------------+
| |
| Doris +------------------+ |
| | FE +--------------+-------+
| | | Request Shard Location
| +--+-------------+-+ | |
| ^ ^ | |
| | | | |
| +-------------------+ +------------------+ | |
| | | | | | | | |
| | +----------+----+ | | +--+-----------+ | | |
| | | BE | | | | BE | | | |
| | +---------------+ | | +--------------+ | | |
+----------------------------------------------+ |
| | | | | | |
| | | | | | |
| HTTP SCROLL | | HTTP SCROLL | |
+-----------+---------------------+------------+ |
| | v | | v | | |
| | +------+--------+ | | +------+-------+ | | |
| | | | | | | | | | |
| | | DataNode | | | | DataNode +<-----------+
| | | | | | | | | | |
| | | +<--------------------------------+
| | +---------------+ | | |--------------| | | |
| +-------------------+ +------------------+ | |
| Same Physical Node | |
| | |
| +-----------------------+ | |
| | | | |
| | MasterNode +<-----------------+
| ES | | |
| +-----------------------+ |
+----------------------------------------------+
```
1. FE会请求建表指定的主机,获取所有节点的HTTP端口信息以及index的shard分布信息等,如果请求失败会顺序遍历host列表直至成功或完全失败
2. 查询时会根据FE得到的一些节点信息和index的元数据信息,生成查询计划并发给对应的BE节点
3. BE节点会根据`就近原则`即优先请求本地部署的ES节点,BE通过`HTTP Scroll`方式流式的从ES index的每个分片中并发的从`_source`或`docvalue`中获取数据
4. Doris计算完结果后,返回给用户

View File

@ -0,0 +1,48 @@
---
{
"title": "常见问题",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# 常见问题
1. 通过 Hive Metastore 访问 Iceberg 表报错:`failed to get schema``Storage schema reading not supported`
在 Hive 的 lib/ 目录放上 `iceberg` 运行时有关的 jar 包。
`hive-site.xml` 配置:
```
metastore.storage.schema.reader.impl=org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader
```
配置完成后需要重启Hive Metastore。
2. 连接 Kerberos 认证的 Hive Metastore 报错:`GSS initiate failed`
1.2.1 之前的版本中,Doris 依赖的 libhdfs3 库没有开启 gsasl。请更新至 1.2.2 之后的版本。
3. 访问 HDFS 3.x 时报错:`java.lang.VerifyError: xxx`
1.2.1 之前的版本中,Doris 依赖的 Hadoop 版本为 2.8。需更新至 2.10.2。或更新 Doris 至 1.2.2 之后的版本。

View File

@ -0,0 +1,157 @@
---
{
"title": "Hive",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hive
通过连接 Hive Metastore,或者兼容 Hive Metatore 的元数据服务,Doris 可以自动获取 Hive 的库表信息,并进行数据查询。
除了 Hive 外,很多其他系统也会使用 Hive Metastore 存储元数据。所以通过 Hive Catalog,我们不仅能方位 Hive,也能访问使用 Hive Metastore 作为元数据存储的系统。如 Iceberg、Hudi 等。
## 使用限制
1. hive 支持 1/2/3 版本。
2. 支持 Managed Table 和 External Table。
3. 可以识别 Hive Metastore 中存储的 hive、iceberg、hudi 元数据。
## 创建 Catalog
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
```
除了 `type``hive.metastore.uris` 两个必须参数外,还可以通过更多参数来传递连接所需要的信息。
如提供 HDFS HA 信息,示例如下:
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
```
同时提供 HDFS HA 信息和 Kerberos 认证信息,示例如下:
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hive.metastore.sasl.enabled' = 'true',
'dfs.nameservices'='your-nameservice',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.keytab' = '/your-keytab-filepath/your.keytab',
'hadoop.kerberos.principal' = 'your-principal@YOUR.COM',
'yarn.resourcemanager.address' = 'your-rm-address:your-rm-port',
'yarn.resourcemanager.principal' = 'your-rm-principal/_HOST@YOUR.COM'
);
```
提供 Hadoop KMS 加密传输信息,示例如下:
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'dfs.encryption.key.provider.uri' = 'kms://http@kms_host:kms_port/kms'
);
```
在 1.2.1 版本之后,我们也可以将这些信息通过创建一个 Resource 统一存储,然后在创建 Catalog 时使用这个 Resource。示例如下:
```sql
# 1. 创建 Resource
CREATE RESOURCE hms_resource PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
# 2. 创建 Catalog 并使用 Resource,这里的 Key Value 信息回覆盖 Resource 中的信息。
CREATE CATALOG hive WITH RESOURCE hms_resource PROPERTIES(
'key' = 'value'
);
```
我们也可以直接将 hive-site.xml 放到 FE 和 BE 的 conf 目录下,系统也会自动读取 hive-site.xml 中的信息。信息覆盖的规则如下:
* Resource 中的信息覆盖 hive-site.xml 中的信息。
* CREATE CATALOG PROPERTIES 中的信息覆盖 Resource 中的信息。
### Hive 版本
Doris 可以正确访问不同 Hive 版本中的 Hive Metastore。在默认情况下,Doris 会以 Hive 2.3 版本的兼容接口访问 Hive Metastore。你也可以在创建 Catalog 时指定 hive 的版本。如访问 Hive 1.1.0 版本:
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hive.version' = '1.1.0'
);
```
## 列类型映射
适用于 Hive/Iceberge/Hudi
| HMS Type | Doris Type | Comment |
|---|---|---|
| boolean| boolean | |
| tinyint|tinyint | |
| smallint| smallint| |
| int| int | |
| bigint| bigint | |
| date| date| |
| timestamp| datetime| |
| float| float| |
| double| double| |
| char| char | |
| varchar| varchar| |
| decimal| decimal | |
| `array<type>` | `array<type>`| 支持array嵌套,如 `array<array<int>>` |
| other | unsupported | |

View File

@ -0,0 +1,54 @@
---
{
"title": "Hudi",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Hudi
## 使用限制
1. Hudi 目前仅支持 Copy On Write 表的 Snapshot Query,以及 Merge On Read 表的 Read Optimized Query。后续将支持 Incremental Query 和 Merge On Read 表的 Snapshot Query。
2. 目前仅支持 Hive Metastore 类型的 Catalog。所以使用方式和 Hive Catalog 基本一致。后续版本将支持其他类型的 Catalog。
## 创建 Catalog
和 Hive Catalog 基本一致,这里仅给出简单示例。其他示例可参阅 [Hive Catalog](./hive)。
```sql
CREATE CATALOG iceberg PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
```
## 列类型映射
和 Hive Catalog 一致,可参阅 [Hive Catalog](./hive) 中 **列类型映射** 一节。

View File

@ -0,0 +1,75 @@
---
{
"title": "Iceberg",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Iceberg
## 使用限制
1. 支持 Iceberg V1/V2 表格式。
2. V2 格式仅支持 Position Delete 方式,不支持 Equality Delete。
3. 目前仅支持 Hive Metastore 类型的 Catalog。所以使用方式和 Hive Catalog 基本一致。后续版本将支持其他类型的 Catalog。
## 创建 Catalog
和 Hive Catalog 基本一致,这里仅给出简单示例。其他示例可参阅 [Hive Catalog](./hive)。
```sql
CREATE CATALOG iceberg PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004',
'hadoop.username' = 'hive',
'dfs.nameservices'='your-nameservice',
'dfs.ha.namenodes.your-nameservice'='nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1'='172.21.0.2:4007',
'dfs.namenode.rpc-address.your-nameservice.nn2'='172.21.0.3:4007',
'dfs.client.failover.proxy.provider.your-nameservice'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
```
## 列类型映射
和 Hive Catalog 一致,可参阅 [Hive Catalog](./hive) 中 **列类型映射** 一节。
## Time Travel
<version since="dev">
支持读取 Iceberg 表指定的 Snapshot。
</version>
每一次对iceberg表的写操作都会产生一个新的快照。
默认情况下,读取请求只会读取最新版本的快照。
可以使用 `FOR TIME AS OF``FOR VERSION AS OF` 语句,根据快照 ID 或者快照产生的时间读取历史版本的数据。示例如下:
`SELECT * FROM iceberg_tbl FOR TIME AS OF "2022-10-07 17:20:37";`
`SELECT * FROM iceberg_tbl FOR VERSION AS OF 868895038966572;`
另外,可以使用 [iceberg_meta](../../sql-manual/sql-functions/table-functions/iceber_meta.md) 表函数查询指定表的 snapshot 信息。

View File

@ -0,0 +1,292 @@
---
{
"title": "JDBC",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# JDBC
JDBC Catalog 通过标准 JDBC 协议,连接其他数据源。
## 使用限制
1. 支持 MySQL、PostgreSQL、Oracle、Clickhouse
## 创建 Catalog
1. MySQL
```sql
CREATE CATALOG jdbc_mysql PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:mysql://127.0.0.1:3306/demo",
"driver_url" = "mysql-connector-java-5.1.47.jar",
"driver_class" = "com.mysql.jdbc.Driver"
)
```
2. PostgreSQL
```sql
CREATE CATALOG jdbc_postgresql PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:postgresql://127.0.0.1:5449/demo",
"driver_url" = "postgresql-42.5.1.jar",
"driver_class" = "org.postgresql.Driver"
);
```
映射 PostgreSQL 时,Doris 的一个 Database 对应于 PostgreSQL 中指定 Catalog(如示例中 `jdbc_url` 参数中 "demo")下的一个 Schema。而 Doris 的 Database 下的 Table 则对应于 PostgreSQL 中,Schema 下的 Tables。即映射关系如下:
|Doris | PostgreSQL |
|---|---|
| Catalog | Database |
| Database | Schema |
| Table | Tablet |
3. Oracle
```sql
CREATE RESOURCE jdbc_oracle PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:oracle:thin:@127.0.0.1:1521:helowin",
"driver_url" = "ojdbc6.jar",
"driver_class" = "oracle.jdbc.driver.OracleDriver"
);
```
映射 Oracle 时,Doris 的一个 Database 对应于 Oracle 中的一个 User(如示例中 `jdbc_url` 参数中 "helowin")。而 Doris 的 Database 下的 Table 则对应于 Oracle 中,该 User 下的有权限访问的 Table。即映射关系如下:
|Doris | PostgreSQL |
|---|---|
| Catalog | Database |
| Database | User |
| Table | Table |
4. Clickhouse
```sql
CREATE RESOURCE jdbc_clickhouse PROPERTIES (
"type"="jdbc",
"user"="root",
"password"="123456",
"jdbc_url" = "jdbc:clickhouse://127.0.0.1:8123/demo",
"driver_url" = "clickhouse-jdbc-0.3.2-patch11-all.jar",
"driver_class" = "com.clickhouse.jdbc.ClickHouseDriver"
);
```
### 参数说明
参数 | 是否必须 | 默认值 | 说明
--- | --- | --- | ---
`user` | 是 | | 对应数据库的用户名 |
`password` | 是 | | 对应数据库的密码 |
`jdbc_url ` | 是 | | JDBC 连接串 |
`driver_url ` | 是 | | JDBC Driver Jar 包名称* |
`driver_class ` | 是 | | JDBC Driver Class 名称 |
> `driver_url` 可以通过以下三种方式指定:
>
> 1. 文件名。如 `mysql-connector-java-5.1.47.jar`。需将 Jar 包预先存放在 FE 和 BE 部署目录的 `jdbc_drivers/` 目录下。系统会自动在这个目录下寻找。该目录的位置,也可以由 fe.conf 和 be.conf 中的 `jdbc_drivers_dir` 配置修改。
>
> 2. 本地绝对路径。如 `file:///path/to/mysql-connector-java-5.1.47.jar`。需将 Jar 包预先存放在所有 FE/BE 节点指定的路径下。
>
> 3. Http 地址。如:`https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/jdbc_driver/mysql-connector-java-5.1.47.jar`。系统会从这个 http 地址下载 Driver 文件。仅支持无认证的 http 服务。
## 列类型映射
### MySQL
| MYSQL Type | Doris Type | Comment |
|---|---|---|
| BOOLEAN | BOOLEAN | |
| TINYINT | TINYINT | |
| SMALLINT | SMALLINT | |
| MEDIUMINT | INT | |
| INT | INT | |
| BIGINT | BIGINT | |
| UNSIGNED TINYINT | SMALLINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级|
| UNSIGNED MEDIUMINT | INT | Doris没有UNSIGNED数据类型,所以扩大一个数量级|
| UNSIGNED INT | BIGINT |Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| UNSIGNED BIGINT | STRING | |
| FLOAT | FLOAT | |
| DOUBLE | DOUBLE | |
| DECIMAL | DECIMAL | |
| DATE | DATE | |
| TIMESTAMP | DATETIME | |
| DATETIME | DATETIME | |
| YEAR | SMALLINT | |
| TIME | STRING | |
| CHAR | CHAR | |
| VARCHAR | STRING | |
| TINYTEXT、TEXT、MEDIUMTEXT、LONGTEXT、TINYBLOB、BLOB、MEDIUMBLOB、LONGBLOB、TINYSTRING、STRING、MEDIUMSTRING、LONGSTRING、BINARY、VARBINARY、JSON、SET、BIT | STRING | |
|Other| UNSUPPORTED |
### PostgreSQL
POSTGRESQL Type | Doris Type | Comment |
|---|---|---|
| boolean | BOOLEAN | |
| smallint/int2 | SMALLINT | |
| integer/int4 | INT | |
| bigint/int8 | BIGINT | |
| decimal/numeric | DECIMAL | |
| real/float4 | FLOAT | |
| double precision | DOUBLE | |
| smallserial | SMALLINT | |
| serial | INT | |
| bigserial | BIGINT | |
| char | CHAR | |
| varchar/text | STRING | |
| timestamp | DATETIME | |
| date | DATE | |
| time | STRING | |
| interval | STRING | |
| point/line/lseg/box/path/polygon/circle | STRING | |
| cidr/inet/macaddr | STRING | |
| bit/bit(n)/bit varying(n) | STRING | `bit`类型映射为doris的`STRING`类型,读出的数据是`true/false`, 而不是`1/0` |
| uuid/josnb | STRING | |
|Other| UNSUPPORTED |
### Oracle
ORACLE Type | Doris Type | Comment |
|---|---|---|
| number(p) / number(p,0) | | Doris会根据p的大小来选择对应的类型:`p < 3` -> `TINYINT`; `p < 5` -> `SMALLINT`; `p < 10` -> `INT`; `p < 19` -> `BIGINT`; `p > 19` -> `LARGEINT` |
| number(p,s) | DECIMAL | |
| decimal | DECIMAL | |
| float/real | DOUBLE | |
| DATE | DATETIME | |
| CHAR/NCHAR | STRING | |
| VARCHAR2/NVARCHAR2 | STRING | |
| LONG/ RAW/ LONG RAW/ INTERVAL | STRING | |
|Other| UNSUPPORTED |
### Clickhouse
| ClickHouse Type | Doris Type | Comment |
|------------------------|------------|-----------------------------------------------------|
| Bool | BOOLEAN | |
| String | STRING | |
| Date/Date32 | DATE | |
| DateTime/DateTime64 | DATETIME | 对于超过了Doris最大的DateTime精度的数据,将截断处理 |
| Float32 | FLOAT | |
| Float64 | DOUBLE | |
| Int8 | TINYINT | |
| Int16/UInt8 | SMALLINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int32/UInt16 | INT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int64/Uint32 | BIGINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int128/UInt64 | LARGEINT | Doris没有UNSIGNED数据类型,所以扩大一个数量级 |
| Int256/UInt128/UInt256 | STRING | Doris没有这个数量级的数据类型,采用STRING处理 |
| DECIMAL | DECIMAL | 对于超过了Doris最大的Decimal精度的数据,将映射为STRING |
| Enum/IPv4/IPv6/UUID | STRING | 在显示上IPv4,IPv6会额外在数据最前面显示一个`/`,需要自己用`split_part`函数处理 |
|Other| UNSUPPORTED |
## 常见问题
1. 除了 MySQL,Oracle,PostgreSQL,SQLServer,ClickHouse 是否能够支持更多的数据库
目前Doris只适配了 MySQL,Oracle,PostgreSQL,SQLServer,ClickHouse. 关于其他的数据库的适配工作正在规划之中,原则上来说任何支持JDBC访问的数据库都能通过JDBC外表来访问。如果您有访问其他外表的需求,欢迎修改代码并贡献给Doris。
2. 读写 MySQL外表的emoji表情出现乱码
Doris进行jdbc外表连接时,由于mysql之中默认的utf8编码为utf8mb3,无法表示需要4字节编码的emoji表情。这里需要在建立mysql外表时设置对应列的编码为utf8mb4,设置服务器编码为utf8mb4,JDBC Url中的characterEncoding不配置.(该属性不支持utf8mb4,配置了非utf8mb4将导致无法写入表情,因此要留空,不配置)
可全局修改配置项
```
修改mysql目录下的my.ini文件(linux系统为etc目录下的my.cnf文件)
[client]
default-character-set=utf8mb4
[mysql]
设置mysql默认字符集
default-character-set=utf8mb4
[mysqld]
设置mysql字符集服务器
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
init_connect='SET NAMES utf8mb4
修改对应表与列的类型
ALTER TABLE table_name MODIFY colum_name VARCHAR(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE table_name CHARSET=utf8mb4;
SET NAMES utf8mb4
```
3. 读 MySQL 外表时,DateTime="0000:00:00 00:00:00"异常报错: "CAUSED BY: DataReadException: Zero date value prohibited"
这是因为JDBC中对于该非法的DateTime默认处理为抛出异常,可以通过参数 `zeroDateTimeBehavior`控制该行为。
可选参数为: `EXCEPTION`,`CONVERT_TO_NULL`,`ROUND`, 分别为:异常报错,转为NULL值,转为 "0001-01-01 00:00:00";
可在url中添加: `"jdbc_url"="jdbc:mysql://IP:PORT/doris_test?zeroDateTimeBehavior=convertToNull"`
4. 读取 MySQL 外表或其他外表时,出现加载类失败
如以下异常:
```
failed to load driver class com.mysql.jdbc.driver in either of hikariconfig class loader
```
这是因为在创建resource时,填写的driver_class不正确,需要正确填写,如上方例子为大小写问题,应填写为 `"driver_class" = "com.mysql.jdbc.Driver"`
5. 读取 MySQL 问题出现通信链路异常
如果出现如下报错:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = PoolInitializationException: Failed to initialize pool: Communications link failure
The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago.
CAUSED BY: CommunicationsException: Communications link failure
The last packet successfully received from the server was 7 milliseconds ago. The last packet sent successfully to the server was 4 milliseconds ago.
CAUSED BY: SSLHandshakeExcepti
```
可查看be的be.out日志
如果包含以下信息:
```
WARN: Establishing SSL connection without server's identity verification is not recommended.
According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set.
For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'.
You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
```
可在创建 Catalog 的 `jdbc_url` 把JDBC连接串最后增加 `?useSSL=false` ,如 `"jdbc_url" = "jdbc:mysql://127.0.0.1:3306/test?useSSL=false"`

View File

@ -0,0 +1,432 @@
---
{
"title": "多源数据目录",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# 多源数据目录
<version since="1.2.0">
多源数据目录(Multi-Catalog)是 Doris 1.2.0 版本中推出的功能,旨在能够更方便对接外部数据目录,以增强Doris的数据湖分析和联邦数据查询能力。
在之前的 Doris 版本中,用户数据只有两个层级:Database 和 Table。当我们需要连接一个外部数据目录时,我们只能在Database 或 Table 层级进行对接。比如通过 `create external table` 的方式创建一个外部数据目录中的表的映射,或通过 `create external database` 的方式映射一个外部数据目录中的 Database。 如果外部数据目录中的 Database 或 Table 非常多,则需要用户手动进行一一映射,使用体验不佳。
而新的 Multi-Catalog 功能在原有的元数据层级上,新增一层Catalog,构成 Catalog -> Database -> Table 的三层元数据层级。其中,Catalog 可以直接对应到外部数据目录。目前支持的外部数据目录包括:
1. Hive
2. Iceberg
3. Hudi
4. Elasticsearch
5. JDBC: 对接数据库访问的标准接口(JDBC)来访问各式数据库的数据。
该功能将作为之前外表连接方式(External Table)的补充和增强,帮助用户进行快速的多数据目录联邦查询。
</version>
## 基础概念
1. Internal Catalog
Doris 原有的 Database 和 Table 都将归属于 Internal Catalog。Internal Catalog 是内置的默认 Catalog,用户不可修改或删除。
2. External Catalog
可以通过 [CREATE CATALOG](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-CATALOG.md) 命令创建一个 External Catalog。创建后,可以通过 [SHOW CATALOGS](../../sql-manual/sql-reference/Show-Statements/SHOW-CATALOGS.md) 命令查看已创建的 Catalog。
3. 切换 Catalog
用户登录 Doris 后,默认进入 Internal Catalog,因此默认的使用和之前版本并无差别,可以直接使用 `SHOW DATABASES``USE DB` 等命令查看和切换数据库。
用户可以通过 [SWITCH](../../sql-manual/sql-reference/Utility-Statements/SWITCH.md) 命令切换 Catalog。如:
```
SWiTCH internal;
SWITCH hive_catalog;
```
切换后,可以直接通过 `SHOW DATABASES`,`USE DB` 等命令查看和切换对应 Catalog 中的 Database。Doris 会自动通过 Catalog 中的 Database 和 Table。用户可以像使用 Internal Catalog 一样,对 External Catalog 中的数据进行查看和访问。
当前,Doris 只支持对 External Catalog 中的数据进行只读访问。
4. 删除 Catalog
External Catalog 中的 Database 和 Table 都是只读的。但是可以删除 Catalog(Internal Catalog无法删除)。可以通过 [DROP CATALOG](../../sql-manual/sql-reference/Data-Definition-Statements/Drop/DROP-CATALOG) 命令删除一个 External Catalog。
该操作仅会删除 Doris 中该 Catalog 的映射信息,并不会修改或变更任何外部数据目录的内容。
5. Resource
Resource 是一组配置的集合。用户可以通过 [CREATE RESOURCE](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-RESOURCE) 命令创建一个 Resource。之后可以在创建 Catalog 时使用这个 Resource。
一个 Resource 可以被多个 Catalog 使用,以复用其中的配置。
## 连接示例
### 连接 Hive
这里我们通过连接一个 Hive 集群说明如何使用 Catalog 功能。
更多关于 Hive 的说明,请参阅:[Hive Catalog](./hive)
1. 创建 Catalog
```sql
CREATE CATALOG hive PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://172.21.0.1:7004'
);
```
> [CREATE CATALOG 语法帮助](../../sql-manual/sql-reference/Data-Definition-Statements/Create/CREATE-CATALOG)
2. 查看 Catalog
创建后,可以通过 `SHOW CATALOGS` 命令查看 catalog:
```
mysql> SHOW CATALOGS;
+-----------+-------------+----------+
| CatalogId | CatalogName | Type |
+-----------+-------------+----------+
| 10024 | hive | hms |
| 0 | internal | internal |
+-----------+-------------+----------+
```
> [SHOW CATALOGS 语法帮助](../../sql-manual/sql-reference/Show-Statements/SHOW-CATALOGS)
> 可以通过 [SHOW CREATE CATALOG](../../sql-manual/sql-reference/Show-Statements/SHOW-CREATE-CATALOG) 查看创建 Catalog 的语句。
> 可以通过 [ALTER CATALOG](../../sql-manual/sql-reference/Data-Definition-Statements/Alter/ALTER-CATALOG) 修改 Catalog 的属性。
3. 切换 Catalog
通过 `SWITCH` 命令切换到 hive catalog,并查看其中的数据库:
```
mysql> SWITCH hive;
Query OK, 0 rows affected (0.00 sec)
mysql> SHOW DATABASES;
+-----------+
| Database |
+-----------+
| default |
| random |
| ssb100 |
| tpch1 |
| tpch100 |
| tpch1_orc |
+-----------+
```
> [SWITCH 语法帮助](../../sql-manual/sql-reference/Utility-Statements/SWITCH.md)
4. 使用 Catalog
切换到 Catalog 后,则可以正常使用内部数据源的功能。
如切换到 tpch100 数据库,并查看其中的表:
```
mysql> USE tpch100;
Database changed
mysql> SHOW TABLES;
+-------------------+
| Tables_in_tpch100 |
+-------------------+
| customer |
| lineitem |
| nation |
| orders |
| part |
| partsupp |
| region |
| supplier |
+-------------------+
```
查看 lineitem 表的schema:
```
mysql> DESC lineitem;
+-----------------+---------------+------+------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+---------------+------+------+---------+-------+
| l_shipdate | DATE | Yes | true | NULL | |
| l_orderkey | BIGINT | Yes | true | NULL | |
| l_linenumber | INT | Yes | true | NULL | |
| l_partkey | INT | Yes | true | NULL | |
| l_suppkey | INT | Yes | true | NULL | |
| l_quantity | DECIMAL(15,2) | Yes | true | NULL | |
| l_extendedprice | DECIMAL(15,2) | Yes | true | NULL | |
| l_discount | DECIMAL(15,2) | Yes | true | NULL | |
| l_tax | DECIMAL(15,2) | Yes | true | NULL | |
| l_returnflag | TEXT | Yes | true | NULL | |
| l_linestatus | TEXT | Yes | true | NULL | |
| l_commitdate | DATE | Yes | true | NULL | |
| l_receiptdate | DATE | Yes | true | NULL | |
| l_shipinstruct | TEXT | Yes | true | NULL | |
| l_shipmode | TEXT | Yes | true | NULL | |
| l_comment | TEXT | Yes | true | NULL | |
+-----------------+---------------+------+------+---------+-------+
```
查询示例:
```
mysql> SELECT l_shipdate, l_orderkey, l_partkey FROM lineitem limit 10;
+------------+------------+-----------+
| l_shipdate | l_orderkey | l_partkey |
+------------+------------+-----------+
| 1998-01-21 | 66374304 | 270146 |
| 1997-11-17 | 66374304 | 340557 |
| 1997-06-17 | 66374400 | 6839498 |
| 1997-08-21 | 66374400 | 11436870 |
| 1997-08-07 | 66374400 | 19473325 |
| 1997-06-16 | 66374400 | 8157699 |
| 1998-09-21 | 66374496 | 19892278 |
| 1998-08-07 | 66374496 | 9509408 |
| 1998-10-27 | 66374496 | 4608731 |
| 1998-07-14 | 66374592 | 13555929 |
+------------+------------+-----------+
```
也可以和其他数据目录中的表进行关联查询:
```
mysql> SELECT l.l_shipdate FROM hive.tpch100.lineitem l WHERE l.l_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;
+------------+
| l_shipdate |
+------------+
| 1993-02-16 |
| 1995-06-26 |
| 1995-08-19 |
| 1992-07-23 |
| 1998-05-23 |
| 1997-07-12 |
| 1994-03-06 |
| 1996-02-07 |
| 1997-06-01 |
| 1996-08-23 |
+------------+
```
这里我们通过 `catalog.database.table` 这种全限定的方式标识一张表,如:`internal.db1.part`。
其中 `catalog` 和 `database` 可以省略,缺省使用当前 SWITCH 和 USE 后切换的 catalog 和 database。
可以通过 INSERT INTO 命令,将 hive catalog 中的表数据,插入到 interal catalog 中的内部表,从而达到**导入外部数据目录数据**的效果:
```
mysql> SWITCH internal;
Query OK, 0 rows affected (0.00 sec)
mysql> USE db1;
Database changed
mysql> INSERT INTO part SELECT * FROM hive.tpch100.part limit 1000;
Query OK, 1000 rows affected (0.28 sec)
{'label':'insert_212f67420c6444d5_9bfc184bf2e7edb8', 'status':'VISIBLE', 'txnId':'4'}
```
### 连接 Iceberg
详见 [Iceberg Catalog](./iceberg)
### 连接 Hudi
详见 [Hudi Catalog](./hudi)
### 连接 Elasticsearch
详见 [Elasticsearch Catalog](./elasticsearch)
### 连接 JDBC
详见 [JDBC Catalog](./jdbc)
### 连接阿里云 Data Lake Formation
> [什么是 Data Lake Formation](https://www.aliyun.com/product/bigdata/dlf)
1. 创建 hive-site.xml
创建 hive-site.xml 文件,并将其放置在 `fe/conf` 目录下。
```
<?xml version="1.0"?>
<configuration>
<!--Set to use dlf client-->
<property>
<name>hive.metastore.type</name>
<value>dlf</value>
</property>
<property>
<name>dlf.catalog.endpoint</name>
<value>dlf-vpc.cn-beijing.aliyuncs.com</value>
</property>
<property>
<name>dlf.catalog.region</name>
<value>cn-beijing</value>
</property>
<property>
<name>dlf.catalog.proxyMode</name>
<value>DLF_ONLY</value>
</property>
<property>
<name>dlf.catalog.uid</name>
<value>20000000000000000</value>
</property>
<property>
<name>dlf.catalog.accessKeyId</name>
<value>XXXXXXXXXXXXXXX</value>
</property>
<property>
<name>dlf.catalog.accessKeySecret</name>
<value>XXXXXXXXXXXXXXXXX</value>
</property>
</configuration>
```
* `dlf.catalog.endpoint`:DLF Endpoint,参阅:[DLF Region和Endpoint对照表](https://www.alibabacloud.com/help/zh/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.region`:DLF Region,参阅:[DLF Region和Endpoint对照表](https://www.alibabacloud.com/help/zh/data-lake-formation/latest/regions-and-endpoints)
* `dlf.catalog.uid`:阿里云账号。即阿里云控制台右上角个人信息的“云账号ID”。
* `dlf.catalog.accessKeyId`:AccessKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。
* `dlf.catalog.accessKeySecret`:SecretKey。可以在 [阿里云控制台](https://ram.console.aliyun.com/manage/ak) 中创建和管理。
其他配置项为固定值,无需改动。
2. 重启 FE,并通过 `CREATE CATALOG` 语句创建 catalog。
HMS resource 会读取和解析 fe/conf/hive-site.xml
```sql
-- 1.2.0+ 版本
CREATE RESOURCE dlf_resource PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
)
CREATE CATALOG dlf WITH RESOURCE dlf_resource;
-- 1.2.0 版本
CREATE CATALOG dlf PROPERTIES (
"type"="hms",
"hive.metastore.uris" = "thrift://127.0.0.1:9083"
)
```
其中 `type` 固定为 `hms`。 `hive.metastore.uris` 的值随意填写即可,实际不会使用。但需要按照标准 hive metastore thrift uri 格式填写。
之后,可以像正常的 Hive MetaStore 一样,访问 DLF 下的元数据。
## 列类型映射
用户创建 Catalog 后,Doris 会自动同步数据目录的数据库和表,针对不同的数据目录和数据表格式,Doris 会进行以下列映射关系。
<version since="dev">
对于当前无法映射到 Doris 列类型的外表类型,如 map,struct 等。Doris 会将列类型映射为 UNSUPPORTED 类型。对于 UNSUPPORTED 类型的查询,示例如下:
假设同步后的表 schema 为:
```
k1 INT,
k2 INT,
k3 UNSUPPORTED,
k4 INT
```
```
select * from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
select * except(k3) from table; // Query OK.
select k1, k3 from table; // Error: Unsupported type 'UNSUPPORTED_TYPE' in '`k3`
select k1, k4 from table; // Query OK.
```
</version>
不同的数据源的列映射规则,请参阅不同数据源的文档。
## 权限管理
使用 Doris 对 External Catalog 中库表进行访问,并不受外部数据目录自身的权限控制,而是依赖 Doris 自身的权限访问管理功能。
Doris 的权限管理功能提供了对 Cataloig 层级的扩展,具体可参阅 [权限管理](../../admin-manual/privilege-ldap/user-privilege.md) 文档。
## 元数据更新
### 手动刷新
默认情况下,外部数据源的元数据变动,如创建、删除表,加减列等操作,不会同步给 Doris。
用户需要通过 [REFRESH CATALOG](../../sql-manual/sql-reference/Utility-Statements/REFRESH.md) 命令手动刷新元数据。
### 自动刷新
<version since="dev">
自动刷新目前仅支持 Hive Metastore 元数据服务。通过让 FE 节点定时读取 HMS 的 notification event 来感知 Hive 表元数据的变更情况,目前支持处理如下event:
</version>
1. CREATE DATABASE event:在对应数据目录下创建数据库。
2. DROP DATABASE event:在对应数据目录下删除数据库。
3. ALTER DATABASE event:此事件的影响主要有更改数据库的属性信息,注释及默认存储位置等,这些改变不影响doris对外部数据目录的查询操作,因此目前会忽略此event。
4. CREATE TABLE event:在对应数据库下创建表。
5. DROP TABLE event:在对应数据库下删除表,并失效表的缓存。
6. ALTER TABLE event:如果是重命名,先删除旧名字的表,再用新名字创建表,否则失效该表的缓存。
7. ADD PARTITION event:在对应表缓存的分区列表里添加分区。
8. DROP PARTITION event:在对应表缓存的分区列表里删除分区,并失效该分区的缓存。
9. ALTER PARTITION event:如果是重命名,先删除旧名字的分区,再用新名字创建分区,否则失效该分区的缓存。
10. 当导入数据导致文件变更,分区表会走ALTER PARTITION event逻辑,不分区表会走ALTER TABLE event逻辑(注意:如果绕过HMS直接操作文件系统的话,HMS不会生成对应事件,doris因此也无法感知)。
该特性被fe的如下参数控制:
1. `enable_hms_events_incremental_sync`: 是否开启元数据自动增量同步功能,默认关闭。
2. `hms_events_polling_interval_ms`: 读取 event 的间隔时间,默认值为 10000,单位:毫秒。
3. `hms_events_batch_size_per_rpc`: 每次读取 event 的最大数量,默认值为 500。
如果想使用该特性,需要更改HMS的 hive-site.xml 并重启HMS
```
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.dml.events</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.transactional.event.listeners</name>
<value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
</property>
```
> 使用建议
>
> 无论是之前已经创建好的catalog现在想改为自动刷新,还是新创建的 catalog,都只需要把 `enable_hms_events_incremental_sync` 设置为true,重启fe节点,无需重启之前或之后再手动刷新元数据。