Files
doris/docs/zh-CN/sql-reference/sql-statements/Data Manipulation/OUTFILE.md

210 lines
8.4 KiB
Markdown

---
{
"title": "OUTFILE",
"language": "zh-CN"
}
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# OUTFILE
## description
该语句用于使用 `SELECT INTO OUTFILE` 命令将查询结果的导出为文件。目前支持通过 Broker 进程, 通过 S3 协议, 或直接通过 HDFS 协议,导出到远端存储,如 HDFS,S3,BOS,COS(腾讯云)上。
语法:
query_stmt
INTO OUTFILE "file_path"
[format_as]
[properties]
1. file_path
file_path 指向文件存储的路径以及文件前缀。如 `hdfs://path/to/my_file_`
最终的文件名将由 `my_file_`,文件序号以及文件格式后缀组成。其中文件序号由0开始,数量为文件被分割的数量。如:
my_file_abcdefg_0.csv
my_file_abcdefg_1.csv
my_file_abcdegf_2.csv
2. format_as
FORMAT AS CSV
指定导出格式。默认为 CSV。
3. properties
指定相关属性。目前支持通过 Broker 进程, 或通过 S3 协议进行导出。
语法:
[PROPERTIES ("key"="value", ...)]
支持如下属性:
column_separator: 列分隔符
line_delimiter: 行分隔符
max_file_size: 单个文件大小限制,如果结果超过这个值,将切割成多个文件。
Broker 相关属性需加前缀 `broker.`:
broker.name: broker名称
broker.hadoop.security.authentication: 指定认证方式为 kerberos
broker.kerberos_principal: 指定 kerberos 的 principal
broker.kerberos_keytab: 指定 kerberos 的 keytab 文件路径。该文件必须为 Broker 进程所在服务器上的文件的绝对路径。并且可以被 Broker 进程访问
HDFS 相关属性需加前缀 `hdfs.`:
hdfs.fs.defaultFS: namenode 地址和端口
hdfs.hdfs_user: hdfs 用户名
S3 协议则直接执行 S3 协议配置即可:
AWS_ENDPOINT
AWS_ACCESS_KEY
AWS_SECRET_KEY
AWS_REGION
## example
1. 使用 broker 方式导出,将简单查询结果导出到文件 `hdfs://path/to/result.txt`。指定导出格式为 CSV。使用 `my_broker` 并设置 kerberos 认证信息。指定列分隔符为 `,`,行分隔符为 `\n`
SELECT * FROM tbl
INTO OUTFILE "hdfs://path/to/result_"
FORMAT AS CSV
PROPERTIES
(
"broker.name" = "my_broker",
"broker.hadoop.security.authentication" = "kerberos",
"broker.kerberos_principal" = "doris@YOUR.COM",
"broker.kerberos_keytab" = "/home/doris/my.keytab",
"column_separator" = ",",
"line_delimiter" = "\n",
"max_file_size" = "100MB"
);
最终生成文件如如果不大于 100MB,则为:`result_0.csv`
如果大于 100MB,则可能为 `result_0.csv, result_1.csv, ...`
2. 将简单查询结果导出到文件 `hdfs://path/to/result.parquet`。指定导出格式为 PARQUET。使用 `my_broker` 并设置 kerberos 认证信息。
SELECT c1, c2, c3 FROM tbl
INTO OUTFILE "hdfs://path/to/result_"
FORMAT AS PARQUET
PROPERTIES
(
"broker.name" = "my_broker",
"broker.hadoop.security.authentication" = "kerberos",
"broker.kerberos_principal" = "doris@YOUR.COM",
"broker.kerberos_keytab" = "/home/doris/my.keytab",
"schema"="required,int32,c1;required,byte_array,c2;required,byte_array,c2"
);
查询结果导出到parquet文件需要明确指定`schema`
3. 将 CTE 语句的查询结果导出到文件 `hdfs://path/to/result.txt`。默认导出格式为 CSV。使用 `my_broker` 并设置 hdfs 高可用信息。使用默认的行列分隔符。
WITH
x1 AS
(SELECT k1, k2 FROM tbl1),
x2 AS
(SELECT k3 FROM tbl2)
SELEC k1 FROM x1 UNION SELECT k3 FROM x2
INTO OUTFILE "hdfs://path/to/result_"
PROPERTIES
(
"broker.name" = "my_broker",
"broker.username"="user",
"broker.password"="passwd",
"broker.dfs.nameservices" = "my_ha",
"broker.dfs.ha.namenodes.my_ha" = "my_namenode1, my_namenode2",
"broker.dfs.namenode.rpc-address.my_ha.my_namenode1" = "nn1_host:rpc_port",
"broker.dfs.namenode.rpc-address.my_ha.my_namenode2" = "nn2_host:rpc_port",
"broker.dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
);
最终生成文件如如果不大于 1GB,则为:`result_0.csv`
如果大于 1GB,则可能为 `result_0.csv, result_1.csv, ...`
4. 将 UNION 语句的查询结果导出到文件 `bos://bucket/result.txt`。指定导出格式为 PARQUET。使用 `my_broker` 并设置 hdfs 高可用信息。PARQUET 格式无需指定列分割符。
导出完成后,生成一个标识文件。
SELECT k1 FROM tbl1 UNION SELECT k2 FROM tbl1
INTO OUTFILE "bos://bucket/result_"
FORMAT AS PARQUET
PROPERTIES
(
"broker.name" = "my_broker",
"broker.bos_endpoint" = "http://bj.bcebos.com",
"broker.bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
"broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyyy",
"schema"="required,int32,k1;required,byte_array,k2"
);
5. 将 select 语句的查询结果导出到文件 `cos://${bucket_name}/path/result.txt`。指定导出格式为 csv。
导出完成后,生成一个标识文件。
select k1,k2,v1 from tbl1 limit 100000
into outfile "s3a://my_bucket/export/my_file_"
FORMAT AS CSV
PROPERTIES
(
"broker.name" = "hdfs_broker",
"broker.fs.s3a.access.key" = "xxx",
"broker.fs.s3a.secret.key" = "xxxx",
"broker.fs.s3a.endpoint" = "https://cos.xxxxxx.myqcloud.com/",
"column_separator" = ",",
"line_delimiter" = "\n",
"max_file_size" = "1024MB",
"success_file_name" = "SUCCESS"
)
最终生成文件如如果不大于 1GB,则为:`my_file_0.csv`
如果大于 1GB,则可能为 `my_file_0.csv, result_1.csv, ...`
在cos上验证
1. 不存在的path会自动创建
2. access.key/secret.key/endpoint需要和cos的同学确认。尤其是endpoint的值,不需要填写bucket_name。
6. 使用 s3 协议导出到 bos,并且并发导出开启。
set enable_parallel_outfile = true;
select k1 from tb1 limit 1000
into outfile "s3://my_bucket/export/my_file_"
format as csv
properties
(
"AWS_ENDPOINT" = "http://s3.bd.bcebos.com",
"AWS_ACCESS_KEY" = "xxxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "bd"
)
最终生成的文件前缀为 `my_file_{fragment_instance_id}_`
7. 使用 s3 协议导出到 bos,并且并发导出 session 变量开启。
注意:但由于查询语句带了一个顶层的排序节点,所以这个查询即使开启并发导出的 session 变量,也是无法并发导出的。
set enable_parallel_outfile = true;
select k1 from tb1 order by k1 limit 1000
into outfile "s3://my_bucket/export/my_file_"
format as csv
properties
(
"AWS_ENDPOINT" = "http://s3.bd.bcebos.com",
"AWS_ACCESS_KEY" = "xxxx",
"AWS_SECRET_KEY" = "xxx",
"AWS_REGION" = "bd"
)
8. 使用 hdfs 方式导出,将简单查询结果导出到文件 `hdfs://path/to/result.txt`。指定导出格式为 CSV,用户名为work。指定列分隔符为 `,`,行分隔符为 `\n`
SELECT * FROM tbl
INTO OUTFILE "hdfs://path/to/result_"
FORMAT AS CSV
PROPERTIES
(
"hdfs.fs.defaultFS" = "hdfs://ip:port",
"hdfs.hdfs_user" = "work"
);
最终生成文件如如果不大于 100MB,则为:`result_0.csv`
如果大于 100MB,则可能为 `result_0.csv, result_1.csv, ...`
## keyword
OUTFILE