[fix](Docs) Modify documents about SELECT INTO OUTFILE and EXPORT (#24641)
This commit is contained in:
@ -311,4 +311,4 @@
|
||||
"message": "版本发布",
|
||||
"description": "The label for category Release notes in sidebar docs"
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -26,85 +26,29 @@ under the License.
|
||||
|
||||
# Export Overview
|
||||
|
||||
Export is a function provided by Doris to export data. This function can export user-specified table or partition data in text format to remote storage through Broker process, such as HDFS / Object storage (supports S3 protocol) etc.
|
||||
`Export` is a feature provided by Doris that allows for the asynchronous export of data. This feature allows the user to export the data of specified tables or partitions in a specified file format through the Broker process or S3 protocol/ HDFS protocol, to remote storage such as object storage or HDFS.
|
||||
|
||||
Currently, `EXPORT` supports exporting Doris local tables / views / external tables and supports exporting to file formats including parquet, orc, csv, csv_with_names, and csv_with_names_and_types.
|
||||
|
||||
This document mainly introduces the basic principles, usage, best practices and precautions of Export.
|
||||
|
||||
## Noun Interpretation
|
||||
|
||||
* FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access.
|
||||
* BE: Backend, Doris's back-end node. Responsible for query execution and data storage.
|
||||
* Broker: Doris can manipulate files for remote storage through the Broker process.
|
||||
* Tablet: Data fragmentation. A table is divided into multiple data fragments.
|
||||
|
||||
## Principles
|
||||
|
||||
After the user submits an Export job. Doris counts all Tablets involved in this job. These tablets are then grouped to generate a special query plan for each group. The query plan reads the data on the included tablet and then writes the data to the specified path of the remote storage through Broker. It can also be directly exported to the remote storage that supports S3 protocol through S3 protocol.
|
||||
After a user submits an `Export Job`, Doris will calculate all the Tablets involved in this job. Then, based on the `parallelism` parameter specified by the user, these tablets will be grouped. Each thread is responsible for a group of tablets, generating multiple `SELECT INTO OUTFILE` query plans. The query plan will read the data from the included tablets and then write the data to the specified path in remote storage through S3 protocol/ HDFS protocol/ Broker.
|
||||
|
||||
The overall mode of dispatch is as follows:
|
||||
|
||||
```
|
||||
+--------+
|
||||
| Client |
|
||||
+---+----+
|
||||
| 1. Submit Job
|
||||
|
|
||||
+---v--------------------+
|
||||
| FE |
|
||||
| |
|
||||
| +-------------------+ |
|
||||
| | ExportPendingTask | |
|
||||
| +-------------------+ |
|
||||
| | 2. Generate Tasks
|
||||
| +--------------------+ |
|
||||
| | ExportExportingTask | |
|
||||
| +--------------------+ |
|
||||
| |
|
||||
| +-----------+ | +----+ +------+ +---------+
|
||||
| | QueryPlan +----------------> BE +--->Broker+---> |
|
||||
| +-----------+ | +----+ +------+ | Remote |
|
||||
| +-----------+ | +----+ +------+ | Storage |
|
||||
| | QueryPlan +----------------> BE +--->Broker+---> |
|
||||
| +-----------+ | +----+ +------+ +---------+
|
||||
+------------------------+ 3. Execute Tasks
|
||||
|
||||
```
|
||||
The overall execution process is as follows:
|
||||
|
||||
1. The user submits an Export job to FE.
|
||||
2. FE's Export scheduler performs an Export job in two stages:
|
||||
1. PENDING: FE generates Export Pending Task, sends snapshot command to BE, and takes a snapshot of all Tablets involved. And generate multiple query plans.
|
||||
2. EXPORTING: FE generates Export ExportingTask and starts executing the query plan.
|
||||
|
||||
### Query Plan Splitting
|
||||
|
||||
The Export job generates multiple query plans, each of which scans a portion of the Tablet. The number of Tablets scanned by each query plan is specified by the FE configuration parameter `export_tablet_num_per_task`, which defaults to 5. That is, assuming a total of 100 Tablets, 20 query plans will be generated. Users can also specify this number by the job attribute `tablet_num_per_task`, when submitting a job.
|
||||
|
||||
Multiple query plans for a job are executed sequentially.
|
||||
|
||||
### Query Plan Execution
|
||||
|
||||
A query plan scans multiple fragments, organizes read data in rows, batches every 1024 actions, and writes Broker to remote storage.
|
||||
|
||||
The query plan will automatically retry three times if it encounters errors. If a query plan fails three retries, the entire job fails.
|
||||
|
||||
Doris will first create a temporary directory named `doris_export_tmp_12345` (where `12345` is the job id) in the specified remote storage path. The exported data is first written to this temporary directory. Each query plan generates a file with an example file name:
|
||||
|
||||
`export-data-c69fcf2b6db5420f-a96b94c1ff8bccef-1561453713822`
|
||||
|
||||
Among them, `c69fcf2b6db5420f-a96b94c1ff8bccef` is the query ID of the query plan. ` 1561453713822` Timestamp generated for the file.
|
||||
|
||||
When all data is exported, Doris will rename these files to the user-specified path.
|
||||
|
||||
### Broker Parameter
|
||||
|
||||
Export needs to use the Broker process to access remote storage. Different brokers need to provide different parameters. For details, please refer to [Broker documentation](../../advanced/broker.md)
|
||||
2. FE calculates all the tablets to be exported and groups them based on the `parallelism` parameter. Each group generates multiple `SELECT INTO OUTFILE` query plans based on the `maximum_number_of_export_partitions` parameter.
|
||||
|
||||
3. Based on the parallelism parameter, an equal number of `ExportTaskExecutor` are generated, and each `ExportTaskExecutor` is responsible for a thread, which is scheduled and executed by FE's `Job scheduler` framework.
|
||||
4. FE's `Job scheduler` schedules and executes the `ExportTaskExecutor`, and each `ExportTaskExecutor` serially executes the multiple `SELECT INTO OUTFILE` query plans it is responsible for.
|
||||
|
||||
## Start Export
|
||||
|
||||
For detailed usage of Export, please refer to [SHOW EXPORT](../../sql-manual/sql-reference/Show-Statements/SHOW-EXPORT.md).
|
||||
For detailed usage of Export, please refer to [EXPORT](../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md).
|
||||
|
||||
Export's detailed commands can be passed through `HELP EXPORT;` Examples are as follows:
|
||||
Export's detailed commands can be passed through `HELP EXPORT;` in mysql client. Examples are as follows:
|
||||
|
||||
### Export to HDFS
|
||||
|
||||
@ -118,8 +62,7 @@ PROPERTIES
|
||||
"label" = "mylabel",
|
||||
"column_separator"=",",
|
||||
"columns" = "col1,col2",
|
||||
"exec_mem_limit"="2147483648",
|
||||
"timeout" = "3600"
|
||||
"parallelusm" = "3"
|
||||
)
|
||||
WITH BROKER "hdfs"
|
||||
(
|
||||
@ -132,24 +75,23 @@ WITH BROKER "hdfs"
|
||||
* `column_separator`: Column separator. The default is `\t`. Supports invisible characters, such as'\x07'.
|
||||
* `column`: columns to be exported, separated by commas, if this parameter is not filled in, all columns of the table will be exported by default.
|
||||
* `line_delimiter`: Line separator. The default is `\n`. Supports invisible characters, such as'\x07'.
|
||||
* `exec_mem_limit`: Represents the memory usage limitation of a query plan on a single BE in an Export job. Default 2GB. Unit bytes.
|
||||
* `timeout`: homework timeout. Default 2 hours. Unit seconds.
|
||||
* `tablet_num_per_task`: The maximum number of fragments allocated per query plan. The default is 5.
|
||||
* `parallelusm`:Exporting with 3 concurrent threads.
|
||||
|
||||
### Export to Object Storage (Supports S3 Protocol)
|
||||
|
||||
```sql
|
||||
EXPORT TABLE test TO "s3://bucket/path/to/export/dir/" WITH S3 (
|
||||
"AWS_ENDPOINT" = "http://host",
|
||||
"AWS_ACCESS_KEY" = "AK",
|
||||
"AWS_SECRET_KEY"="SK",
|
||||
"AWS_REGION" = "region"
|
||||
);
|
||||
EXPORT TABLE test TO "s3://bucket/path/to/export/dir/"
|
||||
WITH S3 (
|
||||
"s3.endpoint" = "http://host",
|
||||
"s3.access_key" = "AK",
|
||||
"s3.secret_key"="SK",
|
||||
"s3.region" = "region"
|
||||
);
|
||||
```
|
||||
|
||||
- `AWS_ACCESS_KEY`/`AWS_SECRET_KEY`:Is your key to access the object storage API.
|
||||
- `AWS_ENDPOINT`:Endpoint indicates the access domain name of object storage external services.
|
||||
- `AWS_REGION`:Region indicates the region where the object storage data center is located.
|
||||
- `s3.access_key`/`s3.secret_key`:Is your key to access the object storage API.
|
||||
- `s3.endpoint`:Endpoint indicates the access domain name of object storage external services.
|
||||
- `s3.region`:Region indicates the region where the object storage data center is located.
|
||||
|
||||
### View Export Status
|
||||
|
||||
@ -161,13 +103,23 @@ mysql> show EXPORT\G;
|
||||
JobId: 14008
|
||||
State: FINISHED
|
||||
Progress: 100%
|
||||
TaskInfo: {"partitions":["*"],"exec mem limit":2147483648,"column separator":",","line delimiter":"\n","tablet num":1,"broker":"hdfs","coord num":1,"db":"default_cluster:db1","tbl":"tbl3"}
|
||||
TaskInfo: {"partitions":[],"max_file_size":"","delete_existing_files":"","columns":"","format":"csv","column_separator":"\t","line_delimiter":"\n","db":"default_cluster:demo","tbl":"student4","tablet_num":30}
|
||||
Path: hdfs://host/path/to/export/
|
||||
CreateTime: 2019-06-25 17:08:24
|
||||
StartTime: 2019-06-25 17:08:28
|
||||
FinishTime: 2019-06-25 17:08:34
|
||||
Timeout: 3600
|
||||
ErrorMsg: NULL
|
||||
OutfileInfo: [
|
||||
[
|
||||
{
|
||||
"fileNumber": "1",
|
||||
"totalRows": "4",
|
||||
"fileSize": "34bytes",
|
||||
"url": "file:///127.0.0.1/Users/fangtiewei/tmp_data/export/f1ab7dcc31744152-bbb4cda2f5c88eac_"
|
||||
}
|
||||
]
|
||||
]
|
||||
1 row in set (0.01 sec)
|
||||
```
|
||||
|
||||
@ -178,21 +130,24 @@ FinishTime: 2019-06-25 17:08:34
|
||||
* EXPORTING: Data Export
|
||||
* FINISHED: Operation Successful
|
||||
* CANCELLED: Job Failure
|
||||
* Progress: Work progress. The schedule is based on the query plan. Assuming a total of 10 query plans have been completed, the progress will be 30%.
|
||||
* Progress: Work progress. The schedule is based on the query plan. Assuming there are 10 threads in total and 3 have been completed, the progress will be 30%.
|
||||
* TaskInfo: Job information in Json format:
|
||||
* db: database name
|
||||
* tbl: Table name
|
||||
* partitions: Specify the exported partition. `*` Represents all partitions.
|
||||
* exec MEM limit: query plan memory usage limit. Unit bytes.
|
||||
* partitions: Specify the exported partition. `empty` Represents all partitions.
|
||||
* column separator: The column separator for the exported file.
|
||||
* line delimiter: The line separator for the exported file.
|
||||
* tablet num: The total number of tablets involved.
|
||||
* Broker: The name of the broker used.
|
||||
* Coord num: Number of query plans.
|
||||
* max_file_size: The maximum size of an export file.
|
||||
* delete_existing_files: Whether to delete existing files and directories in the specified export directory.
|
||||
* columns: Specifies the column names to be exported. Empty values represent exporting all columns.
|
||||
* format: The file format for export.
|
||||
* Path: Export path on remote storage.
|
||||
* CreateTime/StartTime/FinishTime: Creation time, start scheduling time and end time of jobs.
|
||||
* Timeout: Job timeout. The unit is seconds. This time is calculated from CreateTime.
|
||||
* Error Msg: If there is an error in the job, the cause of the error is shown here.
|
||||
* OutfileInfo:If the export job is successful, specific `SELECT INTO OUTFILE` result information will be displayed here.
|
||||
|
||||
### Cancel Export Job
|
||||
|
||||
@ -208,36 +163,50 @@ WHERE LABEL like "%example%";
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Splitting Query Plans
|
||||
### Concurrent Export
|
||||
|
||||
How many query plans need to be executed for an Export job depends on the total number of Tablets and how many Tablets can be allocated for a query plan at most. Since multiple query plans are executed serially, the execution time of jobs can be reduced if more fragments are processed by one query plan. However, if the query plan fails (e.g., the RPC fails to call Broker, the remote storage jitters, etc.), too many tablets can lead to a higher retry cost of a query plan. Therefore, it is necessary to arrange the number of query plans and the number of fragments to be scanned for each query plan in order to balance the execution time and the success rate of execution. It is generally recommended that the amount of data scanned by a query plan be within 3-5 GB (the size and number of tables in a table can be viewed by `SHOW TABLETS FROM tbl_name;`statement.
|
||||
An Export job can be configured with the `parallelism` parameter to concurrently export data. The `parallelism` parameter specifies the number of threads to execute the `EXPORT Job`. Each thread is responsible for exporting a subset of the total tablets.
|
||||
|
||||
The underlying execution logic of an `Export Job `is actually the `SELECT INTO OUTFILE` statement. Each thread specified by the `parallelism` parameter executes independent `SELECT INTO OUTFILE` statements.
|
||||
|
||||
The specific logic for splitting an `Export Job` into multiple `SELECT INTO OUTFILE` is, to evenly distribute all the tablets of the table among all parallel threads. For example:
|
||||
|
||||
- If num(tablets) = 40 and parallelism = 3, then the three threads will be responsible for 14, 13, and 13 tablets, respectively.
|
||||
- If num(tablets) = 2 and parallelism = 3, then Doris automatically sets the parallelism to 2, and each thread is responsible for one tablet.
|
||||
|
||||
When the number of tablets responsible for a thread exceeds the `maximum_tablets_of_outfile_in_export` value (default is 10, and can be modified by adding the `maximum_tablets_of_outfile_in_export` parameter in fe.conf), the thread will split the tablets which are responsibled for this thread into multiple `SELECT INTO OUTFILE` statements. For example:
|
||||
|
||||
- If a thread is responsible for 14 tablets and `maximum_tablets_of_outfile_in_export = 10`, then the thread will be responsible for two `SELECT INTO OUTFILE` statements. The first `SELECT INTO OUTFILE` statement exports 10 tablets, and the second `SELECT INTO OUTFILE` statement exports 4 tablets. The two `SELECT INTO OUTFILE` statements are executed serially by this thread.
|
||||
|
||||
### exec\_mem\_limit
|
||||
|
||||
Usually, a query plan for an Export job has only two parts `scan`- `export`, and does not involve computing logic that requires too much memory. So usually the default memory limit of 2GB can satisfy the requirement. But in some scenarios, such as a query plan, too many Tablets need to be scanned on the same BE, or too many data versions of Tablets, may lead to insufficient memory. At this point, larger memory needs to be set through this parameter, such as 4 GB, 8 GB, etc.
|
||||
The query plan for an `Export Job` typically involves only `scanning and exporting`, and does not involve compute logic that requires a lot of memory. Therefore, the default memory limit of 2GB is usually sufficient to meet the requirements.
|
||||
|
||||
However, in certain scenarios, such as a query plan that requires scanning too many tablets on the same BE, or when there are too many data versions of tablets, it may result in insufficient memory. In these cases, you can adjust the session variable `exec_mem_limit` to increase the memory usage limit.
|
||||
|
||||
## Notes
|
||||
|
||||
* It is not recommended to export large amounts of data at one time. The maximum amount of exported data recommended by an Export job is tens of GB. Excessive export results in more junk files and higher retry costs.
|
||||
* If the amount of table data is too large, it is recommended to export it by partition.
|
||||
* During the operation of the Export job, if FE restarts or cuts the master, the Export job will fail, requiring the user to resubmit.
|
||||
* If the Export job fails, the `__doris_export_tmp_xxx` temporary directory generated in the remote storage and the generated files will not be deleted, requiring the user to delete them manually.
|
||||
* If the Export job runs successfully, the `__doris_export_tmp_xxx` directory generated in the remote storage may be retained or cleared according to the file system semantics of the remote storage. For example, in object storage (supporting the S3 protocol), after removing the last file in a directory through rename operation, the directory will also be deleted. If the directory is not cleared, the user can clear it manually.
|
||||
* When the Export runs successfully or fails, the FE reboots or cuts, then some information of the jobs displayed by `SHOW EXPORT` will be lost and cannot be viewed.
|
||||
* Export jobs only export data from Base tables, not Rollup Index.
|
||||
* If the Export job fails, the temporary files and directory generated in the remote storage will not be deleted, requiring the user to delete them manually.
|
||||
* Export jobs scan data and occupy IO resources, which may affect the query latency of the system.
|
||||
* The Export job can export data from `Doris Base tables`, `View`, and `External tables`, but not from `Rollup Index`.
|
||||
* When using the EXPORT command, please ensure that the target path exists, otherwise the export may fail.
|
||||
* When concurrent export is enabled, please configure the thread count and parallelism appropriately to fully utilize system resources and avoid performance bottlenecks.
|
||||
* When exporting to a local file, pay attention to file permissions and the path, ensure that you have sufficient permissions to write, and follow the appropriate file system path.
|
||||
* It is possible to monitor progress and performance metrics in real-time during the export process to identify issues promptly and make optimal adjustments.
|
||||
* It is recommended to verify the integrity and accuracy of the exported data after the export operation is completed to ensure the quality and integrity of the data.
|
||||
|
||||
## Relevant configuration
|
||||
|
||||
### FE
|
||||
|
||||
* `expo_checker_interval_second`: Scheduling interval of Export job scheduler, default is 5 seconds. Setting this parameter requires restarting FE.
|
||||
* `export_running_job_num_limit `: Limit on the number of Export jobs running. If exceeded, the job will wait and be in PENDING state. The default is 5, which can be adjusted at run time.
|
||||
* `Export_task_default_timeout_second`: Export job default timeout time. The default is 2 hours. It can be adjusted at run time.
|
||||
* `export_tablet_num_per_task`: The maximum number of fragments that a query plan is responsible for. The default is 5.
|
||||
* `label`: The label of this Export job. Doris will generate a label for an Export job if this param is not set.
|
||||
* `maximum_tablets_of_outfile_in_export`: The maximum number of tablets allowed for an OutFile statement in an ExportExecutorTask.
|
||||
|
||||
## More Help
|
||||
|
||||
For more detailed syntax and best practices used by Export, please refer to the [Export](../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md) command manual, you can also You can enter `HELP EXPORT` at the command line of the MySql client for more help.
|
||||
For more detailed syntax and best practices used by Export, please refer to the [Export](../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md) command manual, You can also enter `HELP EXPORT` at the command line of the MySql client for more help.
|
||||
|
||||
The underlying implementation of the `EXPORT` command is the `SELECT INTO OUTFILE` statement. For more information about SELECT INTO OUTFILE, please refer to [Export Query Result](./outfile.md) and [SELECT INTO OUTFILE](../..//sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE.md).
|
||||
|
||||
|
||||
@ -28,6 +28,8 @@ under the License.
|
||||
|
||||
This document describes how to use the [SELECT INTO OUTFILE](../../sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE.md) command to export query results.
|
||||
|
||||
`SELECT INTO OUTFILE` is a synchronous command, which means that the operation is completed when the command returns. It also returns a row of results to show the execution result of the export.
|
||||
|
||||
## Examples
|
||||
|
||||
### Export to HDFS
|
||||
|
||||
@ -32,19 +32,21 @@ EXPORT
|
||||
|
||||
### Description
|
||||
|
||||
This statement is used to export the data of the specified table to the specified location.
|
||||
The `EXPORT` command is used to export the data of a specified table to a designated location as a file. Currently, it supports exporting to remote storage such as HDFS, S3, BOS, and COS (Tencent Cloud) through Broker process, S3 protocol, or HDFS protocol.
|
||||
|
||||
This is an asynchronous operation that returns if the task is submitted successfully. After execution, you can use the [SHOW EXPORT](../../Show-Statements/SHOW-EXPORT.md) command to view the progress.
|
||||
`EXPORT` is an asynchronous operation, and the command submits an `EXPORT JOB` to Doris. The task will be successfully submitted and returns immediately. After execution, you can use the [SHOW EXPORT](../../Show-Statements/SHOW-EXPORT.md) to view the progress.
|
||||
|
||||
```sql
|
||||
EXPORT TABLE table_name
|
||||
[PARTITION (p1[,p2])]
|
||||
[WHERE]
|
||||
TO export_path
|
||||
[opt_properties]
|
||||
WITH BROKER/S3/HDFS
|
||||
[broker_properties];
|
||||
```
|
||||
**grammar**
|
||||
|
||||
```sql
|
||||
EXPORT TABLE table_name
|
||||
[PARTITION (p1[,p2])]
|
||||
[WHERE]
|
||||
TO export_path
|
||||
[opt_properties]
|
||||
WITH BROKER/S3/HDFS
|
||||
[broker_properties];
|
||||
```
|
||||
|
||||
**principle**
|
||||
|
||||
@ -54,7 +56,7 @@ The bottom layer of the `Export` statement actually executes the `select...outfi
|
||||
|
||||
- `table_name`
|
||||
|
||||
The table name of the table currently being exported. Only the export of Doris local table data is supported.
|
||||
The table name of the table currently being exported. Only the export of Doris local table / View / External Table data is supported.
|
||||
|
||||
- `partition`
|
||||
|
||||
@ -75,13 +77,14 @@ The bottom layer of the `Export` statement actually executes the `select...outfi
|
||||
The following parameters can be specified:
|
||||
|
||||
- `label`: This parameter is optional, specifies the label of the export task. If this parameter is not specified, the system randomly assigns a label to the export task.
|
||||
- `column_separator`: Specifies the exported column separator, default is \t. Only single byte is supported.
|
||||
- `line_delimiter`: Specifies the line delimiter for export, the default is \n. Only single byte is supported.
|
||||
- `column_separator`: Specifies the exported column separator, default is `\t`, mulit-bytes is supported. This parameter is only used for `CSV` file format.
|
||||
- `line_delimiter`: Specifies the line delimiter for export, the default is `\n`, mulit-bytes is supported. This parameter is only used for `CSV` file format.
|
||||
- `timeout`: The timeout period of the export job, the default is 2 hours, the unit is seconds.
|
||||
- `columns`: Specifies certain columns of the export job table
|
||||
- `format`: Specifies the file format, support: parquet, orc, csv, csv_with_names, csv_with_names_and_types.The default is csv format.
|
||||
- `parallelism`: The concurrency degree of the `export` job, the default is `1`. The export job will be divided into `select..outfile..` statements of the number of `parallelism` to execute concurrently. (If the value of `parallelism` is greater than the number of tablets in the table, the system will automatically set `parallelism` to the number of tablets, that is, each `select..outfile..` statement is responsible for one tablet)
|
||||
- `delete_existing_files`: default `false`. If it is specified as true, you will first delete all files specified in the directory specified by the file_path, and then export the data to the directory.For example: "file_path" = "/user/tmp", then delete all files and directory under "/user/"; "file_path" = "/user/tmp/", then delete all files and directory under "/user/tmp/"
|
||||
- `max_file_size`: it is the limit for the size of a single file in the export job. If the result file exceeds this value, it will be split into multiple files. The valid range for `max_file_size` is [5MB, 2GB], with a default value of 1GB. (When exporting to the ORC file format, the actual size of the split files will be multiples of 64MB, for example, if max_file_size is specified as 5MB, the actual split size will be 64MB; if max_file_size is specified as 65MB, the actual split size will be 128MB.)
|
||||
|
||||
> Note that to use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` to the fe.conf file and restart the FE. Only then will the `delete_existing_files` parameter take effect. Setting `delete_existing_files = true` is a dangerous operation and it is recommended to only use it in a testing environment.
|
||||
|
||||
@ -143,7 +146,7 @@ The bottom layer of the `Export` statement actually executes the `select...outfi
|
||||
|
||||
#### export to local
|
||||
|
||||
Export data to the local file system needs to add `enable_outfile_to_local = true` to the fe.conf and restart the Fe.
|
||||
> Export data to the local file system needs to add `enable_outfile_to_local = true` to the fe.conf and restart the Fe.
|
||||
|
||||
1. You can export the `test` table to a local store. Export csv format file by default.
|
||||
|
||||
@ -248,78 +251,66 @@ Before exporting data, all files and directories in the `/home/user/` directory
|
||||
|
||||
#### export with S3
|
||||
|
||||
8. Export all data from the `testTbl` table to S3 using invisible character '\x07' as a delimiter for columns and rows.If you want to export data to minio, you also need to specify use_path_style=true.
|
||||
1. Export all data from the `testTbl` table to S3 using invisible character '\x07' as a delimiter for columns and rows.If you want to export data to minio, you also need to specify use_path_style=true.
|
||||
|
||||
```sql
|
||||
EXPORT TABLE testTbl TO "s3://bucket/a/b/c"
|
||||
PROPERTIES (
|
||||
"column_separator"="\\x07",
|
||||
"line_delimiter" = "\\x07"
|
||||
) WITH s3 (
|
||||
"AWS_ENDPOINT" = "xxxxx",
|
||||
"AWS_ACCESS_KEY" = "xxxxx",
|
||||
"AWS_SECRET_KEY"="xxxx",
|
||||
"AWS_REGION" = "xxxxx"
|
||||
)
|
||||
```
|
||||
```sql
|
||||
EXPORT TABLE testTbl TO "s3://bucket/a/b/c"
|
||||
PROPERTIES (
|
||||
"column_separator"="\\x07",
|
||||
"line_delimiter" = "\\x07"
|
||||
) WITH s3 (
|
||||
"s3.endpoint" = "xxxxx",
|
||||
"s3.region" = "xxxxx",
|
||||
"s3.secret_key"="xxxx",
|
||||
"s3.access_key" = "xxxxx"
|
||||
)
|
||||
```
|
||||
|
||||
```sql
|
||||
EXPORT TABLE minio_test TO "s3://bucket/a/b/c"
|
||||
PROPERTIES (
|
||||
"column_separator"="\\x07",
|
||||
"line_delimiter" = "\\x07"
|
||||
) WITH s3 (
|
||||
"AWS_ENDPOINT" = "xxxxx",
|
||||
"AWS_ACCESS_KEY" = "xxxxx",
|
||||
"AWS_SECRET_KEY"="xxxx",
|
||||
"AWS_REGION" = "xxxxx",
|
||||
"use_path_style" = "true"
|
||||
)
|
||||
```
|
||||
|
||||
9. Export all data in the test table to HDFS in the format of parquet, limit the size of a single file to 1024MB, and reserve all files in the specified directory.
|
||||
2. Export all data in the test table to HDFS in the format of parquet, limit the size of a single file to 1024MB, and reserve all files in the specified directory.
|
||||
|
||||
#### export with HDFS
|
||||
1. Export all data from the `test` table to HDFS in `Parquet` format, with a limit of 512MB for the size of a single file in the export job, and retain all files under the specified directory.
|
||||
|
||||
```sql
|
||||
EXPORT TABLE test TO "hdfs://hdfs_host:port/a/b/c/"
|
||||
PROPERTIES(
|
||||
"format" = "parquet",
|
||||
"max_file_size" = "1024MB",
|
||||
"delete_existing_files" = "false"
|
||||
)
|
||||
with HDFS (
|
||||
"fs.defaultFS"="hdfs://hdfs_host:port",
|
||||
"hadoop.username" = "hadoop"
|
||||
);
|
||||
```
|
||||
```sql
|
||||
EXPORT TABLE test TO "hdfs://hdfs_host:port/a/b/c/"
|
||||
PROPERTIES(
|
||||
"format" = "parquet",
|
||||
"max_file_size" = "512MB",
|
||||
"delete_existing_files" = "false"
|
||||
)
|
||||
with HDFS (
|
||||
"fs.defaultFS"="hdfs://hdfs_host:port",
|
||||
"hadoop.username" = "hadoop"
|
||||
);
|
||||
```
|
||||
|
||||
#### export with Broker
|
||||
You need to first start the broker process and add it to the FE before proceeding.
|
||||
1. Export the `test` table to hdfs
|
||||
```sql
|
||||
EXPORT TABLE test TO "hdfs://hdfs_host:port/a/b/c"
|
||||
WITH BROKER "broker_name"
|
||||
(
|
||||
"username"="xxx",
|
||||
"password"="yyy"
|
||||
);
|
||||
```
|
||||
|
||||
```sql
|
||||
EXPORT TABLE test TO "hdfs://hdfs_host:port/a/b/c"
|
||||
WITH BROKER "broker_name"
|
||||
(
|
||||
"username"="xxx",
|
||||
"password"="yyy"
|
||||
);
|
||||
```
|
||||
|
||||
2. Export partitions 'p1' and 'p2' from the 'testTbl' table to HDFS using ',' as the column delimiter and specifying a label.
|
||||
|
||||
```sql
|
||||
EXPORT TABLE testTbl PARTITION (p1,p2) TO "hdfs://hdfs_host:port/a/b/c"
|
||||
PROPERTIES (
|
||||
"label" = "mylabel",
|
||||
"column_separator"=","
|
||||
)
|
||||
WITH BROKER "broker_name"
|
||||
(
|
||||
"username"="xxx",
|
||||
"password"="yyy"
|
||||
);
|
||||
```
|
||||
```sql
|
||||
EXPORT TABLE testTbl PARTITION (p1,p2) TO "hdfs://hdfs_host:port/a/b/c"
|
||||
PROPERTIES (
|
||||
"label" = "mylabel",
|
||||
"column_separator"=","
|
||||
)
|
||||
WITH BROKER "broker_name"
|
||||
(
|
||||
"username"="xxx",
|
||||
"password"="yyy"
|
||||
);
|
||||
```
|
||||
|
||||
3. Export all data from the 'testTbl' table to HDFS using the non-visible character '\x07' as the column and row delimiter.
|
||||
|
||||
@ -342,29 +333,39 @@ WITH BROKER "broker_name"
|
||||
|
||||
### Best Practice
|
||||
|
||||
- Splitting of subtasks
|
||||
#### Concurrent Export
|
||||
|
||||
An Export job will be split into multiple subtasks (execution plans) to execute. How many query plans need to be executed depends on how many tablets there are in total, and how many tablets can be allocated to a query plan.
|
||||
An Export job can be configured with the `parallelism` parameter to concurrently export data. The `parallelism` parameter specifies the number of threads to execute the `EXPORT Job`. Each thread is responsible for exporting a subset of the total tablets.
|
||||
|
||||
Because multiple query plans are executed serially, the execution time of the job can be reduced if one query plan handles more shards.
|
||||
The underlying execution logic of an `Export Job `is actually the `SELECT INTO OUTFILE` statement. Each thread specified by the `parallelism` parameter executes independent `SELECT INTO OUTFILE` statements.
|
||||
|
||||
However, if there is an error in the query plan (such as the failure of the RPC calling the broker, the jitter in the remote storage, etc.), too many Tablets will lead to a higher retry cost of a query plan.
|
||||
The specific logic for splitting an `Export Job` into multiple `SELECT INTO OUTFILE` is, to evenly distribute all the tablets of the table among all parallel threads. For example:
|
||||
|
||||
Therefore, it is necessary to reasonably arrange the number of query plans and the number of shards that each query plan needs to scan, so as to balance the execution time and the execution success rate.
|
||||
- If num(tablets) = 40 and parallelism = 3, then the three threads will be responsible for 14, 13, and 13 tablets, respectively.
|
||||
- If num(tablets) = 2 and parallelism = 3, then Doris automatically sets the parallelism to 2, and each thread is responsible for one tablet.
|
||||
|
||||
It is generally recommended that the amount of data scanned by a query plan is within 3-5 GB.
|
||||
When the number of tablets responsible for a thread exceeds the `maximum_tablets_of_outfile_in_export` value (default is 10, and can be modified by adding the `maximum_tablets_of_outfile_in_export` parameter in fe.conf), the thread will split the tablets which are responsibled for this thread into multiple `SELECT INTO OUTFILE` statements. For example:
|
||||
|
||||
- If a thread is responsible for 14 tablets and `maximum_tablets_of_outfile_in_export = 10`, then the thread will be responsible for two `SELECT INTO OUTFILE` statements. The first `SELECT INTO OUTFILE` statement exports 10 tablets, and the second `SELECT INTO OUTFILE` statement exports 4 tablets. The two `SELECT INTO OUTFILE` statements are executed serially by this thread.
|
||||
|
||||
#### memory limit
|
||||
|
||||
Usually, the query plan of an Export job has only two parts of `scan-export`, which does not involve calculation logic that requires too much memory. So usually the default memory limit of 2GB suffices.
|
||||
The query plan for an `Export Job` typically involves only `scanning and exporting`, and does not involve compute logic that requires a lot of memory. Therefore, the default memory limit of 2GB is usually sufficient to meet the requirements.
|
||||
|
||||
However, in some scenarios, such as a query plan, too many Tablets need to be scanned on the same BE, or too many Tablet data versions may cause insufficient memory. At this point, you need to set a larger memory, such as 4GB, 8GB, etc., through the `exec_mem_limit` parameter.
|
||||
However, in certain scenarios, such as a query plan that requires scanning too many tablets on the same BE, or when there are too many data versions of tablets, it may result in insufficient memory. In these cases, you can adjust the session variable `exec_mem_limit` to increase the memory usage limit.
|
||||
|
||||
#### Precautions
|
||||
|
||||
- Exporting a large amount of data at one time is not recommended. The maximum recommended export data volume for an Export job is several tens of GB. An overly large export results in more junk files and higher retry costs. If the amount of table data is too large, it is recommended to export by partition.
|
||||
|
||||
- If the Export job fails, the generated files will not be deleted, and the user needs to delete it manually.
|
||||
- The Export job only exports the data of the Base table, not the data of the materialized view.
|
||||
|
||||
- The Export job only exports the data of the Base table / View / External table, not the data of the materialized view.
|
||||
|
||||
- The export job scans data and occupies IO resources, which may affect the query latency of the system.
|
||||
|
||||
- Currently, The `Export Job` is simply check whether the `Tablets version` is the same, it is recommended not to import data during the execution of the `Export Job`.
|
||||
|
||||
- The maximum number of partitions that an `Export job` allows is 2000. You can add a parameter to the fe.conf `maximum_number_of_export_partitions` and restart FE to modify the setting.
|
||||
|
||||
- The timeout time for the `EXPORT` command is the same as the timeout time for queries. It can be set using `SET query_timeout=xxx`, where xxx represents the desired timeout value.
|
||||
|
||||
@ -31,18 +31,18 @@ OURFILE
|
||||
|
||||
### description
|
||||
|
||||
This statement is used to export query results to a file using the `SELECT INTO OUTFILE` command. Currently, it supports exporting to remote storage, such as HDFS, S3, BOS, COS (Tencent Cloud), through the Broker process, through the S3 protocol, or directly through the HDFS protocol.
|
||||
This statement is used to export query results to a file using the `SELECT INTO OUTFILE` command. Currently, it supports exporting to remote storage, such as HDFS, S3, BOS, COS (Tencent Cloud), through the Broker process, S3 protocol, or HDFS protocol.
|
||||
|
||||
grammar:
|
||||
**grammar:**
|
||||
|
||||
````
|
||||
```sql
|
||||
query_stmt
|
||||
INTO OUTFILE "file_path"
|
||||
[format_as]
|
||||
[properties]
|
||||
````
|
||||
```
|
||||
|
||||
illustrate:
|
||||
**illustrate:**
|
||||
|
||||
1. file_path
|
||||
|
||||
@ -68,7 +68,7 @@ illustrate:
|
||||
|
||||
3. properties
|
||||
|
||||
Specify related properties. Currently exporting via the Broker process, or via the S3 protocol is supported.
|
||||
Specify related properties. Currently exporting via the Broker process, S3 protocol, or HDFS protocol is supported.
|
||||
|
||||
```
|
||||
grammar:
|
||||
@ -80,7 +80,7 @@ illustrate:
|
||||
line_delimiter: line delimiter,is only for CSV format. mulit-bytes supported starting in version 1.2, such as: "\\x01", "abc".
|
||||
max_file_size: the size limit of a single file, if the result exceeds this value, it will be cut into multiple files, the value range of max_file_size is [5MB, 2GB] and the default is 1GB. (When specified that the file format is ORC, the size of the actual division file will be a multiples of 64MB, such as: specify max_file_size = 5MB, and actually use 64MB as the division; specify max_file_size = 65MB, and will actually use 128MB as cut division points.)
|
||||
delete_existing_files: default `false`. If it is specified as true, you will first delete all files specified in the directory specified by the file_path, and then export the data to the directory.For example: "file_path" = "/user/tmp", then delete all files and directory under "/user/"; "file_path" = "/user/tmp/", then delete all files and directory under "/user/tmp/"
|
||||
file_suffix: Specify the suffix of the export file
|
||||
file_suffix: Specify the suffix of the export file. If this parameter is not specified, the default suffix for the file format will be used.
|
||||
|
||||
Broker related properties need to be prefixed with `broker.`:
|
||||
broker.name: broker name
|
||||
@ -108,11 +108,27 @@ illustrate:
|
||||
s3.secret_key
|
||||
s3.region
|
||||
use_path_stype: (optional) default false . The S3 SDK uses the virtual-hosted style by default. However, some object storage systems may not be enabled or support virtual-hosted style access. At this time, we can add the use_path_style parameter to force the use of path style access method.
|
||||
|
||||
```
|
||||
|
||||
> Note that to use the `delete_existing_files` parameter, you also need to add the configuration `enable_delete_existing_files = true` to the fe.conf file and restart the FE. Only then will the `delete_existing_files` parameter take effect. Setting `delete_existing_files = true` is a dangerous operation and it is recommended to only use it in a testing environment.
|
||||
|
||||
4. Data Types for Export
|
||||
|
||||
All file formats support the export of basic data types, while only csv/orc/csv_with_names/csv_with_names_and_types currently support the export of complex data types (ARRAY/MAP/STRUCT). Nested complex data types are not supported.
|
||||
|
||||
5. Concurrent Export
|
||||
|
||||
Setting the session variable `set enable_parallel_outfile = true;` enables concurrent export using outfile. For detailed usage, see [Export Query Result](../../../data-operate/export/outfile.md).
|
||||
|
||||
6. Export to Local
|
||||
|
||||
To export to a local file, you need configure `enable_outfile_to_local=true` in fe.conf.
|
||||
|
||||
```sql
|
||||
select * from tbl1 limit 10
|
||||
INTO OUTFILE "file:///home/work/path/result_";
|
||||
```
|
||||
|
||||
### example
|
||||
|
||||
1. Use the broker method to export, and export the simple query results to the file `hdfs://path/to/result.txt`. Specifies that the export format is CSV. Use `my_broker` and set kerberos authentication information. Specify the column separator as `,` and the row separator as `\n`.
|
||||
@ -147,13 +163,10 @@ illustrate:
|
||||
"broker.name" = "my_broker",
|
||||
"broker.hadoop.security.authentication" = "kerberos",
|
||||
"broker.kerberos_principal" = "doris@YOUR.COM",
|
||||
"broker.kerberos_keytab" = "/home/doris/my.keytab",
|
||||
"schema"="required,int32,c1;required,byte_array,c2;required,byte_array,c2"
|
||||
"broker.kerberos_keytab" = "/home/doris/my.keytab"
|
||||
);
|
||||
````
|
||||
|
||||
Exporting query results to parquet files requires explicit `schema`.
|
||||
|
||||
3. Export the query result of the CTE statement to the file `hdfs://path/to/result.txt`. The default export format is CSV. Use `my_broker` and set hdfs high availability information. Use the default row and column separators.
|
||||
|
||||
```sql
|
||||
@ -192,8 +205,7 @@ illustrate:
|
||||
"broker.name" = "my_broker",
|
||||
"broker.bos_endpoint" = "http://bj.bcebos.com",
|
||||
"broker.bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyy",
|
||||
"schema"="required,int32,k1;required,byte_array,k2"
|
||||
"broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyy"
|
||||
);
|
||||
````
|
||||
|
||||
@ -339,3 +351,7 @@ OUTFILE
|
||||
4. Results Integrity Guarantee
|
||||
|
||||
This command is a synchronous command, so it is possible that the task connection is disconnected during the execution process, so that it is impossible to live the exported data whether it ends normally, or whether it is complete. At this point, you can use the `success_file_name` parameter to request that a successful file identifier be generated in the directory after the task is successful. Users can use this file to determine whether the export ends normally.
|
||||
|
||||
5. Other Points to Note
|
||||
|
||||
See [Export Query Result](../../../data-operate/export/outfile.md)
|
||||
@ -26,75 +26,27 @@ under the License.
|
||||
|
||||
# 数据导出
|
||||
|
||||
数据导出(Export)是 Doris 提供的一种将数据导出的功能。该功能可以将用户指定的表或分区的数据,以文本的格式,通过 Broker 进程导出到远端存储上,如 HDFS / 对象存储(支持S3协议) 等。
|
||||
异步导出(Export)是 Doris 提供的一种将数据异步导出的功能。该功能可以将用户指定的表或分区的数据,以指定的文件格式,通过 Broker 进程或 S3协议/HDFS协议 导出到远端存储上,如 对象存储 / HDFS 等。
|
||||
|
||||
当前,EXPORT支持导出 Doris本地表 / View视图 / 外表,支持导出到 parquet / orc / csv / csv_with_names / csv_with_names_and_types 文件格式。
|
||||
|
||||
本文档主要介绍 Export 的基本原理、使用方式、最佳实践以及注意事项。
|
||||
|
||||
## 原理
|
||||
|
||||
用户提交一个 Export 作业后。Doris 会统计这个作业涉及的所有 Tablet。然后对这些 Tablet 进行分组,每组生成一个特殊的查询计划。该查询计划会读取所包含的 Tablet 上的数据,然后通过 Broker 将数据写到远端存储指定的路径中,也可以通过S3协议直接导出到支持S3协议的远端存储上。
|
||||
用户提交一个 Export 作业后。Doris 会统计这个作业涉及的所有 Tablet。然后根据`parallelism`参数(由用户指定)对这些 Tablet 进行分组。每个线程负责一组tablets,生成若干个`SELECT INTO OUTFILE`查询计划。该查询计划会读取所包含的 Tablet 上的数据,然后通过 S3协议 / HDFS协议 / Broker 将数据写到远端存储指定的路径中。
|
||||
|
||||
总体的调度方式如下:
|
||||
|
||||
```
|
||||
+--------+
|
||||
| Client |
|
||||
+---+----+
|
||||
| 1. Submit Job
|
||||
|
|
||||
+---v--------------------+
|
||||
| FE |
|
||||
| |
|
||||
| +-------------------+ |
|
||||
| | ExportPendingTask | |
|
||||
| +-------------------+ |
|
||||
| | 2. Generate Tasks
|
||||
| +--------------------+ |
|
||||
| | ExportExportingTask | |
|
||||
| +--------------------+ |
|
||||
| |
|
||||
| +-----------+ | +----+ +------+ +---------+
|
||||
| | QueryPlan +----------------> BE +--->Broker+---> |
|
||||
| +-----------+ | +----+ +------+ | Remote |
|
||||
| +-----------+ | +----+ +------+ | Storage |
|
||||
| | QueryPlan +----------------> BE +--->Broker+---> |
|
||||
| +-----------+ | +----+ +------+ +---------+
|
||||
+------------------------+ 3. Execute Tasks
|
||||
|
||||
```
|
||||
总体的执行流程如下:
|
||||
|
||||
1. 用户提交一个 Export 作业到 FE。
|
||||
2. FE 的 Export 调度器会通过两阶段来执行一个 Export 作业:
|
||||
1. PENDING:FE 生成 ExportPendingTask,向 BE 发送 snapshot 命令,对所有涉及到的 Tablet 做一个快照。并生成多个查询计划。
|
||||
2. EXPORTING:FE 生成 ExportExportingTask,开始执行查询计划。
|
||||
|
||||
### 查询计划拆分
|
||||
|
||||
Export 作业会生成多个查询计划,每个查询计划负责扫描一部分 Tablet。每个查询计划扫描的 Tablet 个数由 FE 配置参数 `export_tablet_num_per_task` 指定,默认为 5。即假设一共 100 个 Tablet,则会生成 20 个查询计划。用户也可以在提交作业时,通过作业属性 `tablet_num_per_task` 指定这个数值。
|
||||
|
||||
一个作业的多个查询计划顺序执行。
|
||||
|
||||
### 查询计划执行
|
||||
|
||||
一个查询计划扫描多个分片,将读取的数据以行的形式组织,每 1024 行为一个 batch,调用 Broker 写入到远端存储上。
|
||||
|
||||
查询计划遇到错误会整体自动重试 3 次。如果一个查询计划重试 3 次依然失败,则整个作业失败。
|
||||
|
||||
Doris 会首先在指定的远端存储的路径中,建立一个名为 `__doris_export_tmp_12345` 的临时目录(其中 `12345` 为作业 id)。导出的数据首先会写入这个临时目录。每个查询计划会生成一个文件,文件名示例:
|
||||
|
||||
`export-data-c69fcf2b6db5420f-a96b94c1ff8bccef-1561453713822`
|
||||
|
||||
其中 `c69fcf2b6db5420f-a96b94c1ff8bccef` 为查询计划的 query id。`1561453713822` 为文件生成的时间戳。
|
||||
|
||||
当所有数据都导出后,Doris 会将这些文件 rename 到用户指定的路径中。
|
||||
|
||||
### Broker 参数
|
||||
|
||||
Export 需要借助 Broker 进程访问远端存储,不同的 Broker 需要提供不同的参数,具体请参阅 [Broker文档](../../advanced/broker.md)
|
||||
2. FE会统计要导出的所有Tablets,然后根据`parallelism`参数将所有Tablets分组,每一组再根据`maximum_number_of_export_partitions`参数生成若干个`SELECT INTO OUTFILE`查询计划
|
||||
3. 根据`parallelism`参数,生成相同个数的`ExportTaskExecutor`,每一个`ExportTaskExecutor`由一个线程负责,线程由FE的Job 调度框架去调度执行。
|
||||
4. FE的Job调度器会去调度`ExportTaskExecutor`并执行,每一个`ExportTaskExecutor`会串行地去执行由它负责的若干个`SELECT INTO OUTFILE`查询计划。
|
||||
|
||||
## 开始导出
|
||||
|
||||
Export 的详细用法可参考 [SHOW EXPORT](../../sql-manual/sql-reference/Show-Statements/SHOW-EXPORT.md) 。
|
||||
Export 的详细用法可参考 [EXPORT](../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md) 。
|
||||
|
||||
|
||||
### 导出到HDFS
|
||||
|
||||
@ -108,8 +60,7 @@ PROPERTIES
|
||||
"label" = "mylabel",
|
||||
"column_separator"=",",
|
||||
"columns" = "col1,col2",
|
||||
"exec_mem_limit"="2147483648",
|
||||
"timeout" = "3600"
|
||||
"parallelusm" = "3"
|
||||
)
|
||||
WITH BROKER "hdfs"
|
||||
(
|
||||
@ -122,27 +73,25 @@ WITH BROKER "hdfs"
|
||||
* `column_separator`:列分隔符。默认为 `\t`。支持不可见字符,比如 '\x07'。
|
||||
* `columns`:要导出的列,使用英文状态逗号隔开,如果不填这个参数默认是导出表的所有列。
|
||||
* `line_delimiter`:行分隔符。默认为 `\n`。支持不可见字符,比如 '\x07'。
|
||||
* `exec_mem_limit`: 表示 Export 作业中,一个查询计划在单个 BE 上的内存使用限制。默认 2GB。单位字节。
|
||||
* `timeout`:作业超时时间。默认 2小时。单位秒。
|
||||
* `tablet_num_per_task`:每个查询计划分配的最大分片数。默认为 5。
|
||||
* `parallelusm`:并发3个线程去导出。
|
||||
|
||||
### 导出到对象存储
|
||||
|
||||
通过s3 协议直接将数据导出到指定的存储.
|
||||
|
||||
```sql
|
||||
|
||||
EXPORT TABLE test TO "s3://bucket/path/to/export/dir/" WITH S3 (
|
||||
"AWS_ENDPOINT" = "http://host",
|
||||
"AWS_ACCESS_KEY" = "AK",
|
||||
"AWS_SECRET_KEY"="SK",
|
||||
"AWS_REGION" = "region"
|
||||
);
|
||||
EXPORT TABLE test TO "s3://bucket/path/to/export/dir/"
|
||||
WITH S3 (
|
||||
"s3.endpoint" = "http://host",
|
||||
"s3.access_key" = "AK",
|
||||
"s3.secret_key"="SK",
|
||||
"s3.region" = "region"
|
||||
);
|
||||
```
|
||||
|
||||
- `AWS_ACCESS_KEY`/`AWS_SECRET_KEY`:是您访问对象存储的ACCESS_KEY/SECRET_KEY
|
||||
- `AWS_ENDPOINT`:Endpoint表示对象存储对外服务的访问域名.
|
||||
- `AWS_REGION`:表示对象存储数据中心所在的地域.
|
||||
- `s3.access_key`/`s3.secret_key`:是您访问对象存储的ACCESS_KEY/SECRET_KEY
|
||||
- `s3.endpoint`:Endpoint表示对象存储对外服务的访问域名.
|
||||
- `s3.region`:表示对象存储数据中心所在的地域.
|
||||
|
||||
|
||||
### 查看导出状态
|
||||
@ -155,13 +104,23 @@ mysql> show EXPORT\G;
|
||||
JobId: 14008
|
||||
State: FINISHED
|
||||
Progress: 100%
|
||||
TaskInfo: {"partitions":["*"],"exec mem limit":2147483648,"column separator":",","line delimiter":"\n","tablet num":1,"broker":"hdfs","coord num":1,"db":"default_cluster:db1","tbl":"tbl3"}
|
||||
TaskInfo: {"partitions":[],"max_file_size":"","delete_existing_files":"","columns":"","format":"csv","column_separator":"\t","line_delimiter":"\n","db":"default_cluster:demo","tbl":"student4","tablet_num":30}
|
||||
Path: hdfs://host/path/to/export/
|
||||
CreateTime: 2019-06-25 17:08:24
|
||||
StartTime: 2019-06-25 17:08:28
|
||||
FinishTime: 2019-06-25 17:08:34
|
||||
Timeout: 3600
|
||||
ErrorMsg: NULL
|
||||
OutfileInfo: [
|
||||
[
|
||||
{
|
||||
"fileNumber": "1",
|
||||
"totalRows": "4",
|
||||
"fileSize": "34bytes",
|
||||
"url": "file:///127.0.0.1/Users/fangtiewei/tmp_data/export/f1ab7dcc31744152-bbb4cda2f5c88eac_"
|
||||
}
|
||||
]
|
||||
]
|
||||
1 row in set (0.01 sec)
|
||||
```
|
||||
|
||||
@ -171,21 +130,25 @@ FinishTime: 2019-06-25 17:08:34
|
||||
* EXPORTING:数据导出中
|
||||
* FINISHED:作业成功
|
||||
* CANCELLED:作业失败
|
||||
* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个查询计划,当前已完成 3 个,则进度为 30%。
|
||||
* Progress:作业进度。该进度以查询计划为单位。假设一共 10 个线程,当前已完成 3 个,则进度为 30%。
|
||||
* TaskInfo:以 Json 格式展示的作业信息:
|
||||
* db:数据库名
|
||||
* tbl:表名
|
||||
* partitions:指定导出的分区。`*` 表示所有分区。
|
||||
* exec mem limit:查询计划内存使用限制。单位字节。
|
||||
* column separator:导出文件的列分隔符。
|
||||
* line delimiter:导出文件的行分隔符。
|
||||
* partitions:指定导出的分区。`空`列表 表示所有分区。
|
||||
* column_separator:导出文件的列分隔符。
|
||||
* line_delimiter:导出文件的行分隔符。
|
||||
* tablet num:涉及的总 Tablet 数量。
|
||||
* broker:使用的 broker 的名称。
|
||||
* coord num:查询计划的个数。
|
||||
* max_file_size:一个导出文件的最大大小。
|
||||
* delete_existing_files:是否删除导出目录下已存在的文件及目录。
|
||||
* columns:指定需要导出的列名,空值代表导出所有列。
|
||||
* format:导出的文件格式
|
||||
* Path:远端存储上的导出路径。
|
||||
* CreateTime/StartTime/FinishTime:作业的创建时间、开始调度时间和结束时间。
|
||||
* Timeout:作业超时时间。单位是秒。该时间从 CreateTime 开始计算。
|
||||
* ErrorMsg:如果作业出现错误,这里会显示错误原因。
|
||||
* OutfileInfo:如果作业导出成功,这里会显示具体的`SELECT INTO OUTFILE`结果信息。
|
||||
|
||||
### 取消导出任务
|
||||
|
||||
@ -201,35 +164,50 @@ WHERE LABEL like "%example%";
|
||||
|
||||
## 最佳实践
|
||||
|
||||
### 查询计划的拆分
|
||||
### 并发导出
|
||||
|
||||
一个 Export 作业有多少查询计划需要执行,取决于总共有多少 Tablet,以及一个查询计划最多可以分配多少个 Tablet。因为多个查询计划是串行执行的,所以如果让一个查询计划处理更多的分片,则可以减少作业的执行时间。但如果查询计划出错(比如调用 Broker 的 RPC 失败,远端存储出现抖动等),过多的 Tablet 会导致一个查询计划的重试成本变高。所以需要合理安排查询计划的个数以及每个查询计划所需要扫描的分片数,在执行时间和执行成功率之间做出平衡。一般建议一个查询计划扫描的数据量在 3-5 GB内(一个表的 Tablet 的大小以及个数可以通过 `SHOW TABLETS FROM tbl_name;` 语句查看。)。
|
||||
一个 Export 作业可以设置`parallelism`参数来并发导出数据。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。每一个线程会负责导出表的部分Tablets。
|
||||
|
||||
一个 Export 作业的底层执行逻辑实际上是`SELECT INTO OUTFILE`语句,`parallelism`参数设置的每一个线程都会去执行独立的`SELECT INTO OUTFILE`语句。
|
||||
|
||||
Export 作业拆分成多个`SELECT INTO OUTFILE`的具体逻辑是:将该表的所有tablets平均的分给所有parallel线程,如:
|
||||
- num(tablets) = 40, parallelism = 3,则这3个线程各自负责的tablets数量分别为 14,13,13个。
|
||||
- num(tablets) = 2, parallelism = 3,则Doris会自动将parallelism设置为2,每一个线程负责一个tablets。
|
||||
|
||||
当一个线程负责的tablest超过 `maximum_tablets_of_outfile_in_export` 数值(默认为10,可在fe.conf中添加`maximum_tablets_of_outfile_in_export`参数来修改该值)时,该线程就会拆分为多个`SELECT INTO OUTFILE`语句,如:
|
||||
- 一个线程负责的tablets数量分别为 14,`maximum_tablets_of_outfile_in_export = 10`,则该线程负责两个`SELECT INTO OUTFILE`语句,第一个`SELECT INTO OUTFILE`语句导出10个tablets,第二个`SELECT INTO OUTFILE`语句导出4个tablets,两个`SELECT INTO OUTFILE`语句由该线程串行执行。
|
||||
|
||||
|
||||
当所要导出的数据量很大时,可以考虑适当调大`parallelism`参数来增加并发导出。若机器核数紧张,无法再增加`parallelism` 而导出表的Tablets又较多 时,可以考虑调大`maximum_tablets_of_outfile_in_export`来增加一个`SELECT INTO OUTFILE`语句负责的tablets数量,也可以加快导出速度。
|
||||
|
||||
### exec\_mem\_limit
|
||||
|
||||
通常一个 Export 作业的查询计划只有 `扫描`-`导出` 两部分,不涉及需要太多内存的计算逻辑。所以通常 2GB 的默认内存限制可以满足需求。但在某些场景下,比如一个查询计划,在同一个 BE 上需要扫描的 Tablet 过多,或者 Tablet 的数据版本过多时,可能会导致内存不足。此时需要通过这个参数设置更大的内存,比如 4GB、8GB 等。
|
||||
通常一个 Export 作业的查询计划只有 `扫描-导出` 两部分,不涉及需要太多内存的计算逻辑。所以通常 2GB 的默认内存限制可以满足需求。
|
||||
|
||||
但在某些场景下,比如一个查询计划,在同一个 BE 上需要扫描的 Tablet 过多,或者 Tablet 的数据版本过多时,可能会导致内存不足。可以调整session变量`exec_mem_limit`来调大内存使用限制。
|
||||
|
||||
## 注意事项
|
||||
|
||||
* 不建议一次性导出大量数据。一个 Export 作业建议的导出数据量最大在几十 GB。过大的导出会导致更多的垃圾文件和更高的重试成本。
|
||||
* 如果表数据量过大,建议按照分区导出。
|
||||
* 在 Export 作业运行过程中,如果 FE 发生重启或切主,则 Export 作业会失败,需要用户重新提交。
|
||||
* 如果 Export 作业运行失败,在远端存储中产生的 `__doris_export_tmp_xxx` 临时目录,以及已经生成的文件不会被删除,需要用户手动删除。
|
||||
* 如果 Export 作业运行成功,在远端存储中产生的 `__doris_export_tmp_xxx` 目录,根据远端存储的文件系统语义,可能会保留,也可能会被清除。比如对象存储(支持S3协议)中,通过 rename 操作将一个目录中的最后一个文件移走后,该目录也会被删除。如果该目录没有被清除,用户可以手动清除。
|
||||
* 当 Export 运行完成后(成功或失败),FE 发生重启或切主,则 [SHOW EXPORT](../../sql-manual/sql-reference/Show-Statements/SHOW-EXPORT.md) 展示的作业的部分信息会丢失,无法查看。
|
||||
* Export 作业只会导出 Base 表的数据,不会导出 Rollup Index 的数据。
|
||||
* 如果 Export 作业运行失败,已经生成的文件不会被删除,需要用户手动删除。
|
||||
* Export 作业可以导出 Base 表 / View视图表 / 外表 的数据,不会导出 Rollup Index 的数据。
|
||||
* Export 作业会扫描数据,占用 IO 资源,可能会影响系统的查询延迟。
|
||||
* 在使用EXPORT命令时,请确保目标路径是已存在的目录,否则导出可能会失败。
|
||||
* 在并发导出时,请注意合理地配置线程数量和并行度,以充分利用系统资源并避免性能瓶颈。
|
||||
* 导出到本地文件时,要注意文件权限和路径,确保有足够的权限进行写操作,并遵循适当的文件系统路径。
|
||||
* 在导出过程中,可以实时监控进度和性能指标,以便及时发现问题并进行优化调整。
|
||||
* 导出操作完成后,建议验证导出的数据是否完整和正确,以确保数据的质量和完整性。
|
||||
|
||||
## 相关配置
|
||||
|
||||
### FE
|
||||
|
||||
* `export_checker_interval_second`:Export 作业调度器的调度间隔,默认为 5 秒。设置该参数需重启 FE。
|
||||
* `export_running_job_num_limit`:正在运行的 Export 作业数量限制。如果超过,则作业将等待并处于 PENDING 状态。默认为 5,可以运行时调整。
|
||||
* `export_task_default_timeout_second`:Export 作业默认超时时间。默认为 2 小时。可以运行时调整。
|
||||
* `export_tablet_num_per_task`:一个查询计划负责的最大分片数。默认为 5。
|
||||
* `label`:用户手动指定的 EXPORT 任务 label ,如果不指定会自动生成一个 label 。
|
||||
* `maximum_tablets_of_outfile_in_export`:ExportExecutorTask任务中一个OutFile语句允许的最大tablets数量。
|
||||
|
||||
## 更多帮助
|
||||
|
||||
关于 Export 使用的更多详细语法及最佳实践,请参阅 [Export](../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md) 命令手册,你也可以在 MySql 客户端命令行下输入 `HELP EXPORT` 获取更多帮助信息。
|
||||
关于 EXPORT 使用的更多详细语法及最佳实践,请参阅 [Export](../../sql-manual/sql-reference/Data-Manipulation-Statements/Manipulation/EXPORT.md) 命令手册,你也可以在 MySql 客户端命令行下输入 `HELP EXPORT` 获取更多帮助信息。
|
||||
|
||||
EXPORT 命令底层实现是`SELECT INTO OUTFILE`语句,有关`SELECT INTO OUTFILE`可以参阅[同步导出](./outfile.md) 和 [SELECT INTO OUTFILE](../..//sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE.md)命令手册。
|
||||
|
||||
@ -28,6 +28,8 @@ under the License.
|
||||
|
||||
本文档介绍如何使用 [SELECT INTO OUTFILE](../../sql-manual/sql-reference/Data-Manipulation-Statements/OUTFILE.md) 命令进行查询结果的导出操作。
|
||||
|
||||
`SELECT INTO OUTFILE` 是一个同步命令,命令返回即表示操作结束,同时会返回一行结果来展示导出的执行结果。
|
||||
|
||||
## 示例
|
||||
|
||||
### 导出到HDFS
|
||||
@ -150,7 +152,7 @@ ERROR 1064 (HY000): errCode = 2, detailMessage = Open broker writer failed ...
|
||||
* 导出命令不会检查文件及文件路径是否存在。是否会自动创建路径、或是否会覆盖已存在文件,完全由远端存储系统的语义决定。
|
||||
* 如果在导出过程中出现错误,可能会有导出文件残留在远端存储系统上。Doris 不会清理这些文件。需要用户手动清理。
|
||||
* 导出命令的超时时间同查询的超时时间。可以通过 `SET query_timeout=xxx` 进行设置。
|
||||
* 对于结果集为空的查询,依然会产生一个大小为0的文件。
|
||||
* 对于结果集为空的查询,依然会产生一个文件。
|
||||
* 文件切分会保证一行数据完整的存储在单一文件中。因此文件的大小并不严格等于 `max_file_size`。
|
||||
* 对于部分输出为非可见字符的函数,如 BITMAP、HLL 类型,输出为 `\N`,即 NULL。
|
||||
* 目前部分地理信息函数,如 `ST_Point` 的输出类型为 VARCHAR,但实际输出值为经过编码的二进制字符。当前这些函数会输出乱码。对于地理函数,请使用 `ST_AsText` 进行输出。
|
||||
|
||||
@ -32,32 +32,33 @@ EXPORT
|
||||
|
||||
### Description
|
||||
|
||||
该语句用于将指定表的数据导出到指定位置。
|
||||
`EXPORT` 命令用于将指定表的数据导出为文件到指定位置。目前支持通过 Broker 进程, S3 协议或HDFS 协议,导出到远端存储,如 HDFS,S3,BOS,COS(腾讯云)上。
|
||||
|
||||
这是一个异步操作,任务提交成功则返回。执行后可使用 [SHOW EXPORT](../../Show-Statements/SHOW-EXPORT.md) 命令查看进度。
|
||||
`EXPORT`是一个异步操作,该命令会提交一个`EXPORT JOB`到Doris,任务提交成功立即返回。执行后可使用 [SHOW EXPORT](../../Show-Statements/SHOW-EXPORT.md) 命令查看进度。
|
||||
|
||||
语法:
|
||||
|
||||
```sql
|
||||
EXPORT TABLE table_name
|
||||
[PARTITION (p1[,p2])]
|
||||
[WHERE]
|
||||
TO export_path
|
||||
[opt_properties]
|
||||
WITH BROKER/S3/HDFS
|
||||
[broker_properties];
|
||||
```
|
||||
|
||||
```sql
|
||||
EXPORT TABLE table_name
|
||||
[PARTITION (p1[,p2])]
|
||||
[WHERE]
|
||||
TO export_path
|
||||
[opt_properties]
|
||||
WITH BROKER/S3/HDFS
|
||||
[broker_properties];
|
||||
```
|
||||
|
||||
**原理**
|
||||
Export语句底层实际执行的是`select...outfile..`语句,Export任务会根据`parallelism`参数的值来分解为多个`select...outfile..`语句并发地去执行,每一个`select...outfile..`负责导出部份tablets数据。
|
||||
|
||||
**说明**:
|
||||
|
||||
- `table_name`
|
||||
|
||||
当前要导出的表的表名。仅支持 Doris 本地表数据的导出。
|
||||
当前要导出的表的表名。支持 Doris 本地表、视图View、Catalog外表数据的导出。
|
||||
|
||||
- `partition`
|
||||
|
||||
可以只导出指定表的某些指定分区
|
||||
可以只导出指定表的某些指定分区,只对Doris本地表有效。
|
||||
|
||||
- `export_path`
|
||||
|
||||
@ -73,14 +74,20 @@ Export语句底层实际执行的是`select...outfile..`语句,Export任务会
|
||||
|
||||
可以指定如下参数:
|
||||
|
||||
- `label`: 可选参数,指定此次Export任务的label,当不指定时系统会随机给一个label。
|
||||
- `column_separator`:指定导出的列分隔符,默认为\t。仅支持单字节。
|
||||
- `line_delimiter`:指定导出的行分隔符,默认为\n。仅支持单字节。
|
||||
- `columns`:指定导出作业表的某些列。
|
||||
- `timeout`:导出作业的超时时间,默认为2小时,单位是秒。
|
||||
- `format`:导出作业的文件格式,支持:parquet, orc, csv, csv_with_names、csv_with_names_and_types。 默认为csv格式。
|
||||
- `max_file_size`:导出作业单个文件大小限制,如果结果超过这个值,将切割成多个文件。
|
||||
- `parallelism`:导出作业的并发度,默认为`1`,导出作业会分割为`parallelism`个数的`select..outfile..`语句去并发执行。(如果parallelism个数大于表的tablets个数,系统将自动把parallelism设置为tablets个数大小,即每一个`select..outfile..`语句负责一个tablets)
|
||||
- `label`: 可选参数,指定此次Export任务的label,当不指定时系统会随机生成一个label。
|
||||
|
||||
- `column_separator`:指定导出的列分隔符,默认为\t,支持多字节。该参数只用于csv文件格式。
|
||||
|
||||
- `line_delimiter`:指定导出的行分隔符,默认为\n,支持多字节。该参数只用于csv文件格式。
|
||||
|
||||
- `columns`:指定导出表的某些列。
|
||||
|
||||
- `format`:指定导出作业的文件格式,支持:parquet, orc, csv, csv_with_names、csv_with_names_and_types。默认为csv格式。
|
||||
|
||||
- `max_file_size`:导出作业单个文件大小限制,如果结果超过这个值,将切割成多个文件。`max_file_size`取值范围是[5MB, 2GB], 默认为1GB。(当指定导出为orc文件格式时,实际切分文件的大小将是64MB的倍数,如:指定max_file_size = 5MB, 实际将以64MB为切分;指定max_file_size = 65MB, 实际将以128MB为切分)
|
||||
|
||||
- `parallelism`:导出作业的并发度,默认为`1`,导出作业会开启`parallelism`个数的线程去执行`select into outfile`语句。(如果parallelism个数大于表的tablets个数,系统将自动把parallelism设置为tablets个数大小,即每一个`select into outfile`语句负责一个tablets)
|
||||
|
||||
- `delete_existing_files`: 默认为false,若指定为true,则会先删除`export_path`所指定目录下的所有文件,然后导出数据到该目录下。例如:"export_path" = "/user/tmp", 则会删除"/user/"下所有文件及目录;"file_path" = "/user/tmp/", 则会删除"/user/tmp/"下所有文件及目录。
|
||||
|
||||
> 注意:要使用delete_existing_files参数,还需要在fe.conf中添加配置`enable_delete_existing_files = true`并重启fe,此时delete_existing_files才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。
|
||||
@ -146,7 +153,7 @@ Export语句底层实际执行的是`select...outfile..`语句,Export任务会
|
||||
### Example
|
||||
|
||||
#### export数据到本地
|
||||
export数据到本地文件系统,需要在fe.conf中添加`enable_outfile_to_local=true`并且重启FE。
|
||||
> export数据到本地文件系统,需要在fe.conf中添加`enable_outfile_to_local=true`并且重启FE。
|
||||
|
||||
1. 将test表中的所有数据导出到本地存储, 默认导出csv格式文件
|
||||
```sql
|
||||
@ -241,7 +248,7 @@ Export导出数据时会先将`/home/user/`目录下所有文件及目录删除
|
||||
|
||||
#### export with S3
|
||||
|
||||
8. 将 s3_test 表中的所有数据导出到 s3 上,以不可见字符 "\x07" 作为列或者行分隔符。如果需要将数据导出到minio,还需要指定use_path_style=true。
|
||||
1. 将 s3_test 表中的所有数据导出到 s3 上,以不可见字符 "\x07" 作为列或者行分隔符。如果需要将数据导出到minio,还需要指定use_path_style=true。
|
||||
|
||||
```sql
|
||||
EXPORT TABLE s3_test TO "s3://bucket/a/b/c"
|
||||
@ -249,36 +256,22 @@ PROPERTIES (
|
||||
"column_separator"="\\x07",
|
||||
"line_delimiter" = "\\x07"
|
||||
) WITH s3 (
|
||||
"AWS_ENDPOINT" = "xxxxx",
|
||||
"AWS_ACCESS_KEY" = "xxxxx",
|
||||
"AWS_SECRET_KEY"="xxxx",
|
||||
"AWS_REGION" = "xxxxx"
|
||||
)
|
||||
```
|
||||
|
||||
```sql
|
||||
EXPORT TABLE minio_test TO "s3://bucket/a/b/c"
|
||||
PROPERTIES (
|
||||
"column_separator"="\\x07",
|
||||
"line_delimiter" = "\\x07"
|
||||
) WITH s3 (
|
||||
"AWS_ENDPOINT" = "xxxxx",
|
||||
"AWS_ACCESS_KEY" = "xxxxx",
|
||||
"AWS_SECRET_KEY"="xxxx",
|
||||
"AWS_REGION" = "xxxxx",
|
||||
"use_path_style" = "true"
|
||||
"s3.endpoint" = "xxxxx",
|
||||
"s3.region" = "xxxxx",
|
||||
"s3.secret_key"="xxxx",
|
||||
"s3.access_key" = "xxxxx"
|
||||
)
|
||||
```
|
||||
|
||||
#### export with HDFS
|
||||
|
||||
9. 将 test 表中的所有数据导出到 HDFS 上,导出文件格式为parquet,导出作业单个文件大小限制为1024MB,保留所指定目录下的所有文件。
|
||||
1. 将 test 表中的所有数据导出到 HDFS 上,导出文件格式为parquet,导出作业单个文件大小限制为512MB,保留所指定目录下的所有文件。
|
||||
|
||||
```sql
|
||||
EXPORT TABLE test TO "hdfs://hdfs_host:port/a/b/c/"
|
||||
PROPERTIES(
|
||||
"format" = "parquet",
|
||||
"max_file_size" = "1024MB",
|
||||
"max_file_size" = "512MB",
|
||||
"delete_existing_files" = "false"
|
||||
)
|
||||
with HDFS (
|
||||
@ -335,29 +328,38 @@ WITH BROKER "broker_name"
|
||||
|
||||
### Best Practice
|
||||
|
||||
#### 子任务的拆分
|
||||
#### 并发执行
|
||||
|
||||
一个 Export 作业会拆分成多个子任务(执行计划)去执行。有多少查询计划需要执行,取决于总共有多少 Tablet,以及一个查询计划最多可以分配多少个 Tablet。
|
||||
一个 Export 作业可以设置`parallelism`参数来并发导出数据。`parallelism`参数实际就是指定执行 EXPORT 作业的线程数量。每一个线程会负责导出表的部分Tablets。
|
||||
|
||||
因为多个查询计划是串行执行的,所以如果让一个查询计划处理更多的分片,则可以减少作业的执行时间。
|
||||
一个 Export 作业的底层执行逻辑实际上是`SELECT INTO OUTFILE`语句,`parallelism`参数设置的每一个线程都会去执行独立的`SELECT INTO OUTFILE`语句。
|
||||
|
||||
但如果查询计划出错(比如调用 Broker 的 RPC 失败,远端存储出现抖动等),过多的 Tablet 会导致一个查询计划的重试成本变高。
|
||||
Export 作业拆分成多个`SELECT INTO OUTFILE`的具体逻辑是:将该表的所有tablets平均的分给所有parallel线程,如:
|
||||
- num(tablets) = 40, parallelism = 3,则这3个线程各自负责的tablets数量分别为 14,13,13个。
|
||||
- num(tablets) = 2, parallelism = 3,则Doris会自动将parallelism设置为2,每一个线程负责一个tablets。
|
||||
|
||||
所以需要合理安排查询计划的个数以及每个查询计划所需要扫描的分片数,在执行时间和执行成功率之间做出平衡。
|
||||
当一个线程负责的tablest超过 `maximum_tablets_of_outfile_in_export` 数值(默认为10,可在fe.conf中添加`maximum_tablets_of_outfile_in_export`参数来修改该值)时,该线程就会拆分为多个`SELECT INTO OUTFILE`语句,如:
|
||||
- 一个线程负责的tablets数量分别为 14,`maximum_tablets_of_outfile_in_export = 10`,则该线程负责两个`SELECT INTO OUTFILE`语句,第一个`SELECT INTO OUTFILE`语句导出10个tablets,第二个`SELECT INTO OUTFILE`语句导出4个tablets,两个`SELECT INTO OUTFILE`语句由该线程串行执行。
|
||||
|
||||
一般建议一个查询计划扫描的数据量在 3-5 GB内。
|
||||
|
||||
当所要导出的数据量很大时,可以考虑适当调大`parallelism`参数来增加并发导出。若机器核数紧张,无法再增加`parallelism` 而导出表的Tablets又较多 时,可以考虑调大`maximum_tablets_of_outfile_in_export`来增加一个`SELECT INTO OUTFILE`语句负责的tablets数量,也可以加快导出速度。
|
||||
|
||||
#### 内存限制
|
||||
|
||||
通常一个 Export 作业的查询计划只有 `扫描-导出` 两部分,不涉及需要太多内存的计算逻辑。所以通常 2GB 的默认内存限制可以满足需求。
|
||||
|
||||
但在某些场景下,比如一个查询计划,在同一个 BE 上需要扫描的 Tablet 过多,或者 Tablet 的数据版本过多时,可能会导致内存不足。此时需要通过参数 `exec_mem_limit` 设置更大的内存,比如 4GB、8GB 等。
|
||||
但在某些场景下,比如一个查询计划,在同一个 BE 上需要扫描的 Tablet 过多,或者 Tablet 的数据版本过多时,可能会导致内存不足。可以调整session变量`exec_mem_limit`来调大内存使用限制。
|
||||
|
||||
#### 注意事项
|
||||
|
||||
- 不建议一次性导出大量数据。一个 Export 作业建议的导出数据量最大在几十 GB。过大的导出会导致更多的垃圾文件和更高的重试成本。如果表数据量过大,建议按照分区导出。
|
||||
|
||||
- 如果 Export 作业运行失败,已经生成的文件不会被删除,需要用户手动删除。
|
||||
- Export 作业只会导出 Base 表的数据,不会导出物化视图的数据。
|
||||
|
||||
- Export 作业会扫描数据,占用 IO 资源,可能会影响系统的查询延迟。
|
||||
|
||||
- 目前在export时只是简单检查tablets版本是否一致,建议在执行export过程中不要对该表进行导入数据操作。
|
||||
|
||||
- 一个Export Job允许导出的分区数量最大为2000,可以在fe.conf中添加参数`maximum_number_of_export_partitions`并重启FE来修改该设置。
|
||||
|
||||
- `EXPORT`命令的超时时间同查询的超时时间。可以通过 SET query_timeout=xxx 进行设置
|
||||
|
||||
@ -32,7 +32,7 @@ OURFILE
|
||||
|
||||
### description
|
||||
|
||||
该语句用于使用 `SELECT INTO OUTFILE` 命令将查询结果的导出为文件。目前支持通过 Broker 进程, 通过 S3 协议, 或直接通过 HDFS 协议,导出到远端存储,如 HDFS,S3,BOS,COS(腾讯云)上。
|
||||
`SELECT INTO OUTFILE` 命令用于将查询结果导出为文件。目前支持通过 Broker 进程, S3 协议或HDFS 协议,导出到远端存储,如 HDFS,S3,BOS,COS(腾讯云)上。
|
||||
|
||||
语法:
|
||||
|
||||
@ -52,7 +52,7 @@ INTO OUTFILE "file_path"
|
||||
```
|
||||
file_path 指向文件存储的路径以及文件前缀。如 `hdfs://path/to/my_file_`。
|
||||
|
||||
最终的文件名将由 `my_file_`,文件序号以及文件格式后缀组成。其中文件序号由0开始,数量为文件被分割的数量。如:
|
||||
最终的文件名将由 `my_file_`、文件序号以及文件格式后缀组成。其中文件序号由0开始,数量为文件被分割的数量。如:
|
||||
my_file_abcdefg_0.csv
|
||||
my_file_abcdefg_1.csv
|
||||
my_file_abcdegf_2.csv
|
||||
@ -73,20 +73,20 @@ INTO OUTFILE "file_path"
|
||||
3. properties
|
||||
|
||||
```
|
||||
指定相关属性。目前支持通过 Broker 进程, 或通过 S3 协议进行导出。
|
||||
指定相关属性。目前支持通过 Broker 进程, 或通过 S3/HDFS协议进行导出。
|
||||
|
||||
语法:
|
||||
[PROPERTIES ("key"="value", ...)]
|
||||
支持如下属性:
|
||||
|
||||
文件相关的属性
|
||||
column_separator: 列分隔符,只支持csv格式。在 1.2 版本开始支持多字节分隔符,如:"\\x01", "abc"。
|
||||
line_delimiter: 行分隔符,只支持csv格式。在 1.2 版本开始支持多字节分隔符,如:"\\x01", "abc"。
|
||||
文件相关的属性:
|
||||
column_separator: 列分隔符,只用于csv相关格式。在 1.2 版本开始支持多字节分隔符,如:"\\x01", "abc"。
|
||||
line_delimiter: 行分隔符,只用于csv相关格式。在 1.2 版本开始支持多字节分隔符,如:"\\x01", "abc"。
|
||||
max_file_size: 单个文件大小限制,如果结果超过这个值,将切割成多个文件, max_file_size取值范围是[5MB, 2GB], 默认为1GB。(当指定导出为orc文件格式时,实际切分文件的大小将是64MB的倍数,如:指定max_file_size = 5MB, 实际将以64MB为切分;指定max_file_size = 65MB, 实际将以128MB为切分)
|
||||
delete_existing_files: 默认为false,若指定为true,则会先删除file_path指定的目录下的所有文件,然后导出数据到该目录下。例如:"file_path" = "/user/tmp", 则会删除"/user/"下所有文件及目录;"file_path" = "/user/tmp/", 则会删除"/user/tmp/"下所有文件及目录
|
||||
file_suffix: 指定导出文件的后缀
|
||||
delete_existing_files: 默认为false,若指定为true,则会先删除file_path指定的目录下的所有文件,然后导出数据到该目录下。例如:"file_path" = "/user/tmp", 则会删除"/user/"下所有文件及目录;"file_path" = "/user/tmp/", 则会删除"/user/tmp/"下所有文件及目录。
|
||||
file_suffix: 指定导出文件的后缀,若不指定该参数,将使用文件格式的默认后缀。
|
||||
|
||||
Broker 相关属性需加前缀 `broker.`:
|
||||
Broker 相关属性需加前缀 `broker.`:
|
||||
broker.name: broker名称
|
||||
broker.hadoop.security.authentication: 指定认证方式为 kerberos
|
||||
broker.kerberos_principal: 指定 kerberos 的 principal
|
||||
@ -114,7 +114,24 @@ INTO OUTFILE "file_path"
|
||||
use_path_stype: (选填) 默认为false 。S3 SDK 默认使用 virtual-hosted style 方式。但某些对象存储系统可能没开启或不支持virtual-hosted style 方式的访问,此时可以添加 use_path_style 参数来强制使用 path style 访问方式。
|
||||
```
|
||||
|
||||
> 注意:要使用delete_existing_files参数,还需要在fe.conf中添加配置`enable_delete_existing_files = true`并重启fe,此时delete_existing_files才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。
|
||||
> 注意:若要使用delete_existing_files参数,还需要在fe.conf中添加配置`enable_delete_existing_files = true`并重启fe,此时delete_existing_files才会生效。delete_existing_files = true 是一个危险的操作,建议只在测试环境中使用。
|
||||
|
||||
4. 导出的数据类型
|
||||
|
||||
所有文件类型都支持到处基本数据类型,而对于复杂数据类型(ARRAY/MAP/STRUCT),当前只有csv/orc/csv_with_names/csv_with_names_and_types支持导出复杂类型,且不支持嵌套复杂类型。
|
||||
|
||||
5. 并发导出
|
||||
|
||||
设置session变量```set enable_parallel_outfile = true;```可开启outfile并发导出,详细使用方法见[导出查询结果集](../../../data-operate/export/outfile.md)
|
||||
|
||||
6. 导出到本地
|
||||
|
||||
导出到本地文件时需要先在fe.conf中配置`enable_outfile_to_local=true`
|
||||
|
||||
```sql
|
||||
select * from tbl1 limit 10
|
||||
INTO OUTFILE "file:///home/work/path/result_";
|
||||
```
|
||||
|
||||
### example
|
||||
|
||||
@ -150,13 +167,10 @@ INTO OUTFILE "file_path"
|
||||
"broker.name" = "my_broker",
|
||||
"broker.hadoop.security.authentication" = "kerberos",
|
||||
"broker.kerberos_principal" = "doris@YOUR.COM",
|
||||
"broker.kerberos_keytab" = "/home/doris/my.keytab",
|
||||
"schema"="required,int32,c1;required,byte_array,c2;required,byte_array,c2"
|
||||
"broker.kerberos_keytab" = "/home/doris/my.keytab"
|
||||
);
|
||||
```
|
||||
|
||||
查询结果导出到parquet文件需要明确指定`schema`。
|
||||
|
||||
3. 将 CTE 语句的查询结果导出到文件 `hdfs://path/to/result.txt`。默认导出格式为 CSV。使用 `my_broker` 并设置 hdfs 高可用信息。使用默认的行列分隔符。
|
||||
|
||||
```sql
|
||||
@ -195,8 +209,7 @@ INTO OUTFILE "file_path"
|
||||
"broker.name" = "my_broker",
|
||||
"broker.bos_endpoint" = "http://bj.bcebos.com",
|
||||
"broker.bos_accesskey" = "xxxxxxxxxxxxxxxxxxxxxxxxxx",
|
||||
"broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyyy",
|
||||
"schema"="required,int32,k1;required,byte_array,k2"
|
||||
"broker.bos_secret_accesskey" = "yyyyyyyyyyyyyyyyyyyyyyyyyy"
|
||||
);
|
||||
```
|
||||
|
||||
@ -342,3 +355,7 @@ INTO OUTFILE "file_path"
|
||||
4. 结果完整性保证
|
||||
|
||||
该命令是一个同步命令,因此有可能在执行过程中任务连接断开了,从而无法活着导出的数据是否正常结束,或是否完整。此时可以使用 `success_file_name` 参数要求任务成功后,在目录下生成一个成功文件标识。用户可以通过这个文件,来判断导出是否正常结束。
|
||||
|
||||
5. 其他注意事项
|
||||
|
||||
见[导出查询结果集](../../../data-operate/export/outfile.md)
|
||||
|
||||
Reference in New Issue
Block a user