From 0f10f840757d4d143faf0fb9b376ebefa5ca217f Mon Sep 17 00:00:00 2001 From: smallhibiscus <844981280@qq.com> Date: Sat, 9 Apr 2022 19:02:16 +0800 Subject: [PATCH] [refactor][doc] Add update-delete documentation (#8821) --- .../import/import-way/binlog-load-manual.md | 31 +- .../import/import-way/insert-into-manual.md | 231 +++++++ .../import/import-way/routine-load-manual.md | 308 +++++++++ .../import/import-way/s3-load-manual.md | 69 +- .../import/import-way/spark-load-manual.md | 604 ++++++++++++++++++ .../import/import-way/stream-load-manual.md | 348 +++++++++- .../update-delete/batch-delete-manual.md | 101 +-- .../update-delete/delete-manual.md | 126 ++-- .../update-delete/sequence-column-manual.md | 62 +- .../en/data-operate/update-delete/update.md | 109 ++-- .../import/import-way/binlog-load-manual.md | 27 +- .../import/import-way/insert-into-manual.md | 34 +- .../import/import-way/routine-load-manual.md | 376 +++++++++++ .../import/import-way/s3-load-manual.md | 69 ++ .../import/import-way/spark-load-manual.md | 551 ++++++++++++++++ .../import/import-way/stream-load-manual.md | 365 +++++++++++ .../update-delete/batch-delete-manual.md | 219 ++++++- .../update-delete/delete-manual.md | 131 +++- .../update-delete/sequence-column-manual.md | 209 +++++- .../data-operate/update-delete/update.md | 92 ++- 20 files changed, 3826 insertions(+), 236 deletions(-) diff --git a/new-docs/en/data-operate/import/import-way/binlog-load-manual.md b/new-docs/en/data-operate/import/import-way/binlog-load-manual.md index a13e52000d..843a9e5d6b 100644 --- a/new-docs/en/data-operate/import/import-way/binlog-load-manual.md +++ b/new-docs/en/data-operate/import/import-way/binlog-load-manual.md @@ -33,17 +33,6 @@ The Binlog Load feature enables Doris to incrementally synchronize update operat * Filter query * Temporarily incompatible with DDL statements -## Glossary -* FE: Frontend, the front-end node of Doris. Responsible for metadata management and request access. -* BE: Backend, the backend node of Doris. Responsible for query execution and data storage. -* Canal: Alibaba's open source MySQL binlog parsing tool. Support incremental data subscription & consumption. -* Batch: A batch of data sent by canal to the client with a globally unique self-incrementing ID. -* SyncJob: A data synchronization job submitted by the user. -* Receiver: Responsible for subscribing to and receiving data from canal. -* Consumer: Responsible for distributing the data received by the Receiver to each channel. -* Channel: The channel that receives the data distributed by Consumer, it creates tasks for sending data, and controls the begining, committing and aborting of transaction in one table. -* Task: Task created by channel, sends data to Be when executing. - ## Principle In the design of phase one, Binlog Load needs to rely on canal as an intermediate medium, so that canal can be pretended to be a slave node to get and parse the binlog on the MySQL master node, and then Doris can get the parsed data on the canal. This process mainly involves mysql, canal and Doris. The overall data flow is as follows: @@ -159,6 +148,7 @@ Binlog log supports two main formats (in addition to mixed based mode): Binlog will record the data change information of each row and all columns of the master node, and the slave node will copy and execute the change of each row to the local node. + The first format only writes the executed SQL statements. Although the log volume will be small, it has the following disadvantages: 1. The actual data of each row is not recorded @@ -276,7 +266,7 @@ After downloading, please follow the steps below to complete the deployment. 2013-02-05 22:50:45.810 [main] INFO c.a.otter.canal.instance.spring.CanalInstanceWithSpring - start successful.... ``` -### Principle Description +### Canal End Description By faking its own MySQL dump protocol, canal disguises itself as a slave node, get and parses the binlog of the master node. @@ -402,7 +392,8 @@ The detailed syntax of creating a SyncJob can be viewd in `help create sync job` ### Show Job Status -Specific commands and examples for showing job status can be found in `help show sync job;` command. + +Specific commands and examples for viewing job status can be viewed through the [SHOW SYNC JOB](../../../sql-manual/sql-reference-v2/show/SHOW-SYNC-JOB.html) command. The parameters in the result set have the following meanings: @@ -454,6 +445,10 @@ Users can control the status of jobs through `stop/pause/resume` commands. You can use `HELP STOP SYNC JOB;`, `HELP PAUSE SYNC JOB`; And `HELP RESUME SYNC JOB;` commands to view help and examples. +## Case Combat + +[How to use Apache Doris Binlog Load and examples](https://doris.apache.org/zh-CN/article/articles/doris-binlog-load.html) + ## Related Parameters ### Canal configuration @@ -474,7 +469,7 @@ You can use `HELP STOP SYNC JOB;`, `HELP PAUSE SYNC JOB`; And `HELP RESUME SYNC The default space occupied by an event at the canal end, default value is 1024 bytes. This value multiplied by `canal.instance.memory.buffer.size` is equal to the maximum space of the store. For example, if the queue length of the store is 16384, the space of the store is 16MB. However, the actual size of an event is not actually equal to this value, but is determined by the number of rows of data in the event and the length of each row of data. For example, the insert event of a table with only two columns is only 30 bytes, but the delete event may reach thousands of bytes. This is because the number of rows of delete event is usually more than that of insert event. - + ### Fe configuration The following configuration belongs to the system level configuration of SyncJob. The configuration value can be modified in configuration file fe.conf. @@ -508,7 +503,7 @@ The following configuration belongs to the system level configuration of SyncJob 1. Will modifying the table structure affect data synchronization? Yes. The SyncJob cannot prohibit `alter table` operation. -When the table's schema changes, if the column mapping cannot match, the job may be suspended incorrectly. It is recommended to reduce such problems by explicitly specifying the column mapping relationship in the data synchronization job, or by adding nullable columns or columns with default values. + When the table's schema changes, if the column mapping cannot match, the job may be suspended incorrectly. It is recommended to reduce such problems by explicitly specifying the column mapping relationship in the data synchronization job, or by adding nullable columns or columns with default values. 2. Will the SyncJob continue to run after the database is deleted? @@ -520,4 +515,8 @@ When the table's schema changes, if the column mapping cannot match, the job may 4. Why is the precision of floating-point type different between MySQL and Doris during data synchronization? - The precision of Doris floating-point type is different from that of MySQL. You can choose to use decimal type instead. \ No newline at end of file + The precision of Doris floating-point type is different from that of MySQL. You can choose to use decimal type instead. + +## More Help + +For more detailed syntax and best practices used by Binlog Load, see [Binlog Load](../../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Load/BINLOG- LOAD.html) command manual, you can also enter `HELP BINLOG` in the MySql client command line for more help information. \ No newline at end of file diff --git a/new-docs/en/data-operate/import/import-way/insert-into-manual.md b/new-docs/en/data-operate/import/import-way/insert-into-manual.md index fd608eb649..9502c04418 100644 --- a/new-docs/en/data-operate/import/import-way/insert-into-manual.md +++ b/new-docs/en/data-operate/import/import-way/insert-into-manual.md @@ -25,3 +25,234 @@ under the License. --> # Insert Into + +The use of Insert Into statements is similar to that of Insert Into statements in databases such as MySQL. But in Doris, all data writing is a separate import job. So Insert Into is also introduced here as an import method. + +The main Insert Into command contains the following two kinds; + +- INSERT INTO tbl SELECT ... +- INSERT INTO tbl (col1, col2, ...) VALUES (1, 2, ...), (1,3, ...); + +The second command is for Demo only, not in a test or production environment. + +## Import operations and load results + +The Insert Into command needs to be submitted through MySQL protocol. Creating an import request returns the import result synchronously. + +The following are examples of the use of two Insert Intos: + +```sql +INSERT INTO tbl2 WITH LABEL label1 SELECT * FROM tbl3; +INSERT INTO tbl1 VALUES ("qweasdzxcqweasdzxc"), ("a"); +``` + +> Note: When you need to use `CTE(Common Table Expressions)` as the query part in an insert operation, you must specify the `WITH LABEL` and column list parts. Example: +> +> ```sql +> INSERT INTO tbl1 WITH LABEL label1 +> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2) +> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1; +> +> +> INSERT INTO tbl1 (k1) +> WITH cte1 AS (SELECT * FROM tbl1), cte2 AS (SELECT * FROM tbl2) +> SELECT k1 FROM cte1 JOIN cte2 WHERE cte1.k1 = 1; +> ``` + +For specific parameter description, you can refer to [INSERT INTO](../../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Manipulation/INSERT.html) command or execute `HELP INSERT ` to see its help documentation for better use of this import method. + + +Insert Into itself is a SQL command, and the return result is divided into the following types according to the different execution results: + +1. Result set is empty + + If the result set of the insert corresponding SELECT statement is empty, it is returned as follows: + + ```text + mysql> insert into tbl1 select * from empty_tbl; + Query OK, 0 rows affected (0.02 sec) + ``` + + `Query OK` indicates successful execution. `0 rows affected` means that no data was loaded. + +2. The result set is not empty + + In the case where the result set is not empty. The returned results are divided into the following situations: + + 1. Insert is successful and data is visible: + + ```text + mysql> insert into tbl1 select * from tbl2; + Query OK, 4 rows affected (0.38 sec) + {'label': 'insert_8510c568-9eda-4173-9e36-6adc7d35291c', 'status': 'visible', 'txnId': '4005'} + + mysql> insert into tbl1 with label my_label1 select * from tbl2; + Query OK, 4 rows affected (0.38 sec) + {'label': 'my_label1', 'status': 'visible', 'txnId': '4005'} + + mysql> insert into tbl1 select * from tbl2; + Query OK, 2 rows affected, 2 warnings (0.31 sec) + {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status': 'visible', 'txnId': '4005'} + + mysql> insert into tbl1 select * from tbl2; + Query OK, 2 rows affected, 2 warnings (0.31 sec) + {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status': 'committed', 'txnId': '4005'} + ``` + + `Query OK` indicates successful execution. `4 rows affected` means that a total of 4 rows of data were imported. `2 warnings` indicates the number of lines to be filtered. + + Also returns a json string: + + ```text + {'label': 'my_label1', 'status': 'visible', 'txnId': '4005'} + {'label': 'insert_f0747f0e-7a35-46e2-affa-13a235f4020d', 'status': 'committed', 'txnId': '4005'} + {'label': 'my_label1', 'status': 'visible', 'txnId': '4005', 'err': 'some other error'} + ``` + + `label` is a user-specified label or an automatically generated label. Label is the ID of this Insert Into load job. Each load job has a label that is unique within a single database. + + `status` indicates whether the loaded data is visible. If visible, show `visible`, if not, show`committed`. + + `txnId` is the id of the load transaction corresponding to this insert. + + The `err` field displays some other unexpected errors. + + When user need to view the filtered rows, the user can use the following statement + + ```text + show load where label = "xxx"; + ``` + + The URL in the returned result can be used to query the wrong data. For details, see the following **View Error Lines** Summary. **"Data is not visible" is a temporary status, this batch of data must be visible eventually** + + You can view the visible status of this batch of data with the following statement: + + ```text + show transaction where id = 4005; + ``` + + If the `TransactionStatus` column in the returned result is `visible`, the data is visible. + + 2. Insert fails + + Execution failure indicates that no data was successfully loaded, and returns as follows: + + ```text + mysql> insert into tbl1 select * from tbl2 where k1 = "a"; + ERROR 1064 (HY000): all partitions have no load data. Url: http://10.74.167.16:8042/api/_load_error_log?file=__shard_2/error_log_insert_stmt_ba8bb9e158e4879-ae8de8507c0bf8a2_ba8bb9e158e4879_ae8de850e8de850 + ``` + + Where `ERROR 1064 (HY000): all partitions have no load data` shows the reason for the failure. The latter url can be used to query the wrong data. For details, see the following **View Error Lines** Summary. + +**In summary, the correct processing logic for the results returned by the insert operation should be:** + +1. If the returned result is `ERROR 1064 (HY000)`, it means that the import failed. +2. If the returned result is `Query OK`, it means the execution was successful. + 1. If `rows affected` is 0, the result set is empty and no data is loaded. + 2. If`rows affected` is greater than 0: + 1. If `status` is`committed`, the data is not yet visible. You need to check the status through the `show transaction` statement until `visible`. + 2. If `status` is`visible`, the data is loaded successfully. + 3. If `warnings` is greater than 0, it means that some data is filtered. You can get the url through the `show load` statement to see the filtered rows. + +In the previous section, we described how to follow up on the results of insert operations. However, it is difficult to get the json string of the returned result in some mysql libraries. Therefore, Doris also provides the `SHOW LAST INSERT` command to explicitly retrieve the results of the last insert operation. + +After executing an insert operation, you can execute `SHOW LAST INSERT` on the same session connection. This command returns the result of the most recent insert operation, e.g. + +```sql +mysql> show last insert\G +*************************** 1. row *************************** + TransactionId: 64067 + Label: insert_ba8f33aea9544866-8ed77e2844d0cc9b + Database: default_cluster:db1 + Table: t1 +TransactionStatus: VISIBLE + LoadedRows: 2 + FilteredRows: 0 +``` + +This command returns the insert results and the details of the corresponding transaction. Therefore, you can continue to execute the `show last insert` command after each insert operation to get the insert results. + +> Note: This command will only return the results of the last insert operation within the same session connection. If the connection is broken or replaced with a new one, the empty set will be returned. + +## Relevant System Configuration + +### FE configuration + +- time out + + The timeout time of the import task (in seconds) will be cancelled by the system if the import task is not completed within the set timeout time, and will become CANCELLED. + + At present, Insert Into does not support custom import timeout time. All Insert Into imports have a uniform timeout time. The default timeout time is 1 hour. If the imported source file cannot complete the import within the specified time, the parameter `insert_load_default_timeout_second` of FE needs to be adjusted. + + At the same time, the Insert Into statement receives the restriction of the Session variable `query_timeout`. You can increase the timeout time by `SET query_timeout = xxx;` in seconds. + +### Session Variables + +- enable_insert_strict + + The Insert Into import itself cannot control the tolerable error rate of the import. Users can only use the Session parameter `enable_insert_strict`. When this parameter is set to false, it indicates that at least one data has been imported correctly, and then it returns successfully. When this parameter is set to true, the import fails if there is a data error. The default is false. It can be set by `SET enable_insert_strict = true;`. + +- query u timeout + + Insert Into itself is also an SQL command, so the Insert Into statement is also restricted by the Session variable `query_timeout`. You can increase the timeout time by `SET query_timeout = xxx;` in seconds. + +## Best Practices + +### Application scenarios + +1. Users want to import only a few false data to verify the functionality of Doris system. The grammar of INSERT INTO VALUES is suitable at this time. +2. Users want to convert the data already in the Doris table into ETL and import it into a new Doris table, which is suitable for using INSERT INTO SELECT grammar. +3. Users can create an external table, such as MySQL external table mapping a table in MySQL system. Or create Broker external tables to map data files on HDFS. Then the data from the external table is imported into the Doris table for storage through the INSERT INTO SELECT grammar. + +### Data volume + +Insert Into has no limitation on the amount of data, and large data imports can also be supported. However, Insert Into has a default timeout time, and the amount of imported data estimated by users is too large, so it is necessary to modify the system's Insert Into import timeout time. + +```text +Import data volume = 36G or less than 3600s*10M/s +Among them, 10M/s is the maximum import speed limit. Users need to calculate the average import speed according to the current cluster situation to replace 10M/s in the formula. +``` + +### Complete examples + +Users have a table store sales in the database sales. Users create a table called bj store sales in the database sales. Users want to import the data recorded in the store sales into the new table bj store sales. The amount of data imported is about 10G. + +```text +large sales scheme +(id, total, user_id, sale_timestamp, region) + +Order large sales schedule: +(id, total, user_id, sale_timestamp) +``` + +Cluster situation: The average import speed of current user cluster is about 5M/s + +- Step1: Determine whether you want to modify the default timeout of Insert Into + + ```text + Calculate the approximate time of import + 10G / 5M /s = 2000s + + Modify FE configuration + insert_load_default_timeout_second = 2000 + ``` + +- Step2: Create Import Tasks + + Since users want to ETL data from a table and import it into the target table, they should use the Insert in query\stmt mode to import it. + + ```text + INSERT INTO bj_store_sales SELECT id, total, user_id, sale_timestamp FROM store_sales where region = "bj"; + ``` + +## Common Questions + +- View the wrong line + + Because Insert Into can't control the error rate, it can only tolerate or ignore the error data completely by `enable_insert_strict`. So if `enable_insert_strict` is set to true, Insert Into may fail. If `enable_insert_strict` is set to false, then only some qualified data may be imported. However, in either case, Doris is currently unable to provide the ability to view substandard data rows. Therefore, the user cannot view the specific import error through the Insert Into statement. + + The causes of errors are usually: source data column length exceeds destination data column length, column type mismatch, partition mismatch, column order mismatch, etc. When it's still impossible to check for problems. At present, it is only recommended that the SELECT command in the Insert Into statement be run to export the data to a file, and then import the file through Stream load to see the specific errors. + +## more help + +For more detailed syntax and best practices used by insert into, see [insert](../../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Manipulation/INSERT.html) command manual, you can also enter `HELP INSERT` in the MySql client command line for more help information. diff --git a/new-docs/en/data-operate/import/import-way/routine-load-manual.md b/new-docs/en/data-operate/import/import-way/routine-load-manual.md index 48dee27811..9d6e845490 100644 --- a/new-docs/en/data-operate/import/import-way/routine-load-manual.md +++ b/new-docs/en/data-operate/import/import-way/routine-load-manual.md @@ -26,3 +26,311 @@ under the License. # Routine Load +The Routine Load feature provides users with a way to automatically load data from a specified data source. + +This document describes the implementation principles, usage, and best practices of this feature. + +## Glossary + +* RoutineLoadJob: A routine load job submitted by the user. +* JobScheduler: A routine load job scheduler for scheduling and dividing a RoutineLoadJob into multiple Tasks. +* Task: RoutineLoadJob is divided by JobScheduler according to the rules. +* TaskScheduler: Task Scheduler. Used to schedule the execution of a Task. + +## Principle + +``` + +---------+ + | Client | + +----+----+ + | ++-----------------------------+ +| FE | | +| +-----------v------------+ | +| | | | +| | Routine Load Job | | +| | | | +| +---+--------+--------+--+ | +| | | | | +| +---v--+ +---v--+ +---v--+ | +| | task | | task | | task | | +| +--+---+ +---+--+ +---+--+ | +| | | | | ++-----------------------------+ + | | | + v v v + +---+--+ +--+---+ ++-----+ + | BE | | BE | | BE | + +------+ +------+ +------+ + +``` + +As shown above, the client submits a routine load job to FE. + +FE splits an load job into several Tasks via JobScheduler. Each Task is responsible for loading a specified portion of the data. The Task is assigned by the TaskScheduler to the specified BE. + +On the BE, a Task is treated as a normal load task and loaded via the Stream Load load mechanism. After the load is complete, report to FE. + +The JobScheduler in the FE continues to generate subsequent new Tasks based on the reported results, or retry the failed Task. + +The entire routine load job completes the uninterrupted load of data by continuously generating new Tasks. + +## Kafka Routine load + +Currently we only support routine load from the Kafka system. This section details Kafka's routine use and best practices. + +### Usage restrictions + +1. Support unauthenticated Kafka access and Kafka clusters certified by SSL. +2. The supported message format is csv text or json format. Each message is a line in csv format, and the end of the line does not contain a ** line break. +3. Kafka 0.10.0.0 (inclusive) or above is supported by default. If you want to use Kafka versions below 0.10.0.0 (0.9.0, 0.8.2, 0.8.1, 0.8.0), you need to modify the configuration of be, set the value of kafka_broker_version_fallback to be the older version, or directly set the value of property.broker.version.fallback to the old version when creating routine load. The cost of the old version is that some of the new features of routine load may not be available, such as setting the offset of the kafka partition by time. + +### Create a routine load task + +The detailed syntax for creating a routine load task can be connected to Doris and execute `HELP ROUTINE LOAD;` to see the syntax help. Here is a detailed description of the precautions when creating a job. + +* columns_mapping + + `columns_mapping` is mainly used to specify the column structure of the table structure and message, as well as the conversion of some columns. If not specified, Doris will default to the columns in the message and the columns of the table structure in a one-to-one correspondence. Although under normal circumstances, if the source data is exactly one-to-one, normal data load can be performed without specifying. However, we still strongly recommend that users ** explicitly specify column mappings**. This way, when the table structure changes (such as adding a nullable column), or the source file changes (such as adding a column), the load task can continue. Otherwise, after the above changes occur, the load will report an error because the column mapping relationship is no longer one-to-one. + + In `columns_mapping` we can also use some built-in functions for column conversion. But you need to pay attention to the actual column type corresponding to the function parameters. for example: + + Suppose the user needs to load a table containing only a column of `k1` with a column type of `int`. And you need to convert the null value in the source file to 0. This feature can be implemented with the `ifnull` function. The correct way to use is as follows: + + `COLUMNS (xx, k1=ifnull(xx, "3"))` + + Note that we use `"3"` instead of `3`, although `k1` is of type `int`. Because the column type in the source data is `varchar` for the load task, the `xx` virtual column is also of type `varchar`. So we need to use `"3"` to match the match, otherwise the `ifnull` function can't find the function signature with the parameter `(varchar, int)`, and an error will occur. + + As another example, suppose the user needs to load a table containing only a column of `k1` with a column type of `int`. And you need to process the corresponding column in the source file: convert the negative number to a positive number and the positive number to 100. This function can be implemented with the `case when` function. The correct wording should be as follows: + + `COLUMNS (xx, k1 = case when xx < 0 then cast(-xx as varchar) else cast((xx + '100') as varchar) end)` + + Note that we need to convert all the parameters in `case when` to varchar in order to get the desired result. + +* where_predicates + + The type of the column in `where_predicates` is already the actual column type, so there is no need to cast to the varchar type as `columns_mapping`. Write according to the actual column type. + +* desired\_concurrent\_number + + `desired_concurrent_number` is used to specify the degree of concurrency expected for a routine job. That is, a job, at most how many tasks are executing at the same time. For Kafka load, the current actual concurrency is calculated as follows: + + ``` + Min(partition num, desired_concurrent_number, Config.max_routine_load_task_concurrrent_num) + ``` + + Where `Config.max_routine_load_task_concurrrent_num` is a default maximum concurrency limit for the system. This is a FE configuration that can be adjusted by changing the configuration. The default is 5. + + Where partition num refers to the number of partitions for the Kafka topic subscribed to. + +* max\_batch\_interval/max\_batch\_rows/max\_batch\_size + + These three parameters are used to control the execution time of a single task. If any of the thresholds is reached, the task ends. Where `max_batch_rows` is used to record the number of rows of data read from Kafka. `max_batch_size` is used to record the amount of data read from Kafka in bytes. The current consumption rate for a task is approximately 5-10MB/s. + + So assume a row of data 500B, the user wants to be a task every 100MB or 10 seconds. The expected processing time for 100MB is 10-20 seconds, and the corresponding number of rows is about 200000 rows. Then a reasonable configuration is: + + ``` + "max_batch_interval" = "10", + "max_batch_rows" = "200000", + "max_batch_size" = "104857600" + ``` + + The parameters in the above example are also the default parameters for these configurations. + +* max\_error\_number + + `max_error_number` is used to control the error rate. When the error rate is too high, the job will automatically pause. Because the entire job is stream-oriented, and because of the borderless nature of the data stream, we can't calculate the error rate with an error ratio like other load tasks. So here is a new way of calculating to calculate the proportion of errors in the data stream. + + We have set up a sampling window. The size of the window is `max_batch_rows * 10`. Within a sampling window, if the number of error lines exceeds `max_error_number`, the job is suspended. If it is not exceeded, the next window restarts counting the number of error lines. + + We assume that `max_batch_rows` is 200000 and the window size is 2000000. Let `max_error_number` be 20000, that is, the user expects an error behavior of 20000 for every 2000000 lines. That is, the error rate is 1%. But because not every batch of tasks consumes 200000 rows, the actual range of the window is [2000000, 2200000], which is 10% statistical error. + + The error line does not include rows that are filtered out by the where condition. But include rows that do not have a partition in the corresponding Doris table. + +* data\_source\_properties + + The specific Kafka partition can be specified in `data_source_properties`. If not specified, all partitions of the subscribed topic are consumed by default. + + Note that when partition is explicitly specified, the load job will no longer dynamically detect changes to Kafka partition. If not specified, the partitions that need to be consumed are dynamically adjusted based on changes in the kafka partition. + +* strict\_mode + + Routine load load can turn on strict mode mode. The way to open it is to add ```"strict_mode" = "true"``` to job\_properties. The default strict mode is off. + + The strict mode mode means strict filtering of column type conversions during the load process. The strict filtering strategy is as follows: + + 1. For column type conversion, if strict mode is true, the wrong data will be filtered. The error data here refers to the fact that the original data is not null, and the result is a null value after participating in the column type conversion. + + 2. When a loaded column is generated by a function transformation, strict mode has no effect on it. + + 3. For a column type loaded with a range limit, if the original data can pass the type conversion normally, but cannot pass the range limit, strict mode will not affect it. For example, if the type is decimal(1,0) and the original data is 10, it is eligible for type conversion but not for column declarations. This data strict has no effect on it. + +* merge\_type + The type of data merging supports three types: APPEND, DELETE, and MERGE. APPEND is the default value, which means that all this batch of data needs to be appended to the existing data. DELETE means to delete all rows with the same key as this batch of data. MERGE semantics Need to be used in conjunction with the delete condition, which means that the data that meets the delete condition is processed according to DELETE semantics and the rest is processed according to APPEND semantics + +**strict mode and load relationship of source data** + +Here is an example of a column type of TinyInt. + +> Note: When a column in a table allows a null value to be loaded + +|source data | source data example | string to int | strict_mode | result| +|------------|---------------------|-----------------|--------------------|---------| +|null | \N | N/A | true or false | NULL| +|not null | aaa or 2000 | NULL | true | invalid data(filtered)| +|not null | aaa | NULL | false | NULL| +|not null | 1 | 1 | true or false | correct data| + +Here the column type is Decimal(1,0) + +> Note: When a column in a table allows a null value to be loaded + +|source data | source data example | string to int | strict_mode | result| +|------------|---------------------|-----------------|--------------------|--------| +|null | \N | N/A | true or false | NULL| +|not null | aaa | NULL | true | invalid data(filtered)| +|not null | aaa | NULL | false | NULL| +|not null | 1 or 10 | 1 | true or false | correct data| + +> Note: 10 Although it is a value that is out of range, because its type meets the requirements of decimal, strict mode has no effect on it. 10 will eventually be filtered in other ETL processing flows. But it will not be filtered by strict mode. + +**Accessing SSL-certified Kafka clusters** + +Accessing the SSL-certified Kafka cluster requires the user to provide a certificate file (ca.pem) for authenticating the Kafka Broker public key. If the Kafka cluster has both client authentication enabled, you will also need to provide the client's public key (client.pem), key file (client.key), and key password. The files needed here need to be uploaded to Doris via the `CREAE FILE` command, ** and the catalog name is `kafka`**. See `HELP CREATE FILE;` for specific help on the `CREATE FILE` command. Here is an example: + +1. Upload file + + ``` + CREATE FILE "ca.pem" PROPERTIES("url" = "https://example_url/kafka-key/ca.pem", "catalog" = "kafka"); + CREATE FILE "client.key" PROPERTIES("url" = "https://example_urlkafka-key/client.key", "catalog" = "kafka"); + CREATE FILE "client.pem" PROPERTIES("url" = "https://example_url/kafka-key/client.pem", "catalog" = "kafka"); + ``` + +2. Create a routine load job + + ``` + CREATE ROUTINE LOAD db1.job1 on tbl1 + PROPERTIES + ( + "desired_concurrent_number"="1" + ) + FROM KAFKA + ( + "kafka_broker_list"= "broker1:9091,broker2:9091", + "kafka_topic" = "my_topic", + "property.security.protocol" = "ssl", + "property.ssl.ca.location" = "FILE:ca.pem", + "property.ssl.certificate.location" = "FILE:client.pem", + "property.ssl.key.location" = "FILE:client.key", + "property.ssl.key.password" = "abcdefg" + ); + ``` + +> Doris accesses Kafka clusters via Kafka's C++ API `librdkafka`. The parameters supported by `librdkafka` can be found. +> +> + +### Viewing the status of the load job + +Specific commands and examples for viewing the status of the ** job** can be viewed with the `HELP SHOW ROUTINE LOAD;` command. + +Specific commands and examples for viewing the **Task** status can be viewed with the `HELP SHOW ROUTINE LOAD TASK;` command. + +You can only view tasks that are currently running, and tasks that have ended and are not started cannot be viewed. + +### Alter job + +Users can modify jobs that have been created. Specific instructions can be viewed through the `HELP ALTER ROUTINE LOAD;` command. Or refer to [ALTER ROUTINE LOAD](../../sql-reference/sql-statements/Data%20Manipulation/alter-routine-load.md). + +### Job Control + +The user can control the stop, pause and restart of the job by the three commands `STOP/PAUSE/RESUME`. You can view help and examples with the three commands `HELP STOP ROUTINE LOAD;`, `HELP PAUSE ROUTINE LOAD;` and `HELP RESUME ROUTINE LOAD;`. + +## other instructions + +1. The relationship between a routine load job and an ALTER TABLE operation + + * Routine load does not block SCHEMA CHANGE and ROLLUP operations. Note, however, that if the column mappings are not matched after SCHEMA CHANGE is completed, the job's erroneous data will spike and eventually cause the job to pause. It is recommended to reduce this type of problem by explicitly specifying column mappings in routine load jobs and by adding Nullable columns or columns with Default values. + * Deleting a Partition of a table may cause the loaded data to fail to find the corresponding Partition and the job will be paused. + +2. Relationship between routine load jobs and other load jobs (LOAD, DELETE, INSERT) + + * Routine load does not conflict with other LOAD jobs and INSERT operations. + * When performing a DELETE operation, the corresponding table partition cannot have any load tasks being executed. Therefore, before performing the DELETE operation, you may need to pause the routine load job and wait for the delivered task to complete before you can execute DELETE. + +3. Relationship between routine load jobs and DROP DATABASE/TABLE operations + + When the corresponding database or table is deleted, the job will automatically CANCEL. + +4. The relationship between the kafka type routine load job and kafka topic + + When the user creates a routine load declaration, the `kafka_topic` does not exist in the kafka cluster. + + * If the broker of the user kafka cluster has `auto.create.topics.enable = true` set, `kafka_topic` will be automatically created first, and the number of partitions created automatically will be in the kafka cluster** of the user side. The broker is configured with `num.partitions`. The routine job will continue to read the data of the topic continuously. + * If the broker of the user kafka cluster has `auto.create.topics.enable = false` set, topic will not be created automatically, and the routine will be paused before any data is read, with the status `PAUSED`. + + So, if the user wants to be automatically created by the routine when the kafka topic does not exist, just set the broker in the kafka cluster** of the user's side to set auto.create.topics.enable = true` . + +5. Problems that may occur in the some environment + In some environments, there are isolation measures for network segment and domain name resolution. So should pay attention to: + 1. The broker list specified in the routine load task must be accessible on the doris environment. + 2. If `advertised.listeners` is configured in kafka, The addresses in `advertised.listeners` need to be accessible on the doris environment. + +6. About specified Partition and Offset + + Doris supports specifying Partition and Offset to start consumption. The new version also supports the consumption function at a specified time point. The configuration relationship of the corresponding parameters is explained here. + + There are three relevant parameters: + + * `kafka_partitions`: Specify the list of partitions to be consumed, such as: "0, 1, 2, 3". + * `kafka_offsets`: Specify the starting offset of each partition, which must correspond to the number of `kafka_partitions` lists. Such as: "1000, 1000, 2000, 2000" + * `property.kafka_default_offset`: Specify the default starting offset of the partition. + + When creating an routine load job, these three parameters can have the following combinations: + + | Combinations | `kafka_partitions` | `kafka_offsets` | `property.kafka_default_offset` | Behavior | + |---|---|---|---|---| + |1| No | No | No | The system will automatically find all the partitions corresponding to the topic and start consumption from OFFSET_END | + |2| No | No | Yes | The system will automatically find all the partitions corresponding to the topic and start consumption from the position specified by the default offset | + |3| Yes | No | No | The system will start consumption from the OFFSET_END of the specified partition | + |4| Yes | Yes | No | The system will start consumption from the specified offset of the specified partition | + |5| Yes | No | Yes | The system will start consumption from the specified partition and the location specified by the default offset | + + 7. The difference between STOP and PAUSE + + the FE will automatically clean up stopped ROUTINE LOAD,while paused ROUTINE LOAD can be resumed + +## Related parameters + +Some system configuration parameters can affect the use of routine loads. + +1. max\_routine\_load\_task\_concurrent\_num + + The FE configuration item, which defaults to 5, can be modified at runtime. This parameter limits the maximum number of subtask concurrency for a routine load job. It is recommended to maintain the default value. If the setting is too large, it may cause too many concurrent tasks and occupy cluster resources. + +2. max\_routine_load\_task\_num\_per\_be + + The FE configuration item, which defaults to 5, can be modified at runtime. This parameter limits the number of subtasks that can be executed concurrently by each BE node. It is recommended to maintain the default value. If the setting is too large, it may cause too many concurrent tasks and occupy cluster resources. + +3. max\_routine\_load\_job\_num + + The FE configuration item, which defaults to 100, can be modified at runtime. This parameter limits the total number of routine load jobs, including NEED_SCHEDULED, RUNNING, PAUSE. After the overtime, you cannot submit a new assignment. + +4. max\_consumer\_num\_per\_group + + BE configuration item, the default is 3. This parameter indicates that up to several consumers are generated in a subtask for data consumption. For a Kafka data source, a consumer may consume one or more kafka partitions. Suppose a task needs to consume 6 kafka partitions, it will generate 3 consumers, and each consumer consumes 2 partitions. If there are only 2 partitions, only 2 consumers will be generated, and each consumer will consume 1 partition. + +5. push\_write\_mbytes\_per\_sec + + BE configuration item. The default is 10, which is 10MB/s. This parameter is to load common parameters, not limited to routine load jobs. This parameter limits the speed at which loaded data is written to disk. For high-performance storage devices such as SSDs, this speed limit can be appropriately increased. + +6. max\_tolerable\_backend\_down\_num + FE configuration item, the default is 0. Under certain conditions, Doris can reschedule PAUSED tasks, that becomes RUNNING?This parameter is 0, which means that rescheduling is allowed only when all BE nodes are in alive state. + +7. period\_of\_auto\_resume\_min + FE configuration item, the default is 5 mins. Doris reschedules will only try at most 3 times in the 5 minute period. If all 3 times fail, the current task will be locked, and auto-scheduling will not be performed. However, manual intervention can be performed. + +## More Help + +For more detailed syntax used by **Routine load**, you can enter `HELP ROUTINE LOAD` on the Mysql client command line for more help. + diff --git a/new-docs/en/data-operate/import/import-way/s3-load-manual.md b/new-docs/en/data-operate/import/import-way/s3-load-manual.md index 2e70130cc0..815c2ba2fa 100644 --- a/new-docs/en/data-operate/import/import-way/s3-load-manual.md +++ b/new-docs/en/data-operate/import/import-way/s3-load-manual.md @@ -1,7 +1,7 @@ --- { "title": "S3 Load", -"language": "zh-CN" +"language": "en" } --- @@ -25,3 +25,70 @@ under the License. --> # S3 Load + +Starting from version 0.14, Doris supports the direct import of data from online storage systems that support the S3 protocol through the S3 protocol. + +This document mainly introduces how to import data stored in AWS S3. It also supports the import of other object storage systems that support the S3 protocol, such as Baidu Cloud’s BOS, Alibaba Cloud’s OSS and Tencent Cloud’s COS, etc. + +## Applicable scenarios + +- Source data in S3 protocol accessible storage systems, such as S3, BOS. +- Data volumes range from tens to hundreds of GB. + +## Preparing + +1. Standard AK and SK First, you need to find or regenerate AWS `Access keys`, you can find the generation method in `My Security Credentials` of AWS console, as shown in the following figure: [AK_SK](https://doris.apache.org/images/aws_ak_sk.png) Select `Create New Access Key` and pay attention to save and generate AK and SK. +2. Prepare REGION and ENDPOINT REGION can be selected when creating the bucket or can be viewed in the bucket list. ENDPOINT can be found through REGION on the following page [AWS Documentation](https://docs.aws.amazon.com/general/latest/gr/s3.html#s3_region) + +Other cloud storage systems can find relevant information compatible with S3 in corresponding documents + +## Start Loading + +Like Broker Load just replace `WITH BROKER broker_name ()` with + +```text + WITH S3 + ( + "AWS_ENDPOINT" = "AWS_ENDPOINT", + "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY", + "AWS_SECRET_KEY"="AWS_SECRET_KEY", + "AWS_REGION" = "AWS_REGION" + ) +``` + +example: + +```sql + LOAD LABEL example_db.exmpale_label_1 + ( + DATA INFILE("s3://your_bucket_name/your_file.txt") + INTO TABLE load_test + COLUMNS TERMINATED BY "," + ) + WITH S3 + ( + "AWS_ENDPOINT" = "AWS_ENDPOINT", + "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY", + "AWS_SECRET_KEY"="AWS_SECRET_KEY", + "AWS_REGION" = "AWS_REGION" + ) + PROPERTIES + ( + "timeout" = "3600" + ); +``` + +## FAQ + +S3 SDK uses virtual-hosted style by default. However, some object storage systems may not be enabled or support virtual-hosted style access. At this time, we can add the `use_path_style` parameter to force the use of path style: + +```text + WITH S3 + ( + "AWS_ENDPOINT" = "AWS_ENDPOINT", + "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY", + "AWS_SECRET_KEY"="AWS_SECRET_KEY", + "AWS_REGION" = "AWS_REGION", + "use_path_style" = "true" + ) +``` diff --git a/new-docs/en/data-operate/import/import-way/spark-load-manual.md b/new-docs/en/data-operate/import/import-way/spark-load-manual.md index 9990d43a73..a536efae9e 100644 --- a/new-docs/en/data-operate/import/import-way/spark-load-manual.md +++ b/new-docs/en/data-operate/import/import-way/spark-load-manual.md @@ -26,3 +26,607 @@ under the License. # Spark Load +Spark load realizes the preprocessing of load data by spark, improves the performance of loading large amount of Doris data and saves the computing resources of Doris cluster. It is mainly used for the scene of initial migration and large amount of data imported into Doris. + +Spark load is an asynchronous load method. Users need to create spark type load job by MySQL protocol and view the load results by `show load`. + +## Applicable scenarios + +* The source data is in a file storage system that spark can access, such as HDFS. + +* The data volume ranges from tens of GB to TB. + +## Explanation of terms + +1. Spark ETL: in the load process, it is mainly responsible for ETL of data, including global dictionary construction (bitmap type), partition, sorting, aggregation, etc. + +2. Broker: broker is an independent stateless process. It encapsulates the file system interface and provides the ability of Doris to read the files in the remote storage system. + +3. Global dictionary: it stores the data structure from the original value to the coded value. The original value can be any data type, while the encoded value is an integer. The global dictionary is mainly used in the scene of precise de duplication precomputation. + +## Basic principles + +### Basic process + +The user submits spark type load job by MySQL client, Fe records metadata and returns that the user submitted successfully. + +The implementation of spark load task is mainly divided into the following five stages. + + +1. Fe schedules and submits ETL tasks to spark cluster for execution. + +2. Spark cluster executes ETL to complete the preprocessing of load data. It includes global dictionary building (bitmap type), partitioning, sorting, aggregation, etc. + +3. After the ETL task is completed, Fe obtains the data path of each partition that has been preprocessed, and schedules the related be to execute the push task. + +4. Be reads data through broker and converts it into Doris underlying storage format. + +5. Fe schedule the effective version and complete the load job. + +``` + + + | 0. User create spark load job + +----v----+ + | FE |---------------------------------+ + +----+----+ | + | 3. FE send push tasks | + | 5. FE publish version | + +------------+------------+ | + | | | | ++---v---+ +---v---+ +---v---+ | +| BE | | BE | | BE | |1. FE submit Spark ETL job ++---^---+ +---^---+ +---^---+ | + |4. BE push with broker | | ++---+---+ +---+---+ +---+---+ | +|Broker | |Broker | |Broker | | ++---^---+ +---^---+ +---^---+ | + | | | | ++---+------------+------------+---+ 2.ETL +-------------v---------------+ +| HDFS +-------> Spark cluster | +| <-------+ | ++---------------------------------+ +-----------------------------+ + +``` + +## Global dictionary + +### Applicable scenarios + +At present, the bitmap column in Doris is implemented using the class library '`roaingbitmap`', while the input data type of '`roaringbitmap`' can only be integer. Therefore, if you want to pre calculate the bitmap column in the import process, you need to convert the type of input data to integer. + +In the existing Doris import process, the data structure of global dictionary is implemented based on hive table, which stores the mapping from original value to encoded value. + +### Build process + +1. Read the data from the upstream data source and generate a hive temporary table, which is recorded as `hive_table`. + +2. Extract the de duplicated values of the fields to be de duplicated from the `hive_table`, and generate a new hive table, which is marked as `distinct_value_table`. + +3. Create a new global dictionary table named `dict_table`; one column is the original value, and the other is the encoded value. + +4. Left join the `distinct_value_table` and `dict_table`, calculate the new de duplication value set, and then code this set with window function. At this time, the original value of the de duplication column will have one more column of encoded value. Finally, the data of these two columns will be written back to `dict_table`. + +5. Join the `dict_table` with the `hive_table` to replace the original value in the `hive_table` with the integer encoded value. + +6. `hive_table` will be read by the next data preprocessing process and imported into Doris after calculation. + +## Data preprocessing (DPP) + +### Basic process + +1. Read data from the data source. The upstream data source can be HDFS file or hive table. + +2. Map the read data, calculate the expression, and generate the bucket field `bucket_id` according to the partition information. + +3. Generate rolluptree according to rollup metadata of Doris table. + +4. Traverse rolluptree to perform hierarchical aggregation. The rollup of the next level can be calculated from the rollup of the previous level. + +5. After each aggregation calculation, the data will be calculated according to the `bucket_id`is divided into buckets and then written into HDFS. + +6. Subsequent brokers will pull the files in HDFS and import them into Doris be. + +## Hive Bitmap UDF + +Spark supports loading hive-generated bitmap data directly into Doris, see [hive-bitmap-udf documentation](../../ecosystem/external-table/hive-bitmap-udf.html) + +## Basic operation + +### Configure ETL cluster + +As an external computing resource, spark is used to complete ETL work in Doris. In the future, there may be other external resources that will be used in Doris, such as spark / GPU for query, HDFS / S3 for external storage, MapReduce for ETL, etc. Therefore, we introduce resource management to manage these external resources used by Doris. + +Before submitting the spark import task, you need to configure the spark cluster that performs the ETL task. + +Grammar: + +```sql +-- create spark resource +CREATE EXTERNAL RESOURCE resource_name +PROPERTIES +( + type = spark, + spark_conf_key = spark_conf_value, + working_dir = path, + broker = broker_name, + broker.property_key = property_value +) + +-- drop spark resource +DROP RESOURCE resource_name + +-- show resources +SHOW RESOURCES +SHOW PROC "/resources" + +-- privileges +GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity +GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name + +REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity +REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name +``` + +**Create resource** + +`resource_name` is the name of the spark resource configured in Doris. + +`Properties` are the parameters related to spark resources, as follows: + +- `type`: resource type, required. Currently, only spark is supported. + +- Spark related parameters are as follows: + + - `spark.master`: required, yarn is supported at present, `spark://host:port`. + + - `spark.submit.deployMode`: the deployment mode of Spark Program. It is required and supports cluster and client. + + - `spark.hadoop.yarn.resourcemanager.address`: required when master is yarn. + + - `spark.hadoop.fs.defaultfs`: required when master is yarn. + + - Other parameters are optional, refer to `http://spark.apache.org/docs/latest/configuration.html` + +- `working_dir`: directory used by ETL. Spark is required when used as an ETL resource. For example: `hdfs://host :port/tmp/doris`. + +- `broker`: the name of the broker. Spark is required when used as an ETL resource. You need to use the 'alter system add broker' command to complete the configuration in advance. + +- `broker.property_key`: the authentication information that the broker needs to specify when reading the intermediate file generated by ETL. + +Example: + +```sql +-- yarn cluster 模式 +CREATE EXTERNAL RESOURCE "spark0" +PROPERTIES +( + "type" = "spark", + "spark.master" = "yarn", + "spark.submit.deployMode" = "cluster", + "spark.jars" = "xxx.jar,yyy.jar", + "spark.files" = "/tmp/aaa,/tmp/bbb", + "spark.executor.memory" = "1g", + "spark.yarn.queue" = "queue0", + "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999", + "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000", + "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris", + "broker" = "broker0", + "broker.username" = "user0", + "broker.password" = "password0" +); + +-- spark standalone client 模式 +CREATE EXTERNAL RESOURCE "spark1" +PROPERTIES +( + "type" = "spark", + "spark.master" = "spark://127.0.0.1:7777", + "spark.submit.deployMode" = "client", + "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris", + "broker" = "broker1" +); +``` + +**Show resources** + +Ordinary accounts can only see the resources that they have `USAGE_PRIV` to use. + +The root and admin accounts can see all the resources. + +**Resource privilege** + +Resource permissions are managed by grant revoke. Currently, only `USAGE_PRIV` permission is supported. + +You can use the `USAGE_PRIV` permission is given to a user or a role, and the role is used the same as before. + +```sql +-- Grant permission to the spark0 resource to user user0 + +GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%"; + + +-- Grant permission to the spark0 resource to role ROLE0 + +GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0"; + + +-- Grant permission to all resources to user user0 + +GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%"; + + +-- Grant permission to all resources to role ROLE0 + +GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0"; + + +-- Revoke the spark0 resource permission of user user0 + +REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%"; + +``` + +### Configure spark client + +The Fe submits the spark task by executing the spark submit command. Therefore, it is necessary to configure the spark client for Fe. It is recommended to use the official version of spark 2 above 2.4.3, [download spark here](https://archive.apache.org/dist/spark/). After downloading, please follow the steps to complete the following configuration. + +**Configure SPARK_HOME environment variable** + +Place the spark client on the same machine as Fe and configure `spark_home_default_dir` in the `fe.conf`. This configuration item defaults to the `fe/lib/spark2x` path. This config cannot be empty. + +**Configure spark dependencies** + +Package all jar packages in jars folder under spark client root path into a zip file, and configure `spark_resource_patj` in `fe.conf` as this zip file's path. + +When the spark load task is submitted, this zip file will be uploaded to the remote repository, and the default repository path will be hung in `working_dir/{cluster_ID}` directory named as `__spark_repository__{resource_name}`, which indicates that a resource in the cluster corresponds to a remote warehouse. The directory structure of the remote warehouse is as follows: + +``` +__spark_repository__spark0/ + |-__archive_1.0.0/ + | |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp-1.0.0-jar-with-dependencies.jar + | |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip + |-__archive_1.1.0/ + | |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp-1.1.0-jar-with-dependencies.jar + | |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip + |-__archive_1.2.0/ + | |-... +``` + +In addition to spark dependency (named by `spark-2x.zip` by default), Fe will also upload DPP's dependency package to the remote repository. If all the dependency files submitted by spark load already exist in the remote repository, then there is no need to upload dependency, saving the time of repeatedly uploading a large number of files each time. + +### Configure yarn client + +The Fe obtains the running application status and kills the application by executing the yarn command. Therefore, you need to configure the yarn client for Fe. It is recommended to use the official version of Hadoop above 2.5.2, [download hadoop](https://archive.apache.org/dist/hadoop/common/). After downloading, please follow the steps to complete the following configuration. + +**Configure the yarn client path** + +Place the downloaded yarn client in the same machine as Fe, and configure `yarn_client_path` in the `fe.conf` as the executable file of yarn, which is set as the `fe/lib/yarn-client/hadoop/bin/yarn` by default. + +(optional) when Fe obtains the application status or kills the application through the yarn client, the configuration files required for executing the yarn command will be generated by default in the `lib/yarn-config` path in the Fe root directory. This path can be configured by configuring `yarn-config-dir` in the `fe.conf`. The currently generated configuration yarn config files include `core-site.xml` and `yarn-site.xml`. + +### Create Load + +Grammar: + +```sql +LOAD LABEL load_label + (data_desc, ...) + WITH RESOURCE resource_name resource_properties + [PROPERTIES (key1=value1, ... )] + +* load_label: + db_name.label_name + +* data_desc: + DATA INFILE ('file_path', ...) + [NEGATIVE] + INTO TABLE tbl_name + [PARTITION (p1, p2)] + [COLUMNS TERMINATED BY separator ] + [(col1, ...)] + [SET (k1=f1(xx), k2=f2(xx))] + [WHERE predicate] + +* resource_properties: + (key2=value2, ...) +``` + +Example 1: when the upstream data source is HDFS file + +```sql +LOAD LABEL db1.label1 +( + DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1") + INTO TABLE tbl1 + COLUMNS TERMINATED BY "," + (tmp_c1,tmp_c2) + SET + ( + id=tmp_c2, + name=tmp_c1 + ), + DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2") + INTO TABLE tbl2 + COLUMNS TERMINATED BY "," + (col1, col2) + where col1 > 1 +) +WITH RESOURCE 'spark0' +( + "spark.executor.memory" = "2g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); + +``` + +Example 2: when the upstream data source is hive table + +```sql +step 1:新建hive外部表 +CREATE EXTERNAL TABLE hive_t1 +( + k1 INT, + K2 SMALLINT, + k3 varchar(50), + uuid varchar(100) +) +ENGINE=hive +properties +( +"database" = "tmp", +"table" = "t1", +"hive.metastore.uris" = "thrift://0.0.0.0:8080" +); + +step 2: 提交load命令 +LOAD LABEL db1.label1 +( + DATA FROM TABLE hive_t1 + INTO TABLE tbl1 + (k1,k2,k3) + SET + ( + uuid=bitmap_dict(uuid) + ) +) +WITH RESOURCE 'spark0' +( + "spark.executor.memory" = "2g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); + +``` + +Example 3: when the upstream data source is hive binary type table + +```sql +step 1: create hive external table +CREATE EXTERNAL TABLE hive_t1 +( + k1 INT, + K2 SMALLINT, + k3 varchar(50), + uuid varchar(100) +) +ENGINE=hive +properties +( +"database" = "tmp", +"table" = "t1", +"hive.metastore.uris" = "thrift://0.0.0.0:8080" +); + +step 2: submit load command +LOAD LABEL db1.label1 +( + DATA FROM TABLE hive_t1 + INTO TABLE tbl1 + (k1,k2,k3) + SET + ( + uuid=binary_bitmap(uuid) + ) +) +WITH RESOURCE 'spark0' +( + "spark.executor.memory" = "2g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); + +``` + +You can view the details syntax about creating load by input `help spark load`. This paper mainly introduces the parameter meaning and precautions in the creation and load syntax of spark load. + +**Label** + +Identification of the import task. Each import task has a unique label within a single database. The specific rules are consistent with `broker load`. + +**Data description parameters** + +Currently, the supported data sources are CSV and hive table. Other rules are consistent with `broker load`. + +**Load job parameters** + +Load job parameters mainly refer to the `opt_properties` in the spark load. Load job parameters are applied to the entire load job. The rules are consistent with `broker load`. + +**Spark resource parameters** + +Spark resources need to be configured into the Doris system in advance, and users should be given `USAGE_PRIV`. Spark load can only be used after priv permission. + +When users have temporary requirements, such as adding resources for tasks and modifying spark configs, you can set them here. The settings only take effect for this task and do not affect the existing configuration in the Doris cluster. + +```sql +WITH RESOURCE 'spark0' +( + "spark.driver.memory" = "1g", + "spark.executor.memory" = "3g" +) +``` + +**Load when data source is hive table** + +At present, if you want to use hive table as a data source in the import process, you need to create an external table of type hive, + +Then you can specify the table name of the external table when submitting the Load command. + +**Load process to build global dictionary** + +The data type applicable to the aggregate columns of the Doris table is of type bitmap. + +In the load command, you can specify the field to build a global dictionary. The format is: '```doris field name=bitmap_dict(hive_table field name)``` + +It should be noted that the construction of global dictionary is supported only when the upstream data source is hive table. + +**Load when data source is hive binary type table** + +The data type applicable to the aggregate column of the doris table is bitmap type, and the data type of the corresponding column in the hive table of the data source is binary (through the org.apache.doris.load.loadv2.dpp.BitmapValue (FE spark-dpp) class serialized) type. + +There is no need to build a global dictionary, just specify the corresponding field in the load command, the format is: ```doris field name=binary_bitmap (hive table field name)``` + +Similarly, the binary (bitmap) type of data import is currently only supported when the upstream data source is a hive table. + +### Show Load + +Spark load is asynchronous just like broker load, so the user must create the load label record and use label in the **show load command to view the load result**. The show load command is common in all load types. The specific syntax can be viewed by executing help show load. + +Example: + +```mysql +mysql> show load order by createtime desc limit 1\G +*************************** 1. row *************************** + JobId: 76391 + Label: label1 + State: FINISHED + Progress: ETL:100%; LOAD:100% + Type: SPARK + EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376 + TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5 + ErrorMsg: N/A + CreateTime: 2019-07-27 11:46:42 + EtlStartTime: 2019-07-27 11:46:44 + EtlFinishTime: 2019-07-27 11:49:44 + LoadStartTime: 2019-07-27 11:49:44 +LoadFinishTime: 2019-07-27 11:50:16 + URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/ + JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000} +``` + +Refer to broker load for the meaning of parameters in the returned result set. The differences are as follows: + ++ State + +The current phase of the load job. After the job is submitted, the status is pending. After the spark ETL is submitted, the status changes to ETL. After ETL is completed, Fe schedules be to execute push operation, and the status changes to finished after the push is completed and the version takes effect. + +There are two final stages of the load job: cancelled and finished. When the load job is in these two stages, the load is completed. Among them, cancelled is load failure, finished is load success. + ++ Progress + +Progress description of the load job. There are two kinds of progress: ETL and load, corresponding to the two stages of the load process, ETL and loading. + +The progress range of load is 0 ~ 100%. + +```Load progress = the number of tables that have completed all replica imports / the total number of tables in this import task * 100%``` + +**If all load tables are loaded, the progress of load is 99%**, the load enters the final effective stage. After the whole load is completed, the load progress will be changed to 100%. + +The load progress is not linear. Therefore, if the progress does not change over a period of time, it does not mean that the load is not in execution. + ++ Type + +Type of load job. Spark load is spark. + ++ CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime + +These values represent the creation time of the load, the start time of the ETL phase, the completion time of the ETL phase, the start time of the loading phase, and the completion time of the entire load job. + ++ JobDetails + +Display the detailed running status of some jobs, which will be updated when ETL ends. It includes the number of loaded files, the total size (bytes), the number of subtasks, the number of processed original lines, etc. + +```{"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}``` + ++ URL + +Copy this url to the browser and jump to the web interface of the corresponding application. + +### View spark launcher commit log + +Sometimes users need to view the detailed logs generated during the spark submission process. The logs are saved in the `log/spark_launcher_log` under the Fe root directory named as `spark_launcher_{load_job_id}_{label}.log`. The log will be saved in this directory for a period of time. When the load information in Fe metadata is cleaned up, the corresponding log will also be cleaned. The default saving log time is 3 days. + +### cancel load + +When the spark load job status is not cancelled or finished, it can be manually cancelled by the user. When canceling, you need to specify the label to cancel the load job. The syntax of the cancel load command can be viewed by executing `help cancel load`. + +## Related system configuration + +### FE configuration + +The following configuration belongs to the system level configuration of spark load, that is, the configuration for all spark load import tasks. Mainly through modification``` fe.conf ``` to modify the configuration value. + ++ `enable_spark_load` + +Open spark load and create resource. The default value is false. This feature is turned off. + ++ `spark_load_default_timeout_second` + +The default timeout for tasks is 259200 seconds (3 days). + ++ `spark_home_default_dir` + +Spark client path (`Fe/lib/spark2x`). + ++ `spark_resource_path` + +The path of the packaged spark dependent file (empty by default). + ++ `spark_launcher_log_dir` + +The directory where the spark client's commit log is stored (`Fe/log/spark)_launcher_log`). + ++ `yarn_client_path` + +The path of the yarn binary executable file (`Fe/lib/yarn-client/Hadoop/bin/yarn'). + ++ `yarn_config_dir` + +The path to generate the yarn configuration file (`Fe/lib/yarn-config`). + +## Best practices + +### Application scenarios + +The most suitable scenario to use spark load is that the raw data is in the file system (HDFS), and the amount of data is tens of GB to TB. Stream load or broker load is recommended for small amount of data. + +## FAQ + +* When using spark load, the `HADOOP_CONF_DIR` environment variable is no set in the `spark-env.sh`. + +If the `HADOOP_CONF_DIR` environment variable is not set, the error `When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment` will be reported. + +* When using spark load, the `spark_home_default_dir` does not specify correctly. + +The spark submit command is used when submitting a spark job. If `spark_home_default_dir` is set incorrectly, an error `Cannot run program 'xxx/bin/spark_submit', error = 2, no such file or directory` will be reported. + +* When using spark load, `spark_resource_path` does not point to the packaged zip file. + +If `spark_resource_path` is not set correctly. An error `file XXX/jars/spark-2x.zip` does not exist will be reported. + +* When using spark load `yarn_client_path` does not point to a executable file of yarn. + +If `yarn_client_path` is not set correctly. An error `yarn client does not exist in path: XXX/yarn-client/hadoop/bin/yarn` will be reported. + +## More Help + +For more detailed syntax used by **Spark Load**, you can enter `HELP SPARK LOAD` on the Mysql client command line for more help. diff --git a/new-docs/en/data-operate/import/import-way/stream-load-manual.md b/new-docs/en/data-operate/import/import-way/stream-load-manual.md index 7b6598d310..be48651550 100644 --- a/new-docs/en/data-operate/import/import-way/stream-load-manual.md +++ b/new-docs/en/data-operate/import/import-way/stream-load-manual.md @@ -26,5 +26,351 @@ under the License. # Stream load - +Stream load is a synchronous way of importing. Users import local files or data streams into Doris by sending HTTP protocol requests. Stream load synchronously executes the import and returns the import result. Users can directly determine whether the import is successful by the return body of the request. +Stream load is mainly suitable for importing local files or data from data streams through procedures. + +## Basic Principles + +The following figure shows the main flow of Stream load, omitting some import details. + +``` + ^ + + | | + | | 1A. User submit load to FE + | | + | +--v-----------+ + | | FE | +5. Return result to user | +--+-----------+ + | | + | | 2. Redirect to BE + | | + | +--v-----------+ + +---+Coordinator BE| 1B. User submit load to BE + +-+-----+----+-+ + | | | + +-----+ | +-----+ + | | | 3. Distrbute data + | | | + +-v-+ +-v-+ +-v-+ + |BE | |BE | |BE | + +---+ +---+ +---+ +``` + +In Stream load, Doris selects a node as the Coordinator node. This node is responsible for receiving data and distributing data to other data nodes. + +Users submit import commands through HTTP protocol. If submitted to FE, FE forwards the request to a BE via the HTTP redirect instruction. Users can also submit import commands directly to a specified BE. + +The final result of the import is returned to the user by Coordinator BE. + +## Support data format + +Currently Stream Load supports two data formats: CSV (text) and JSON + +## Basic operations +### Create Load + +Stream load submits and transfers data through HTTP protocol. Here, the `curl` command shows how to submit an import. + +Users can also operate through other HTTP clients. + +``` +curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_host:http_port/api/{db}/{table}/_stream_load + +The properties supported in the header are described in "Load Parameters" below +The format is: - H "key1: value1" +``` + +Examples: + +``` +curl --location-trusted -u root -T date -H "label:123" http://abc.com:8030/api/test/date/_stream_load +``` +The detailed syntax for creating imports helps to execute ``HELP STREAM LOAD`` view. The following section focuses on the significance of creating some parameters of Stream load. + +**Signature parameters** + ++ user/passwd + + Stream load uses the HTTP protocol to create the imported protocol and signs it through the Basic Access authentication. The Doris system verifies user identity and import permissions based on signatures. + +**Load Parameters** + +Stream load uses HTTP protocol, so all parameters related to import tasks are set in the header. The significance of some parameters of the import task parameters of Stream load is mainly introduced below. + ++ label + + Identity of import task. Each import task has a unique label inside a single database. Label is a user-defined name in the import command. With this label, users can view the execution of the corresponding import task. + + Another function of label is to prevent users from importing the same data repeatedly. **It is strongly recommended that users use the same label for the same batch of data. This way, repeated requests for the same batch of data will only be accepted once, guaranteeing at-Most-Once** + + When the corresponding import operation state of label is CANCELLED, the label can be used again. + + ++ column_separator + + Used to specify the column separator in the load file. The default is `\t`. If it is an invisible character, you need to add `\x` as a prefix and hexadecimal to indicate the separator. + + For example, the separator `\x01` of the hive file needs to be specified as `-H "column_separator:\x01"`. + + You can use a combination of multiple characters as the column separator. + ++ line_delimiter + + Used to specify the line delimiter in the load file. The default is `\n`. + + You can use a combination of multiple characters as the column separator. + ++ max\_filter\_ratio + + The maximum tolerance rate of the import task is 0 by default, and the range of values is 0-1. When the import error rate exceeds this value, the import fails. + + If the user wishes to ignore the wrong row, the import can be successful by setting this parameter greater than 0. + + The calculation formula is as follows: + + ``` (dpp.abnorm.ALL / (dpp.abnorm.ALL + dpp.norm.ALL ) ) > max_filter_ratio ``` + + ``` dpp.abnorm.ALL``` denotes the number of rows whose data quality is not up to standard. Such as type mismatch, column mismatch, length mismatch and so on. + + ``` dpp.norm.ALL ``` refers to the number of correct data in the import process. The correct amount of data for the import task can be queried by the ``SHOW LOAD` command. + +The number of rows in the original file = `dpp.abnorm.ALL + dpp.norm.ALL` + ++ where + + Import the filter conditions specified by the task. Stream load supports filtering of where statements specified for raw data. The filtered data will not be imported or participated in the calculation of filter ratio, but will be counted as `num_rows_unselected`. + ++ partition + + Partition information for tables to be imported will not be imported if the data to be imported does not belong to the specified Partition. These data will be included in `dpp.abnorm.ALL`. + ++ columns + + The function transformation configuration of data to be imported includes the sequence change of columns and the expression transformation, in which the expression transformation method is consistent with the query statement. + + ``` + Examples of column order transformation: There are three columns of original data (src_c1,src_c2,src_c3), and there are also three columns (dst_c1,dst_c2,dst_c3) in the doris table at present. + when the first column src_c1 of the original file corresponds to the dst_c1 column of the target table, while the second column src_c2 of the original file corresponds to the dst_c2 column of the target table and the third column src_c3 of the original file corresponds to the dst_c3 column of the target table,which is written as follows: + columns: dst_c1, dst_c2, dst_c3 + + when the first column src_c1 of the original file corresponds to the dst_c2 column of the target table, while the second column src_c2 of the original file corresponds to the dst_c3 column of the target table and the third column src_c3 of the original file corresponds to the dst_c1 column of the target table,which is written as follows: + columns: dst_c2, dst_c3, dst_c1 + + Example of expression transformation: There are two columns in the original file and two columns in the target table (c1, c2). However, both columns in the original file need to be transformed by functions to correspond to the two columns in the target table. + columns: tmp_c1, tmp_c2, c1 = year(tmp_c1), c2 = mouth(tmp_c2) + Tmp_* is a placeholder, representing two original columns in the original file. + ``` + ++ exec\_mem\_limit + + Memory limit. Default is 2GB. Unit is Bytes + ++ merge\_type + The type of data merging supports three types: APPEND, DELETE, and MERGE. APPEND is the default value, which means that all this batch of data needs to be appended to the existing data. DELETE means to delete all rows with the same key as this batch of data. MERGE semantics Need to be used in conjunction with the delete condition, which means that the data that meets the delete condition is processed according to DELETE semantics and the rest is processed according to APPEND semantics + ++ two\_phase\_commit + + Stream load supports the two-phase commit mode。The mode could be enabled by declaring ```two_phase_commit=true``` in http header. This mode is disabled by default. + the two-phase commit mode means:During Stream load, after data is written, the message will be returned to the client, the data is invisible at this point and the transaction status is PRECOMMITTED. The data will be visible only after COMMIT is triggered by client。 + + 1. User can invoke the following interface to trigger commit operations for transaction: + ``` + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/_stream_load_2pc + ``` + or + ``` + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" http://be_host:webserver_port/api/{db}/_stream_load_2pc + ``` + + 2. User can invoke the following interface to trigger abort operations for transaction: + ``` + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/_stream_load_2pc + ``` + or + ``` + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" http://be_host:webserver_port/api/{db}/_stream_load_2pc + ``` + +### Return results + +Since Stream load is a synchronous import method, the result of the import is directly returned to the user by creating the return value of the import. + +Examples: + +``` +{ + "TxnId": 1003, + "Label": "b6f3bc78-0d2c-45d9-9e4c-faa0a0149bee", + "Status": "Success", + "ExistingJobStatus": "FINISHED", // optional + "Message": "OK", + "NumberTotalRows": 1000000, + "NumberLoadedRows": 1000000, + "NumberFilteredRows": 1, + "NumberUnselectedRows": 0, + "LoadBytes": 40888898, + "LoadTimeMs": 2144, + "BeginTxnTimeMs": 1, + "StreamLoadPutTimeMs": 2, + "ReadDataTimeMs": 325, + "WriteDataTimeMs": 1933, + "CommitAndPublishTimeMs": 106, + "ErrorURL": "http://192.168.1.1:8042/api/_load_error_log?file=__shard_0/error_log_insert_stmt_db18266d4d9b4ee5-abb00ddd64bdf005_db18266d4d9b4ee5_abb00ddd64bdf005" +} +``` + +The following main explanations are given for the Stream load import result parameters: + ++ TxnId: The imported transaction ID. Users do not perceive. + ++ Label: Import Label. User specified or automatically generated by the system. + ++ Status: Import completion status. + + "Success": Indicates successful import. + + "Publish Timeout": This state also indicates that the import has been completed, except that the data may be delayed and visible without retrying. + + "Label Already Exists": Label duplicate, need to be replaced Label. + + "Fail": Import failed. + ++ ExistingJobStatus: The state of the load job corresponding to the existing Label. + + This field is displayed only when the status is "Label Already Exists". The user can know the status of the load job corresponding to Label through this state. "RUNNING" means that the job is still executing, and "FINISHED" means that the job is successful. + ++ Message: Import error messages. + ++ NumberTotalRows: Number of rows imported for total processing. + ++ NumberLoadedRows: Number of rows successfully imported. + ++ NumberFilteredRows: Number of rows that do not qualify for data quality. + ++ NumberUnselectedRows: Number of rows filtered by where condition. + ++ LoadBytes: Number of bytes imported. + ++ LoadTimeMs: Import completion time. Unit milliseconds. + ++ BeginTxnTimeMs: The time cost for RPC to Fe to begin a transaction, Unit milliseconds. + ++ StreamLoadPutTimeMs: The time cost for RPC to Fe to get a stream load plan, Unit milliseconds. + ++ ReadDataTimeMs: Read data time, Unit milliseconds. + ++ WriteDataTimeMs: Write data time, Unit milliseconds. + ++ CommitAndPublishTimeMs: The time cost for RPC to Fe to commit and publish a transaction, Unit milliseconds. + ++ ErrorURL: If you have data quality problems, visit this URL to see specific error lines. + +> Note: Since Stream load is a synchronous import mode, import information will not be recorded in Doris system. Users cannot see Stream load asynchronously by looking at import commands. You need to listen for the return value of the create import request to get the import result. + +### Cancel Load + +Users can't cancel Stream load manually. Stream load will be cancelled automatically by the system after a timeout or import error. + +## Relevant System Configuration + +### FE configuration + ++ stream\_load\_default\_timeout\_second + + The timeout time of the import task (in seconds) will be cancelled by the system if the import task is not completed within the set timeout time, and will become CANCELLED. + + At present, Stream load does not support custom import timeout time. All Stream load import timeout time is uniform. The default timeout time is 300 seconds. If the imported source file can no longer complete the import within the specified time, the FE parameter ```stream_load_default_timeout_second``` needs to be adjusted. + +### BE configuration + ++ streaming\_load\_max\_mb + + The maximum import size of Stream load is 10G by default, in MB. If the user's original file exceeds this value, the BE parameter ```streaming_load_max_mb``` needs to be adjusted. + +## Best Practices + +### Application scenarios + +The most appropriate scenario for using Stream load is that the original file is in memory or on disk. Secondly, since Stream load is a synchronous import method, users can also use this import if they want to obtain the import results in a synchronous manner. + +### Data volume + +Since Stream load is based on the BE initiative to import and distribute data, the recommended amount of imported data is between 1G and 10G. Since the default maximum Stream load import data volume is 10G, the configuration of BE ```streaming_load_max_mb``` needs to be modified if files exceeding 10G are to be imported. + +``` +For example, the size of the file to be imported is 15G +Modify the BE configuration streaming_load_max_mb to 16000 +``` + +Stream load default timeout is 300 seconds, according to Doris currently the largest import speed limit, about more than 3G files need to modify the import task default timeout. + +``` +Import Task Timeout = Import Data Volume / 10M / s (Specific Average Import Speed Requires Users to Calculate Based on Their Cluster Conditions) +For example, import a 10G file +Timeout = 1000s -31561;. 20110G / 10M /s +``` + +### Complete examples +Data situation: In the local disk path / home / store_sales of the sending and importing requester, the imported data is about 15G, and it is hoped to be imported into the table store\_sales of the database bj_sales. + +Cluster situation: The concurrency of Stream load is not affected by cluster size. + ++ Step 1: Does the import file size exceed the default maximum import size of 10G + + ``` + BE conf + streaming_load_max_mb = 16000 + ``` ++ Step 2: Calculate whether the approximate import time exceeds the default timeout value + + ``` + Import time 15000/10 = 1500s + Over the default timeout time, you need to modify the FE configuration + stream_load_default_timeout_second = 1500 + ``` + ++ Step 3: Create Import Tasks + + ``` + curl --location-trusted -u user:password -T /home/store_sales -H "label:abc" http://abc.com:8000/api/bj_sales/store_sales/_stream_load + ``` + +## Common Questions + +* Label Already Exists + + The Label repeat checking steps of Stream load are as follows: + + 1. Is there an import Label conflict that already exists with other import methods? + + Because imported Label in Doris system does not distinguish between import methods, there is a problem that other import methods use the same Label. + + Through ``SHOW LOAD WHERE LABEL = "xxx"'``, where XXX is a duplicate Label string, see if there is already a Label imported by FINISHED that is the same as the Label created by the user. + + 2. Are Stream loads submitted repeatedly for the same job? + + Since Stream load is an HTTP protocol submission creation import task, HTTP Clients in various languages usually have their own request retry logic. After receiving the first request, the Doris system has started to operate Stream load, but because the result is not returned to the Client side in time, the Client side will retry to create the request. At this point, the Doris system is already operating on the first request, so the second request will be reported to Label Already Exists. + + To sort out the possible methods mentioned above: Search FE Master's log with Label to see if there are two ``redirect load action to destination = ``redirect load action to destination cases in the same Label. If so, the request is submitted repeatedly by the Client side. + + It is recommended that the user calculate the approximate import time based on the amount of data currently requested, and change the request overtime on the client side to a value greater than the import timeout time according to the import timeout time to avoid multiple submissions of the request by the client side. + + 3. Connection reset abnormal + + In the community version 0.14.0 and earlier versions, the connection reset exception occurred after Http V2 was enabled, because the built-in web container is tomcat, and Tomcat has pits in 307 (Temporary Redirect). There is a problem with the implementation of this protocol. All In the case of using Stream load to import a large amount of data, a connect reset exception will occur. This is because tomcat started data transmission before the 307 jump, which resulted in the lack of authentication information when the BE received the data request. Later, changing the built-in container to Jetty solved this problem. If you encounter this problem, please upgrade your Doris or disable Http V2 (`enable_http_server_v2=false`). + + After the upgrade, also upgrade the http client version of your program to `4.5.13`,Introduce the following dependencies in your pom.xml file + + ```xml + + org.apache.httpcomponents + httpclient + 4.5.13 + + ``` + + +## More Help + +For more detailed syntax used by **Stream Load**, you can enter `HELP STREAM LOAD` on the Mysql client command line for more help. diff --git a/new-docs/en/data-operate/update-delete/batch-delete-manual.md b/new-docs/en/data-operate/update-delete/batch-delete-manual.md index 1efc2bcb05..540ffe22ce 100644 --- a/new-docs/en/data-operate/update-delete/batch-delete-manual.md +++ b/new-docs/en/data-operate/update-delete/batch-delete-manual.md @@ -33,41 +33,51 @@ There are three ways to merge data import: 2. DELETE: delete all rows with the same key column value as the imported data 3. MERGE: APPEND or DELETE according to DELETE ON decision -## Principle +## Fundamental This is achieved by adding a hidden column `__DORIS_DELETE_SIGN__`, because we are only doing batch deletion on the unique model, so we only need to add a hidden column whose type is bool and the aggregate function is replace. In be, the various aggregation write processes are the same as normal columns, and there are two read schemes: -Remove `__DORIS_DELETE_SIGN__` when fe encounters extensions such as *, and add the condition of `__DORIS_DELETE_SIGN__ != true` by default -When be reads, a column is added for judgment, and the condition is used to determine whether to delete. +Remove `__DORIS_DELETE_SIGN__` when fe encounters extensions such as *, and add the condition of `__DORIS_DELETE_SIGN__ != true` by default When be reads, a column is added for judgment, and the condition is used to determine whether to delete. ### Import -When importing, set the value of the hidden column to the value of the `DELETE ON` expression during fe parsing. The other aggregation behaviors are the same as the replace aggregation column +When importing, set the value of the hidden column to the value of the `DELETE ON` expression during fe parsing. The other aggregation behaviors are the same as the replace aggregation column. ### Read -When reading, add the condition of `__DORIS_DELETE_SIGN__ != true` to all olapScanNodes with hidden columns, be does not perceive this process and executes normally +When reading, add the condition of `__DORIS_DELETE_SIGN__ != true` to all olapScanNodes with hidden columns, be does not perceive this process and executes normally. ### Cumulative Compaction -In Cumulative Compaction, hidden columns are treated as normal columns, and the compaction logic remains unchanged +In Cumulative Compaction, hidden columns are treated as normal columns, and the compaction logic remains unchanged. ### Base Compaction -In Base Compaction, delete the rows marked for deletion to reduce the space occupied by data +In Base Compaction, delete the rows marked for deletion to reduce the space occupied by data. -### Syntax -The import syntax design is mainly to add a column mapping that specifies the field of the delete mark column, and this column needs to be added to the imported data. The method of setting each import method is as follows +## Enable bulk delete support -#### stream load +There are two ways of enabling batch delete support: -The wording of stream load adds a field to set the delete mark column in the columns field in the header. Example +1. By adding `enable_batch_delete_by_default=true` in the fe configuration file, all newly created tables after restarting fe support batch deletion, this option defaults to false + +2. For tables that have not changed the above fe configuration or for existing tables that do not support the bulk delete function, you can use the following statement: + `ALTER TABLE tablename ENABLE FEATURE "BATCH_DELETE"` to enable the batch delete. + +If you want to determine whether a table supports batch delete, you can set a session variable to display the hidden columns `SET show_hidden_columns=true`, and then use `desc tablename`, if there is a `__DORIS_DELETE_SIGN__` column in the output, it is supported, if not, it is not supported + +### Syntax Description +The syntax design of the import is mainly to add a column mapping that specifies the field of the delete marker column, and it is necessary to add a column to the imported data. The syntax of various import methods is as follows: + +#### Stream Load + +The writing method of `Stream Load` adds a field to set the delete label column in the columns field in the header. Example `-H "columns: k1, k2, label_c3" -H "merge_type: [MERGE|APPEND|DELETE]" -H "delete: label_c3=1"` -#### broker load +#### Broker Load -Set the field to delete the mark column at `PROPERTIES` +The writing method of `Broker Load` sets the field of the delete marker column at `PROPERTIES`, the syntax is as follows: -``` +```sql LOAD LABEL db1.label1 ( [MERGE|APPEND|DELETE] DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1") @@ -79,8 +89,7 @@ LOAD LABEL db1.label1 id=tmp_c2, name=tmp_c1, ) - [DELETE ON label=true] - + [DELETE ON label_c3=true] ) WITH BROKER'broker' ( @@ -90,16 +99,15 @@ WITH BROKER'broker' PROPERTIES ( "timeout" = "3600" - ); ``` -#### routine load +#### Routine Load -Routine load adds a mapping to the `columns` field. The mapping method is the same as above, the example is as follows +The writing method of `Routine Load` adds a mapping to the `columns` field. The mapping method is the same as above. The syntax is as follows: -``` +```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl [WITH MERGE|APPEND|DELETE] COLUMNS(k1, k2, k3, v1, v2, label), @@ -122,34 +130,46 @@ Routine load adds a mapping to the `columns` field. The mapping method is the sa ); ``` -## Enable bulk delete support -There are two ways of enabling batch delete support: -1. By adding `enable_batch_delete_by_default=true` in the fe configuration file, all newly created tables after restarting fe support batch deletion, this option defaults to false - -2. For tables that have not changed the above fe configuration or for existing tables that do not support the bulk delete function, you can use the following statement: -`ALTER TABLE tablename ENABLE FEATURE "BATCH_DELETE"` to enable the batch delete. - -If you want to determine whether a table supports batch delete, you can set a session variable to display the hidden columns `SET show_hidden_columns=true`, and then use `desc tablename`, if there is a `__DORIS_DELETE_SIGN__` column in the output, it is supported, if not, it is not supported ## Note -1. Since import operations other than stream load may be executed out of order inside doris, if it is not stream load when importing using the `MERGE` method, it needs to be used with load sequence. For the specific syntax, please refer to the sequence column related documents -2. `DELETE ON` condition can only be used with MERGE +1. Since import operations other than stream load may be executed out of order inside doris, if it is not stream load when importing using the `MERGE` method, it needs to be used with load sequence. For the specific syntax, please refer to the [sequence](sequence-column-manual.html) column related documents +2. `DELETE ON` condition can only be used with MERGE. ## Usage example -Let's take stream load as an example to show how to use it -1. Import data normally: + +### Check if bulk delete support is enabled + +```sql +mysql> SET show_hidden_columns=true; +Query OK, 0 rows affected (0.00 sec) + +mysql> DESC test; ++-----------------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-----------------------+--------------+------+-------+---------+---------+ +| name | VARCHAR(100) | No | true | NULL | | +| gender | VARCHAR(10) | Yes | false | NULL | REPLACE | +| age | INT | Yes | false | NULL | REPLACE | +| __DORIS_DELETE_SIGN__ | TINYINT | No | false | 0 | REPLACE | ++-----------------------+--------------+------+-------+---------+---------+ +4 rows in set (0.00 sec) ``` + +### Stream Load usage example + +1. Import data normally: +```shell curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: APPEND" -T ~/table1_data http://127.0.0.1: 8130/api/test/table1/_stream_load ``` The APPEND condition can be omitted, which has the same effect as the following statement: -``` +```shell curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -T ~/table1_data http://127.0.0.1:8130/api/test/table1 /_stream_load ``` 2. Delete all data with the same key as the imported data -``` +```shell curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: DELETE" -T ~/table1_data http://127.0.0.1: 8130/api/test/table1/_stream_load ``` Before load: -``` +```sql +--------+----------+----------+------+ | siteid | citycode | username | pv | +--------+----------+----------+------+ @@ -161,9 +181,9 @@ Before load: Load data: ``` 3,2,tom,0 -``` -After load: ``` +After load: +```sql +--------+----------+----------+------+ | siteid | citycode | username | pv | +--------+----------+----------+------+ @@ -172,11 +192,11 @@ After load: +--------+----------+----------+------+ ``` 3. Import the same row as the key column of the row with `site_id=1` -``` +```shell curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: MERGE" -H "delete: siteid=1" -T ~/ table1_data http://127.0.0.1:8130/api/test/table1/_stream_load ``` Before load: -``` +```sql +--------+----------+----------+------+ | siteid | citycode | username | pv | +--------+----------+----------+------+ @@ -192,7 +212,7 @@ Load data: 1,1,jim,2 ``` After load: -``` +```sql +--------+----------+----------+------+ | siteid | citycode | username | pv | +--------+----------+----------+------+ @@ -202,3 +222,4 @@ After load: | 5 | 3 | helen | 3 | +--------+----------+----------+------+ ``` + diff --git a/new-docs/en/data-operate/update-delete/delete-manual.md b/new-docs/en/data-operate/update-delete/delete-manual.md index 4067bff76e..fb9ac65c46 100644 --- a/new-docs/en/data-operate/update-delete/delete-manual.md +++ b/new-docs/en/data-operate/update-delete/delete-manual.md @@ -27,49 +27,12 @@ under the License. # Delete -Unlike other import methods, delete is a synchronization process. Similar to insert into, all delete operations are an independent import job in Doris. Generally, delete statements need to specify tables, partitions and delete conditions to tell which data to be deleted, and the data on base index and rollup index will be deleted at the same time. +Delete is different from other import methods. It is a synchronization process, similar to Insert into. All Delete operations are an independent import job in Doris. Generally, the Delete statement needs to specify the table and partition and delete conditions to filter the data to be deleted. , and will delete the data of the base table and the rollup table at the same time. ## Syntax -The delete statement's syntax is as follows: - -``` -DELETE FROM table_name [PARTITION partition_name] -WHERE -column_name1 op value[ AND column_name2 op value ...]; -``` - -example 1: - -``` -DELETE FROM my_table PARTITION p1 WHERE k1 = 3; -``` - -example 2: - -``` -DELETE FROM my_table PARTITION p1 WHERE k1 < 3 AND k2 = "abc"; -``` - -The following describes the parameters used in the delete statement: - -* PARTITION - - The target partition of the delete statement. If not specified, the table must be a single partition table, otherwise it cannot be deleted - -* WHERE - - The condition of the delete statement. All delete statements must specify a where condition. - -Explanation: - -1. The type of `OP` in the WHERE condition can only include `=, >, <, >=, <=, !=, in, not in`. -2. The column in the WHERE condition can only be the `key` column. -3. Cannot delete when the `key` column does not exist in any rollup table. -4. Each condition in WHERE condition can only be connected by `and`. If you want `or`, you are suggested to write these conditions into two delete statements. -5. If the specified table is a range or list partitioned table, `PARTITION` must be specified unless the table is a single partition table,. -6. Unlike the insert into command, delete statement cannot specify `label` manually. You can view the concept of `label` in [Insert Into](./insert-into-manual.md) +Please refer to the official website for the [DELETE](../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Manipulation/DELETE.html) syntax of the delete operation. ## Delete Result @@ -88,35 +51,34 @@ The delete command is an SQL command, and the returned results are synchronous. 2. Submitted successfully, but not visible - The transaction submission of Doris is divided into two steps: submission and publish version. Only after the publish version step is completed, the result will be visible to the user. If it has been submitted successfully, then it can be considered that the publish version step will eventually success. Doris will try to wait for publishing for a period of time after submitting. If it has timed out, even if the publishing version has not been completed, it will return to the user in priority and prompt the user that the submission has been completed but not visible. If delete has been committed and executed, but has not been published and visible, the following results will be returned. - - ``` - mysql> delete from test_tbl PARTITION p1 where k1 = 1; - Query OK, 0 rows affected (0.04 sec) - {'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'COMMITTED', 'txnId':'4005', 'err':'delete job is committed but may be taking effect later' } - ``` - - The result will return a JSON string at the same time: - - `affected rows`: Indicates the row affected by this deletion. Since the deletion of Doris is currently a logical deletion, the value is always 0. - - `label`: The label generated automatically to be the signature of the delete jobs. Each job has a unique label within a single database. - - `status`: Indicates whether the data deletion is visible. If it is visible, `visible` will be displayed. If it is not visible, `committed` will be displayed. +~~~text +The transaction submission of Doris is divided into two steps: submission and publish version. Only after the publish version step is completed, the result will be visible to the user. If it has been submitted successfully, then it can be considered that the publish version step will eventually success. Doris will try to wait for publishing for a period of time after submitting. If it has timed out, even if the publishing version has not been completed, it will return to the user in priority and prompt the user that the submission has been completed but not visible. If delete has been committed and executed, but has not been published and visible, the following results will be returned. +``` +mysql> delete from test_tbl PARTITION p1 where k1 = 1; +Query OK, 0 rows affected (0.04 sec) +{'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'COMMITTED', 'txnId':'4005', 'err':'delete job is committed but may be taking effect later' } +``` -​ + The result will return a JSON string at the same time: - `txnId`: The transaction ID corresponding to the delete job - - `err`: Field will display some details of this deletion +`affected rows`: Indicates the row affected by this deletion. Since the deletion of Doris is currently a logical deletion, the value is always 0. + +`label`: The label generated automatically to be the signature of the delete jobs. Each job has a unique label within a single database. + +`status`: Indicates whether the data deletion is visible. If it is visible, `visible` will be displayed. If it is not visible, `committed` will be displayed. + +`txnId`: The transaction ID corresponding to the delete job + +`err`: Field will display some details of this deletion +~~~ 3. Commit failed, transaction cancelled If the delete statement is not submitted successfully, it will be automatically aborted by Doris and the following results will be returned - ``` + ```sql mysql> delete from test_tbl partition p1 where k1 > 80; ERROR 1064 (HY000): errCode = 2, detailMessage = {错误原因} ``` @@ -126,7 +88,7 @@ The delete command is an SQL command, and the returned results are synchronous. A timeout deletion will return the timeout and unfinished replicas displayed as ` (tablet = replica)` - ``` + ```sql mysql> delete from test_tbl partition p1 where k1 > 80; ERROR 1064 (HY000): errCode = 2, detailMessage = failed to delete replicas from job: 4005, Unfinished replicas:10000=60000, 10001=60000, 10002=60000 ``` @@ -140,9 +102,7 @@ The delete command is an SQL command, and the returned results are synchronous. 1. If `status` is `committed`, the data deletion is committed and will be eventually invisible. Users can wait for a while and then use the `show delete` command to view the results. 2. If `status` is `visible`, the data have been deleted successfully. -## Relevant Configuration - -### FE configuration +## Delete operation related FE configuration **TIMEOUT configuration** @@ -174,23 +134,31 @@ In general, Doris's deletion timeout is limited from 30 seconds to 5 minutes. Th ## Show delete history -1. The user can view the deletion completed in history through the show delete statement. +The user can view the deletion completed in history through the show delete statement. - Syntax +Syntax - ``` - SHOW DELETE [FROM db_name] - ``` +``` +SHOW DELETE [FROM db_name] +``` - example +example - ``` - mysql> show delete from test_db; - +-----------+---------------+---------------------+-----------------+----------+ - | TableName | PartitionName | CreateTime | DeleteCondition | State | - +-----------+---------------+---------------------+-----------------+----------+ - | empty_tbl | p3 | 2020-04-15 23:09:35 | k1 EQ "1" | FINISHED | - | test_tbl | p4 | 2020-04-15 23:09:53 | k1 GT "80" | FINISHED | - +-----------+---------------+---------------------+-----------------+----------+ - 2 rows in set (0.00 sec) - ``` \ No newline at end of file +```sql +mysql> show delete from test_db; ++-----------+---------------+---------------------+-----------------+----------+ +| TableName | PartitionName | CreateTime | DeleteCondition | State | ++-----------+---------------+---------------------+-----------------+----------+ +| empty_tbl | p3 | 2020-04-15 23:09:35 | k1 EQ "1" | FINISHED | +| test_tbl | p4 | 2020-04-15 23:09:53 | k1 GT "80" | FINISHED | ++-----------+---------------+---------------------+-----------------+----------+ +2 rows in set (0.00 sec) +``` + +### Note + +Unlike the Insert into command, delete cannot specify `label` manually. For the concept of label, see the [Insert Into](../import/import-way/insert-into-manual.html) documentation. + +## More Help + +For more detailed syntax used by **delete**, see the [delete](../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Manipulation/DELETE.html) command manual, You can also enter `HELP DELETE` in the Mysql client command line to get more help information \ No newline at end of file diff --git a/new-docs/en/data-operate/update-delete/sequence-column-manual.md b/new-docs/en/data-operate/update-delete/sequence-column-manual.md index aeb62e3621..ef44f1b561 100644 --- a/new-docs/en/data-operate/update-delete/sequence-column-manual.md +++ b/new-docs/en/data-operate/update-delete/sequence-column-manual.md @@ -25,58 +25,61 @@ under the License. --> # Sequence Column -The Sequence Column currently only supports the Uniq model. The Uniq model is mainly for scenarios requiring a unique primary key, which can guarantee the uniqueness constraint of the primary key. However, due to the use of REPLACE aggregation, the replacement sequence is not guaranteed for data imported in the same batch, which can be described in detail [here](../../getting-started/data-model-rollup.md). If the order of substitution is not guaranteed, then the specific data that is finally imported into the table cannot be determined, and there is uncertainty. +The sequence column currently only supports the Uniq model. The Uniq model is mainly aimed at scenarios that require a unique primary key, which can guarantee the uniqueness constraint of the primary key. However, due to the REPLACE aggregation method, the replacement order of data imported in the same batch is not guaranteed. See [Data Model](../../data-table/data-model.md). If the replacement order cannot be guaranteed, the specific data finally imported into the table cannot be determined, and there is uncertainty. -To solve this problem, Doris supported a sequence column by allowing the user to specify the sequence column when importing. Under the same key column, columns of the REPLACE aggregate type will be replaced according to the value of the sequence column, larger values can be replaced with smaller values, and vice versa. In this method, the order is determined by the user, and the user controls the replacement order. +In order to solve this problem, Doris supports the sequence column. The user specifies the sequence column when importing. Under the same key column, the REPLACE aggregation type column will be replaced according to the value of the sequence column. The larger value can replace the smaller value, and vice versa. Cannot be replaced. This method leaves the determination of the order to the user, who controls the replacement order. -## Principle +## Applicable scene -Implemented by adding a hidden column `__DORIS_SEQUENCE_COL__`, the type of the column is specified by the user while create the table, determines the specific value of the column on import, and replaces the REPLACE column with that value. +Sequence columns can only be used under the Uniq data model. + +## Fundamental + +By adding a hidden column `__DORIS_SEQUENCE_COL__`, the type of the column is specified by the user when creating the table, the specific value of the column is determined during import, and the REPLACE column is replaced according to this value. ### Create Table -When you create the Uniq table, a hidden column `__DORIS_SEQUENCE_COL__` is automatically added, depending on the type specified by the user +When creating a Uniq table, a hidden column `__DORIS_SEQUENCE_COL__` will be automatically added according to the user-specified type. ### Import -When importing, fe sets the value of the hidden column during parsing to the value of the 'order by' expression (Broker Load and routine Load), or the value of the 'function_column.sequence_col' expression (stream load), and the value column will be replaced according to this value. The value of the hidden column `__DORIS_SEQUENCE_COL__` can be set as a column in the source data or in the table structure. +When importing, fe sets the value of the hidden column to the value of the `order by` expression (broker load and routine load), or the value of the `function_column.sequence_col` expression (stream load) during the parsing process, the value column will be Replace with this value. The value of the hidden column `__DORIS_SEQUENCE_COL__` can be set to either a column in the data source or a column in the table structure. ### Read -The request with the value column needs to read the additional column of `__DORIS_SEQUENCE_COL__`, which is used as a basis for the order of replacement aggregation function replacement under the same key column, with the larger value replacing the smaller value and not the reverse. +When the request contains the value column, the `__DORIS_SEQUENCE_COL__` column needs to be additionally read. This column is used as the basis for the replacement order of the REPLACE aggregate function under the same key column. The larger value can replace the smaller value, otherwise it cannot be replaced. ### Cumulative Compaction -Cumulative Compaction works in the same way as the reading process +The principle is the same as that of the reading process during Cumulative Compaction. ### Base Compaction -Base Compaction works in the same way as the reading process +The principle is the same as the reading process during Base Compaction. ### Syntax -The syntax aspect of the table construction adds a property to the property identifying the type of `__DORIS_SEQUENCE_COL__`. -The syntax design aspect of the import is primarily the addition of a mapping from the sequence column to other columns, the settings of each import mode are described below +When the Sequence column creates a table, an attribute is added to the property, which is used to identify the type import of `__DORIS_SEQUENCE_COL__`. The grammar design is mainly to add a mapping from the sequence column to other columns. The settings of each seed method will be described below introduce #### Create Table When you create the Uniq table, you can specify the sequence column type -``` +```text PROPERTIES ( "function_column.sequence_type" = 'Date', ); ``` The sequence_type is used to specify the type of the sequence column, which can be integral and time -#### stream load +#### Stream Load The syntax of the stream load is to add the mapping of hidden columns corresponding to source_sequence in the 'function_column.sequence_col' field in the header, for example -``` +```shell curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load ``` -#### broker load +#### Broker Load Set the source_sequence field for the hidden column map at `ORDER BY` -``` +```sql LOAD LABEL db1.label1 ( DATA INFILE("hdfs://host:port/user/data/*/test.txt") @@ -97,11 +100,11 @@ PROPERTIES ``` -#### routine load +#### Routine Load The mapping method is the same as above, as shown below -``` +```sql CREATE ROUTINE LOAD example_db.test1 ON example_tbl [WITH MERGE|APPEND|DELETE] COLUMNS(k1, k2, source_sequence, v1, v2), @@ -125,17 +128,16 @@ The mapping method is the same as above, as shown below ``` ## Enable sequence column support -If `function_column.sequence_type` is set when creating a new table, then the sequence column will be supported. -For a table that does not support sequence column, use the following statement if you would like to use this feature: -`ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` to enable. -If you want to determine if a table supports sequence column, you can set the session variable to display the hidden column `SET show_hidden_columns=true`, followed by `desc Tablename`, if the output contains the column `__DORIS_SEQUENCE_COL__`, it is supported, if not, it is not supported +If `function_column.sequence_type` is set when creating a new table, the new table will support sequence column. For a table that does not support sequence column, if you want to use this function, you can use the following statement: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` to enable. + + If you are not sure whether a table supports sequence column, you can display hidden columns by setting a session variable `SET show_hidden_columns=true`, then use `desc tablename`, if there is a `__DORIS_SEQUENCE_COL__` column in the output, it is supported, if not, it is not supported . ## Usage example Let's take the stream Load as an example to show how to use it 1. Create a table that supports sequence column. The table structure is shown below -``` +```sql MySQL > desc test_table; +-------------+--------------+------+-------+---------+---------+ | Field | Type | Null | Key | Default | Extra | @@ -160,11 +162,11 @@ Import the following data 1 2020-02-22 1 2020-02-22 b ``` Take the Stream Load as an example here and map the sequence column to the modify_date column -``` +```shell curl --location-trusted -u root: -H "function_column.sequence_col: modify_date" -T testData http://host:port/api/test/test_table/_stream_load ``` The results is -``` +```sql MySQL > select * from test_table; +---------+------------+----------+-------------+---------+ | user_id | date | group_id | modify_date | keyword | @@ -177,12 +179,12 @@ In this import, the c is eventually retained in the keyword column because the v 3. Guarantee of substitution order After the above steps are completed, import the following data -``` +```text 1 2020-02-22 1 2020-02-22 a 1 2020-02-22 1 2020-02-23 b ``` Query data -``` +```sql MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ | user_id | date | group_id | modify_date | keyword | @@ -192,12 +194,13 @@ MySQL [test]> select * from test_table; ``` Because the sequence column for the newly imported data are all smaller than the values already in the table, they cannot be replaced Try importing the following data again + ``` 1 2020-02-22 1 2020-02-22 a 1 2020-02-22 1 2020-03-23 w ``` Query data -``` +```sql MySQL [test]> select * from test_table; +---------+------------+----------+-------------+---------+ | user_id | date | group_id | modify_date | keyword | @@ -205,4 +208,5 @@ MySQL [test]> select * from test_table; | 1 | 2020-02-22 | 1 | 2020-03-23 | w | +---------+------------+----------+-------------+---------+ ``` -At this point, you can replace the original data in the table \ No newline at end of file +At this point, you can replace the original data in the table + diff --git a/new-docs/en/data-operate/update-delete/update.md b/new-docs/en/data-operate/update-delete/update.md index b3efc45ece..6105e1503a 100644 --- a/new-docs/en/data-operate/update-delete/update.md +++ b/new-docs/en/data-operate/update-delete/update.md @@ -1,9 +1,7 @@ ---- -{ +## { "title": "update", "language": "en" } ---- Spark cluster | +| <-------+ | ++---------------------------------+ +-----------------------------+ +``` + +## 全局字典 + +### 适用场景 + +目前Doris中Bitmap列是使用类库`Roaringbitmap`实现的,而`Roaringbitmap`的输入数据类型只能是整型,因此如果要在导入流程中实现对于Bitmap列的预计算,那么就需要将输入数据的类型转换成整型。 + +在Doris现有的导入流程中,全局字典的数据结构是基于Hive表实现的,保存了原始值到编码值的映射。 + +### 构建流程 + +1. 读取上游数据源的数据,生成一张hive临时表,记为`hive_table`。 +2. 从`hive_table`中抽取待去重字段的去重值,生成一张新的hive表,记为`distinct_value_table`。 +3. 新建一张全局字典表,记为`dict_table`;一列为原始值,一列为编码后的值。 +4. 将`distinct_value_table`与`dict_table`做left join,计算出新增的去重值集合,然后对这个集合使用窗口函数进行编码,此时去重列原始值就多了一列编码后的值,最后将这两列的数据写回`dict_table`。 +5. 将`dict_table`与`hive_table`做join,完成`hive_table`中原始值替换成整型编码值的工作。 +6. `hive_table`会被下一步数据预处理的流程所读取,经过计算后导入到Doris中。 + +## 数据预处理(DPP) + +### 基本流程 + +1. 从数据源读取数据,上游数据源可以是HDFS文件,也可以是Hive表。 +2. 对读取到的数据进行字段映射,表达式计算以及根据分区信息生成分桶字段`bucket_id`。 +3. 根据Doris表的rollup元数据生成RollupTree。 +4. 遍历RollupTree,进行分层的聚合操作,下一个层级的rollup可以由上一个层的rollup计算得来。 +5. 每次完成聚合计算后,会对数据根据`bucket_id`进行分桶然后写入HDFS中。 +6. 后续broker会拉取HDFS中的文件然后导入Doris Be中。 + +## Hive Bitmap UDF + +Spark 支持将 hive 生成的 bitmap 数据直接导入到 Doris。详见 [hive-bitmap-udf](../../../ecosystem/external-table/hive-bitmap-udf.html) 文档。 + +## 基本操作 + +### 配置ETL集群 + +Spark作为一种外部计算资源在Doris中用来完成ETL工作,未来可能还有其他的外部资源会加入到Doris中使用,如Spark/GPU用于查询,HDFS/S3用于外部存储,MapReduce用于ETL等,因此我们引入resource management来管理Doris使用的这些外部资源。 + +提交 Spark 导入任务之前,需要配置执行 ETL 任务的 Spark 集群。 + +```sql +-- create spark resource +CREATE EXTERNAL RESOURCE resource_name +PROPERTIES +( + type = spark, + spark_conf_key = spark_conf_value, + working_dir = path, + broker = broker_name, + broker.property_key = property_value +) + +-- drop spark resource +DROP RESOURCE resource_name + +-- show resources +SHOW RESOURCES +SHOW PROC "/resources" + +-- privileges +GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity +GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name + +REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity +REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name +``` + +**创建资源** + +`resource_name` 为 Doris 中配置的 Spark 资源的名字。 + +`PROPERTIES` 是 Spark 资源相关参数,如下: + +- `type`:资源类型,必填,目前仅支持 spark。 +- Spark 相关参数如下: + - `spark.master`: 必填,目前支持yarn,spark://host:port。 + - `spark.submit.deployMode`: Spark 程序的部署模式,必填,支持 cluster,client 两种。 + - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。 + - `spark.hadoop.fs.defaultFS`: master为yarn时必填。 + - 其他参数为可选,参考http://spark.apache.org/docs/latest/configuration.html +- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如:hdfs://host:port/tmp/doris。 +- `broker`: broker 名字。spark作为ETL资源使用时必填。需要使用`ALTER SYSTEM ADD BROKER` 命令提前完成配置。 + - `broker.property_key`: broker读取ETL生成的中间文件时需要指定的认证信息等。 + +示例: + +```sql +-- yarn cluster 模式 +CREATE EXTERNAL RESOURCE "spark0" +PROPERTIES +( + "type" = "spark", + "spark.master" = "yarn", + "spark.submit.deployMode" = "cluster", + "spark.jars" = "xxx.jar,yyy.jar", + "spark.files" = "/tmp/aaa,/tmp/bbb", + "spark.executor.memory" = "1g", + "spark.yarn.queue" = "queue0", + "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999", + "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000", + "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris", + "broker" = "broker0", + "broker.username" = "user0", + "broker.password" = "password0" +); + +-- spark standalone client 模式 +CREATE EXTERNAL RESOURCE "spark1" +PROPERTIES +( + "type" = "spark", + "spark.master" = "spark://127.0.0.1:7777", + "spark.submit.deployMode" = "client", + "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris", + "broker" = "broker1" +); +``` + +**查看资源** + +普通账户只能看到自己有USAGE_PRIV使用权限的资源。 + +root和admin账户可以看到所有的资源。 + +**资源权限** + +资源权限通过GRANT REVOKE来管理,目前仅支持USAGE_PRIV使用权限。 + +可以将USAGE_PRIV权限赋予某个用户或者某个角色,角色的使用与之前一致。 + +```sql +-- 授予spark0资源的使用权限给用户user0 +GRANT USAGE_PRIV ON RESOURCE "spark0" TO "user0"@"%"; + +-- 授予spark0资源的使用权限给角色role0 +GRANT USAGE_PRIV ON RESOURCE "spark0" TO ROLE "role0"; + +-- 授予所有资源的使用权限给用户user0 +GRANT USAGE_PRIV ON RESOURCE * TO "user0"@"%"; + +-- 授予所有资源的使用权限给角色role0 +GRANT USAGE_PRIV ON RESOURCE * TO ROLE "role0"; + +-- 撤销用户user0的spark0资源使用权限 +REVOKE USAGE_PRIV ON RESOURCE "spark0" FROM "user0"@"%"; +``` + +### 配置SPARK客户端 + +FE底层通过执行spark-submit的命令去提交spark任务,因此需要为FE配置spark客户端,建议使用2.4.5或以上的spark2官方版本,[spark下载地址 ](https://archive.apache.org/dist/spark/),下载完成后,请按步骤完成以下配置。 + +**配置 SPARK_HOME 环境变量** + +将spark客户端放在FE同一台机器上的目录下,并在FE的配置文件配置`spark_home_default_dir`项指向此目录,此配置项默认为FE根目录下的 `lib/spark2x`路径,此项不可为空。 + +**配置 SPARK 依赖包** + +将spark客户端下的jars文件夹内所有jar包归档打包成一个zip文件,并在FE的配置文件配置`spark_resource_path`项指向此zip文件,若此配置项为空,则FE会尝试寻找FE根目录下的`lib/spark2x/jars/spark-2x.zip`文件,若没有找到则会报文件不存在的错误。 + +当提交spark load任务时,会将归档好的依赖文件上传至远端仓库,默认仓库路径挂在`working_dir/{cluster_id}`目录下,并以`__spark_repository__{resource_name}`命名,表示集群内的一个resource对应一个远端仓库,远端仓库目录结构参考如下: + +```text +__spark_repository__spark0/ + |-__archive_1.0.0/ + | |-__lib_990325d2c0d1d5e45bf675e54e44fb16_spark-dpp-1.0.0-jar-with-dependencies.jar + | |-__lib_7670c29daf535efe3c9b923f778f61fc_spark-2x.zip + |-__archive_1.1.0/ + | |-__lib_64d5696f99c379af2bee28c1c84271d5_spark-dpp-1.1.0-jar-with-dependencies.jar + | |-__lib_1bbb74bb6b264a270bc7fca3e964160f_spark-2x.zip + |-__archive_1.2.0/ + | |-... +``` + +除了spark依赖(默认以`spark-2x.zip`命名),FE还会上传DPP的依赖包至远端仓库,若此次spark load提交的所有依赖文件都已存在远端仓库,那么就不需要在上传依赖,省下原来每次重复上传大量文件的时间。 + +### 配置 YARN 客户端 + +FE底层通过执行yarn命令去获取正在运行的application的状态以及杀死application,因此需要为FE配置yarn客户端,建议使用2.5.2或以上的hadoop2官方版本,[hadoop下载地址](https://archive.apache.org/dist/hadoop/common/) ,下载完成后,请按步骤完成以下配置。 + +**配置 YARN 可执行文件路径** + +将下载好的yarn客户端放在FE同一台机器的目录下,并在FE配置文件配置`yarn_client_path`项指向yarn的二进制可执行文件,默认为FE根目录下的`lib/yarn-client/hadoop/bin/yarn`路径。 + +(可选) 当FE通过yarn客户端去获取application的状态或者杀死application时,默认会在FE根目录下的`lib/yarn-config`路径下生成执行yarn命令所需的配置文件,此路径可通过在FE配置文件配置`yarn_config_dir`项修改,目前生成的配置文件包括`core-site.xml`和`yarn-site.xml`。 + +### 创建导入 + +语法: + +```sql +LOAD LABEL load_label + (data_desc, ...) + WITH RESOURCE resource_name + [resource_properties] + [PROPERTIES (key1=value1, ... )] + +* load_label: + db_name.label_name + +* data_desc: + DATA INFILE ('file_path', ...) + [NEGATIVE] + INTO TABLE tbl_name + [PARTITION (p1, p2)] + [COLUMNS TERMINATED BY separator ] + [(col1, ...)] + [COLUMNS FROM PATH AS (col2, ...)] + [SET (k1=f1(xx), k2=f2(xx))] + [WHERE predicate] + + DATA FROM TABLE hive_external_tbl + [NEGATIVE] + INTO TABLE tbl_name + [PARTITION (p1, p2)] + [SET (k1=f1(xx), k2=f2(xx))] + [WHERE predicate] + +* resource_properties: + (key2=value2, ...) +``` + +示例1:上游数据源为hdfs文件的情况 + +```sql +LOAD LABEL db1.label1 +( + DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1") + INTO TABLE tbl1 + COLUMNS TERMINATED BY "," + (tmp_c1,tmp_c2) + SET + ( + id=tmp_c2, + name=tmp_c1 + ), + DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file2") + INTO TABLE tbl2 + COLUMNS TERMINATED BY "," + (col1, col2) + where col1 > 1 +) +WITH RESOURCE 'spark0' +( + "spark.executor.memory" = "2g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); +``` + +示例2:上游数据源是hive表的情况 + +```sql +step 1:新建hive外部表 +CREATE EXTERNAL TABLE hive_t1 +( + k1 INT, + K2 SMALLINT, + k3 varchar(50), + uuid varchar(100) +) +ENGINE=hive +properties +( +"database" = "tmp", +"table" = "t1", +"hive.metastore.uris" = "thrift://0.0.0.0:8080" +); + +step 2: 提交load命令,要求导入的 doris 表中的列必须在 hive 外部表中存在。 +LOAD LABEL db1.label1 +( + DATA FROM TABLE hive_t1 + INTO TABLE tbl1 + SET + ( + uuid=bitmap_dict(uuid) + ) +) +WITH RESOURCE 'spark0' +( + "spark.executor.memory" = "2g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); +``` + +示例3:上游数据源是hive binary类型情况 + +```sql +step 1:新建hive外部表 +CREATE EXTERNAL TABLE hive_t1 +( + k1 INT, + K2 SMALLINT, + k3 varchar(50), + uuid varchar(100) //hive中的类型为binary +) +ENGINE=hive +properties +( +"database" = "tmp", +"table" = "t1", +"hive.metastore.uris" = "thrift://0.0.0.0:8080" +); + +step 2: 提交load命令,要求导入的 doris 表中的列必须在 hive 外部表中存在。 +LOAD LABEL db1.label1 +( + DATA FROM TABLE hive_t1 + INTO TABLE tbl1 + SET + ( + uuid=binary_bitmap(uuid) + ) +) +WITH RESOURCE 'spark0' +( + "spark.executor.memory" = "2g", + "spark.shuffle.compress" = "true" +) +PROPERTIES +( + "timeout" = "3600" +); +``` + +创建导入的详细语法执行 `HELP SPARK LOAD` 查看语法帮助。这里主要介绍 Spark load 的创建导入语法中参数意义和注意事项。 + +**Label** + +导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 Label。具体规则与 [`Broker Load`](broker-load-manual.html) 一致。 + +**数据描述类参数** + +目前支持的数据源有CSV和hive table。其他规则与 [`Broker Load`](broker-load-manual.html) 一致。 + +**导入作业参数** + +导入作业参数主要指的是 Spark load 创建导入语句中的属于 `opt_properties`部分的参数。导入作业参数是作用于整个导入作业的。规则与 [`Broker Load`](broker-load-manual.html) 一致。 + +**Spark资源参数** + +Spark资源需要提前配置到 Doris系统中并且赋予用户USAGE_PRIV权限后才能使用 Spark load。 + +当用户有临时性的需求,比如增加任务使用的资源而修改 Spark configs,可以在这里设置,设置仅对本次任务生效,并不影响 Doris 集群中已有的配置。 + +```sql +WITH RESOURCE 'spark0' +( + "spark.driver.memory" = "1g", + "spark.executor.memory" = "3g" +) +``` + +**数据源为hive表时的导入** + +目前如果期望在导入流程中将hive表作为数据源,那么需要先新建一张类型为hive的外部表, 然后提交导入命令时指定外部表的表名即可。 + +**导入流程构建全局字典** + +适用于doris表聚合列的数据类型为bitmap类型。 在load命令中指定需要构建全局字典的字段即可,格式为:`doris字段名称=bitmap_dict(hive表字段名称)` 需要注意的是目前只有在上游数据源为hive表时才支持全局字典的构建。 + +**hive binary(bitmap)类型列的导入** + +适用于doris表聚合列的数据类型为bitmap类型,且数据源hive表中对应列的数据类型为binary(通过FE中spark-dpp中的org.apache.doris.load.loadv2.dpp.BitmapValue类序列化)类型。 无需构建全局字典,在load命令中指定相应字段即可,格式为:`doris字段名称=binary_bitmap(hive表字段名称)` 同样,目前只有在上游数据源为hive表时才支持binary(bitmap)类型的数据导入。 + +### 查看导入 + +Spark Load 导入方式同 Broker load 一样都是异步的,所以用户必须将创建导入的 Label 记录,并且在**查看导入命令中使用 Label 来查看导入结果**。查看导入命令在所有导入方式中是通用的,具体语法可执行 `HELP SHOW LOAD` 查看。 + +示例: + +```sql +mysql> show load order by createtime desc limit 1\G +*************************** 1. row *************************** + JobId: 76391 + Label: label1 + State: FINISHED + Progress: ETL:100%; LOAD:100% + Type: SPARK + EtlInfo: unselected.rows=4; dpp.abnorm.ALL=15; dpp.norm.ALL=28133376 + TaskInfo: cluster:cluster0; timeout(s):10800; max_filter_ratio:5.0E-5 + ErrorMsg: N/A + CreateTime: 2019-07-27 11:46:42 + EtlStartTime: 2019-07-27 11:46:44 + EtlFinishTime: 2019-07-27 11:49:44 + LoadStartTime: 2019-07-27 11:49:44 +LoadFinishTime: 2019-07-27 11:50:16 + URL: http://1.1.1.1:8089/proxy/application_1586619723848_0035/ + JobDetails: {"ScannedRows":28133395,"TaskNumber":1,"FileNumber":1,"FileSize":200000} +``` + +返回结果集中参数意义可以参考 [Broker Load](broker-load-manual.html)。不同点如下: + +- State + + 导入任务当前所处的阶段。任务提交之后状态为 PENDING,提交 Spark ETL 之后状态变为 ETL,ETL 完成之后 FE 调度 BE 执行 push 操作状态变为 LOADING,push 完成并且版本生效后状态变为 FINISHED。 + + 导入任务的最终阶段有两个:CANCELLED 和 FINISHED,当 Load job 处于这两个阶段时导入完成。其中 CANCELLED 为导入失败,FINISHED 为导入成功。 + +- Progress + + 导入任务的进度描述。分为两种进度:ETL 和 LOAD,对应了导入流程的两个阶段 ETL 和 LOADING。 + + LOAD 的进度范围为:0~100%。 + + `LOAD 进度 = 当前已完成所有replica导入的tablet个数 / 本次导入任务的总tablet个数 * 100%` + + **如果所有导入表均完成导入,此时 LOAD 的进度为 99%** 导入进入到最后生效阶段,整个导入完成后,LOAD 的进度才会改为 100%。 + + 导入进度并不是线性的。所以如果一段时间内进度没有变化,并不代表导入没有在执行。 + +- Type + + 导入任务的类型。Spark load 为 SPARK。 + +- CreateTime/EtlStartTime/EtlFinishTime/LoadStartTime/LoadFinishTime + + 这几个值分别代表导入创建的时间,ETL 阶段开始的时间,ETL 阶段完成的时间,LOADING 阶段开始的时间和整个导入任务完成的时间。 + +- JobDetails + + 显示一些作业的详细运行状态,ETL 结束的时候更新。包括导入文件的个数、总大小(字节)、子任务个数、已处理的原始行数等。 + + `{"ScannedRows":139264,"TaskNumber":1,"FileNumber":1,"FileSize":940754064}` + +- URL + + 可复制输入到浏览器,跳转至相应application的web界面 + +### 查看 spark launcher 提交日志 + +有时用户需要查看spark任务提交过程中产生的详细日志,日志默认保存在FE根目录下`log/spark_launcher_log`路径下,并以`spark_launcher_{load_job_id}_{label}.log`命名,日志会在此目录下保存一段时间,当FE元数据中的导入信息被清理时,相应的日志也会被清理,默认保存时间为3天。 + +### 取消导入 + +当 Spark Load 作业状态不为 CANCELLED 或 FINISHED 时,可以被用户手动取消。取消时需要指定待取消导入任务的 Label 。取消导入命令语法可执行 `HELP CANCEL LOAD`查看。 + +## 相关系统配置 + +### FE配置 + +下面配置属于 Spark load 的系统级别配置,也就是作用于所有 Spark load 导入任务的配置。主要通过修改 `fe.conf`来调整配置值。 + +- `enable_spark_load` + + 开启 Spark load 和创建 resource 功能。默认为 false,关闭此功能。 + +- `spark_load_default_timeout_second` + + 任务默认超时时间为259200秒(3天)。 + +- `spark_home_default_dir` + + spark客户端路径 (`fe/lib/spark2x`) 。 + +- `spark_resource_path` + + 打包好的spark依赖文件路径(默认为空)。 + +- `spark_launcher_log_dir` + + spark客户端的提交日志存放的目录(`fe/log/spark_launcher_log`)。 + +- `yarn_client_path` + + yarn二进制可执行文件路径 (`fe/lib/yarn-client/hadoop/bin/yarn`) 。 + +- `yarn_config_dir` + + yarn配置文件生成路径 (`fe/lib/yarn-config`) 。 + +## 最佳实践 + +### 应用场景 + +使用 Spark Load 最适合的场景就是原始数据在文件系统(HDFS)中,数据量在 几十 GB 到 TB 级别。小数据量还是建议使用 [Stream Load](stream-load-manual.html) 或者 [Broker Load](broker-load-manual.html)。 + +## 常见问题 + +- 使用Spark Load时没有在spark客户端的spark-env.sh配置`HADOOP_CONF_DIR`环境变量。 + +如果`HADOOP_CONF_DIR`环境变量没有设置,会报 `When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.` 错误。 + +- 使用Spark Load时`spark_home_default_dir`配置项没有指定spark客户端根目录。 + +提交Spark job时用到spark-submit命令,如果`spark_home_default_dir`设置错误,会报 `Cannot run program "xxx/bin/spark-submit": error=2, No such file or directory` 错误。 + +- 使用Spark load时`spark_resource_path`配置项没有指向打包好的zip文件。 + +如果`spark_resource_path`没有设置正确,会报`File xxx/jars/spark-2x.zip does not exist` 错误。 + +- 使用Spark load时`yarn_client_path`配置项没有指定yarn的可执行文件。 + +如果`yarn_client_path`没有设置正确,会报`yarn client does not exist in path: xxx/yarn-client/hadoop/bin/yarn` 错误 + +## 更多帮助 + +关于**Spark Load** 使用的更多详细语法,可以在Mysql客户端命令行下输入 `HELP SPARK LOAD` 获取更多帮助信息。 diff --git a/new-docs/zh-CN/data-operate/import/import-way/stream-load-manual.md b/new-docs/zh-CN/data-operate/import/import-way/stream-load-manual.md index 91f9ada417..505f0aeec8 100644 --- a/new-docs/zh-CN/data-operate/import/import-way/stream-load-manual.md +++ b/new-docs/zh-CN/data-operate/import/import-way/stream-load-manual.md @@ -26,5 +26,370 @@ under the License. # Stream load +Stream load 是一个同步的导入方式,用户通过发送 HTTP 协议发送请求将本地文件或数据流导入到 Doris 中。Stream load 同步执行导入并返回导入结果。用户可直接通过请求的返回体判断本次导入是否成功。 +Stream load 主要适用于导入本地文件,或通过程序导入数据流中的数据。 + +## 基本原理 + +下图展示了 Stream load 的主要流程,省略了一些导入细节。 + +```text + ^ + + | | + | | 1A. User submit load to FE + | | + | +--v-----------+ + | | FE | +5. Return result to user | +--+-----------+ + | | + | | 2. Redirect to BE + | | + | +--v-----------+ + +---+Coordinator BE| 1B. User submit load to BE + +-+-----+----+-+ + | | | + +-----+ | +-----+ + | | | 3. Distrbute data + | | | + +-v-+ +-v-+ +-v-+ + |BE | |BE | |BE | + +---+ +---+ +---+ +``` + +Stream load 中,Doris 会选定一个节点作为 Coordinator 节点。该节点负责接数据并分发数据到其他数据节点。 + +用户通过 HTTP 协议提交导入命令。如果提交到 FE,则 FE 会通过 HTTP redirect 指令将请求转发给某一个 BE。用户也可以直接提交导入命令给某一指定 BE。 + +导入的最终结果由 Coordinator BE 返回给用户。 + +## 支持数据格式 + +目前 Stream Load 支持两个数据格式:CSV(文本) 和 JSON + +## 基本操作 + +### 创建导入 + +Stream Load 通过 HTTP 协议提交和传输数据。这里通过 `curl` 命令展示如何提交导入。 + +用户也可以通过其他 HTTP client 进行操作。 + +```shell +curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_host:http_port/api/{db}/{table}/_stream_load + +# Header 中支持属性见下面的 ‘导入任务参数’ 说明 +# 格式为: -H "key1:value1" +``` + +示例: + +```shell +curl --location-trusted -u root -T date -H "label:123" http://abc.com:8030/api/test/date/_stream_load +``` + +创建导入的详细语法帮助执行 `HELP STREAM LOAD` 查看, 下面主要介绍创建 Stream Load 的部分参数意义。 + +**签名参数** + +- user/passwd + + Stream load 由于创建导入的协议使用的是 HTTP 协议,通过 Basic access authentication 进行签名。Doris 系统会根据签名验证用户身份和导入权限。 + +**导入任务参数** + +Stream Load 由于使用的是 HTTP 协议,所以所有导入任务有关的参数均设置在 Header 中。下面主要介绍了 Stream Load 导入任务参数的部分参数意义。 + +- label + + 导入任务的标识。每个导入任务,都有一个在单 database 内部唯一的 label。label 是用户在导入命令中自定义的名称。通过这个 label,用户可以查看对应导入任务的执行情况。 + + label 的另一个作用,是防止用户重复导入相同的数据。**强烈推荐用户同一批次数据使用相同的 label。这样同一批次数据的重复请求只会被接受一次,保证了 At-Most-Once** + + 当 label 对应的导入作业状态为 CANCELLED 时,该 label 可以再次被使用。 + +- column_separator + + 用于指定导入文件中的列分隔符,默认为\t。如果是不可见字符,则需要加\x作为前缀,使用十六进制来表示分隔符。 + + 如hive文件的分隔符\x01,需要指定为-H "column_separator:\x01"。 + + 可以使用多个字符的组合作为列分隔符。 + +- line_delimiter + + 用于指定导入文件中的换行符,默认为\n。 + + 可以使用做多个字符的组合作为换行符。 + +- max_filter_ratio + + 导入任务的最大容忍率,默认为0容忍,取值范围是0~1。当导入的错误率超过该值,则导入失败。 + + 如果用户希望忽略错误的行,可以通过设置这个参数大于 0,来保证导入可以成功。 + + 计算公式为: + + `(dpp.abnorm.ALL / (dpp.abnorm.ALL + dpp.norm.ALL ) ) > max_filter_ratio` + + `dpp.abnorm.ALL` 表示数据质量不合格的行数。如类型不匹配,列数不匹配,长度不匹配等等。 + + `dpp.norm.ALL` 指的是导入过程中正确数据的条数。可以通过 `SHOW LOAD` 命令查询导入任务的正确数据量。 + + 原始文件的行数 = `dpp.abnorm.ALL + dpp.norm.ALL` + +- where + + 导入任务指定的过滤条件。Stream load 支持对原始数据指定 where 语句进行过滤。被过滤的数据将不会被导入,也不会参与 filter ratio 的计算,但会被计入`num_rows_unselected`。 + +- partition + + 待导入表的 Partition 信息,如果待导入数据不属于指定的 Partition 则不会被导入。这些数据将计入 `dpp.abnorm.ALL` + +- columns + + 待导入数据的函数变换配置,目前 Stream load 支持的函数变换方法包含列的顺序变化以及表达式变换,其中表达式变换的方法与查询语句的一致。 + + ```text + 列顺序变换例子:原始数据有三列(src_c1,src_c2,src_c3), 目前doris表也有三列(dst_c1,dst_c2,dst_c3) + + 如果原始表的src_c1列对应目标表dst_c1列,原始表的src_c2列对应目标表dst_c2列,原始表的src_c3列对应目标表dst_c3列,则写法如下: + columns: dst_c1, dst_c2, dst_c3 + + 如果原始表的src_c1列对应目标表dst_c2列,原始表的src_c2列对应目标表dst_c3列,原始表的src_c3列对应目标表dst_c1列,则写法如下: + columns: dst_c2, dst_c3, dst_c1 + + 表达式变换例子:原始文件有两列,目标表也有两列(c1,c2)但是原始文件的两列均需要经过函数变换才能对应目标表的两列,则写法如下: + columns: tmp_c1, tmp_c2, c1 = year(tmp_c1), c2 = month(tmp_c2) + 其中 tmp_*是一个占位符,代表的是原始文件中的两个原始列。 + ``` + +- exec_mem_limit + + 导入内存限制。默认为 2GB,单位为字节。 + +- strict_mode + + Stream Load 导入可以开启 strict mode 模式。开启方式为在 HEADER 中声明 `strict_mode=true` 。默认的 strict mode 为关闭。 + + strict mode 模式的意思是:对于导入过程中的列类型转换进行严格过滤。严格过滤的策略如下: + + 1. 对于列类型转换来说,如果 strict mode 为true,则错误的数据将被 filter。这里的错误数据是指:原始数据并不为空值,在参与列类型转换后结果为空值的这一类数据。 + 2. 对于导入的某列由函数变换生成时,strict mode 对其不产生影响。 + 3. 对于导入的某列类型包含范围限制的,如果原始数据能正常通过类型转换,但无法通过范围限制的,strict mode 对其也不产生影响。例如:如果类型是 decimal(1,0), 原始数据为 10,则属于可以通过类型转换但不在列声明的范围内。这种数据 strict 对其不产生影响。 + +- merge_type 数据的合并类型,一共支持三种类型APPEND、DELETE、MERGE 其中,APPEND是默认值,表示这批数据全部需要追加到现有数据中,DELETE 表示删除与这批数据key相同的所有行,MERGE 语义 需要与delete 条件联合使用,表示满足delete 条件的数据按照DELETE 语义处理其余的按照APPEND 语义处理 + +- two_phase_commit + + Stream load 导入可以开启两阶段事务提交模式。开启方式为在 HEADER 中声明 `two_phase_commit=true` 。默认的两阶段批量事务提交为关闭。 两阶段批量事务提交模式的意思是:Stream load过程中,数据写入完成即会返回信息给用户,此时数据不可见,事务状态为PRECOMMITTED,用户手动触发commit操作之后,数据才可见。 + + 1. 用户可以调用如下接口对stream load事务触发commit操作: + + ```shell + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" http://fe_host:http_port/api/{db}/_stream_load_2pc + ``` + + 或 + + ```shell + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" http://be_host:webserver_port/api/{db}/_stream_load_2pc + ``` + + 1. 用户可以调用如下接口对stream load事务触发abort操作: + + ```shell + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" http://fe_host:http_port/api/{db}/_stream_load_2pc + ``` + + 或 + + ```shell + curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" http://be_host:webserver_port/api/{db}/_stream_load_2pc + ``` + +### 返回结果 + +由于 Stream load 是一种同步的导入方式,所以导入的结果会通过创建导入的返回值直接返回给用户。 + +示例: + +```text +{ + "TxnId": 1003, + "Label": "b6f3bc78-0d2c-45d9-9e4c-faa0a0149bee", + "Status": "Success", + "ExistingJobStatus": "FINISHED", // optional + "Message": "OK", + "NumberTotalRows": 1000000, + "NumberLoadedRows": 1000000, + "NumberFilteredRows": 1, + "NumberUnselectedRows": 0, + "LoadBytes": 40888898, + "LoadTimeMs": 2144, + "BeginTxnTimeMs": 1, + "StreamLoadPutTimeMs": 2, + "ReadDataTimeMs": 325, + "WriteDataTimeMs": 1933, + "CommitAndPublishTimeMs": 106, + "ErrorURL": "http://192.168.1.1:8042/api/_load_error_log?file=__shard_0/error_log_insert_stmt_db18266d4d9b4ee5-abb00ddd64bdf005_db18266d4d9b4ee5_abb00ddd64bdf005" +} +``` + +下面主要解释了 Stream load 导入结果参数: + +- TxnId:导入的事务ID。用户可不感知。 + +- Label:导入 Label。由用户指定或系统自动生成。 + +- Status:导入完成状态。 + + "Success":表示导入成功。 + + "Publish Timeout":该状态也表示导入已经完成,只是数据可能会延迟可见,无需重试。 + + "Label Already Exists":Label 重复,需更换 Label。 + + "Fail":导入失败。 + +- ExistingJobStatus:已存在的 Label 对应的导入作业的状态。 + + 这个字段只有在当 Status 为 "Label Already Exists" 时才会显示。用户可以通过这个状态,知晓已存在 Label 对应的导入作业的状态。"RUNNING" 表示作业还在执行,"FINISHED" 表示作业成功。 + +- Message:导入错误信息。 + +- NumberTotalRows:导入总处理的行数。 + +- NumberLoadedRows:成功导入的行数。 + +- NumberFilteredRows:数据质量不合格的行数。 + +- NumberUnselectedRows:被 where 条件过滤的行数。 + +- LoadBytes:导入的字节数。 + +- LoadTimeMs:导入完成时间。单位毫秒。 + +- BeginTxnTimeMs:向Fe请求开始一个事务所花费的时间,单位毫秒。 + +- StreamLoadPutTimeMs:向Fe请求获取导入数据执行计划所花费的时间,单位毫秒。 + +- ReadDataTimeMs:读取数据所花费的时间,单位毫秒。 + +- WriteDataTimeMs:执行写入数据操作所花费的时间,单位毫秒。 + +- CommitAndPublishTimeMs:向Fe请求提交并且发布事务所花费的时间,单位毫秒。 + +- ErrorURL:如果有数据质量问题,通过访问这个 URL 查看具体错误行。 + +> 注意:由于 Stream load 是同步的导入方式,所以并不会在 Doris 系统中记录导入信息,用户无法异步的通过查看导入命令看到 Stream load。使用时需监听创建导入请求的返回值获取导入结果。 + +### 取消导入 + +用户无法手动取消 Stream Load,Stream Load 在超时或者导入错误后会被系统自动取消。 + +## 相关系统配置 + +### FE配置 + +- stream_load_default_timeout_second + + 导入任务的超时时间(以秒为单位),导入任务在设定的 timeout 时间内未完成则会被系统取消,变成 CANCELLED。 + + 默认的 timeout 时间为 600 秒。如果导入的源文件无法在规定时间内完成导入,用户可以在 stream load 请求中设置单独的超时时间。 + + 或者调整 FE 的参数`stream_load_default_timeout_second` 来设置全局的默认超时时间。 + +### BE配置 + +- streaming_load_max_mb + + Stream load 的最大导入大小,默认为 10G,单位是 MB。如果用户的原始文件超过这个值,则需要调整 BE 的参数 `streaming_load_max_mb`。 + +## 最佳实践 + +### 应用场景 + +使用 Stream load 的最合适场景就是原始文件在内存中或者在磁盘中。其次,由于 Stream load 是一种同步的导入方式,所以用户如果希望用同步方式获取导入结果,也可以使用这种导入。 + +### 数据量 + +由于 Stream load 的原理是由 BE 发起的导入并分发数据,建议的导入数据量在 1G 到 10G 之间。由于默认的最大 Stream load 导入数据量为 10G,所以如果要导入超过 10G 的文件需要修改 BE 的配置 `streaming_load_max_mb` + +```text +比如:待导入文件大小为15G +修改 BE 配置 streaming_load_max_mb 为 16000 即可。 +``` + +Stream load 的默认超时为 300秒,按照 Doris 目前最大的导入限速来看,约超过 3G 的文件就需要修改导入任务默认超时时间了。 + +```text +导入任务超时时间 = 导入数据量 / 10M/s (具体的平均导入速度需要用户根据自己的集群情况计算) +例如:导入一个 10G 的文件 +timeout = 1000s 等于 10G / 10M/s +``` + +### 完整例子 + +数据情况: 数据在发送导入请求端的本地磁盘路径 /home/store_sales 中,导入的数据量约为 15G,希望导入到数据库 bj_sales 的表 store_sales 中。 + +集群情况:Stream load 的并发数不受集群大小影响。 + +- step1: 导入文件大小是否超过默认的最大导入大小10G + + ```text + 修改 BE conf + streaming_load_max_mb = 16000 + ``` + +- step2: 计算大概的导入时间是否超过默认 timeout 值 + + ```text + 导入时间 ≈ 15000 / 10 = 1500s + 超过了默认的 timeout 时间,需要修改 FE 的配置 + stream_load_default_timeout_second = 1500 + ``` + +- step3:创建导入任务 + + ```shell + curl --location-trusted -u user:password -T /home/store_sales -H "label:abc" http://abc.com:8000/api/bj_sales/store_sales/_stream_load + ``` + +## 常见问题 + +- Label Already Exists + + Stream load 的 Label 重复排查步骤如下: + + 1. 是否和其他导入方式已经存在的导入 Label 冲突: + + 由于 Doris 系统中导入的 Label 不区分导入方式,所以存在其他导入方式使用了相同 Label 的问题。 + + 通过 `SHOW LOAD WHERE LABEL = “xxx”`,其中 xxx 为重复的 Label 字符串,查看是否已经存在一个 FINISHED 导入的 Label 和用户申请创建的 Label 相同。 + + 2. 是否 Stream load 同一个作业被重复提交了 + + 由于 Stream load 是 HTTP 协议提交创建导入任务,一般各个语言的 HTTP Client 均会自带请求重试逻辑。Doris 系统在接受到第一个请求后,已经开始操作 Stream load,但是由于没有及时返回给 Client 端结果, Client 端会发生再次重试创建请求的情况。这时候 Doris 系统由于已经在操作第一个请求,所以第二个请求已经就会被报 Label Already Exists 的情况。 + + 排查上述可能的方法:使用 Label 搜索 FE Master 的日志,看是否存在同一个 Label 出现了两次 `redirect load action to destination=` 的情况。如果有就说明,请求被 Client 端重复提交了。 + + 建议用户根据当前请求的数据量,计算出大致导入的时间,并根据导入超时时间,将Client 端的请求超时间改成大于导入超时时间的值,避免请求被 Client 端多次提交。 + + 3. Connection reset 异常 + + 在社区版 0.14.0 及之前的版本在启用Http V2之后出现connection reset异常,因为Web 容器内置的是tomcat,Tomcat 在 307 (Temporary Redirect) 是有坑的,对这个协议实现是有问题的,所有在使用Stream load 导入大数据量的情况下会出现connect reset异常,这个是因为tomcat在做307跳转之前就开始了数据传输,这样就造成了BE收到的数据请求的时候缺少了认证信息,之后将内置容器改成了Jetty解决了这个问题,如果你遇到这个问题,请升级你的Doris或者禁用Http V2(`enable_http_server_v2=false`)。 + + 升级以后同时升级你程序的http client 版本到 `4.5.13`,在你的pom.xml文件中引入下面的依赖 + + ```xml + + org.apache.httpcomponents + httpclient + 4.5.13 + + ``` +## 更多帮助 + +关于 Stream Load 使用的更多详细语法及最佳实践,请参阅 [Stream Load](../../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Load/STREAM-LOAD.html) 命令手册,你也可以在 MySql 客户端命令行下输入 `HELP STREAM LOAD` 获取更多帮助信息。 diff --git a/new-docs/zh-CN/data-operate/update-delete/batch-delete-manual.md b/new-docs/zh-CN/data-operate/update-delete/batch-delete-manual.md index e0a4e1968d..450ef6d72a 100644 --- a/new-docs/zh-CN/data-operate/update-delete/batch-delete-manual.md +++ b/new-docs/zh-CN/data-operate/update-delete/batch-delete-manual.md @@ -24,4 +24,221 @@ specific language governing permissions and limitations under the License. --> -# 批量删除 \ No newline at end of file +# 批量删除 + +目前Doris 支持`Broker Load`,` Routine Load`, `Stream Load` 等多种导入方式,对于数据的删除目前只能通过delete语句进行删除,使用delete 语句的方式删除时,每执行一次delete 都会生成一个新的数据版本,如果频繁删除会严重影响查询性能,并且在使用delete方式删除时,是通过生成一个空的rowset来记录删除条件实现,每次读取都要对删除条件进行过滤,同样在条件较多时会对性能造成影响。对比其他的系统,greenplum 的实现方式更像是传统数据库产品,snowflake 通过merge 语法实现。 + +对于类似于cdc数据导入的场景,数据中insert和delete一般是穿插出现的,面对这种场景我们目前的导入方式也无法满足,即使我们能够分离出insert和delete虽然可以解决导入的问题,但是仍然解决不了删除的问题。使用批量删除功能可以解决这些个别场景的需求。数据导入有三种合并方式: + +1. APPEND: 数据全部追加到现有数据中; +2. DELETE: 删除所有与导入数据key 列值相同的行; +3. MERGE: 根据 DELETE ON 的决定 APPEND 还是 DELETE。 + +## 基本原理 + +通过增加一个隐藏列`__DORIS_DELETE_SIGN__`实现,因为我们只是在unique 模型上做批量删除,因此只需要增加一个类型为bool 聚合函数为replace 的隐藏列即可。在be 各种聚合写入流程都和正常列一样,读取方案有两个: + +在fe遇到 * 等扩展时去掉`__DORIS_DELETE_SIGN__`,并且默认加上 `__DORIS_DELETE_SIGN__ != true` 的条件, be 读取时都会加上一列进行判断,通过条件确定是否删除。 + +### 导入 + +导入时在fe 解析时将隐藏列的值设置成 `DELETE ON` 表达式的值,其他的聚合行为和replace的聚合列相同。 + +### 读取 + +读取时在所有存在隐藏列的olapScanNode上增加`__DORIS_DELETE_SIGN__ != true` 的条件,be 不感知这一过程,正常执行。 + +### Cumulative Compaction + +Cumulative Compaction 时将隐藏列看作正常的列处理,Compaction逻辑没有变化。 + +### Base Compaction + +Base Compaction 时要将标记为删除的行的删掉,以减少数据占用的空间。 + +## 启用批量删除支持 + +启用批量删除支持有一下两种形式: + +1. 通过在fe 配置文件中增加`enable_batch_delete_by_default=true` 重启fe 后新建表的都支持批量删除,此选项默认为false; +2. 对于没有更改上述fe 配置或对于以存在的不支持批量删除功能的表,可以使用如下语句: `ALTER TABLE tablename ENABLE FEATURE "BATCH_DELETE"` 来启用批量删除。本操作本质上是一个schema change 操作,操作立即返回,可以通过`show alter table column` 来确认操作是否完成。 + +那么如何确定一个表是否支持批量删除,可以通过 设置一个session variable 来显示隐藏列 `SET show_hidden_columns=true` ,之后使用`desc tablename`,如果输出中有`__DORIS_DELETE_SIGN__` 列则支持,如果没有则不支持。 + +## 语法说明 + +导入的语法设计方面主要是增加一个指定删除标记列的字段的colum映射,并且需要在导入的数据中增加一列,各种导入方式设置的语法如下 + +### Stream Load + +`Stream Load` 的写法在header 中的 columns 字段增加一个设置删除标记列的字段, 示例 `-H "columns: k1, k2, label_c3" -H "merge_type: [MERGE|APPEND|DELETE]" -H "delete: label_c3=1"`。 + +### Broker Load + +`Broker Load` 的写法在 `PROPERTIES` 处设置删除标记列的字段,语法如下: + +```text +LOAD LABEL db1.label1 +( + [MERGE|APPEND|DELETE] DATA INFILE("hdfs://abc.com:8888/user/palo/test/ml/file1") + INTO TABLE tbl1 + COLUMNS TERMINATED BY "," + (tmp_c1,tmp_c2, label_c3) + SET + ( + id=tmp_c2, + name=tmp_c1, + ) + [DELETE ON label_c3=true] +) +WITH BROKER 'broker' +( + "username"="user", + "password"="pass" +) +PROPERTIES +( + "timeout" = "3600" +); +``` + +### Routine Load + +`Routine Load`的写法在 `columns`字段增加映射,映射方式同上,语法如下: + +```text + CREATE ROUTINE LOAD example_db.test1 ON example_tbl + [WITH MERGE|APPEND|DELETE] + COLUMNS(k1, k2, k3, v1, v2, label), + WHERE k1 > 100 and k2 like "%doris%" + [DELETE ON label=true] + PROPERTIES + ( + "desired_concurrent_number"="3", + "max_batch_interval" = "20", + "max_batch_rows" = "300000", + "max_batch_size" = "209715200", + "strict_mode" = "false" + ) + FROM KAFKA + ( + "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092", + "kafka_topic" = "my_topic", + "kafka_partitions" = "0,1,2,3", + "kafka_offsets" = "101,0,0,200" + ); +``` + +## 注意事项 + +1. 由于除`Stream Load` 外的导入操作在doris 内部有可能乱序执行,因此在使用`MERGE` 方式导入时如果不是`Stream Load`,需要与 load sequence 一起使用,具体的 语法可以参照[`sequence`](sequence-column-manual.html)列 相关的文档; +2. `DELETE ON` 条件只能与 MERGE 一起使用。 + +## 使用示例 + +### 查看是否启用批量删除支持 + +```text +mysql> SET show_hidden_columns=true; +Query OK, 0 rows affected (0.00 sec) + +mysql> DESC test; ++-----------------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-----------------------+--------------+------+-------+---------+---------+ +| name | VARCHAR(100) | No | true | NULL | | +| gender | VARCHAR(10) | Yes | false | NULL | REPLACE | +| age | INT | Yes | false | NULL | REPLACE | +| __DORIS_DELETE_SIGN__ | TINYINT | No | false | 0 | REPLACE | ++-----------------------+--------------+------+-------+---------+---------+ +4 rows in set (0.00 sec) +``` + +### Stream Load使用示例 + +1. 正常导入数据: + +```text +curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: APPEND" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load +``` + +其中的APPEND 条件可以省略,与下面的语句效果相同: + +```text +curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load +``` + +2. 将与导入数据key 相同的数据全部删除 + +```text +curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: DELETE" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load +``` + +假设导入表中原有数据为: + +```text ++--------+----------+----------+------+ +| siteid | citycode | username | pv | ++--------+----------+----------+------+ +| 3 | 2 | tom | 2 | +| 4 | 3 | bush | 3 | +| 5 | 3 | helen | 3 | ++--------+----------+----------+------+ +``` + +导入数据为: + +```text +3,2,tom,0 +``` + +导入后数据变成: + +```text ++--------+----------+----------+------+ +| siteid | citycode | username | pv | ++--------+----------+----------+------+ +| 4 | 3 | bush | 3 | +| 5 | 3 | helen | 3 | ++--------+----------+----------+------+ +``` + +3. 将导入数据中与`site_id=1` 的行的key列相同的行 + +```text +curl --location-trusted -u root: -H "column_separator:," -H "columns: siteid, citycode, username, pv" -H "merge_type: MERGE" -H "delete: siteid=1" -T ~/table1_data http://127.0.0.1:8130/api/test/table1/_stream_load +``` + +假设导入前数据为: + +```text ++--------+----------+----------+------+ +| siteid | citycode | username | pv | ++--------+----------+----------+------+ +| 4 | 3 | bush | 3 | +| 5 | 3 | helen | 3 | +| 1 | 1 | jim | 2 | ++--------+----------+----------+------+ +``` + +导入数据为: + +```text +2,1,grace,2 +3,2,tom,2 +1,1,jim,2 +``` + +导入后为: + +```text ++--------+----------+----------+------+ +| siteid | citycode | username | pv | ++--------+----------+----------+------+ +| 4 | 3 | bush | 3 | +| 2 | 1 | grace | 2 | +| 3 | 2 | tom | 2 | +| 5 | 3 | helen | 3 | ++--------+----------+----------+------+ +``` + diff --git a/new-docs/zh-CN/data-operate/update-delete/delete-manual.md b/new-docs/zh-CN/data-operate/update-delete/delete-manual.md index 79bee0d4b0..0d6efe2ace 100644 --- a/new-docs/zh-CN/data-operate/update-delete/delete-manual.md +++ b/new-docs/zh-CN/data-operate/update-delete/delete-manual.md @@ -24,4 +24,133 @@ specific language governing permissions and limitations under the License. --> -# Delete 操作 \ No newline at end of file +# Delete 操作 + +Delete不同于其他导入方式,它是一个同步过程,与Insert into相似,所有的Delete操作在Doris中是一个独立的导入作业,一般Delete语句需要指定表和分区以及删除的条件来筛选要删除的数据,并将会同时删除base表和rollup表的数据。 + +## 语法 + +delete操作的语法详见官网 [DELETE](../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Manipulation/DELETE.html) 语法。 + +## 返回结果 + +Delete命令是一个SQL命令,返回结果是同步的,分为以下几种: + +1. 执行成功 + + 如果Delete顺利执行完成并可见,将返回下列结果,`Query OK`表示成功 + + ```text + mysql> delete from test_tbl PARTITION p1 where k1 = 1; + Query OK, 0 rows affected (0.04 sec) + {'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'VISIBLE', 'txnId':'4005'} + ``` + +2. 提交成功,但未可见 + + Doris的事务提交分为两步:提交和发布版本,只有完成了发布版本步骤,结果才对用户是可见的。若已经提交成功了,那么就可以认为最终一定会发布成功,Doris会尝试在提交完后等待发布一段时间,如果超时后即使发布版本还未完成也会优先返回给用户,提示用户提交已经完成。若如果Delete已经提交并执行,但是仍未发布版本和可见,将返回下列结果 + + ```text + mysql> delete from test_tbl PARTITION p1 where k1 = 1; + Query OK, 0 rows affected (0.04 sec) + {'label':'delete_e7830c72-eb14-4cb9-bbb6-eebd4511d251', 'status':'COMMITTED', 'txnId':'4005', 'err':'delete job is committed but may be taking effect later' } + ``` + + 结果会同时返回一个json字符串: + + `affected rows`:表示此次删除影响的行,由于Doris的删除目前是逻辑删除,因此对于这个值是恒为0; + + `label`:自动生成的 label,是该导入作业的标识。每个导入作业,都有一个在单 database 内部唯一的 Label; + + `status`:表示数据删除是否可见,如果可见则显示`VISIBLE`,如果不可见则显示`COMMITTED`; + + `txnId`:这个Delete job对应的事务id; + + `err`:字段会显示一些本次删除的详细信息。 + +3. 提交失败,事务取消 + + 如果Delete语句没有提交成功,将会被Doris自动中止,返回下列结果 + + ```text + mysql> delete from test_tbl partition p1 where k1 > 80; + ERROR 1064 (HY000): errCode = 2, detailMessage = {错误原因} + ``` + + 示例: + + 比如说一个超时的删除,将会返回timeout时间和未完成的`(tablet=replica)` + + ```text + mysql> delete from test_tbl partition p1 where k1 > 80; + ERROR 1064 (HY000): errCode = 2, detailMessage = failed to delete replicas from job: 4005, Unfinished replicas:10000=60000, 10001=60000, 10002=60000 + ``` + + **综上,对于Delete操作返回结果的正确处理逻辑为:** + + 1. 如果返回结果为`ERROR 1064 (HY000)`,则表示删除失败; + 2. 如果返回结果为`Query OK`,则表示删除执行成功; + - 如果`status`为`COMMITTED`,表示数据仍不可见,用户可以稍等一段时间再用`show delete`命令查看结果; + - 如果`status`为`VISIBLE`,表示数据删除成功。 + +## Delete操作相关FE配置 + +**TIMEOUT配置** + +总体来说,Doris的删除作业的超时时间限制在30秒到5分钟时间内,具体时间可通过下面配置项调整 + +- `tablet_delete_timeout_second` + + delete自身的超时时间是可受指定分区下tablet的数量弹性改变的,此项配置为平均一个tablet所贡献的timeout时间,默认值为2。 + + 假设此次删除所指定分区下有5个tablet,那么可提供给delete的timeout时间为10秒,由于低于最低超时时间30秒,因此最终超时时间为30秒。 + +- `load_straggler_wait_second` + + 如果用户预估的数据量确实比较大,使得5分钟的上限不足时,用户可以通过此项调整timeout上限,默认值为300。 + + **TIMEOUT的具体计算规则为(秒)** + + `TIMEOUT = MIN(load_straggler_wait_second, MAX(30, tablet_delete_timeout_second * tablet_num))` + +- `query_timeout` + + 因为delete本身是一个SQL命令,因此删除语句也会受session限制,timeout还受Session中的`query_timeout`值影响,可以通过`SET query_timeout = xxx`来增加超时时间,单位是秒。 + +**IN谓词配置** + +- `max_allowed_in_element_num_of_delete` + + 如果用户在使用in谓词时需要占用的元素比较多,用户可以通过此项调整允许携带的元素上限,默认值为1024。 + +## 查看历史记录 + +用户可以通过show delete语句查看历史上已执行完成的删除记录。 + +语法如下 + +```text +SHOW DELETE [FROM db_name] +``` + +使用示例 + +```text +mysql> show delete from test_db; ++-----------+---------------+---------------------+-----------------+----------+ +| TableName | PartitionName | CreateTime | DeleteCondition | State | ++-----------+---------------+---------------------+-----------------+----------+ +| empty_tbl | p3 | 2020-04-15 23:09:35 | k1 EQ "1" | FINISHED | +| test_tbl | p4 | 2020-04-15 23:09:53 | k1 GT "80" | FINISHED | ++-----------+---------------+---------------------+-----------------+----------+ +2 rows in set (0.00 sec) +``` + +## 注意事项 + +- 不同于 Insert into 命令,delete 不能手动指定`label`,有关 label 的概念可以查看[Insert Into](../import/import-way/insert-into-manual.html) 文档。 + +## 更多帮助 + +关于 **delete** 使用的更多详细语法,请参阅 [delete](../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Manipulation/DELETE.html) 命令手册,也可以在Mysql客户端命令行下输入 `HELP DELETE` 获取更多帮助信息。 + diff --git a/new-docs/zh-CN/data-operate/update-delete/sequence-column-manual.md b/new-docs/zh-CN/data-operate/update-delete/sequence-column-manual.md index 9128d8217e..aba3eee4ac 100644 --- a/new-docs/zh-CN/data-operate/update-delete/sequence-column-manual.md +++ b/new-docs/zh-CN/data-operate/update-delete/sequence-column-manual.md @@ -24,4 +24,211 @@ specific language governing permissions and limitations under the License. --> -# sequence 列 \ No newline at end of file +# sequence 列 + +sequence列目前只支持Uniq模型,Uniq模型主要针对需要唯一主键的场景,可以保证主键唯一性约束,但是由于使用REPLACE聚合方式,在同一批次中导入的数据,替换顺序不做保证,详细介绍可以参考[数据模型](../../data-table/data-model.md)。替换顺序无法保证则无法确定最终导入到表中的具体数据,存在了不确定性。 + +为了解决这个问题,Doris支持了sequence列,通过用户在导入时指定sequence列,相同key列下,REPLACE聚合类型的列将按照sequence列的值进行替换,较大值可以替换较小值,反之则无法替换。该方法将顺序的确定交给了用户,由用户控制替换顺序。 + +## 适用场景 + +Sequence列只能在Uniq数据模型下使用。 + +## 基本原理 + +通过增加一个隐藏列`__DORIS_SEQUENCE_COL__`实现,该列的类型由用户在建表时指定,在导入时确定该列具体值,并依据该值对REPLACE列进行替换。 + +### 建表 + +创建Uniq表时,将按照用户指定类型自动添加一个隐藏列`__DORIS_SEQUENCE_COL__`。 + +### 导入 + +导入时,fe在解析的过程中将隐藏列的值设置成 `order by` 表达式的值(broker load和routine load),或者`function_column.sequence_col`表达式的值(stream load),value列将按照该值进行替换。隐藏列`__DORIS_SEQUENCE_COL__`的值既可以设置为数据源中一列,也可以是表结构中的一列。 + +### 读取 + +请求包含value列时需要额外读取`__DORIS_SEQUENCE_COL__`列,该列用于在相同key列下,REPLACE聚合函数替换顺序的依据,较大值可以替换较小值,反之则不能替换。 + +### Cumulative Compaction + +Cumulative Compaction 时和读取过程原理相同。 + +### Base Compaction + +Base Compaction 时读取过程原理相同。 + +## 使用语法 + +Sequence列建表时在property中增加了一个属性,用来标识`__DORIS_SEQUENCE_COL__`的类型 导入的语法设计方面主要是增加一个从sequence列的到其他column的映射,各个导种方式设置的将在下面分别介绍 + +**建表** + +创建Uniq表时,可以指定sequence列类型 + +```text +PROPERTIES ( + "function_column.sequence_type" = 'Date', +); +``` + +sequence_type用来指定sequence列的类型,可以为整型和时间类型。 + +**Stream Load** + +stream load 的写法是在header中的`function_column.sequence_col`字段添加隐藏列对应的source_sequence的映射, 示例 + +```text +curl --location-trusted -u root -H "columns: k1,k2,source_sequence,v1,v2" -H "function_column.sequence_col: source_sequence" -T testData http://host:port/api/testDb/testTbl/_stream_load +``` + +**Broker Load** + +在`ORDER BY` 处设置隐藏列映射的source_sequence字段 + +```text +LOAD LABEL db1.label1 +( + DATA INFILE("hdfs://host:port/user/data/*/test.txt") + INTO TABLE `tbl1` + COLUMNS TERMINATED BY "," + (k1,k2,source_sequence,v1,v2) + ORDER BY source_sequence +) +WITH BROKER 'broker' +( + "username"="user", + "password"="pass" +) +PROPERTIES +( + "timeout" = "3600" +); +``` + +**Routine Load** + +映射方式同上,示例如下 + +```text + CREATE ROUTINE LOAD example_db.test1 ON example_tbl + [WITH MERGE|APPEND|DELETE] + COLUMNS(k1, k2, source_sequence, v1, v2), + WHERE k1 > 100 and k2 like "%doris%" + [ORDER BY source_sequence] + PROPERTIES + ( + "desired_concurrent_number"="3", + "max_batch_interval" = "20", + "max_batch_rows" = "300000", + "max_batch_size" = "209715200", + "strict_mode" = "false" + ) + FROM KAFKA + ( + "kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092", + "kafka_topic" = "my_topic", + "kafka_partitions" = "0,1,2,3", + "kafka_offsets" = "101,0,0,200" + ); +``` + +## 启用sequence column支持 + +在新建表时如果设置了`function_column.sequence_type` ,则新建表将支持sequence column。 对于一个不支持sequence column的表,如果想要使用该功能,可以使用如下语句: `ALTER TABLE example_db.my_table ENABLE FEATURE "SEQUENCE_LOAD" WITH PROPERTIES ("function_column.sequence_type" = "Date")` 来启用。 如果不确定一个表是否支持sequence column,可以通过设置一个session variable来显示隐藏列 `SET show_hidden_columns=true` ,之后使用`desc tablename`,如果输出中有`__DORIS_SEQUENCE_COL__` 列则支持,如果没有则不支持。 + +## 使用示例 + +下面以Stream Load为例为示例来展示使用方式: + +1. 创建支持sequence column的表 + +表结构如下: + +```text +MySQL > desc test_table; ++-------------+--------------+------+-------+---------+---------+ +| Field | Type | Null | Key | Default | Extra | ++-------------+--------------+------+-------+---------+---------+ +| user_id | BIGINT | No | true | NULL | | +| date | DATE | No | true | NULL | | +| group_id | BIGINT | No | true | NULL | | +| modify_date | DATE | No | false | NULL | REPLACE | +| keyword | VARCHAR(128) | No | false | NULL | REPLACE | ++-------------+--------------+------+-------+---------+---------+ +``` + +2. 正常导入数据: + +导入如下数据 + +```text +1 2020-02-22 1 2020-02-22 a +1 2020-02-22 1 2020-02-22 b +1 2020-02-22 1 2020-03-05 c +1 2020-02-22 1 2020-02-26 d +1 2020-02-22 1 2020-02-22 e +1 2020-02-22 1 2020-02-22 b +``` + +此处以stream load为例, 将sequence column映射为modify_date列 + +```text +curl --location-trusted -u root: -H "function_column.sequence_col: modify_date" -T testData http://host:port/api/test/test_table/_stream_load +``` + +结果为 + +```text +MySQL > select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +在这次导入中,因sequence column的值(也就是modify_date中的值)中'2020-03-05'为最大值,所以keyword列中最终保留了c。 + +3. 替换顺序的保证 + +上述步骤完成后,接着导入如下数据 + +```text +1 2020-02-22 1 2020-02-22 a +1 2020-02-22 1 2020-02-23 b +``` + +查询数据 + +```text +MySQL [test]> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-05 | c | ++---------+------------+----------+-------------+---------+ +``` + +由于新导入的数据的sequence column都小于表中已有的值,无法替换 再尝试导入如下数据 + +```text +1 2020-02-22 1 2020-02-22 a +1 2020-02-22 1 2020-03-23 w +``` + +查询数据 + +```text +MySQL [test]> select * from test_table; ++---------+------------+----------+-------------+---------+ +| user_id | date | group_id | modify_date | keyword | ++---------+------------+----------+-------------+---------+ +| 1 | 2020-02-22 | 1 | 2020-03-23 | w | ++---------+------------+----------+-------------+---------+ +``` + +此时就可以替换表中原有的数据 + + + diff --git a/new-docs/zh-CN/data-operate/update-delete/update.md b/new-docs/zh-CN/data-operate/update-delete/update.md index 0a15b6a6fd..88d61c194b 100644 --- a/new-docs/zh-CN/data-operate/update-delete/update.md +++ b/new-docs/zh-CN/data-operate/update-delete/update.md @@ -24,4 +24,94 @@ specific language governing permissions and limitations under the License. --> -# 数据更新 \ No newline at end of file +# 数据更新 + +本文主要讲述如果我们需要修改或更新Doris中的数据,如何使用UPDATE命令来操作。数据更新对Doris的版本有限制,只能在Doris **Version 0.15.x +** 才可以使用。 + +## 适用场景 + +- 对满足某些条件的行,修改他的取值; +- 点更新,小范围更新,待更新的行最好是整个表的非常小的一部分; +- update 命令只能在 Unique 数据模型的表中执行。 + +## 基本原理 + +利用查询引擎自身的 where 过滤逻辑,从待更新表中筛选出需要被更新的行。再利用 Unique 模型自带的 Value 列新数据替换旧数据的逻辑,将待更新的行变更后,再重新插入到表中,从而实现行级别更新。 + +### 同步 + +Update 语法在Doris中是一个同步语法,即 Update 语句执行成功,更新操作也就完成了,数据是可见的。 + +### 性能 + +Update 语句的性能和待更新的行数以及 condition 的检索效率密切相关。 + +- 待更新的行数:待更新的行数越多,Update 语句的速度就会越慢。这和导入的原理是一致的。 Doris 的更新比较合适偶发更新的场景,比如修改个别行的值。 Doris 并不适合大批量的修改数据。大批量修改会使得 Update 语句运行时间很久。 +- condition 的检索效率:Doris 的 Update 实现原理是先将满足 condition 的行读取处理,所以如果 condition 的检索效率高,则 Update 的速度也会快。 condition 列最好能命中索引或者分区分桶裁剪,这样 Doris 就不需要扫全表,可以快速定位到需要更新的行,从而提升更新效率。 **强烈不推荐 condition 列中包含 UNIQUE 模型的 value 列**。 + +### 并发控制 + +默认情况下,并不允许同一时间对同一张表并发进行多个 Update 操作。 + +主要原因是,Doris 目前支持的是行更新,这意味着,即使用户声明的是 `SET v2 = 1`,实际上,其他所有的 Value 列也会被覆盖一遍(尽管值没有变化)。 + +这就会存在一个问题,如果同时有两个 Update 操作对同一行进行更新,那么其行为可能是不确定的,也就是可能存在脏数据。 + +但在实际应用中,如果用户自己可以保证即使并发更新,也不会同时对同一行进行操作的话,就可以手动打开并发限制。通过修改 FE 配置 `enable_concurrent_update`,当配置值为 true 时,则对更新并发无限制。 + +## 使用风险 + +由于 Doris 目前支持的是行更新,并且采用的是读取后再写入的两步操作,则如果 Update 语句和其他导入或 Delete 语句刚好修改的是同一行时,存在不确定的数据结果。 + +所以用户在使用的时候,一定要注意***用户侧自己***进行 Update 语句和其他 DML 语句的并发控制。 + +## 使用示例 + +假设 Doris 中存在一张订单表,其中 订单id 是 Key 列,订单状态,订单金额是 Value 列。数据状态如下: + +| 订单id | 订单金额 | 订单状态 | +| ------ | -------- | -------- | +| 1 | 100 | 待付款 | + +```sql ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | 待付款 | ++----------+--------------+--------------+ +1 row in set (0.01 sec) +``` + +这时候,用户点击付款后,Doris 系统需要将订单id 为 '1' 的订单状态变更为 '待发货',就需要用到 Update 功能。 + +```sql +mysql> UPDATE test_order SET order_status = '待发货' WHERE order_id = 1; +Query OK, 1 row affected (0.11 sec) +{'label':'update_20ae22daf0354fe0-b5aceeaaddc666c5', 'status':'VISIBLE', 'txnId':'33', 'queryId':'20ae22daf0354fe0-b5aceeaaddc666c5'} +``` + +更新后结果如下 + +```sql ++----------+--------------+--------------+ +| order_id | order_amount | order_status | ++----------+--------------+--------------+ +| 1 | 100 | 待发货 | ++----------+--------------+--------------+ +1 row in set (0.01 sec) +``` + +用户执行 UPDATE 命令后,系统会进行如下三步: + +- 第一步:读取满足 WHERE 订单id=1 的行 (1,100,'待付款') + +- 第二步:变更该行的订单状态,从'待付款'改为'待发货' (1,100,'待发货') + +- 第三步:将更新后的行再插入回表中,从而达到更新的效果。 + + |订单id | 订单金额| 订单状态| |---|---|---| | 1 | 100| 待付款 | | 1 | 100 | 待发货 | 由于表 test_order 是 UNIQUE 模型,所以相同 Key 的行,之后后者才会生效,所以最终效果如下: |订单id | 订单金额| 订单状态| |---|---|---| | 1 | 100 | 待发货 | + +## 更多帮助 + +关于 **数据更新** 使用的更多详细语法,请参阅 [update](../../sql-manual/sql-reference-v2/Data-Manipulation-Statements/Manipulation/UPDATE.html) 命令手册,也可以在Mysql客户端命令行下输入 `HELP UPDATE` 获取更多帮助信息。 +