Support setting timezone for stream load and routine load (#1831)
This commit is contained in:
@ -1,248 +1,324 @@
|
||||
# ROUTINE LOAD
|
||||
## Description
|
||||
## description
|
||||
|
||||
Routine Load enables users to submit a permanent import task and import data into Doris by constantly reading data from a specified data source.
|
||||
Currently only support importing text format (CSV) data from Kakfa through unauthenticated or SSL authentication.
|
||||
Routine Load function allows users to submit a resident load task, and continuously load data into Doris by continuously reading data from the specified data source. Currently, only text data format (CSV) data is loaded from Kakfa by means of no authentication or SSL authentication.
|
||||
|
||||
Grammar:
|
||||
Syntax:
|
||||
|
||||
```
|
||||
CREATE ROUTINE LOAD [db.]job_name ON tbl_name
|
||||
[load_properties]
|
||||
[job_properties]
|
||||
FROM data_source
|
||||
[data source properties]
|
||||
[data_source_properties]
|
||||
```
|
||||
|
||||
1. [db.]job_name
|
||||
|
||||
The name of the import job, in the same database, can only have one job running with the same name.
|
||||
The name of the load job, in the same database, only one job can run with the same name.
|
||||
|
||||
2. tbl name
|
||||
2. tbl_name
|
||||
|
||||
Specifies the name of the table to be imported.
|
||||
Specifies the name of the table that needs to be loaded.
|
||||
|
||||
3. load_properties
|
||||
|
||||
Used to describe imported data. Grammar:
|
||||
Used to describe the load data. grammar:
|
||||
|
||||
[Swing separator],
|
||||
[columns_mapping],
|
||||
[where_predicates],
|
||||
[partitions]
|
||||
```
|
||||
[column_separator],
|
||||
[columns_mapping],
|
||||
[where_predicates],
|
||||
[partitions]
|
||||
```
|
||||
|
||||
One Column U separator:
|
||||
1. column_separator:
|
||||
|
||||
Specify column separators, such as:
|
||||
Specify column separators, such as:
|
||||
|
||||
COLUMNS TERMINATED BY ","
|
||||
`COLUMNS TERMINATED BY ","`
|
||||
|
||||
Default: t
|
||||
The default is: `\t`
|
||||
|
||||
2. columns_mapping:
|
||||
|
||||
2. columns_mapping:
|
||||
Specifies the mapping of columns in the source data and defines how the derived columns are generated.
|
||||
|
||||
Specifies the mapping relationship of columns in source data and defines how derivative columns are generated.
|
||||
1. Map column:
|
||||
|
||||
1. Mapping column:
|
||||
Specify in order, which columns in the source data correspond to which columns in the destination table. For columns that you want to skip, you can specify a column name that does not exist.
|
||||
|
||||
Specify in sequence which columns in the source data correspond to those in the destination table. For columns you want to skip, you can specify a column name that does not exist.
|
||||
Assume that the destination table has three columns k1, k2, v1. Source data has four columns, of which columns 1, 2 and 4 correspond to k2, K1 and v1, respectively. Written as follows:
|
||||
Suppose the destination table has three columns k1, k2, v1. The source data has 4 columns, of which columns 1, 2, and 4 correspond to k2, k1, and v1, respectively. Write as follows:
|
||||
|
||||
COLUMNS (k2, k1, xxx, v1)
|
||||
`COLUMNS (k2, k1, xxx, v1)`
|
||||
|
||||
XXX is a non-existent column used to skip the third column in the source data.
|
||||
Where xxx is a column that does not exist and is used to skip the third column in the source data.
|
||||
|
||||
2. Derivative column:
|
||||
2. Derived columns:
|
||||
|
||||
Columns in the form of col_name = expr are called derived columns. That is to say, it supports calculating the values of the corresponding columns in the destination table by expr.
|
||||
Derivative columns are usually arranged after the mapping column. Although this is not mandatory, Doris always parses the mapping column first and then the derived column.
|
||||
Following an example, suppose that the destination table also has the fourth column v2, which is generated by the sum of K1 and k2. It can be written as follows:
|
||||
A column represented in the form of col_name = expr, which we call a derived column. That is, the value of the corresponding column in the destination table is calculated by expr.
|
||||
|
||||
COLUMNS (k2, k1, xxx, v1, v2 = k1 + k2);
|
||||
Derived columns are usually arranged after the mapped column. Although this is not mandatory, Doris always parses the mapped columns first and then parses the derived columns.
|
||||
|
||||
3. where_predicates
|
||||
Following an example, assume that the destination table also has column 4, v2, which is generated by the sum of k1 and k2. You can write as follows:
|
||||
|
||||
Used to specify filtering conditions to filter out unnecessary columns. The filter column can be a mapping column or a derived column.
|
||||
For example, if we only want to import columns with K1 greater than 100 and K2 equal to 1000, we write as follows:
|
||||
`COLUMNS (k2, k1, xxx, v1, v2 = k1 + k2);`
|
||||
|
||||
WHERE k1 > 100 and k2 = 1000
|
||||
3. where_predicates
|
||||
|
||||
Four Division
|
||||
Used to specify filter criteria to filter out unwanted columns. Filter columns can be either mapped columns or derived columns.
|
||||
|
||||
Specify which partitions to import into the destination table. If not specified, it is automatically imported into the corresponding partition.
|
||||
Examples:
|
||||
For example, if we only want to load a column with k1 greater than 100 and k2 equal to 1000, we would write as follows:
|
||||
|
||||
Segmentation (P1, P2, P3)
|
||||
`WHERE k1 > 100 and k2 = 1000`
|
||||
|
||||
4. partitions
|
||||
|
||||
Specifies which partitions of the load destination table. If not specified, it will be automatically loaded into the corresponding partition.
|
||||
|
||||
Example:
|
||||
|
||||
`PARTITION(p1, p2, p3)`
|
||||
|
||||
4. job_properties
|
||||
|
||||
General parameters used to specify routine import jobs.
|
||||
Grammar:
|
||||
A generic parameter that specifies a routine load job.
|
||||
|
||||
PROPERTIES (
|
||||
"key1" = "val1",
|
||||
"key2" = "val2"
|
||||
)
|
||||
syntax:
|
||||
|
||||
At present, we support the following parameters:
|
||||
```
|
||||
PROPERTIES (
|
||||
"key1" = "val1",
|
||||
"key2" = "val2"
|
||||
)
|
||||
```
|
||||
|
||||
1. desired_concurrent_number
|
||||
Currently we support the following parameters:
|
||||
|
||||
Expected concurrency. A routine import job is divided into multiple subtasks. This parameter specifies how many tasks a job can perform simultaneously. Must be greater than 0. The default is 3.
|
||||
This concurrency degree is not the actual concurrency degree. The actual concurrency degree will be considered comprehensively by the node number, load and data source of the cluster.
|
||||
Example:
|
||||
1. `desired_concurrent_number`
|
||||
|
||||
"desired_concurrent_number" = "3"
|
||||
The degree of concurrency desired. A routine load job is split into multiple subtasks. This parameter specifies how many tasks can be executed simultaneously in a job. Must be greater than 0. The default is 3.
|
||||
|
||||
2. max_batch_interval/max_batch_rows/max_batch_size
|
||||
This concurrency is not the actual concurrency. The actual concurrency will be considered by the number of nodes in the cluster, the load, and the data source.
|
||||
|
||||
These three parameters are respectively expressed as follows:
|
||||
1) Maximum execution time per sub-task in seconds. The range is 5 to 60. The default is 10.
|
||||
2) The maximum number of rows read by each subtask. Must be greater than or equal to 200000. The default is 2000.
|
||||
3) The maximum number of bytes read by each subtask. Units are bytes, ranging from 100MB to 1GB. The default is 100MB.
|
||||
example:
|
||||
|
||||
These three parameters are used to control the execution time and processing capacity of a subtask. When any one reaches the threshold, the task ends.
|
||||
Example:
|
||||
`"desired_concurrent_number" = "3"`
|
||||
|
||||
"max_batch_interval" = "20",
|
||||
"max_batch_rows" = "300000",
|
||||
"max_batch_size" = "209715200"
|
||||
2. `max_batch_interval/max_batch_rows/max_batch_size`
|
||||
|
||||
Three The biggest mistake
|
||||
These three parameters represent:
|
||||
|
||||
The maximum number of error lines allowed in the sampling window. Must be greater than or equal to 0. The default is 0, that is, no error lines are allowed.
|
||||
The sampling window is max_batch_rows* 10. That is, if the number of error lines is greater than max_error_number in the sampling window, routine jobs will be suspended, and manual intervention is needed to check data quality.
|
||||
Rows filtered by where conditions are not incorrect rows.
|
||||
1) The maximum execution time of each subtask, in seconds. The range is 5 to 60. The default is 10.
|
||||
|
||||
Five. data source
|
||||
2) The maximum number of rows read per subtask. Must be greater than or equal to 200,000. The default is 200000.
|
||||
|
||||
Types of data sources. Current support:
|
||||
3) The maximum number of bytes read per subtask. The unit is byte and the range is 100MB to 1GB. The default is 100MB.
|
||||
|
||||
KAFKA
|
||||
These three parameters are used to control the execution time and throughput of a subtask. When either one reaches the threshold, the task ends.
|
||||
|
||||
Six. data source properties
|
||||
example:
|
||||
|
||||
Specify information about the data source.
|
||||
Grammar:
|
||||
```
|
||||
"max_batch_interval" = "20",
|
||||
"max_batch_rows" = "300000",
|
||||
"max_batch_size" = "209715200"
|
||||
```
|
||||
|
||||
(
|
||||
"key1" = "val1",
|
||||
"key2" = "val2"
|
||||
)
|
||||
3. `max_error_number`
|
||||
|
||||
One. KAFKA -25968;` 25454;`28304;
|
||||
The maximum number of error lines allowed in the sampling window. Must be greater than or equal to 0. The default is 0, which means that no error lines are allowed.
|
||||
|
||||
One. Kafka -u broker list
|
||||
The sampling window is max_batch_rows * 10. That is, if the number of error lines is greater than max_error_number in the sampling window, the routine job will be suspended, and manual intervention is required to check the data quality problem.
|
||||
|
||||
Kafka's broker connection information. The format is ip: host. Multiple brokers are separated by commas.
|
||||
Examples:
|
||||
Lines that are filtered by the where condition are not counted as error lines.
|
||||
|
||||
"broker list"= "broker1:9092,broker2:9092"
|
||||
4. `strict_mode`
|
||||
|
||||
Two Kafkato Pitch
|
||||
Whether to enable strict mode, the default is on. If turned on, the column type transformation of non-null raw data is filtered if the result is NULL. Specified as "strict_mode" = "true"
|
||||
|
||||
5. timezone
|
||||
|
||||
Specify the topic of Kafka to subscribe to.
|
||||
Examples:
|
||||
Specifies the time zone in which the job will be loaded. The default by using session variable's timezone. This parameter affects all function results related to the time zone involved in the load.
|
||||
|
||||
"coffee topic" ="my topic"
|
||||
5. data_source
|
||||
|
||||
Three Kafka score/Kafka offset printing
|
||||
The type of data source. Current support:
|
||||
|
||||
Specify the Kafka partition to be subscribed to and the corresponding initial offset for each partition.
|
||||
KAFKA
|
||||
|
||||
Offset can specify a specific offset from greater than or equal to 0, or:
|
||||
1) OFFSET_BEGINNING: Subscribe from a location with data.
|
||||
2) OFFSET_END: Subscribe from the end.
|
||||
6. `data_source_properties`
|
||||
|
||||
If not specified, all partitions under topic are subscribed by default from OFFSET_END.
|
||||
Examples:
|
||||
Specify information about the data source.
|
||||
|
||||
"kafka partitions" ="0,1,2,3",
|
||||
"Kafka\ xED"= "101.0, ofset\ xEnd"
|
||||
syntax:
|
||||
|
||||
```
|
||||
(
|
||||
"key1" = "val1",
|
||||
"key2" = "val2"
|
||||
)
|
||||
```
|
||||
|
||||
4. property
|
||||
1. KAFKA data source
|
||||
|
||||
Specify custom Kafka parameters.
|
||||
The function is equivalent to the "- property" parameter in the Kafka shell.
|
||||
When the value of a parameter is a file, you need to add the keyword "FILE:" before the value.
|
||||
For how to create a file, see "HELP CREATE FILE;"
|
||||
For more support for custom parameters, see the client-side configuration item in the official CONFIGURATION document of librdkafka.
|
||||
`Kafka_broker_list`
|
||||
|
||||
Examples:
|
||||
"property.client.id" = "12345",
|
||||
"property.ssl.ca.location" ="FILE:ca.pem"
|
||||
Kafka's broker connection information. The format is ip:host. Multiple brokare separated by commas.
|
||||
|
||||
When connecting Kafka with SSL, you need to specify the following parameters:
|
||||
Example:
|
||||
|
||||
"property.security.protocol" = "ssl",
|
||||
"property.ssl.ca.location" ="FILE:ca.pem",
|
||||
"property.ssl.certificate.location" ="FILE:client.pem",
|
||||
"property.ssl.key.location" ="FILE:client.key",
|
||||
"property.ssl.key.password" = "abcdefg"
|
||||
`"kafka_broker_list" = "broker1:9092,broker2:9092"`
|
||||
|
||||
Among them:
|
||||
Property. security. protocol and property. ssl. ca. location are required to specify the connection mode is SSL and the location of CA certificates.
|
||||
2. `kafka_topic`
|
||||
|
||||
If the client authentication is enabled on the Kafka server side, the following settings are required:
|
||||
Specify the topic of Kafka to subscribe to.
|
||||
|
||||
"property.ssl.certificate.location"
|
||||
"property.ssl.key.location"
|
||||
"property.ssl.key.password"
|
||||
Example:
|
||||
|
||||
The passwords used to specify the public key, private key and private key of the client, respectively.
|
||||
`"kafka_topic" = "my_topic"`
|
||||
|
||||
3. `kafka_partitions/kafka_offsets`
|
||||
|
||||
7. Import data format sample
|
||||
Specify the kafka partition to be subscribed to, and the corresponding star offset for each partition.
|
||||
|
||||
Integer classes (TINYINT/SMALLINT/INT/BIGINT/LARGEINT): 1,1000,1234
|
||||
Floating Point Class (FLOAT/DOUBLE/DECIMAL): 1.1, 0.23, 356
|
||||
Date class (DATE/DATETIME): 2017-10-03, 2017-06-13 12:34:03.
|
||||
String class (CHAR/VARCHAR) (without quotation marks): I am a student, a
|
||||
NULL value: N
|
||||
Offset can specify a specific offset from 0 or greater, or:
|
||||
|
||||
1) OFFSET_BEGINNING: Subscribe from the location where the data is avaie.
|
||||
|
||||
2) OFFSET_END: Subscribe from the end.
|
||||
|
||||
If not specified, all partitions under topic are subscribed by default fromSET_END.
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
"kafka_partitions" = "0,1,2,3",
|
||||
"kafka_offsets" = "101,0,OFFSET_BEGINNING,OFFSET_END"
|
||||
```
|
||||
|
||||
4. property
|
||||
|
||||
Specify custom kafka parameters.
|
||||
|
||||
The function is equivalent to the "--property" parameter in the kafka shel
|
||||
|
||||
When the value of the parameter is a file, you need to add the keyword: "FILbefore the value.
|
||||
|
||||
For information on how to create a file, see "HELP CREATE FILE;"
|
||||
|
||||
For more supported custom parameters, see the configuration items on the nt side in the official CONFIGURATION documentation for librdkafka.
|
||||
|
||||
Example:
|
||||
|
||||
```
|
||||
"property.client.id" = "12345",
|
||||
"property.ssl.ca.location" = "FILE:ca.pem"
|
||||
```
|
||||
|
||||
1. When connecting to Kafka using SSL, you need to specify the follg parameters:
|
||||
|
||||
```
|
||||
"property.security.protocol" = "ssl",
|
||||
"property.ssl.ca.location" = "FILE:ca.pem",
|
||||
"property.ssl.certificate.location" = "FILE:client.pem",
|
||||
"property.ssl.key.location" = "FILE:client.key",
|
||||
"property.ssl.key.password" = "abcdefg"
|
||||
```
|
||||
|
||||
among them:
|
||||
|
||||
"property.security.protocol" and "property.ssl.ca.location" are requ to indicate the connection method is SSL and the location of the CA certate.
|
||||
|
||||
If the client authentication is enabled on the Kafka server, you alsod to set:
|
||||
|
||||
```
|
||||
"property.ssl.certificate.location"
|
||||
"property.ssl.key.location"
|
||||
"property.ssl.key.password"
|
||||
```
|
||||
|
||||
Used to specify the public key of the client, the private key, and the word of the private key.
|
||||
|
||||
2. Specify the default starting offset for kafka partition
|
||||
|
||||
If kafka_partitions/kafka_offsets is not specified, all partitions are umed by default, and you can specify kafka_default_offsets to specify the star offset. The default is OFFSET_END, which starts at the end of the substion.
|
||||
|
||||
Values:
|
||||
|
||||
1) OFFSET_BEGINNING: Subscribe from the location where the data is avaie.
|
||||
|
||||
2) OFFSET_END: Subscribe from the end.
|
||||
|
||||
Example:
|
||||
|
||||
`"property.kafka_default_offsets" = "OFFSET_BEGINNING"`
|
||||
|
||||
7. load data format sample
|
||||
|
||||
Integer class (TINYINT/SMALLINT/INT/BIGINT/LARGEINT): 1, 1000, 1234
|
||||
|
||||
Floating point class (FLOAT/DOUBLE/DECIMAL): 1.1, 0.23, .356
|
||||
|
||||
Date class (DATE/DATETIME): 2017-10-03, 2017-06-13 12:34:03.
|
||||
|
||||
String class (CHAR/VARCHAR) (without quotes): I am a student, a
|
||||
|
||||
NULL value: \N
|
||||
|
||||
## example
|
||||
|
||||
1. Create a Kafka routine import task named test 1 for example_tbl of example_db.
|
||||
1. Create a Kafka routine load task named test1 for the example_tbl of example_db. The load task is in strict mode.
|
||||
|
||||
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
|
||||
COLUMNS (k1, k2, k3, v1, v2, v3 = k1 *100),
|
||||
WHERE k1 > 100 and k2 like "%doris%"
|
||||
PROPERTIES
|
||||
(
|
||||
"desired_concurrent_number"="3",
|
||||
"max_batch_interval" = "20",
|
||||
"max_batch_rows" = "300000",
|
||||
"max_batch_size" = "209715200"
|
||||
)
|
||||
FROM KAFKA
|
||||
(
|
||||
"broker list"= "broker1:9092,broker2:9092,broker3:9092",
|
||||
"kafu topic" ="my topic",
|
||||
"kafka partitions" ="0,1,2,3",
|
||||
"kafka_offsets" = "101,0,0,200"
|
||||
);
|
||||
```
|
||||
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
|
||||
COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100),
|
||||
WHERE k1 > 100 and k2 like "%doris%"
|
||||
PROPERTIES
|
||||
(
|
||||
"desired_concurrent_number"="3",
|
||||
"max_batch_interval" = "20",
|
||||
"max_batch_rows" = "300000",
|
||||
"max_batch_size" = "209715200",
|
||||
"strict_mode" = "false"
|
||||
)
|
||||
FROM KAFKA
|
||||
(
|
||||
"kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
|
||||
"kafka_topic" = "my_topic",
|
||||
"kafka_partitions" = "0,1,2,3",
|
||||
"kafka_offsets" = "101,0,0,200"
|
||||
);
|
||||
```
|
||||
|
||||
2. Import data from Kafka cluster through SSL authentication. Set the client. ID parameter at the same time.
|
||||
2. load data from Kafka clusters via SSL authentication. Also set the client.id parameter. The load task is in non-strict mode and the time zone is Africa/Abidjan
|
||||
|
||||
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
|
||||
COLUMNS (k1, k2, k3, v1, v2, v3 = k1 *100),
|
||||
WHERE k1 > 100 and k2 like "%doris%"
|
||||
PROPERTIES
|
||||
(
|
||||
"desired_concurrent_number"="3",
|
||||
"max_batch_interval" = "20",
|
||||
"max_batch_rows" = "300000",
|
||||
"max_batch_size" = "209715200"
|
||||
)
|
||||
FROM KAFKA
|
||||
(
|
||||
"broker list"= "broker1:9092,broker2:9092,broker3:9092",
|
||||
"kafu topic" ="my topic",
|
||||
"property.security.protocol" = "ssl",
|
||||
"property.ssl.ca.location" ="FILE:ca.pem",
|
||||
"property.ssl.certificate.location" ="FILE:client.pem",
|
||||
"property.ssl.key.location" ="FILE:client.key",
|
||||
"property.ssl.key.password" = "abcdefg",
|
||||
"property.client.id" = "my_client_id"
|
||||
);
|
||||
```
|
||||
CREATE ROUTINE LOAD example_db.test1 ON example_tbl
|
||||
COLUMNS(k1, k2, k3, v1, v2, v3 = k1 * 100),
|
||||
WHERE k1 > 100 and k2 like "%doris%"
|
||||
PROPERTIES
|
||||
(
|
||||
"desired_concurrent_number"="3",
|
||||
"max_batch_interval" = "20",
|
||||
"max_batch_rows" = "300000",
|
||||
"max_batch_size" = "209715200",
|
||||
"strict_mode" = "false",
|
||||
"timezone" = "Africa/Abidjan"
|
||||
)
|
||||
FROM KAFKA
|
||||
(
|
||||
"kafka_broker_list" = "broker1:9092,broker2:9092,broker3:9092",
|
||||
"kafka_topic" = "my_topic",
|
||||
"property.security.protocol" = "ssl",
|
||||
"property.ssl.ca.location" = "FILE:ca.pem",
|
||||
"property.ssl.certificate.location" = "FILE:client.pem",
|
||||
"property.ssl.key.location" = "FILE:client.key",
|
||||
"property.ssl.key.password" = "abcdefg",
|
||||
"property.client.id" = "my_client_id"
|
||||
);
|
||||
```
|
||||
|
||||
## keyword
|
||||
CREATE,ROUTINE,LOAD
|
||||
|
||||
CREATE, ROUTINE, LOAD
|
||||
|
||||
@ -1,91 +1,163 @@
|
||||
# STREAM LOAD
|
||||
## Description
|
||||
NAME:
|
||||
stream-load: load data to table in streaming
|
||||
## description
|
||||
|
||||
NAME
|
||||
|
||||
load data to table in streaming
|
||||
|
||||
SYNOPSIS
|
||||
curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_host:http_port/api/{db}/{table}/_stream_load
|
||||
|
||||
Curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_host:http_port/api/{db}/{table}/_stream_load
|
||||
|
||||
DESCRIPTION
|
||||
This statement is used to import data to a specified table, which differs from ordinary Load in that it is imported synchronously.
|
||||
This import method can still guarantee the atomicity of a batch of import tasks, either all data imports succeed or all failures.
|
||||
This operation updates the rollup table data associated with the base table at the same time.
|
||||
This is a synchronous operation, the entire data import work is completed and returned to the user import results.
|
||||
Currently, HTTP chunked and non-chunked uploads are supported. For non-chunked uploads, Content-Length must be used to indicate the length of uploaded content, so as to ensure the integrity of data.
|
||||
In addition, it is better for users to set Expect Header field content 100-continue, which can avoid unnecessary data transmission in some error scenarios.
|
||||
|
||||
This statement is used to load data to the specified table. The difference from normal load is that this load method is synchronous load.
|
||||
|
||||
This type of load still guarantees the atomicity of a batch of load tasks, either all data is loaded successfully or all fails.
|
||||
|
||||
This operation also updates the data for the rollup table associated with this base table.
|
||||
|
||||
This is a synchronous operation that returns the results to the user after the entire data load is completed.
|
||||
|
||||
Currently, HTTP chunked and non-chunked uploads are supported. For non-chunked mode, Content-Length must be used to indicate the length of the uploaded content, which ensures data integrity.
|
||||
|
||||
In addition, the user preferably sets the Content of the Expect Header field to 100-continue, which avoids unnecessary data transmission in certain error scenarios.
|
||||
|
||||
OPTIONS
|
||||
Users can pass in import parameters through the Header section of HTTP
|
||||
|
||||
Label: A label that is imported at one time. Data from the same label cannot be imported many times. Users can avoid the problem of duplicate data import by specifying Label.
|
||||
Currently Palo retains the recently successful label within 30 minutes.
|
||||
Users can pass in the load parameters through the Header part of HTTP.
|
||||
|
||||
Column_separator: Specifies the column separator in the import file, defaulting to t. If the character is invisible, it needs to be prefixed with x, using hexadecimal to represent the separator.
|
||||
For example, the separator X01 of the hit file needs to be specified as - H "column_separator: x01"
|
||||
`label`
|
||||
|
||||
Columns: Used to specify the correspondence between columns in the import file and columns in the table. If the column in the source file corresponds exactly to the content in the table, then you do not need to specify the content of this field.
|
||||
If the source file does not correspond to the table schema, some data conversion is required for this field. There are two forms of column. One is to directly correspond to the field in the imported file, which is represented by the field name.
|
||||
One is a derived column with the grammar `column_name'= expression. Give me a few examples to help understand.
|
||||
Example 1: There are three columns "c1, c2, c3" in the table, and the three columns in the source file correspond to "c3, c2, c1" at one time; then - H "columns: c3, c2, c1" needs to be specified.
|
||||
Example 2: There are three columns "c1, c2, c3" in the table, and the first three columns in the source file correspond in turn, but there is one more column; then - H"columns: c1, c2, c3, XXX"need to be specified;
|
||||
The last column is free to specify a name placeholder.
|
||||
Example 3: There are three columns "year, month, day" in the table. There is only one time column in the source file, in the format of "2018-06-01:02:03";
|
||||
那么可以指定-H "columns: col, year = year(col), month=month(col), day=day(col)"完成导入
|
||||
A label that is loaded at one time. The data of the same label cannot be loaded multiple times. Users can avoid the problem of repeated data load by specifying the label.
|
||||
|
||||
Where: Used to extract some data. Users can set this option if they need to filter out unnecessary data.
|
||||
Example 1: If you import only data whose column is larger than K1 equals 20180601, you can specify - H "where: K1 = 20180601" at import time.
|
||||
Currently Palo internally retains the most recent successful label within 30 minutes.
|
||||
|
||||
Max_filter_ratio: The ratio of data that is most tolerant of being filterable (for reasons such as data irregularities). Default zero tolerance. Data irregularities do not include rows filtered through where conditions.
|
||||
`column_separator`
|
||||
|
||||
Partitions: Used to specify the partitions designed for this import. If the user can determine the partition corresponding to the data, it is recommended to specify the item. Data that does not satisfy these partitions will be filtered out.
|
||||
For example, specify imports to p1, P2 partitions, - H "partitions: p1, p2"
|
||||
Used to specify the column separator in the load file. The default is `\t`. If it is an invisible character, you need to add `\x` as a prefix and hexadecimal to indicate the separator.
|
||||
|
||||
For example, the separator `\x01` of the hive file needs to be specified as `-H "column_separator:\x01"`
|
||||
|
||||
`columns`
|
||||
|
||||
used to specify the correspondence between the columns in the load file and the columns in the table. If the column in the source file corresponds exactly to the contents of the table, then it is not necessary to specify the contents of this field. If the source file does not correspond to the table schema, then this field is required for some data conversion. There are two forms of column, one is directly corresponding to the field in the load file, directly using the field name to indicate.
|
||||
|
||||
One is a derived column with the syntax `column_name` = expression. Give a few examples to help understand.
|
||||
|
||||
Example 1: There are three columns "c1, c2, c3" in the table. The three columns in the source file correspond to "c3, c2, c1" at a time; then you need to specify `-H "columns: c3, c2, c1"`
|
||||
|
||||
Example 2: There are three columns in the table, "c1, c2, c3". The first three columns in the source file correspond in turn, but there are more than one column; then you need to specify` -H "columns: c1, c2, c3, xxx"`
|
||||
|
||||
The last column can optionally specify a name for the placeholder.
|
||||
|
||||
Example 3: There are three columns in the table, "year, month, day". There is only one time column in the source file, which is "2018-06-01 01:02:03" format. Then you can specify `-H "columns: col, year = year(col), month=month(col), day=day(col)"` to complete the load.
|
||||
|
||||
`where`
|
||||
|
||||
Used to extract some data. If the user needs to filter out the unwanted data, it can be achieved by setting this option.
|
||||
|
||||
Example 1: load only data larger than k1 column equal to 20180601, then you can specify -H "where: k1 = 20180601" when loading
|
||||
|
||||
`max_filter_ratio`
|
||||
|
||||
The maximum proportion of data that can be filtered (for reasons such as data irregularity). The default is zero tolerance. Data non-standard does not include rows that are filtered out by the where condition.
|
||||
|
||||
`Partitions`
|
||||
|
||||
Used to specify the partition designed for this load. If the user is able to determine the partition corresponding to the data, it is recommended to specify the item. Data that does not satisfy these partitions will be filtered out.
|
||||
|
||||
For example, specify load to p1, p2 partition, `-H "partitions: p1, p2"`
|
||||
|
||||
`Timeout`
|
||||
|
||||
Specifies the timeout for the load. Unit seconds. The default is 600 seconds. The range is from 1 second to 259200 seconds.
|
||||
|
||||
`strict_mode`
|
||||
|
||||
The user specifies whether strict load mode is enabled for this load. The default is enabled. The shutdown mode is `-H "strict_mode: false"`.
|
||||
|
||||
`Timezone`
|
||||
|
||||
Specifies the time zone used for this load. The default is East Eight District. This parameter affects all function results related to the time zone involved in the load.
|
||||
|
||||
RETURN VALUES
|
||||
When the import is complete, the relevant content of the import will be returned in Json format. Currently includes the following fields
|
||||
Status: Import the final state.
|
||||
Success: This means that the import is successful and the data is visible.
|
||||
Publish Timeout: Represents that the import job has been successfully Commit, but for some reason it is not immediately visible. Users can view imports as successful without retrying
|
||||
Label Already Exists: Indicates that the Label has been occupied by other jobs, either successfully imported or being imported.
|
||||
Users need to use the get label state command to determine subsequent operations
|
||||
Others: The import failed, and the user can specify Label to retry the job.
|
||||
Message: Detailed description of import status. Failure returns the specific cause of failure.
|
||||
NumberTotal Rows: The total number of rows read from the data stream
|
||||
Number Loaded Rows: Number of rows imported for this time is valid only for Success
|
||||
Number Filtered Rows: The number of rows filtered out by this import, that is, the number of rows whose data quality is not up to par
|
||||
Number Unselected Rows: Number of rows filtered out by where condition in this import
|
||||
LoadBytes: The amount of data in the source file imported
|
||||
LoadTime Ms: The time taken for this import
|
||||
ErrorURL: Specific content of filtered data, retaining only the first 1000 items
|
||||
|
||||
After the load is completed, the related content of this load will be returned in Json format. Current field included
|
||||
|
||||
* `Status`: load status.
|
||||
|
||||
* Success: indicates that the load is successful and the data is visible.
|
||||
|
||||
* Publish Timeout: Indicates that the load job has been successfully Commit, but for some reason it is not immediately visible. Users can be considered successful and do not have to retry load
|
||||
|
||||
* Label Already Exists: Indicates that the Label is already occupied by another job, either the load was successful or it is being loaded. The user needs to use the get label state command to determine the subsequent operations.
|
||||
|
||||
* Other: The load failed, the user can specify Label to retry the job.
|
||||
|
||||
* Message: A detailed description of the load status. When it fails, it will return the specific reason for failure.
|
||||
|
||||
* NumberTotalRows: The total number of rows read from the data stream
|
||||
|
||||
* NumberLoadedRows: The number of data rows loaded this time, only valid when Success
|
||||
|
||||
* NumberFilteredRows: The number of rows filtered by this load, that is, the number of rows with unqualified data quality.
|
||||
|
||||
* NumberUnselectedRows: Number of rows that were filtered by the where condition for this load
|
||||
|
||||
* LoadBytes: The amount of source file data loaded this time
|
||||
|
||||
* LoadTimeMs: Time spent on this load
|
||||
|
||||
* ErrorURL: The specific content of the filtered data, only the first 1000 items are retained
|
||||
|
||||
ERRORS
|
||||
The import error details can be viewed by the following statement:
|
||||
|
||||
SHOW LOAD WARNINGS ON 'url'
|
||||
You can view the load error details by the following statement:
|
||||
|
||||
The URL is the URL given by Error URL.
|
||||
```SHOW LOAD WARNINGS ON 'url'```
|
||||
|
||||
Where url is the url given by ErrorURL.
|
||||
|
||||
## example
|
||||
|
||||
1. Import the data from the local file'testData'into the table'testTbl' in the database'testDb', and use Label for de-duplication.
|
||||
curl --location-trusted -u root -H "label:123" -T testData http://host:port/api/testDb/testTbl/_stream_load
|
||||
1. load the data from the local file 'testData' into the table 'testTbl' in the database 'testDb' and use Label for deduplication. Specify a timeout of 100 seconds
|
||||
|
||||
2. Import the data from the local file'testData'into the table'testTbl' in the database'testDb', use Label for de-duplication, and import only the data whose K1 equals 20180601.
|
||||
curl --location-trusted -u root -H "label:123" -H "where: k1=20180601" -T testData http://host:port/api/testDb/testTbl/_stream_load
|
||||
```Curl --location-trusted -u root -H "label:123" -H "timeout:100" -T testData http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
3. Import data from the local file'testData'into the'testTbl' table in the database'testDb', allowing a 20% error rate (the user is in defalut_cluster)
|
||||
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -T testData http://host:port/api/testDb/testTbl/_stream_load
|
||||
2. load the data in the local file 'testData' into the table of 'testTbl' in the database 'testDb', use Label for deduplication, and load only data with k1 equal to 20180601
|
||||
|
||||
```Curl --location-trusted -u root -H "label:123" -H "where: k1=20180601" -T testData http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
4. Import the data from the local file'testData'into the table'testTbl' in the database'testDb', allowing a 20% error rate, and specify the column name of the file (the user is in defalut_cluster)
|
||||
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "columns: k2, k1, v1" -T testData http://host:port/api/testDb/testTbl/_stream_load
|
||||
3. load the data from the local file 'testData' into the 'testTbl' table in the database 'testDb', allowing a 20% error rate (user is in defalut_cluster)
|
||||
|
||||
5. Import the data from the local file'testData'into the tables of'testTbl' in'testDb'of the database, allowing 20% error rate.
|
||||
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "partitions: p1, p2" -T testData http://host:port/api/testDb/testTbl/_stream_load
|
||||
```Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -T testData http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
6. Import in streaming mode (user is in defalut_cluster)
|
||||
seq 1 10 | awk '{OFS="\t"}{print $1, $1 * 10}' | curl --location-trusted -u root -T - http://host:port/api/testDb/testTbl/_stream_load
|
||||
4. load the data from the local file 'testData' into the 'testTbl' table in the database 'testDb', allow a 20% error rate, and specify the column name of the file (user is in defalut_cluster)
|
||||
|
||||
```Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "columns: k2, k1, v1" -T testData http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
5. load the data from the local file 'testData' into the p1, p2 partition in the 'testTbl' table in the database 'testDb', allowing a 20% error rate.
|
||||
|
||||
```Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "partitions: p1, p2" -T testData http://host:port/api/testDb/testTbl/stream_load```
|
||||
|
||||
6. load using streaming mode (user is in defalut_cluster)
|
||||
|
||||
```Seq 1 10 | awk '{OFS="\t"}{print $1, $1 * 10}' | curl --location-trusted -u root -T - http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
7. load a table with HLL columns, which can be columns in the table or columns in the data used to generate HLL columns
|
||||
|
||||
```Curl --location-trusted -u root -H "columns: k1, k2, v1=hll_hash(k1)" -T testData http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
8. load data for strict mode filtering and set the time zone to Africa/Abidjan
|
||||
|
||||
```Curl --location-trusted -u root -H "strict_mode: true" -H "timezone: Africa/Abidjan" -T testData http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
9. load a table with an aggregate model of `BITMAP_UNION`, either a column in the table or a column in the data to generate a `BITMAP_UNION` column
|
||||
|
||||
```Curl --location-trusted -u root -H "columns: k1, k2, v1=to_bitmap(k1)" -T testData http://host:port/api/testDb/testTbl/_stream_load```
|
||||
|
||||
7. Import tables containing HLL columns, which can be columns in tables or columns in data to generate HLL columns
|
||||
curl --location-trusted -u root -H "columns: k1, k2, v1=hll_hash(k1)" -T testData http://host:port/api/testDb/testTbl/_stream_load
|
||||
|
||||
## keyword
|
||||
STREAM,LOAD
|
||||
|
||||
STREAM, LOAD
|
||||
Reference in New Issue
Block a user