Support setting timezone for stream load and routine load (#1831)

This commit is contained in:
Mingyu Chen
2019-09-20 07:55:05 +08:00
committed by ZHAO Chun
parent 7bf02d0ae7
commit e8da855cd2
11 changed files with 423 additions and 228 deletions

View File

@ -1,91 +1,163 @@
# STREAM LOAD
## Description
NAME:
stream-load: load data to table in streaming
## description
NAME
load data to table in streaming
SYNOPSIS
curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_host:http_port/api/{db}/{table}/_stream_load
Curl --location-trusted -u user:passwd [-H ""...] -T data.file -XPUT http://fe_host:http_port/api/{db}/{table}/_stream_load
DESCRIPTION
This statement is used to import data to a specified table, which differs from ordinary Load in that it is imported synchronously.
This import method can still guarantee the atomicity of a batch of import tasks, either all data imports succeed or all failures.
This operation updates the rollup table data associated with the base table at the same time.
This is a synchronous operation, the entire data import work is completed and returned to the user import results.
Currently, HTTP chunked and non-chunked uploads are supported. For non-chunked uploads, Content-Length must be used to indicate the length of uploaded content, so as to ensure the integrity of data.
In addition, it is better for users to set Expect Header field content 100-continue, which can avoid unnecessary data transmission in some error scenarios.
This statement is used to load data to the specified table. The difference from normal load is that this load method is synchronous load.
This type of load still guarantees the atomicity of a batch of load tasks, either all data is loaded successfully or all fails.
This operation also updates the data for the rollup table associated with this base table.
This is a synchronous operation that returns the results to the user after the entire data load is completed.
Currently, HTTP chunked and non-chunked uploads are supported. For non-chunked mode, Content-Length must be used to indicate the length of the uploaded content, which ensures data integrity.
In addition, the user preferably sets the Content of the Expect Header field to 100-continue, which avoids unnecessary data transmission in certain error scenarios.
OPTIONS
Users can pass in import parameters through the Header section of HTTP
Label: A label that is imported at one time. Data from the same label cannot be imported many times. Users can avoid the problem of duplicate data import by specifying Label.
Currently Palo retains the recently successful label within 30 minutes.
Users can pass in the load parameters through the Header part of HTTP.
Column_separator: Specifies the column separator in the import file, defaulting to t. If the character is invisible, it needs to be prefixed with x, using hexadecimal to represent the separator.
For example, the separator X01 of the hit file needs to be specified as - H "column_separator: x01"
`label`
Columns: Used to specify the correspondence between columns in the import file and columns in the table. If the column in the source file corresponds exactly to the content in the table, then you do not need to specify the content of this field.
If the source file does not correspond to the table schema, some data conversion is required for this field. There are two forms of column. One is to directly correspond to the field in the imported file, which is represented by the field name.
One is a derived column with the grammar `column_name'= expression. Give me a few examples to help understand.
Example 1: There are three columns "c1, c2, c3" in the table, and the three columns in the source file correspond to "c3, c2, c1" at one time; then - H "columns: c3, c2, c1" needs to be specified.
Example 2: There are three columns "c1, c2, c3" in the table, and the first three columns in the source file correspond in turn, but there is one more column; then - H"columns: c1, c2, c3, XXX"need to be specified;
The last column is free to specify a name placeholder.
Example 3: There are three columns "year, month, day" in the table. There is only one time column in the source file, in the format of "2018-06-01:02:03";
那么可以指定-H "columns: col, year = year(col), month=month(col), day=day(col)"完成导入
A label that is loaded at one time. The data of the same label cannot be loaded multiple times. Users can avoid the problem of repeated data load by specifying the label.
Where: Used to extract some data. Users can set this option if they need to filter out unnecessary data.
Example 1: If you import only data whose column is larger than K1 equals 20180601, you can specify - H "where: K1 = 20180601" at import time.
Currently Palo internally retains the most recent successful label within 30 minutes.
Max_filter_ratio: The ratio of data that is most tolerant of being filterable (for reasons such as data irregularities). Default zero tolerance. Data irregularities do not include rows filtered through where conditions.
`column_separator`
Partitions: Used to specify the partitions designed for this import. If the user can determine the partition corresponding to the data, it is recommended to specify the item. Data that does not satisfy these partitions will be filtered out.
For example, specify imports to p1, P2 partitions, - H "partitions: p1, p2"
Used to specify the column separator in the load file. The default is `\t`. If it is an invisible character, you need to add `\x` as a prefix and hexadecimal to indicate the separator.
For example, the separator `\x01` of the hive file needs to be specified as `-H "column_separator:\x01"`
`columns`
used to specify the correspondence between the columns in the load file and the columns in the table. If the column in the source file corresponds exactly to the contents of the table, then it is not necessary to specify the contents of this field. If the source file does not correspond to the table schema, then this field is required for some data conversion. There are two forms of column, one is directly corresponding to the field in the load file, directly using the field name to indicate.
One is a derived column with the syntax `column_name` = expression. Give a few examples to help understand.
Example 1: There are three columns "c1, c2, c3" in the table. The three columns in the source file correspond to "c3, c2, c1" at a time; then you need to specify `-H "columns: c3, c2, c1"`
Example 2: There are three columns in the table, "c1, c2, c3". The first three columns in the source file correspond in turn, but there are more than one column; then you need to specify` -H "columns: c1, c2, c3, xxx"`
The last column can optionally specify a name for the placeholder.
Example 3: There are three columns in the table, "year, month, day". There is only one time column in the source file, which is "2018-06-01 01:02:03" format. Then you can specify `-H "columns: col, year = year(col), month=month(col), day=day(col)"` to complete the load.
`where`
Used to extract some data. If the user needs to filter out the unwanted data, it can be achieved by setting this option.
Example 1: load only data larger than k1 column equal to 20180601, then you can specify -H "where: k1 = 20180601" when loading
`max_filter_ratio`
The maximum proportion of data that can be filtered (for reasons such as data irregularity). The default is zero tolerance. Data non-standard does not include rows that are filtered out by the where condition.
`Partitions`
Used to specify the partition designed for this load. If the user is able to determine the partition corresponding to the data, it is recommended to specify the item. Data that does not satisfy these partitions will be filtered out.
For example, specify load to p1, p2 partition, `-H "partitions: p1, p2"`
`Timeout`
Specifies the timeout for the load. Unit seconds. The default is 600 seconds. The range is from 1 second to 259200 seconds.
`strict_mode`
The user specifies whether strict load mode is enabled for this load. The default is enabled. The shutdown mode is `-H "strict_mode: false"`.
`Timezone`
Specifies the time zone used for this load. The default is East Eight District. This parameter affects all function results related to the time zone involved in the load.
RETURN VALUES
When the import is complete, the relevant content of the import will be returned in Json format. Currently includes the following fields
Status: Import the final state.
Success: This means that the import is successful and the data is visible.
Publish Timeout: Represents that the import job has been successfully Commit, but for some reason it is not immediately visible. Users can view imports as successful without retrying
Label Already Exists: Indicates that the Label has been occupied by other jobs, either successfully imported or being imported.
Users need to use the get label state command to determine subsequent operations
Others: The import failed, and the user can specify Label to retry the job.
Message: Detailed description of import status. Failure returns the specific cause of failure.
NumberTotal Rows: The total number of rows read from the data stream
Number Loaded Rows: Number of rows imported for this time is valid only for Success
Number Filtered Rows: The number of rows filtered out by this import, that is, the number of rows whose data quality is not up to par
Number Unselected Rows: Number of rows filtered out by where condition in this import
LoadBytes: The amount of data in the source file imported
LoadTime Ms: The time taken for this import
ErrorURL: Specific content of filtered data, retaining only the first 1000 items
After the load is completed, the related content of this load will be returned in Json format. Current field included
* `Status`: load status.
* Success: indicates that the load is successful and the data is visible.
* Publish Timeout: Indicates that the load job has been successfully Commit, but for some reason it is not immediately visible. Users can be considered successful and do not have to retry load
* Label Already Exists: Indicates that the Label is already occupied by another job, either the load was successful or it is being loaded. The user needs to use the get label state command to determine the subsequent operations.
* Other: The load failed, the user can specify Label to retry the job.
* Message: A detailed description of the load status. When it fails, it will return the specific reason for failure.
* NumberTotalRows: The total number of rows read from the data stream
* NumberLoadedRows: The number of data rows loaded this time, only valid when Success
* NumberFilteredRows: The number of rows filtered by this load, that is, the number of rows with unqualified data quality.
* NumberUnselectedRows: Number of rows that were filtered by the where condition for this load
* LoadBytes: The amount of source file data loaded this time
* LoadTimeMs: Time spent on this load
* ErrorURL: The specific content of the filtered data, only the first 1000 items are retained
ERRORS
The import error details can be viewed by the following statement:
SHOW LOAD WARNINGS ON 'url'
You can view the load error details by the following statement:
The URL is the URL given by Error URL.
```SHOW LOAD WARNINGS ON 'url'```
Where url is the url given by ErrorURL.
## example
1. Import the data from the local file'testData'into the table'testTbl' in the database'testDb', and use Label for de-duplication.
curl --location-trusted -u root -H "label:123" -T testData http://host:port/api/testDb/testTbl/_stream_load
1. load the data from the local file 'testData' into the table 'testTbl' in the database 'testDb' and use Label for deduplication. Specify a timeout of 100 seconds
2. Import the data from the local file'testData'into the table'testTbl' in the database'testDb', use Label for de-duplication, and import only the data whose K1 equals 20180601.
curl --location-trusted -u root -H "label:123" -H "where: k1=20180601" -T testData http://host:port/api/testDb/testTbl/_stream_load
```Curl --location-trusted -u root -H "label:123" -H "timeout:100" -T testData http://host:port/api/testDb/testTbl/_stream_load```
3. Import data from the local file'testData'into the'testTbl' table in the database'testDb', allowing a 20% error rate (the user is in defalut_cluster)
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -T testData http://host:port/api/testDb/testTbl/_stream_load
2. load the data in the local file 'testData' into the table of 'testTbl' in the database 'testDb', use Label for deduplication, and load only data with k1 equal to 20180601
        
```Curl --location-trusted -u root -H "label:123" -H "where: k1=20180601" -T testData http://host:port/api/testDb/testTbl/_stream_load```
4. Import the data from the local file'testData'into the table'testTbl' in the database'testDb', allowing a 20% error rate, and specify the column name of the file (the user is in defalut_cluster)
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "columns: k2, k1, v1" -T testData http://host:port/api/testDb/testTbl/_stream_load
3. load the data from the local file 'testData' into the 'testTbl' table in the database 'testDb', allowing a 20% error rate (user is in defalut_cluster)
5. Import the data from the local file'testData'into the tables of'testTbl' in'testDb'of the database, allowing 20% error rate.
curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "partitions: p1, p2" -T testData http://host:port/api/testDb/testTbl/_stream_load
```Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -T testData http://host:port/api/testDb/testTbl/_stream_load```
6. Import in streaming mode (user is in defalut_cluster)
seq 1 10 | awk '{OFS="\t"}{print $1, $1 * 10}' | curl --location-trusted -u root -T - http://host:port/api/testDb/testTbl/_stream_load
4. load the data from the local file 'testData' into the 'testTbl' table in the database 'testDb', allow a 20% error rate, and specify the column name of the file (user is in defalut_cluster)
```Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "columns: k2, k1, v1" -T testData http://host:port/api/testDb/testTbl/_stream_load```
5. load the data from the local file 'testData' into the p1, p2 partition in the 'testTbl' table in the database 'testDb', allowing a 20% error rate.
```Curl --location-trusted -u root -H "label:123" -H "max_filter_ratio:0.2" -H "partitions: p1, p2" -T testData http://host:port/api/testDb/testTbl/stream_load```
6. load using streaming mode (user is in defalut_cluster)
```Seq 1 10 | awk '{OFS="\t"}{print $1, $1 * 10}' | curl --location-trusted -u root -T - http://host:port/api/testDb/testTbl/_stream_load```
7. load a table with HLL columns, which can be columns in the table or columns in the data used to generate HLL columns
```Curl --location-trusted -u root -H "columns: k1, k2, v1=hll_hash(k1)" -T testData http://host:port/api/testDb/testTbl/_stream_load```
8. load data for strict mode filtering and set the time zone to Africa/Abidjan
```Curl --location-trusted -u root -H "strict_mode: true" -H "timezone: Africa/Abidjan" -T testData http://host:port/api/testDb/testTbl/_stream_load```
9. load a table with an aggregate model of `BITMAP_UNION`, either a column in the table or a column in the data to generate a `BITMAP_UNION` column
```Curl --location-trusted -u root -H "columns: k1, k2, v1=to_bitmap(k1)" -T testData http://host:port/api/testDb/testTbl/_stream_load```
7. Import tables containing HLL columns, which can be columns in tables or columns in data to generate HLL columns
curl --location-trusted -u root -H "columns: k1, k2, v1=hll_hash(k1)" -T testData http://host:port/api/testDb/testTbl/_stream_load
## keyword
STREAM,LOAD
STREAM, LOAD