This CL mainly changes: 1. Reorganized the code logic to limit the supported json format to two, and the import behavior is more consistent. 2. Modified the statistical behavior of the number of error rows when loading in json format, so that the error rows can be counted correctly. 3. See `load-json-format.md` to get details of loading json format.
9.7 KiB
title, language
| title | language |
|---|---|
| Load Json Format Data | en |
Load Json Format Data
Doris supports data load in Json format since version 0.12.
Supported Load Methods
Currently only the following load methods support data import in Json format:
- Stream Load
- Routine Load
For specific instructions on the above load methods, please refer to the relevant documentation. This document mainly introduces the instructions for using Json in these load methods.
Supported Json Format
Currently, only the following two Json formats are supported:
-
Multi-line data represented by Array
Json format with Array as the root node. Each element in the Array represents a row of data to be loaded, usually an Object. Examples are as follows:
[ { "id": 123, "city" : "beijing"}, { "id": 456, "city" : "shanghai"}, ... ][ { "id": 123, "city" : { "name" : "beijing", "region" : "haidian"}}, { "id": 456, "city" : { "name" : "beijing", "region" : "chaoyang"}}, ... ]This method is usually used for the Stream Load method to represent multiple rows of data in a batch of load data.
This method must be used in conjunction with setting
stripe_outer_array=true. Doris will expand the array when parsing, and then parse each Object in turn as a row of data. -
Single row of data represented by Object
Json format with Object as the root node. The entire Object represents a row of data to be loaded. Examples are as follows:
{ "id": 123, "city" : "beijing"}{ "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}This method is usually used for the Routine Load method, such as representing a message in Kafka, that is, a row of data.
Json Path
Doris supports extracting the data specified in Json through Json Path.
Note: Because for Array type data, Doris will first expand the array, and finally perform single-line processing according to the Object format. Therefore, the examples after this document will be illustrated with Json data in single Object format.
-
Json Path is not specified
If Json Path is not specified, Doris will use the column names in the table to find the elements in Object by default. Examples are as follows:
The table contains two columns:
id,cityJson data is as follows:
{ "id": 123, "city" : "beijing"}Then Doris will use
id,cityto match, and get the final data123andbeijing.If the Json data is as follows:
{ "id": 123, "name" : "beijing"}Then use
id,cityto match and get the final data123andnull. -
Json Path is specified
Specify a set of Json Path in the form of a Json data. Each element in the array represents a column to be extracted. Examples are as follows:
["$.id", "$.name"]["$.id.sub_id", "$.name[0]", "$.city[0]"]Doris will use the specified Json Path for data matching and extraction.
-
Match non-primitive types
The values that the previous example finally matched are all primitive types, such as Integer, String, and so on. Doris currently does not support complex types, such as Array, Map, etc. So when a non-primitive type is matched, Doris will convert the type to a Json format string and load it as a string type. Examples are as follows:
{ "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}The Json Path is
["$.city"]. Then the matched elements are:{ "name" : "beijing", "region" : "haidian" }This element will be converted into a string for subsequent load operations:
"{'name':'beijing','region':'haidian'}" -
Match failed
When the match fails,
nullwill be returned. Examples are as follows:Json data is:
{ "id": 123, "name" : "beijing"}The Json Path is
["$.id", "$.info"]. Then the matched elements are123andnull.Doris currently does not distinguish between the null value represented in the Json data and the null value generated when the match fails. Suppose the Json data is:
{ "id": 123, "name" : null }Then use the following two Json Path will get the same result:
123andnull.["$.id", "$.name"]["$.id", "$.info"] -
Complete match failed
In order to prevent misoperation caused by some parameter setting errors. When Doris tries to match a row of data, if all columns fail to match, it will be considered a error row. Suppose the Json data is:
{ "id": 123, "city" : "beijing" }If Json Path is incorrectly written as (or when Json Path is not specified, the columns in the table do not contain
idandcity):["$.ad", "$.infa"]Will result in a complete match failure, the line will be marked as an error row, instead of producing
null, null.
Examples
Stream Load
Because of the indivisible nature of the Json format, when using Stream Load to load a Json format file, the file content will be fully loaded into memory before processing. Therefore, if the file is too large, it may occupy more memory.
Suppose the table structure is:
id INT NOT NULL,
city VARHCAR NULL,
code INT NULL
-
Load single-line data 1
{"id": 100, "city": "beijing", "code" : 1}-
Not specify Json Path
curl --location-trusted -u user:passwd -H "format: json" -T data.json http://localhost:8030/api/db1/tbl1/_stream_loadResults:
100 beijing 1 -
Specify Json Path
curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_loadResults:
100 beijing 1
-
-
Load sigle-line data 2
{"id": 100, "content": {"city": "beijing", "code" : 1}}-
Specify Json Path
curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.content.city\",\"$.content.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_loadResults:
100 beijing 1
-
-
Load multi-line data
[ {"id": 100, "city": "beijing", "code" : 1}, {"id": 101, "city": "shanghai"}, {"id": 102, "city": "tianjin", "code" : 3}, {"id": 103, "city": "chongqing", "code" : 4}, {"id": 104, "city": ["zhejiang", "guangzhou"], "code" : 5}, { "id": 105, "city": { "order1": ["guangzhou"] }, "code" : 6 } ]-
Specify Json Path
curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -T data.json http://localhost:8030/api/db1/tbl1/_stream_loadResults:
100 beijing 1 101 shanghai NULL 102 tianjin 3 103 chongqing 4 104 ["zhejiang","guangzhou"] 5 105 {"order1":["guangzhou"]} 6
-
-
Convert load data
The data is still the multi-row data in Example 3. Now you need to add 1 to the
codecolumn in the loaded data and load it.curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -H "columns: id, city, tmpc, code=tmpc+1" -T data.json http://localhost:8030/api/db1/tbl1/_stream_loadResults:
100 beijing 2 101 shanghai NULL 102 tianjin 4 103 chongqing 5 104 ["zhejiang","guangzhou"] 6 105 {"order1":["guangzhou"]} 7
Routine Load
Routine Load processes Json data the same as Stream Load. I will not repeat them here.
For the Kafka data source, the content of each Massage is treated as a complete Json data. If multiple rows of data expressed in Array format in a Massage are loaded, multiple rows will be loaded, and Kafka's offset will only increase by 1. If an Array format Json represents multiple rows of data, but because the Json format error causes the parsing Json to fail, the error row will only increase by 1 (because the parsing fails, in fact, Doris cannot determine how many rows of data it contains, and can only add one row of errors rows record).