Files
doris/docs/en/administrator-guide/load-data/load-json-format.md
Mingyu Chen c3d9feed75 [Load][Json] Refactor json load logic to make it more reasonable (#4020)
This CL mainly changes:

1. Reorganized the code logic to limit the supported json format to two, and the import behavior is more consistent.
2. Modified the statistical behavior of the number of error rows when loading in json format, so that the error rows can be counted correctly.
3. See `load-json-format.md` to get details of loading json format.
2020-07-07 23:07:28 +08:00

9.7 KiB

title, language
title language
Load Json Format Data en

Load Json Format Data

Doris supports data load in Json format since version 0.12.

Supported Load Methods

Currently only the following load methods support data import in Json format:

  • Stream Load
  • Routine Load

For specific instructions on the above load methods, please refer to the relevant documentation. This document mainly introduces the instructions for using Json in these load methods.

Supported Json Format

Currently, only the following two Json formats are supported:

  1. Multi-line data represented by Array

    Json format with Array as the root node. Each element in the Array represents a row of data to be loaded, usually an Object. Examples are as follows:

    [
        { "id": 123, "city" : "beijing"},
        { "id": 456, "city" : "shanghai"},
        ...
    ]
    
    [
        { "id": 123, "city" : { "name" : "beijing", "region" : "haidian"}},
        { "id": 456, "city" : { "name" : "beijing", "region" : "chaoyang"}},
        ...
    ]
    

    This method is usually used for the Stream Load method to represent multiple rows of data in a batch of load data.

    This method must be used in conjunction with setting stripe_outer_array=true. Doris will expand the array when parsing, and then parse each Object in turn as a row of data.

  2. Single row of data represented by Object

    Json format with Object as the root node. The entire Object represents a row of data to be loaded. Examples are as follows:

    { "id": 123, "city" : "beijing"}
    
    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
    

    This method is usually used for the Routine Load method, such as representing a message in Kafka, that is, a row of data.

Json Path

Doris supports extracting the data specified in Json through Json Path.

Note: Because for Array type data, Doris will first expand the array, and finally perform single-line processing according to the Object format. Therefore, the examples after this document will be illustrated with Json data in single Object format.

  • Json Path is not specified

    If Json Path is not specified, Doris will use the column names in the table to find the elements in Object by default. Examples are as follows:

    The table contains two columns: id, city

    Json data is as follows:

    { "id": 123, "city" : "beijing"}
    

    Then Doris will use id, city to match, and get the final data 123 and beijing.

    If the Json data is as follows:

    { "id": 123, "name" : "beijing"}
    

    Then use id, city to match and get the final data 123 and null.

  • Json Path is specified

    Specify a set of Json Path in the form of a Json data. Each element in the array represents a column to be extracted. Examples are as follows:

    ["$.id", "$.name"]
    
    ["$.id.sub_id", "$.name[0]", "$.city[0]"]
    

    Doris will use the specified Json Path for data matching and extraction.

  • Match non-primitive types

    The values that the previous example finally matched are all primitive types, such as Integer, String, and so on. Doris currently does not support complex types, such as Array, Map, etc. So when a non-primitive type is matched, Doris will convert the type to a Json format string and load it as a string type. Examples are as follows:

    { "id": 123, "city" : { "name" : "beijing", "region" : "haidian" }}
    

    The Json Path is ["$.city"]. Then the matched elements are:

    { "name" : "beijing", "region" : "haidian" }
    

    This element will be converted into a string for subsequent load operations:

    "{'name':'beijing','region':'haidian'}"
    
  • Match failed

    When the match fails, null will be returned. Examples are as follows:

    Json data is:

    { "id": 123, "name" : "beijing"}
    

    The Json Path is ["$.id", "$.info"]. Then the matched elements are 123 and null.

    Doris currently does not distinguish between the null value represented in the Json data and the null value generated when the match fails. Suppose the Json data is:

    { "id": 123, "name" : null }
    

    Then use the following two Json Path will get the same result: 123 and null.

    ["$.id", "$.name"]
    
    ["$.id", "$.info"]
    
  • Complete match failed

    In order to prevent misoperation caused by some parameter setting errors. When Doris tries to match a row of data, if all columns fail to match, it will be considered a error row. Suppose the Json data is:

    { "id": 123, "city" : "beijing" }
    

    If Json Path is incorrectly written as (or when Json Path is not specified, the columns in the table do not contain id and city):

    ["$.ad", "$.infa"]
    

    Will result in a complete match failure, the line will be marked as an error row, instead of producing null, null.

Examples

Stream Load

Because of the indivisible nature of the Json format, when using Stream Load to load a Json format file, the file content will be fully loaded into memory before processing. Therefore, if the file is too large, it may occupy more memory.

Suppose the table structure is:

id      INT     NOT NULL,
city    VARHCAR NULL,
code    INT     NULL
  1. Load single-line data 1

    {"id": 100, "city": "beijing", "code" : 1}
    
    • Not specify Json Path

      curl --location-trusted -u user:passwd -H "format: json" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
      

      Results:

      100     beijing     1
      
    • Specify Json Path

      curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
      

      Results:

      100     beijing     1
      
  2. Load sigle-line data 2

    {"id": 100, "content": {"city": "beijing", "code" : 1}}
    
    • Specify Json Path

      curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.content.city\",\"$.content.code\"]" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
      

      Results:

      100     beijing     1
      
  3. Load multi-line data

    [
        {"id": 100, "city": "beijing", "code" : 1},
        {"id": 101, "city": "shanghai"},
        {"id": 102, "city": "tianjin", "code" : 3},
        {"id": 103, "city": "chongqing", "code" : 4},
        {"id": 104, "city": ["zhejiang", "guangzhou"], "code" : 5},
        {
            "id": 105,
            "city": {
                "order1": ["guangzhou"]
            }, 
            "code" : 6
        }
    ]
    
    • Specify Json Path

      curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
      

      Results:

      100     beijing                     1
      101     shanghai                    NULL
      102     tianjin                     3
      103     chongqing                   4
      104     ["zhejiang","guangzhou"]    5
      105     {"order1":["guangzhou"]}    6
      
  4. Convert load data

    The data is still the multi-row data in Example 3. Now you need to add 1 to the code column in the loaded data and load it.

    curl --location-trusted -u user:passwd -H "format: json" -H "jsonpaths: [\"$.id\",\"$.city\",\"$.code\"]" -H "strip_outer_array: true" -H "columns: id, city, tmpc, code=tmpc+1" -T data.json http://localhost:8030/api/db1/tbl1/_stream_load
    

    Results:

    100     beijing                     2
    101     shanghai                    NULL
    102     tianjin                     4
    103     chongqing                   5
    104     ["zhejiang","guangzhou"]    6
    105     {"order1":["guangzhou"]}    7
    

Routine Load

Routine Load processes Json data the same as Stream Load. I will not repeat them here.

For the Kafka data source, the content of each Massage is treated as a complete Json data. If multiple rows of data expressed in Array format in a Massage are loaded, multiple rows will be loaded, and Kafka's offset will only increase by 1. If an Array format Json represents multiple rows of data, but because the Json format error causes the parsing Json to fail, the error row will only increase by 1 (because the parsing fails, in fact, Doris cannot determine how many rows of data it contains, and can only add one row of errors rows record).