diff --git a/docs/en/docs/data-table/dynamic-schema-table.md b/docs/en/docs/data-table/dynamic-schema-table.md new file mode 100644 index 0000000000..0b21f6f572 --- /dev/null +++ b/docs/en/docs/data-table/dynamic-schema-table.md @@ -0,0 +1,129 @@ +--- +{ + "title": "dynamie schema table", + "language": "zh-CN" +} +--- + + + +# Dynamic Table +A dynamic schema table is a special kind of table which schema expands automatically with the import procedure. Currently, this feature is mainly used for importing semi-structured data such as JSON. Because JSON is self-describing, we can extract the schema information from the original document and infer the final type information. This special table can reduce manual schema change operations and easily import semi-structured data and automatically expand its schema. + + + +## Terminology +- Schema change, changing the structure of the table, such as adding columns, reducing columns, changing column types +- Static column, column specified during table creation, such as partition columns, primary key columns +- Dynamic column, columns automatically recognized and added during import + +## Create dynamic table + +```sql +CREATE DATABASE test_dynamic_table; + +-- Create table and specify static column types, import will automatically convert to the type of static column +-- Choose random bucketing +CREATE TABLE IF NOT EXISTS test_dynamic_table ( + qid bigint, + `answers.date` array, + `title` string, + ... -- ... Identifying a table as a dynamic table and its syntax for dynamic tables. + ) +DUPLICATE KEY(`qid`) +DISTRIBUTED BY RANDOM BUCKETS 5 +properties("replication_num" = "1"); + +-- Three Columns are added to the table by default, and their types are specified +mysql> DESC test_dynamic_table; ++--------------+-----------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++--------------+-----------------+------+-------+---------+-------+ +| qid | BIGINT | Yes | true | NULL | | +| answers.date | ARRAY | Yes | false | NULL | NONE | +| user | TEXT | Yes | false | NULL | NONE | ++--------------+-----------------+------+-------+---------+-------+ +3 rows in set (0.00 sec) +``` + +## Importing data + +``` sql +-- example1.json +'{ + "title": "Display Progress Bar at the Time of Processing", + "qid": "1000000", + "answers": [ + {"date": "2009-06-16T09:55:57.320", "user": "Micha\u0142 Niklas (22595)"}, + {"date": "2009-06-17T12:34:22.643", "user": "Jack Njiri (77153)"} + ], + "tag": ["vb6", "progress-bar"], + "user": "Jash", + "creationdate": "2009-06-16T07:28:42.770" +}' + +curl -X PUT -T example1.json --location-trusted -u root: -H "read_json_by_line:false" -H "format:json" http://127.0.0.1:8147/api/regression_test_dynamic_table/test_dynamic_table/_stream_load + +-- Added five new columns: `title`, `answers.user`, `tag`, `title`, `creationdate` +-- The types of the three columns: `qid`, `answers.date`, `user` remain the same as with the table was created +-- The default value of the new array type is an empty array [] +mysql> DESC test_dynamic_table; ++--------------+-----------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++--------------+-----------------+------+-------+---------+-------+ +| qid | BIGINT | Yes | true | NULL | | +| answers.date | ARRAY | Yes | false | NULL | NONE | +| title | TEXT | Yes | false | NULL | NONE | +| answers.user | ARRAY | No | false | [] | NONE | +| tag | ARRAY | No | false | [] | NONE | +| user | TEXT | Yes | false | NULL | NONE | +| creationdate | TEXT | Yes | false | NULL | NONE | +| date | TEXT | Yes | false | NULL | NONE | ++--------------+-----------------+------+-------+---------+-------+ + +-- Batch import data +-- Specifying -H "read_json_by_line:true", parsing JSON line by line +curl -X PUT -T example_batch.json --location-trusted -u root: -H "read_json_by_line:true" -H "format:json" http://127.0.0.1:8147/api/regression_test_dynamic_table/test_dynamic_table/_stream_load + +-- Specifying -H "strip_outer_array:true", parsing the entire file as a JSON array, each element in the array is the same, more efficient parsing way +curl -X PUT -T example_batch_array.json --location-trusted -u root: -H "strip_outer_array:true" -H "format:json" http://127.0.0.1:8147/api/regression_test_dynamic_table/test_dynamic_table/_stream_load +``` +For a dynamic table, you can also use S3load or Routine load, with similar usage. + + +## Adding Index to Dynamic Columns +```sql +-- Create an inverted index on the title column, using English parsing. +CREATE INDEX title_idx ON test_dynamic_table (`title`) using inverted PROPERTIES("parser"="english") +``` + +## Type conflict resolution + +In the first batch import, the unified type will be automatically inferred and used as the final Column type, so it is recommended to keep the Column type consistent, for example: +``` +{"id" : 123} +{"id" : "123"} +-- The type will finally be inferred as Text type, and if {"id" : 123} is imported later, the type will automatically be converted to String type + +For types that cannot be unified, such as: +{"id" : [123]} +{"id" : 123} +-- Importing will result in an error." +``` \ No newline at end of file diff --git a/docs/sidebars.json b/docs/sidebars.json index 3a36f83e84..183f69800e 100644 --- a/docs/sidebars.json +++ b/docs/sidebars.json @@ -49,6 +49,7 @@ "data-table/basic-usage", "data-table/hit-the-rollup", "data-table/best-practice", + "data-table/dynamic-schema-table", { "type": "category", "label": "Index", diff --git a/docs/zh-CN/docs/data-table/dynamic-schema-table.md b/docs/zh-CN/docs/data-table/dynamic-schema-table.md new file mode 100644 index 0000000000..5a3f891cae --- /dev/null +++ b/docs/zh-CN/docs/data-table/dynamic-schema-table.md @@ -0,0 +1,128 @@ +--- +{ + "title": "动态schema表", + "language": "zh-CN" +} +--- + + + +# 动态表 + +动态schema表是一种特殊的表,其schema随着导入自动进行扩展。目前该功能,主要用于半结构数据,例如JSON等的导入、自动列生成。因为JSON是类型自描述的,所以我们可以从原始文档中提取schema信息,推断最终类型信息。这种特殊的表可以减少人工schema change的操作,并轻松导入半结构数据并自动扩展其schema。 + +## 名词解释 +- schema change, 改变表的结构, 例如增加列、减少列, 修改列类型 +- 静态列, 在建表时指定的列, 例如分区列、主键列 +- 动态列, 随着导入自动识别并增加的列 +- + +## 建表 + +```sql +CREATE DATABASE test_dynamic_table; + +-- 建表, 并指定静态列类型, 导入遇到对应列会自动转换成静态列的类型 +-- 选择随机分桶方式 +CREATE TABLE IF NOT EXISTS test_dynamic_table ( + qid bigint, + `answers.date` array, + `title` string, + ... -- ...标识该表是动态表, 是动态表的语法 + ) +DUPLICATE KEY(`qid`) +DISTRIBUTED BY RANDOM BUCKETS 5 +properties("replication_num" = "1"); + +-- 可以看到三列Column在表中默认添加, 类型是指定类型 +mysql> DESC test_dynamic_table; ++--------------+-----------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++--------------+-----------------+------+-------+---------+-------+ +| qid | BIGINT | Yes | true | NULL | | +| answers.date | ARRAY | Yes | false | NULL | NONE | +| user | TEXT | Yes | false | NULL | NONE | ++--------------+-----------------+------+-------+---------+-------+ +3 rows in set (0.00 sec) +``` + +## 导入数据 + +``` sql +-- example1.json +'{ + "title": "Display Progress Bar at the Time of Processing", + "qid": "1000000", + "answers": [ + {"date": "2009-06-16T09:55:57.320", "user": "Micha\u0142 Niklas (22595)"}, + {"date": "2009-06-17T12:34:22.643", "user": "Jack Njiri (77153)"} + ], + "tag": ["vb6", "progress-bar"], + "user": "Jash", + "creationdate": "2009-06-16T07:28:42.770" +}' + +curl -X PUT -T example1.json --location-trusted -u root: -H "read_json_by_line:false" -H "format:json" http://127.0.0.1:8147/api/regression_test_dynamic_table/test_dynamic_table/_stream_load + +-- 新增 title,answers.user, tag, title, creationdate 五列 +-- 且 qid,answers.date,user三列类型与建表时保持一致 +-- 新增数组类型默认Default值为空数组[] +mysql> DESC test_dynamic_table; ++--------------+-----------------+------+-------+---------+-------+ +| Field | Type | Null | Key | Default | Extra | ++--------------+-----------------+------+-------+---------+-------+ +| qid | BIGINT | Yes | true | NULL | | +| answers.date | ARRAY | Yes | false | NULL | NONE | +| title | TEXT | Yes | false | NULL | NONE | +| answers.user | ARRAY | No | false | [] | NONE | +| tag | ARRAY | No | false | [] | NONE | +| user | TEXT | Yes | false | NULL | NONE | +| creationdate | TEXT | Yes | false | NULL | NONE | +| date | TEXT | Yes | false | NULL | NONE | ++--------------+-----------------+------+-------+---------+-------+ + +-- 批量导入数据 + +-- 指定 -H "read_json_by_line:true", 逐行解析JSON +curl -X PUT -T example_batch.json --location-trusted -u root: -H "read_json_by_line:true" -H "format:json" http://127.0.0.1:8147/api/regression_test_dynamic_table/test_dynamic_table/_stream_load + +-- 指定 -H "strip_outer_array:true", 整个文件当做一个JSON array解析, array中的每个元素是一行, 解析效率更高效 +curl -X PUT -T example_batch_array.json --location-trusted -u root: -H "strip_outer_array:true" -H "format:json" http://127.0.0.1:8147/api/regression_test_dynamic_table/test_dynamic_table/_stream_load +``` +对于dynamic table, 你也可以使用S3load或者Routine load, 使用方式类似 + +## 对动态列增加索引 +```sql +-- 将在titile列上新建倒排索引, 并按照english分词 +CREATE INDEX title_idx ON test_dynamic_table (`title`) using inverted PROPERTIES("parser"="english") +``` + +## 类型冲突 +在第一批导入会自动推断出统一的类型, 并以此作为最终的Column类型,所以建议保持Column类型的一致, 例如 +``` +{"id" : 123} +{"id" : "123"} +-- 类型会被最终推断为Text类型, 如果在后续导入{"id" : 123}则类型会被自动转成String类型 + +对于无法统一的类型, 例如 +{"id" : [123]} +{"id" : 123} +-- 导入将会报错 +```