Commit Graph

14 Commits

Author SHA1 Message Date
f207036cad [Spark load][Document] Add docs about spark and yarn client for spark load (#4489)
Add docs about spark and yarn client for spark load
2020-09-02 10:52:49 +08:00
wyb
ffe696d17c [Doc] Add spark load sql statement doc and update manual (#4463)
1. add sql statement in dml
2. update spark load manual
2020-08-30 21:09:17 +08:00
174c9f89ea [DOCS] Add batch delete docs (#4435)
update documents for batch delete #4051
2020-08-28 09:24:07 +08:00
1410d4e623 [Doc] Add in predicate support content in delete-manual.md (#4404)
Add in predicate support content in delete-manual.md
2020-08-24 21:52:28 +08:00
05fa55047e [Doc][Json Load] Improve json data format load documents (#4337)
And some detail explaination of JsonPath and Columns parameter
2020-08-13 23:39:57 +08:00
237c0807a4 [RoutineLoad] Support modify routine load job (#4158)
Support ALTER ROUTINE LOAD JOB stmt, for example:

```
alter routine load db1.label1
properties
(
"desired_concurrent_number"="3",
"max_batch_interval" = "5",
"max_batch_rows" = "300000",
"max_batch_size" = "209715200",
"strict_mode" = "false",
"timezone" = "+08:00"
)
```

Details can be found in `alter-routine-load.md`
2020-08-06 23:11:02 +08:00
3f31866169 [Bug][Load][Json] #4124 Load json format with stream load failed (#4217)
Stream load should read all the data completely before parsing the json.
And also add a new BE config streaming_load_max_batch_read_mb
to limit the data size when loading json data.

Fix the bug of loading empty json array []

Add doc to explain some certain case of loading json format data.

Fix: #4124
2020-08-04 12:55:53 +08:00
c3d9feed75 [Load][Json] Refactor json load logic to make it more reasonable (#4020)
This CL mainly changes:

1. Reorganized the code logic to limit the supported json format to two, and the import behavior is more consistent.
2. Modified the statistical behavior of the number of error rows when loading in json format, so that the error rows can be counted correctly.
3. See `load-json-format.md` to get details of loading json format.
2020-07-07 23:07:28 +08:00
210ee9664f [SparkLoad]add user doc for build global dict (#3938)
describe global dict and how to use it in spark load
2020-06-30 19:12:35 +08:00
de91037d8c [Doc]Add some routine load docs (#3796)
Add some documentation about using routine load in the cloud environment
2020-06-10 22:57:00 +08:00
01c1de1870 [Load] Add more metric to trace the time cost in stream load and make brpc_num_threads configurable (#3703) 2020-06-04 13:37:28 +08:00
wyb
4978bd6c81 [Spark load] Add resource manager (#3418)
1. User interface:

1.1 Spark resource management

Spark is used as an external computing resource in Doris to do ETL work. In the future, there may be other external resources that will be used in Doris, for example, MapReduce is used for ETL, Spark/GPU is used for queries, HDFS/S3  is used for external storage. We introduced resource management to manage these external resources used by Doris.

```sql
-- create spark resource
CREATE EXTERNAL RESOURCE resource_name
PROPERTIES 
(                 
  type = spark,
  spark_conf_key = spark_conf_value,
  working_dir = path,
  broker = broker_name,
  broker.property_key = property_value
)

-- drop spark resource
DROP RESOURCE resource_name

-- show resources
SHOW RESOURCES
SHOW PROC "/resources"

-- privileges
GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name

REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
```



- CREATE EXTERNAL RESOURCE:

FOR user_name is optional. If there has, the external resource belongs to this user. If not, the external resource belongs to the system and all users are available.

PROPERTIES:
1. type: resource type. Only support spark now.
2. spark configuration: follow the standard writing of Spark configurations, refer to: https://spark.apache.org/docs/latest/configuration.html.
3. working_dir: optional, used to store ETL intermediate results in spark ETL.
4. broker: optional, used in spark ETL. The ETL intermediate results need to be read with the broker when pushed into BE.

Example: 

```sql
CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES 
(                                                                             
  "type" = "spark",                   
  "spark.master" = "yarn",
  "spark.submit.deployMode" = "cluster",
  "spark.jars" = "xxx.jar,yyy.jar",
  "spark.files" = "/tmp/aaa,/tmp/bbb",
  "spark.yarn.queue" = "queue0",
  "spark.executor.memory" = "1g",
  "spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
  "spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
  "working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
  "broker" = "broker0",
  "broker.username" = "user0",
  "broker.password" = "password0"
)
```



- SHOW RESOURCES:
General users can only see their own resources.
Admin and root users can show all resources.




1.2 Create spark load job

```sql
LOAD LABEL db_name.label_name 
(
  DATA INFILE ("/tmp/file1") INTO TABLE table_name, ...
)
WITH RESOURCE resource_name
[(key1 = value1, ...)]
[PROPERTIES (key2 = value2, ... )]
```

Example:

```sql
LOAD LABEL example_db.test_label 
(
  DATA INFILE ("hdfs:/127.0.0.1:10000/tmp/file1") INTO TABLE example_table
)
WITH RESOURCE "spark0"
(
  "spark.executor.memory" = "1g",
  "spark.files" = "/tmp/aaa,/tmp/bbb"
)
PROPERTIES ("timeout" = "3600")
```

The spark configurations in load stmt can override the existing configuration in the resource for temporary use.

#3010
2020-05-26 18:21:21 +08:00
dbfe8a067f [Doc ]Add docs of max_running_txn_num_per_db (#3657)
Change-Id: Ibdbc19a5558b0eb3f6a5fc4ef630de255b408a92
2020-05-22 10:22:11 +08:00
432965e360 [Enhancement] documents rebuild with Vuepress (#3408) (#3414) 2020-04-29 09:14:31 +08:00