1.At present, read_json_by_line and fuzzy_parse are used for json format writing, and the performance of streamload writing will decrease. It is modified to strip_outer_array and fuzzy_parse writing, and the speed is increased by about 3 times.
2.Add csv writing, the column separator is set to \x01, and the row separator is set to \x02, the performance is about 5 times higher than before
* 1.Remove import doris via csv in Dataxwriter, only support via json;
2.Format Dataxwriter code;
3.Optimize exception handling and reduce multiple output of exception logs;
4.Update the dataxwriter's documentation;
* Delete DorisCsvCodec.java
delete unused file extension/DataX/doriswriter/src/main/java/com/alibaba/datax/plugin/writer/doriswriter/DorisCsvCodec.java
* 1.remove `format` config key;
2.Optimize serialization code in DorisJsonCodec class
The previous DataSourceFunction inherited from RichSourceFunction.
As a result, no matter how much the parallelism of flink is set, the parallelism of DataSourceFunction is only 1.
Now modify it to RichParallelSourceFunction.
And when flink has multiple degrees of parallelism, assign the doris data to each parallelism.
For example, read dorisPartitions.size = 10, flink.parallelism = 4
The task is split as follows:
task0: dorisPartitions[0],[4],[8]
task1: dorisPartitions[1],[5],[9]
task2: dorisPartitions[2],[6]
task3: dorisPartitions[3],[7]
Add `sink.batch.size` `sink.max-retries` options in `Doris Spark-connector`.
Be consistent with `link-connector` options .
eg:
```scala
df.write
.format("doris")
// specify maximum number of lines in a single flushing
.option("sink.batch.size",2048)
// specify number of retries after writing failed
.option("sink.max-retries",3)
.save()
```
The jar file compiled by Flink and Spark Connector, with the corresponding Flink, Spark version
and Scala version at compile time, so that users can know whether the version number matches when using it.
Example of output file name:doris-spark-1.0.0-spark-3.2.0_2.12.jar
Doris should provide a http api to return backends list for connectors to submit stream load,
and without privilege checking, which can let common user to use it
1. By default , Spark connector must write all fields value to `Doris` table .
In this feature , user can specify part of fields to write , even specify the order of the fields to write.
eg:
I have a table named `student` which has three columns (name,gender,age) ,
creating table sql as following:
```sql
create table student (name varchar(255), gender varchar(10), age int) duplicate key (name) distributed by hash(name) buckets 2;
```
Now , I just want to write values to two columns : name , gender.
The code as following:
```scala
val df = spark.createDataFrame(Seq(
("m", "zhangsan"),
("f", "lisi"),
("m", "wangwu")
))
df.write
.format("doris")
.option("doris.fenodes", dorisFeNodes)
.option("doris.table.identifier", dorisTable)
.option("user", dorisUser)
.option("password", dorisPwd)
//specify your fields or the order
.option("doris.write.field", "gender,name")
.save()
```
1. Simplify the use of flink connector like other stream sink by GenericDorisSinkFunction.
2. Add the use cases of flink connector.
## Use case
```
env.fromElements("{\"longitude\": \"116.405419\", \"city\": \"北京\", \"latitude\": \"39.916927\"}")
.addSink(
DorisSink.sink(
DorisOptions.builder()
.setFenodes("FE_IP:8030")
.setTableIdentifier("db.table")
.setUsername("root")
.setPassword("").build()
));
```
* [Bug]:fix when data null , throw NullPointerException
* [Bug]:Distinguish between null and empty string
* [Feature]:flink-connector supports streamload parameters
* [Fix]:code style
* [Fix]: support json format import and use httpclient to streamload
* [Fix]:remove System out
* [Fix]:upgrade httpclient version
* [Doc]: add json format import doc
Co-authored-by: wudi <wud3@shuhaisc.com>