Commit Graph

47 Commits

Author SHA1 Message Date
b54a12ef11 [Build]Compile and output the jar file, add Spark, Flink version and Scala version (#7051)
The jar file compiled by Flink and Spark Connector, with the corresponding Flink, Spark version
and Scala version at compile time, so that users can know whether the version number matches when using it.

Example of output file name:doris-spark-1.0.0-spark-3.2.0_2.12.jar
2021-11-09 10:02:08 +08:00
29838f07da [HTTP][API] Add backends info API for spark/flink connector (#6984)
Doris should provide a http api to return backends list for connectors to submit stream load,
and without privilege checking, which can let common user to use it
2021-11-05 09:43:06 +08:00
d19a971582 [Revert] Revert RestService.java (#6994) 2021-11-04 12:13:18 +08:00
f39a5bc1d0 [Feature] Spark connector supports to specify fields to write (#6973)
1. By default , Spark connector must write all fields value to `Doris` table .
In this feature , user can specify part of fields to write ,  even specify the order of the fields to write.

eg:
I have a table named `student` which has three columns (name,gender,age) ,
creating table sql as following:
```sql
create table student (name varchar(255), gender varchar(10), age int) duplicate key (name) distributed by hash(name) buckets 2;
```
Now , I just want  to write values to two columns : name , gender.
The code as following:
```scala
    val df = spark.createDataFrame(Seq(
      ("m", "zhangsan"),
      ("f", "lisi"),
      ("m", "wangwu")
    ))
    df.write
      .format("doris")
      .option("doris.fenodes", dorisFeNodes)
      .option("doris.table.identifier", dorisTable)
      .option("user", dorisUser)
      .option("password", dorisPwd)
      //specify your fields or the order
      .option("doris.write.field", "gender,name")
      .save()
```
2021-11-02 16:35:29 +08:00
466cd5dd09 [Optimize] Spark connector supports multiple spark versions:2.1.x/2.3.x/2.4.x/3.x (#6956)
* Spark connector supports multiple spark versions:2.1.x/2.3.x/2.4.x/3.x
Co-authored-by: wei.zhao <wei.zhao@aispeech.com>
2021-10-29 17:06:05 +08:00
1f65de1a5d Fix spark connector build error (#6948)
pom.xml error
2021-10-29 14:59:05 +08:00
addfff74c4 support use char like \x01 in flink-doris-sink column & line delimiter (#6937)
* support use char like \x01 in flink-doris-sink column & line delimiter

* extend imports

* add docs
2021-10-29 13:56:52 +08:00
ebb4c282b1 [Flink]Simplify the use of flink connector (#6892)
1. Simplify the use of flink connector like other stream sink by GenericDorisSinkFunction.
2. Add the use cases of flink connector.

## Use case
```
env.fromElements("{\"longitude\": \"116.405419\", \"city\": \"北京\", \"latitude\": \"39.916927\"}")
     .addSink(
          DorisSink.sink(
             DorisOptions.builder()
                   .setFenodes("FE_IP:8030")
                   .setTableIdentifier("db.table")
                   .setUsername("root")
                   .setPassword("").build()
                ));
```
2021-10-23 18:10:47 +08:00
6029082c2a [Flink][Bug] Fix potential NPE when cancel DorisSourceFunction (#6838)
Fix potential NPE of `scalaValueReader` when cancelling DorisSourceFunction.
2021-10-23 16:45:24 +08:00
24d38614a0 [Dependency] Upgrade thirdparty libs (#6766)
Upgrade the following dependecies:

libevent -> 2.1.12
OpenSSL 1.0.2k -> 1.1.1l
thrift 0.9.3 -> 0.13.0
protobuf 3.5.1 -> 3.14.0
gflags 2.2.0 -> 2.2.2
glog 0.3.3 -> 0.4.0
googletest 1.8.0 -> 1.10.0
snappy 1.1.7 -> 1.1.8
gperftools 2.7 -> 2.9.1
lz4 1.7.5 -> 1.9.3
curl 7.54.1 -> 7.79.0
re2 2017-05-01 -> 2021-02-02
zstd 1.3.7 -> 1.5.0
brotli 1.0.7 -> 1.0.9
flatbuffers 1.10.0 -> 2.0.0
apache-arrow 0.15.1 -> 5.0.0
CRoaring 0.2.60 -> 0.3.4
orc 1.5.8 -> 1.6.6
libdivide 4.0.0 -> 5.0
brpc 0.97 -> 1.0.0-rc02
librdkafka 1.7.0 -> 1.8.0

after this pr compile doris should use build-env:1.4.0
2021-10-15 13:03:04 +08:00
237a8ae948 [Feature] support spark connector sink data using sql (#6796)
Co-authored-by: wei.zhao <wei.zhao@aispeech.com>
2021-10-09 15:47:36 +08:00
8d471007a6 [Feature] support spark connector sink stream data to doris (#6761)
* [Feature] support spark connector sink stream data to doris

* [Doc] Add spark-connector batch/stream writing instructions

* add license and remove meaningless blanks code

Co-authored-by: wei.zhao <wei.zhao@aispeech.com>
2021-09-28 17:46:19 +08:00
df5ba6b5a2 [Fix] Flink connector support json import and use httpclient to streamlaod (#6740)
* [Bug]:fix when data null , throw NullPointerException

* [Bug]:Distinguish between null and empty string

* [Feature]:flink-connector supports streamload parameters

* [Fix]:code style

* [Fix]: support json format import and use httpclient to streamload

* [Fix]:remove System out

* [Fix]:upgrade httpclient  version

* [Doc]: add json format import doc

Co-authored-by: wudi <wud3@shuhaisc.com>
2021-09-28 17:37:03 +08:00
68529d20f3 [Flink] Fix bug of flink doris connector (#6655)
Flink-Doris-Connector do not support flink 1.13, refactor doris sink forma 
to not use GenericRowData. But to use RowData::FieldGetter.
2021-09-24 21:38:35 +08:00
a7b8d110a0 Spark 2.x and 3.x version compilation instructions (#6503)
Spark 2.x and 3.x version compilation instructions
2021-08-27 10:55:29 +08:00
4ff6eb55d0 [FlinkConnector] Make flink datastream source parameterized (#6473)
make flink datastream source parameterized as List<?> instead of Object.
2021-08-22 22:03:32 +08:00
4ea2fcefbc [Improve]The connector supports spark 3.0, flink 1.13 (#6449)
Modify the flink/spark compilation documentation
2021-08-18 15:57:50 +08:00
2f90aaab8e [Doc] flink/spark connector: add sources/javadoc plugins (#6435)
spark-doris-connector/flink-doris-connect add plugins to generate javadoc and sources jar,
so can be easy to distribute and debug.
2021-08-16 22:41:24 +08:00
b13e512a65 [Feature] Support spark connector sink data to Doris (#6256)
support spark conector write dataframe to doris
2021-08-16 22:40:43 +08:00
1a5b03167a [Doc] Add document for datax and sample codes (#6389)
Add documents for datax in extension catalog.
Add documents for sampes in best-practice catalog.
2021-08-11 11:51:13 +08:00
929b33ac0a [DataX] doriswriter support csv (#6373)
make doriswriter of DataX support format csv.  Format csv is more simple and faster than
format json when data is simple

add property format: csv/json
add property column_separator: effect when format is csv, for example "\x01" , "^", etc...
2021-08-10 10:14:21 +08:00
d9fc1bf3ca [Feature]:Flink-connector supports streamload parameters (#6243)
Flink-connector supports streamload parameters
#6199
2021-08-09 22:12:46 +08:00
8fe5c75877 [DataX] Refactor doriswriter (#6188)
1. Use `read_json_by_line` to load data
2. Use FE http server as the target host of stream load
2021-07-13 11:36:40 +08:00
fcd31f29b6 [Bug][Flink] Fix when data null , flink-connector throw NullPointerException (#6165) 2021-07-08 09:55:50 +08:00
c33321ff42 [Feature][DataX] Implementation Datax doriswriter plugin (#6107) 2021-07-08 09:33:02 +08:00
b69ebc3ec4 [Extension] Add DataX doriswriter extension directory (#6111)
This CL only add the script for building DataX development environment
2021-06-30 09:55:19 +08:00
28e7d01ef7 [FlinkConnector] Support time interval for flink connector (#5934) 2021-06-30 09:27:12 +08:00
04cc6eaadc [Log] Fix a mistake in DorisDynamicOutputFormat.java (#5963)
Fix a mistake DorisDynamicOutputFormat.java
2021-06-06 22:06:57 +08:00
7ef9aa13d4 [Bug] Modify spark, flink doris connector to send request to FE, fix the problem of POST method, it should be the same as the method when sending the request (#5788)
Modify spark, flink doris connector to send request to FE, fix the problem of POST method,
it should be the same as the method when sending the request
2021-05-19 09:28:21 +08:00
7eea811f6b [Feature] Flink Doris Connector (#5372) (#5375) 2021-04-23 09:43:48 +08:00
39136011c2 [Spark-Doris-Connector][Bug-Fix] Resolve deserialize exception when Spark Doris Connector in aync deserialize mode (#5336)
Resolve deserialize exception when Spark Doris Connector in aync deserialize mode 
Co-authored-by: lanhuajian <lanhuajian@sankuai.com>
2021-03-04 17:48:59 +08:00
5781d67afe Fix file licences (#5414)
Add license to files
For Doris 0.14
2021-02-24 16:37:17 +08:00
9c022e3764 [Bug] Spark doris connector http v2 authentication fails, and HTTP v2 interface returns json nesting problem (#5366)
1. Deal with the problem of inconsistent data format returned by http v1 and v2
2. Deal with user authentication failure
2021-02-07 09:28:55 +08:00
56d0cc3f54 [Spark on Doris] fix the encode of varchar when convertArrowToRowBatch (#5202)
`convertArrowToRowBatch` use the default charset to encode String.
Set it to UTF_8, because we use `arrow::utf8` on the Backends.
2021-01-10 20:48:46 +08:00
86d235a76a [Extension] Logstash Doris output plugin (#3800)
This plugin is used to output data to Doris for logstash
Use the HTTP protocol to interact with the Doris FE Http interface
Load data through Doris's stream load
2020-06-11 08:54:51 +08:00
4cbcae1574 [Spark on Doris] Shade and provide the thrift lib in spark-doris-connector (#3631)
Mainly changes:
1. Shade and provide the thrift lib in spark-doris-connector
2. Add a `build.sh` for spark-doris-connector
3. Move the README.md of spark-doris-connector to `docs/`
4. Change the line delimiter of `fe/src/test/java/org/apache/doris/analysis/AggregateTest.java`
2020-05-19 14:20:21 +08:00
c9c58342b2 [License] Add License to codes (#3272) 2020-04-07 16:35:13 +08:00
16b61b62f5 [Spark] Support convert Arrow data to RowBatch asynchronously in Spark-Doris-Connector (#3186)
Currently, in the Spark-Doris-Connector, when Spark iteratively obtains each row of data,
it needs to synchronously convert the Arrow format data into the row format required by Spark.
In order to speed up the conversion process, we can add an asynchronous thread in the Connector,
which is responsible for obtaining the Arrow format data from BE and converting it into the row
format required by Spark calculation

In our test environment, Doris cluster used 1 fe and 7 be (32C+128G). When using Spark-Doris-Connector
to query a table containing 67 columns, the original query returned 69 million rows of data
took about 2.5min, but after improvement, it reduced to about 1.6min, which reduced the time by about 30%
2020-03-26 21:34:37 +08:00
e20d905d70 Remove unused KUDU codes (#3175)
KUDU table is no longer supported long time ago. Remove code related to it.
2020-03-24 13:54:05 +08:00
1550401d4b Support param exec_mem_limit for spark-doris-connctor (#2775) 2020-01-18 00:14:39 +08:00
8ea5907252 Update arrow's version to 0.15.1 and shaded it in spark-doris-connector (#2769) 2020-01-15 21:08:34 +08:00
18a11f5663 Convert from arrow to rowbatch (#2723)
For #2722
In our test environment, Doris cluster used 1 fe and 7 be (32C+128G). When using spakr-doris connecter to query a table containing 67 columns, it took about 1 hour for the query to return 69 million rows of data. After the improvement, the same query condition took 2.5 minutes and the query performance was significantly improved
2020-01-10 14:11:15 +08:00
feda66f99f Spark return error to users when spark on doris query failed (#2531) 2019-12-30 21:58:13 +08:00
435fdd236e Fix npe in spark-doris-connector when query is complex (#2503) 2019-12-19 14:53:29 +08:00
48f559600f Fix bug when spark on doris run long time (#2485) 2019-12-18 13:08:21 +08:00
0e84a88c1a Fix document bugs in spark-doris-connector (#2275) 2019-11-22 18:05:36 +08:00
732c473043 Add spark-doris-connector extension (#2228) 2019-11-22 15:38:05 +08:00