Commit Graph

16 Commits

Author SHA1 Message Date
a2ceea5951 [refactor](jni) unified jni framework for java udaf (#25591)
Follow https://github.com/apache/doris/pull/25302, and use the unified jni framework to refactor java udaf.
This PR has removed the old interfaces to run java udf/udaf. Thanks to the ease of use of the new framework, the core code for modifying UDAF does not exceed 100 lines, and the logic is similar to that of UDF.
2023-10-20 16:13:40 +08:00
47689fd452 [refactor](jni) unified jni framework for java udf (#25302)
Use the unified jni framework to refactor java udf.
The unified jni framework takes VectorTable as the container to transform data between c++ and java, and hide the details of data format conversion.
In addition, the unified framework supports complex and nested types.
The performance of basic types remains consistent, with a 30% improvement in string types and an order of magnitude improvement in complex types.
2023-10-18 09:27:54 +08:00
522faa8cd2 [fix](jni) the offset in map type is int64 (#25394)
The offset in map type column is int64, but #24810 has put as int32, causing error like:
2023-10-13 14:23:17 +08:00
26818de9c8 [feature](jni) support complex types in jni framework (#24810)
Support complex types in jni framework, and successfully run end-to-end on hudi.
### How to Use
Other scanners only need to implement three interfaces in `ColumnValue`:
```
// Get array elements and append into values
void unpackArray(List<ColumnValue> values);

// Get map key array&value array, and append into keys&values
void unpackMap(List<ColumnValue> keys, List<ColumnValue> values);

// Get the struct fields specified by `structFieldIndex`, and append into values
void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values);
```
Developers can take `HudiColumnValue` as an example.
2023-09-27 14:47:41 +08:00
ee56783629 [fix](Java UDF) Do not use enum as the data type for JavaUdfDataType. (#24460) 2023-09-19 14:06:02 +08:00
96c4471b4a [feature](udf) udf array/map support decimal and update doc (#23560)
* update

* decimal

* update table name

* remove log

* add log
2023-08-31 07:44:18 +08:00
fa6110accd [fix](catalog)paimon support more data type (#22899) 2023-08-14 13:48:33 +08:00
209f36f1bf [fix](multi-catalog)fix jdbc loader (#22814) 2023-08-11 14:36:19 +08:00
919bfd73f1 [improvement](multi-catalog)add scanner isolation class loader (#22247)
Add scanner isolation class loader to make each plugin non-conflicting.
The BE will get scanner classes by JNI call and use JniClassLoader load them.
In the last version,we always get canner classes from the system class path by default,
so it cannot isolate the classes for each scanner
2023-08-10 10:02:46 +08:00
47c2cc5c74 [vectorized](udf) java udf support with return map type (#22300) 2023-07-29 12:52:27 +08:00
0f439bb1ca [vectorized](udf) java udf support map type (#22059) 2023-07-25 11:56:20 +08:00
9adbca685a [opt](hudi) use spark bundle to read hudi data (#21260)
Use spark-bundle to read hudi data instead of using hive-bundle to read hudi data.

**Advantage** for using spark-bundle to read hudi data:
1. The performance of spark-bundle is more than twice that of hive-bundle
2. spark-bundle using `UnsafeRow` can reduce data copying and GC time of the jvm
3. spark-bundle support `Time Travel`, `Incremental Read`, and `Schema Change`, these functions can be quickly ported to Doris

**Disadvantage** for using spark-bundle to read hudi data:
1. More dependencies make hudi-dependency.jar very cumbersome(from 138M -> 300M)
2. spark-bundle only provides `RDD` interface and cannot be used directly
2023-07-04 17:04:49 +08:00
a6b51ec19a [Feature](avro) Support Apache Avro file format (#19990)
support read avro file by hdfs() or s3() .
```sql
select * from s3(
         "uri" = "http://127.0.0.1:9312/test2/person.avro",
         "ACCESS_KEY" = "ak",
         "SECRET_KEY" = "sk",
         "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name   | boolean_type | double_type | long_type       |
+--------+--------------+-------------+-----------------+
| Alyssa |            1 |     10.0012 | 100000000221133 |
| Ben    |            0 |    5555.999 |      4009990000 |
| lisi   |            0 | 5992225.999 |      9099933330 |
+--------+--------------+-------------+-----------------+

select * from hdfs(
                "uri" = "hdfs://127.0.0.1:9000/input/person2.avro",
                "fs.defaultFS" = "hdfs://127.0.0.1:9000",
                "hadoop.username" = "doris",
                "format" = "avro");
+--------+--------------+-------------+-----------+
| name   | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------+
| Alyssa |            1 |  8888.99999 |  89898989 |
+--------+--------------+-------------+-----------+
```

current avro reader only support common data type, the complex data types will be supported later.
2023-06-28 21:15:35 +08:00
ef17289925 [feature](jni) add jni metrics and attach to BE profile automatically (#21004)
Add JNI metrics, for example:
```
-  HudiJniScanner:  0ns
  -  FillBlockTime:  31.29ms
  -  GetRecordReaderTime:  1m5s
  -  JavaScanTime:  35s991ms
  -  OpenScannerTime:  1m6s
```
Add three common performance metrics for JNI scanner:
1. `OpenScannerTime`: Time to init and open JNI scanner
2. `JavaScanTime`: Time to scan data and insert into vector table in java side
3. `FillBlockTime`: Time to convert java vector table to c++ block

And support user defined metrics in java side, for example: `OpenScannerTime` is a long time for the open process, we want to determine which sub-process takes too much time, so we add `GetRecordReaderTime` in java side.
The user defined metrics in java side can be attached to BE profile automatically.
2023-06-21 11:19:02 +08:00
923f7edad0 [opt](hudi) using native reader to read the base file with no log file (#20988)
Two optimizations:
1. Insert string bytes directly to remove decoding&encoding process.
2. Use native reader to read the hudi base file if it has no log file. Use `explain` to show how many splits are read natively.
2023-06-20 11:20:21 +08:00
57656b2459 [Enhancement](java-udf) java-udf module split to sub modules (#20185)
The java-udf module has become increasingly large and difficult to manage, making it inconvenient to package and use as needed. It needs to be split into multiple sub-modules, such as : java-commom、java-udf、jdbc-scanner、hudi-scanner、 paimon-scanner.

Co-authored-by: lexluo <lexluo@tencent.com>
2023-06-13 09:41:22 +08:00