doris

Author	SHA1	Message	Date
Ashin Gau	a2ceea5951	[refactor](jni) unified jni framework for java udaf (#25591 ) Follow https://github.com/apache/doris/pull/25302, and use the unified jni framework to refactor java udaf. This PR has removed the old interfaces to run java udf/udaf. Thanks to the ease of use of the new framework, the core code for modifying UDAF does not exceed 100 lines, and the logic is similar to that of UDF.	2023-10-20 16:13:40 +08:00
Ashin Gau	47689fd452	[refactor](jni) unified jni framework for java udf (#25302 ) Use the unified jni framework to refactor java udf. The unified jni framework takes VectorTable as the container to transform data between c++ and java, and hide the details of data format conversion. In addition, the unified framework supports complex and nested types. The performance of basic types remains consistent, with a 30% improvement in string types and an order of magnitude improvement in complex types.	2023-10-18 09:27:54 +08:00
Ashin Gau	522faa8cd2	[fix](jni) the offset in map type is int64 (#25394 ) The offset in map type column is int64, but #24810 has put as int32, causing error like:	2023-10-13 14:23:17 +08:00
Ashin Gau	26818de9c8	[feature](jni) support complex types in jni framework (#24810 ) Support complex types in jni framework, and successfully run end-to-end on hudi. ### How to Use Other scanners only need to implement three interfaces in `ColumnValue`: ``` // Get array elements and append into values void unpackArray(List<ColumnValue> values); // Get map key array&value array, and append into keys&values void unpackMap(List<ColumnValue> keys, List<ColumnValue> values); // Get the struct fields specified by `structFieldIndex`, and append into values void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values); ``` Developers can take `HudiColumnValue` as an example.	2023-09-27 14:47:41 +08:00
Mryange	ee56783629	[fix](Java UDF) Do not use enum as the data type for JavaUdfDataType. (#24460 )	2023-09-19 14:06:02 +08:00
Mryange	96c4471b4a	[feature](udf) udf array/map support decimal and update doc (#23560 ) * update * decimal * update table name * remove log * add log	2023-08-31 07:44:18 +08:00
zhangdong	fa6110accd	[fix](catalog)paimon support more data type (#22899 )	2023-08-14 13:48:33 +08:00
slothever	209f36f1bf	[fix](multi-catalog)fix jdbc loader (#22814 )	2023-08-11 14:36:19 +08:00
slothever	919bfd73f1	[improvement](multi-catalog)add scanner isolation class loader (#22247 ) Add scanner isolation class loader to make each plugin non-conflicting. The BE will get scanner classes by JNI call and use JniClassLoader load them. In the last version，we always get canner classes from the system class path by default, so it cannot isolate the classes for each scanner	2023-08-10 10:02:46 +08:00
Mryange	47c2cc5c74	[vectorized](udf) java udf support with return map type (#22300 )	2023-07-29 12:52:27 +08:00
Mryange	0f439bb1ca	[vectorized](udf) java udf support map type (#22059 )	2023-07-25 11:56:20 +08:00
Ashin Gau	9adbca685a	[opt](hudi) use spark bundle to read hudi data (#21260 ) Use spark-bundle to read hudi data instead of using hive-bundle to read hudi data. Advantage for using spark-bundle to read hudi data: 1. The performance of spark-bundle is more than twice that of hive-bundle 2. spark-bundle using `UnsafeRow` can reduce data copying and GC time of the jvm 3. spark-bundle support `Time Travel`, `Incremental Read`, and `Schema Change`, these functions can be quickly ported to Doris Disadvantage for using spark-bundle to read hudi data: 1. More dependencies make hudi-dependency.jar very cumbersome(from 138M -> 300M) 2. spark-bundle only provides `RDD` interface and cannot be used directly	2023-07-04 17:04:49 +08:00
DongLiang-0	a6b51ec19a	[Feature](avro) Support Apache Avro file format (#19990 ) support read avro file by hdfs() or s3() . ```sql select * from s3( "uri" = "http://127.0.0.1:9312/test2/person.avro", "ACCESS_KEY" = "ak", "SECRET_KEY" = "sk", "FORMAT" = "avro"); +--------+--------------+-------------+-----------------+ \| name \| boolean_type \| double_type \| long_type \| +--------+--------------+-------------+-----------------+ \| Alyssa \| 1 \| 10.0012 \| 100000000221133 \| \| Ben \| 0 \| 5555.999 \| 4009990000 \| \| lisi \| 0 \| 5992225.999 \| 9099933330 \| +--------+--------------+-------------+-----------------+ select * from hdfs( "uri" = "hdfs://127.0.0.1:9000/input/person2.avro", "fs.defaultFS" = "hdfs://127.0.0.1:9000", "hadoop.username" = "doris", "format" = "avro"); +--------+--------------+-------------+-----------+ \| name \| boolean_type \| double_type \| long_type \| +--------+--------------+-------------+-----------+ \| Alyssa \| 1 \| 8888.99999 \| 89898989 \| +--------+--------------+-------------+-----------+ ``` current avro reader only support common data type, the complex data types will be supported later.	2023-06-28 21:15:35 +08:00
Ashin Gau	ef17289925	[feature](jni) add jni metrics and attach to BE profile automatically (#21004 ) Add JNI metrics, for example: ``` - HudiJniScanner: 0ns - FillBlockTime: 31.29ms - GetRecordReaderTime: 1m5s - JavaScanTime: 35s991ms - OpenScannerTime: 1m6s ``` Add three common performance metrics for JNI scanner: 1. `OpenScannerTime`: Time to init and open JNI scanner 2. `JavaScanTime`: Time to scan data and insert into vector table in java side 3. `FillBlockTime`: Time to convert java vector table to c++ block And support user defined metrics in java side, for example: `OpenScannerTime` is a long time for the open process, we want to determine which sub-process takes too much time, so we add `GetRecordReaderTime` in java side. The user defined metrics in java side can be attached to BE profile automatically.	2023-06-21 11:19:02 +08:00
Ashin Gau	923f7edad0	[opt](hudi) using native reader to read the base file with no log file (#20988 ) Two optimizations: 1. Insert string bytes directly to remove decoding&encoding process. 2. Use native reader to read the hudi base file if it has no log file. Use `explain` to show how many splits are read natively.	2023-06-20 11:20:21 +08:00
lexluo09	57656b2459	[Enhancement](java-udf) java-udf module split to sub modules (#20185 ) The java-udf module has become increasingly large and difficult to manage, making it inconvenient to package and use as needed. It needs to be split into multiple sub-modules, such as : java-commom、java-udf、jdbc-scanner、hudi-scanner、 paimon-scanner. Co-authored-by: lexluo <lexluo@tencent.com>	2023-06-13 09:41:22 +08:00

16 Commits