Commit Graph

62 Commits

Author SHA1 Message Date
22bf2889e5 [feature](tvf)(jni-avro)jni-avro scanner add complex data types (#26236)
Support avro's enum, record, union data types
2023-11-09 13:58:49 +08:00
809510f8b2 [bug](udf) Fix method invoking (#26131) 2023-10-31 11:46:14 +08:00
267c11207b [feature](paimon)paimon catalog supports complex types (#25364) 2023-10-23 17:32:13 +08:00
a2ceea5951 [refactor](jni) unified jni framework for java udaf (#25591)
Follow https://github.com/apache/doris/pull/25302, and use the unified jni framework to refactor java udaf.
This PR has removed the old interfaces to run java udf/udaf. Thanks to the ease of use of the new framework, the core code for modifying UDAF does not exceed 100 lines, and the logic is similar to that of UDF.
2023-10-20 16:13:40 +08:00
47689fd452 [refactor](jni) unified jni framework for java udf (#25302)
Use the unified jni framework to refactor java udf.
The unified jni framework takes VectorTable as the container to transform data between c++ and java, and hide the details of data format conversion.
In addition, the unified framework supports complex and nested types.
The performance of basic types remains consistent, with a 30% improvement in string types and an order of magnitude improvement in complex types.
2023-10-18 09:27:54 +08:00
18c2a13e09 [fix](multi-catalog)fix maxcompute partition filter and session creation (#24911)
add maxcompute partition support
fix maxcompute partition filter
modify maxcompute session create method
2023-10-17 22:36:10 +08:00
ce18f1148a [improvement](catalog)compatible with paimon 0.5 (#24985)
compatible with paimon 0.5
add p0 for paimon,need set enablePaimonTest=true
2023-10-17 22:07:13 +08:00
522faa8cd2 [fix](jni) the offset in map type is int64 (#25394)
The offset in map type column is int64, but #24810 has put as int32, causing error like:
2023-10-13 14:23:17 +08:00
4e8cde127c [Enhance](catalog)add table cache in paimon jni (#25014)
- fix get old schema after refresh paimon table
- add table cache in paimon jni
2023-10-08 10:36:18 +08:00
26818de9c8 [feature](jni) support complex types in jni framework (#24810)
Support complex types in jni framework, and successfully run end-to-end on hudi.
### How to Use
Other scanners only need to implement three interfaces in `ColumnValue`:
```
// Get array elements and append into values
void unpackArray(List<ColumnValue> values);

// Get map key array&value array, and append into keys&values
void unpackMap(List<ColumnValue> keys, List<ColumnValue> values);

// Get the struct fields specified by `structFieldIndex`, and append into values
void unpackStruct(List<Integer> structFieldIndex, List<ColumnValue> values);
```
Developers can take `HudiColumnValue` as an example.
2023-09-27 14:47:41 +08:00
1f8e0b48bc [fix](S3)delete main function because hardcoded ip is not safe (#24872) 2023-09-26 10:49:16 +08:00
c832e018d0 [Dependence](Fe)Upgrade Fe dependencies (#24606)
* be scanner
- Upgrade avro to 1.11.2
fe
- Upgrade quartz to 2.5.0-rc1
- Upgrade maxcompute to 0.45-2-publish
- Binding  avro-ipc  to 1.11.2

* Binding hbase  version to 2.5.5
binding nimbusds version to 9.35
2023-09-22 10:14:42 +08:00
ee56783629 [fix](Java UDF) Do not use enum as the data type for JavaUdfDataType. (#24460) 2023-09-19 14:06:02 +08:00
4816ca6679 [fix](multi-catalog)fix mc decimal type parse, fix wrong obj location (#24242)
1. mc decimal type need parse correctly by arrow vector method
2. fix wrong obj location if use oss,obs,cosn

Will add test case in another PR
2023-09-15 17:44:56 +08:00
dbfacdc4af [improvement](jdbc catalog) Optimize Loop Performance by Caching isNebula Method Result (#24260) 2023-09-13 21:40:28 +08:00
fca34ec337 [fix](multi-catalog)support bit type and hidden mc secret key (#24124)
support max compute bit type and mask mc secret key
bool type will use bit arrow vector
should mask secret key: close #24019
2023-09-12 10:36:48 +08:00
6e28d878b5 [fix](hudi) compatible with hudi spark configuration and support skip merge (#24067)
Fix three bugs:
1. Hudi slice maybe has log files only, so `new Path(filePath)`  will throw errors.
2. Hive column names are lowercase only, so match column names in ignore-case-mode.
3.  Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.
2023-09-11 19:54:59 +08:00
f85da7d942 [improvement](jdbc) add profile for jdbc read and convert phase (#23962)
Add 2 metrics in jdbc scan node profile:
- `CallJniNextTime`: call get next from jdbc result set
- `ConvertBatchTime`: call convert jobject to columm block

Also fix a potential concurrency issue when init jdbc connection cache pool
2023-09-10 21:42:06 +08:00
13c9c41c1f [opt](hudi) reduce the memory usage of avro reader (#23745)
1. Reduce the number of threads reading avro logs and keep the readers in a fixed thread pool.
2. Regularly cleaning the cached resolvers in the thread local map by reflection.
2023-09-05 23:59:23 +08:00
228f0ac5bb [Feature](Multi-Catalog) support query doris bitmap column in external jdbc catalog (#23021) 2023-09-02 12:46:33 +08:00
96c4471b4a [feature](udf) udf array/map support decimal and update doc (#23560)
* update

* decimal

* update table name

* remove log

* add log
2023-08-31 07:44:18 +08:00
aef162ad4c [test](log) add some log in udf function when thrown exception (#23651)
[test](log) add some log in udf function when thrown exception (#23651)
2023-08-30 14:16:05 +08:00
f66f161017 [fix](multi-catalog)fix hive table with cosn location issue (#23409)
Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc).
This PR mainly changes:

1. Fix the bug of accessing files via cosn.
2. Add a new field `fs_name` in TFileRangeDesc
    This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query
request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name
for each file, otherwise, it may return error:

`reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`
2023-08-26 00:16:00 +08:00
5ba505ebf4 [fix](multi-catalog)fix avro and jdbc scanner dependency (#23015)
add preload-extensions module, put all conflict dependencies to pom.xml in preload-extensions
2023-08-20 19:28:17 +08:00
221e7bdd17 [test](jdbc external) fix mysql and pg external regression test (#22998) 2023-08-16 10:44:47 +08:00
fa6110accd [fix](catalog)paimon support more data type (#22899) 2023-08-14 13:48:33 +08:00
a089fe3e43 [Improve](jni-avro)Reduce the volume of the avro-scanner-jar package (#22276)
The avro-scanner-jar package is reduced from 204M to 160M.

Hadoop-related dependencies in the original avro pom are directly packaged into a jar package, resulting in a jar volume of 200M. Now since there is already a hadoop jar package environment in be lib, it can be directly referenced.
2023-08-11 17:26:14 +08:00
db69457576 [fix](avro)Fix S3 TVF avro format reading failure (#22199)
This pr fixes two issues:

1. when using s3 TVF to query files in AVRO format, due to the change of `TFileType`, the originally queried `FILE_S3 ` becomes `FILE_LOCAL`, causing the query failed.
2. currently, both parameters `s3.virtual.key` and `s3.virtual.bucket` are removed. A new `S3Utils`  in jni-avro to parse the bucket and key of s3.
The purpose of doing this operation is mainly to unify the parameters of s3.
2023-08-11 17:22:48 +08:00
209f36f1bf [fix](multi-catalog)fix jdbc loader (#22814) 2023-08-11 14:36:19 +08:00
919bfd73f1 [improvement](multi-catalog)add scanner isolation class loader (#22247)
Add scanner isolation class loader to make each plugin non-conflicting.
The BE will get scanner classes by JNI call and use JniClassLoader load them.
In the last version,we always get canner classes from the system class path by default,
so it cannot isolate the classes for each scanner
2023-08-10 10:02:46 +08:00
768088c95e [refactor](udaf) refactor call udaf function and support map type in return (#22508) 2023-08-09 22:44:07 +08:00
ddd90855a9 [vectorized](udaf) java udaf support with map type (#22397)
[vectorized](udaf) java udaf support with map type (#22397)
* test
* remove some unused
* update
* add case
2023-08-02 15:03:44 +08:00
47c2cc5c74 [vectorized](udf) java udf support with return map type (#22300) 2023-07-29 12:52:27 +08:00
6f1c03c766 [fix](jdbc_catalog) fix int and bigint in mysql view when use doris catalog (#22251) 2023-07-27 16:50:42 +08:00
4f6a3c5bf0 [feature](catalog) support clob type in oracle jdbc catalog (#21532) 2023-07-27 15:49:15 +08:00
619a2857e1 [improvement](jdbc catalog) improve mysql jdbc catalog read bytea`s types & else improve (#22233) 2023-07-27 10:18:37 +08:00
4c4f08f805 [fix](hudi) the required fields are empty if only reading partition columns (#22187)
1. If only read the partition columns, the `JniConnector` will produce empty required fields, so `HudiJniScanner` should read the "_hoodie_record_key" field at least to know how many rows in current hoodie split. Even if the `JniConnector` doesn't read this field, the call of `releaseTable` in `JniConnector` will reclaim the resource.

2. To prevent BE failure and exit, `JniConnector` should call release methods after `HudiJniScanner` is initialized. It should be noted that `VectorTable` is created lazily in `JniScanner`,  so we don't need to reclaim the resource when `HudiJniScanner` is failed to initialize.

## Remaining works
Other jni readers like `paimon` and `maxcompute` may encounter the same problems, the jni reader need to handle this abnormal situation on its own, and currently this fix can only ensure that BE will not exit.
2023-07-26 10:59:45 +08:00
9abf32324b [improvement](jdbc) add timestamp put to datev2 (#21680) 2023-07-26 09:10:34 +08:00
e8f4323e0f [Fix](jdbcCatalog) fix typo of some variable #22214 2023-07-26 08:34:45 +08:00
3414d1a61f [fix](hudi) table schema is not the same as parquet schema (#22186)
Upgrade hudi version from 0.13.0 to 0.13.1, and keep the hudi version of jni scanner the same as that of FE.
This may fix the bug of the table schema is not same as parquet schema.
2023-07-26 00:29:53 +08:00
cf677b327b [fix](jdbc catalog) Fixed mappings with type errors for bool and tinyint(1) (#22089)
First of all, mysql does not have a boolean type, its boolean type is actually tinyint(1), in the previous logic, We force tinyint(1) to be a boolean by passing tinyInt1isBit=true, which causes an error if tinyint(1) is not a 0 or 1, Therefore, we need to match tinyint(1) according to tinyint instead of boolean, and this change will not affect the correctness of where k = 1 or where k = true queries
2023-07-25 22:45:22 +08:00
999fbdc802 [improvement](jdbc) add new type 'object' of int (#21681) 2023-07-25 21:29:46 +08:00
0f439bb1ca [vectorized](udf) java udf support map type (#22059) 2023-07-25 11:56:20 +08:00
7fcf702081 [improvement](multi catalog)paimon support filesystem metastore (#21910)
1.support filesystem metastore

2.support predicate and project when split

3.fix partition table query error

todo: Now you need to manually put paimon-s3-0.4.0-incubating.jar in be/lib/java_extensions when use s3 filesystem

doc pr: #21966
2023-07-24 22:02:57 +08:00
7d488688b4 [fix](multi-catalog)fix minio default region and throw minio error msg, support s3 bucket root path (#21994)
1. check minio region, set default region if user region is not provided, and throw minio error msg
2. support read root path s3://bucket1
3. fix max compute public access
2023-07-20 20:48:55 +08:00
c07e2ada43 [imporve](udaf) refactor java-udaf executor by using for loop (#21713)
refactor java-udaf executor by using for loop
2023-07-14 11:37:19 +08:00
4158253799 [feature](hudi) support hudi time travel in external table (#21739)
Support hudi time travel in external table:
```
select * from hudi_table for time as of '20230712221248';
```
PR(https://github.com/apache/doris/pull/15418) supports to take timestamp or version as the snapshot ID in iceberg, but hudi only has timestamp as the snapshot ID. Therefore, when querying hudi table with `for version as of`, error will be thrown like:
```
ERROR 1105 (HY000): errCode = 2, detailMessage = Hudi table only supports timestamp as snapshot ID
```
The supported formats of timestamp in hudi are: 'yyyy-MM-dd HH:mm:ss[.SSS]' or 'yyyy-MM-dd' or 'yyyyMMddHHmmss[SSS]', which is consistent with the [time-travel-query.](https://hudi.apache.org/docs/quick-start-guide#time-travel-query)

## Partitioning Strategies
Before this PR, hudi's partitions need to be synchronized to hive through [hive-sync-tool](https://hudi.apache.org/docs/syncing_metastore/#hive-sync-tool), or by setting very complex synchronization parameters in [spark conf](https://hudi.apache.org/docs/syncing_metastore/#sync-template). These processes are exceptionally complex and unnecessary, unless you want to query hudi data through hive.

In addition, partitions are changed in time travel. We cannot guarantee the correctness of time travel through partition synchronization.

So this PR directly obtain partitions by reading hudi meta information. Caching and updating table partition information through hudi instant timestamp, and reusing Doris' partition pruning.
2023-07-13 22:30:07 +08:00
0be349e250 [feature](jdbc) Support jdbc catalog to read json types (#21341) 2023-07-10 16:21:00 +08:00
bb985cd9a1 [refactor](udf) refactor java-udf execute method by using for loop (#21388) 2023-07-07 11:43:11 +08:00
0084b9fd9a [fix](hudi) scala can't call Properties.putAll in jdk11 (#21494) 2023-07-05 10:53:09 +08:00