Commit Graph

58 Commits

Author SHA1 Message Date
c78341b728 [improvement](spark-load) support datev2 and datetimev2 #21839 2023-07-24 09:07:53 +08:00
365afb5389 [fix](sparkdpp) Hive table properties not take effect when create spark session (#21881)
When creating a Hive external table for Spark loading, the Hive external table includes related information such as the Hive Metastore. However, when submitting a job, it is required to have the hive-site.xml file in the Spark conf directory; otherwise, the Spark job may fail with an error message indicating that the corresponding Hive table cannot be found.

The SparkEtlJob.initSparkConfigs method sets the properties of the external table into the Spark conf. However, at this point, the Spark session has already been created, and the Hive-related parameters will not take effect. To ensure that the Spark Hive catalog properly loads Hive tables, you need to set the Hive-related parameters before creating the Spark session.

Co-authored-by: zhangshixin <zhangshixin@youzan.com>
2023-07-20 14:36:00 +08:00
4418eb36a3 [Fix](multi-catalog) Fix some hive partition issues. (#19513)
Fix some hive partition issues.
1. Fix be will crash when using hive partitions field of `date`, `timestamp`, `decimal` type.
2. Fix hdfs uri decode error when using `timestamp` partition filed which will cause some url-encoding for special chars, such as `%3A` will encode `:`.
2023-05-11 07:49:46 +08:00
5459cd9c30 [Improve](fe)Upgrade dependencies and optimize jar package management (#18882)
bind netty-version to 4.1.89-final
bind jettison to 1.5.4
upgrade hadoop version to 3.3.5
upgrade range-plugins-common to 2.4.0
bind bcprov-jdk15on to 2.4.0
upgrade and bind woodstox to 6.5.1
upgrade and bind kerby to 2.0.3
upgrade hudi to 0.13.0
upgrade parquet to 1.13.0
upgrade maven-source-plugin to 3.2.1
upgrade maven-assembly-plugin to 3.3.0
upgrade maven-javadoc-plugin to 3.3.2
upgrade maven-shade-plugin to 3.3.4
upgrade maven-clean-plugin to 3.1.0
Remove meaningless plugins
Optimize doris maven path
Unify the Java modules for management in fe
2023-05-04 10:07:37 +08:00
75fd4b70fa [improve](fe)Optimize fe binary package packaging (#18554) 2023-04-12 12:58:45 +08:00
b5b595519a [fix](log) use logger to replace printStackTrace() (#17382)
Use Logger to replace printStackTrace to better locate problems.
2023-03-03 14:51:30 +08:00
726427b795 [refactor](fe) refactor and upgrade dependency tree of FE and support AWS glue catalog (#16046)
1. Spark dpp
 
	Move `DppResult` and `EtlJobConfig` to sparkdpp package in `fe-common` module.
	So taht `fe-core` is longer depends on `spark-dpp` module, so that the `spark-dpp.jar`
	will not be moved into `fe/lib`, which reduce the size of FE output.
	
2. Modify start_fe.sh

	Modify the CLASSPATH to make sure that doris-fe.jar is at front, so that
	when loading classes with same qualified name, it will be got from doris-fe.jar firstly.
	
3. Upgrade hadoop and hive version

	hadoop: 2.10.2 -> 3.3.3
	hive: 2.3.7 -> 3.1.3
	
4. Override the IHiveMetastoreClient implementations from dependency

	`ProxyMetaStoreClient.java` for Aliyun DLF.
	`HiveMetaStoreClient.java` for origin Apache Hive metastore.

	Because I need to modified some of their method to make them compatible with
	different version of Hive.
	
5. Exclude some unused dependencies to reduce the size of FE output

	Now it is only 370MB (Before is 600MB)
	
6. Upgrade aws-java-sdk version to 1.12.31

7. Support AWS Glue Data Catalog

8. Remove HudiScanNode(no longer support)
2023-01-20 14:42:16 +08:00
650136c32e [Enhancement](fe): replace assertTrue(X.equals(X)) with assertEquals (#15356) 2022-12-27 00:37:24 +08:00
d48abd91df [deps](fe)upgrade deps version (#15262)
upgrade hadoop version to 2.10.2
jackson-databind to 2.14.1
2022-12-24 22:18:10 +08:00
c5f9fd5619 [fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error (#14661)
* [fix](spark load)partition column is not duplicate key,spark load IndexOutOfBoundsException error

Co-authored-by: 张放(vivianv.zhang) <vivianv.zhang@huolala.cn>
2022-11-29 15:21:21 +08:00
91bd76a902 [enhancement](FE) use forEach() to replace stream().forEach() (#14039) 2022-11-21 15:40:43 +08:00
7fedfdcf6a [fix](spark load)The where condition does not take effect when spark load loads the file (#13803) 2022-11-01 23:01:45 +08:00
60d5e4dfce [improvement](spark-load) support parquet and orc file (#13438)
Add support for parquet/orc in SparkDpp.java
Fixed sparkDpp checkstyle issue
2022-10-20 08:59:22 +08:00
7aae98eb71 [fix](comment) sparkload comment mislead which file types it support (#12982) 2022-09-29 20:23:57 +08:00
f0cde35ea6 [performance improvement] Spark Load, SparkDpp processRDDAggregate performance improvement (#12186)
Co-authored-by: hourong <hourong@zhihu.com>
2022-08-31 09:14:13 +08:00
915d8989c5 [feature](spark-load)Spark load supports string type data import (#11927) 2022-08-22 08:56:59 +08:00
976e7685db [minor](*): remove redundant log and unused code. (#11620) 2022-08-10 19:28:04 +08:00
e769597fd2 [Improvement] (datetime) support microsecond for date literal (#10917)
* [Improvement] (datetime) support microsecond for date literal

* remove joda dependency
2022-07-18 21:39:39 +08:00
3b46242483 [feature-wip] Optimize Decimal type (#10794)
* [feature-wip](decimalv3) support decimalv3

* [feature-wip] Optimize Decimal type

Co-authored-by: liaoxin <liaoxinbit@126.com>
2022-07-14 10:50:50 +08:00
ca94867b4e [Feature-wip] add date v2 type (#9916) 2022-06-26 16:07:56 +08:00
b7b78ae707 [style](fe)the last step of fe CheckStyle (#10134)
1. fix all checkstyle warning
2. change all checkstyle rules to error
3. remove some java doc rules
    a. RequireEmptyLineBeforeBlockTagGroup
    b. JavadocStyle
    c. JavadocParagraph
4. suppress some rules for old codes
    a. all java doc rules only affect on Nereids
    b. DeclarationOrder only affect on Nereids
    c. OverloadMethodsDeclarationOrder only affect on Nereids
    d. VariableDeclarationUsageDistance only affect on Nereids
    e. suppress OneTopLevelClass on org/apache/doris/load/loadv2/dpp/ColumnParser.java
    f. suppress OneTopLevelClass on org/apache/doris/load/loadv2/dpp/SparkRDDAggregator.java
    g. suppress LineLength on org/apache/doris/catalog/FunctionSet.java
    h. suppress LineLength on org/apache/doris/common/ErrorCode.java
2022-06-17 21:02:45 +08:00
e701c057dc [style](fe) wrap and whitespace rules (#9764)
change below rules' severity to error and fix original code error:

- EmptyBlock
- EmptyCatchBlock
- LeftCurly
- RightCurly
- IllegalTokenText
- MultipleVariableDeclarations
- OneStatementPerLine
- StringLiteralEquality
- UnusedLocalVariable
- Indentation
- OuterTypeFilename
- MethodParamPad
- GenericWhitespace
- NoWhitespaceBefore
- OperatorWrap
- ParenPad
- WhitespaceAfter
- WhitespaceAround
2022-05-26 16:56:20 +08:00
77297bb7ee Fix some typos in fe/. (#9682) 2022-05-23 12:11:01 +08:00
c048b1f0f9 [fix](sparkload): fix min_value will be negative number when maxGlobalDictValue exceeds integer range (#9436) 2022-05-19 23:56:24 +08:00
235d586f11 [style](fe) code correct rules and name rules (#9670)
* [style](fe) code correct rules and name rules

* revert some change according to comments
2022-05-19 16:36:03 +08:00
8a0097cfb9 [style](java) format fe code with some check rules (#9460)
Issue Number: close #9403 

set below rules' severity to error and format code according check info.
a. Merge conflicts unresolved
b. Avoid using corresponding octal or Unicode escape
c. Avoid Escaped Unicode Characters
d. No Line Wrap
e. Package Name
f. Type Name
g. Annotation Location
h. Interface Type Parameter
i. CatchParameterName
j. Pattern Variable Name
k. Record Component Name
l. Record Type Parameter Name
m. Method Type Parameter Name
n. Redundant Import
o. Custom Import Order
p. Unused Imports
q. Avoid Star Import
r. tab character in file
s. Newline At End Of File
t. Trailing whitespace found
2022-05-12 20:14:38 +08:00
d1b85d51a0 [code style](fe) Include test sources (#9366)
Include test sources, we also need to check them.
2022-05-09 09:40:44 +08:00
1746f61388 [refactor](test) Refactor FE unit test framework that starts a FE server. (#9388)
Currently, we use `UtFrameUtils` to start a FE server in the FE unit test. 
Each test class has to do some initialization and clean up stuff with the JUnit4
`@BeforeClass` and `@AfterClass` annotation. It's redundant and boring.
Besides, almost all the APIs in `UtFrameUtils` has a `ConnectContext` parameter, which is not easy to use.

This PR proposes to use an inherit-manner, i.e., wrap all the common logic in base class `TestWithFeService`,
leveraging the 
JUnit5 `@BeforeAll` and `@AfterAll` annotation to narrow down the setup and cleanup lifecycle to each test class instance.
At the same time, the derived concrete test class could directly use utility methods inherited from the base class,
without calling a util class and passing a `ConnectContext` argument.

`UtFrameUtils` and `DorisAssert`  are marked as deprecated. We could remove these two classes
if this refactor works well for a time.
2022-05-07 21:28:42 +08:00
c5941fd166 [FE Code Style][sub] Adjust some check rules (#9345)
Adjust `RedundantImport`,`UnusedImports`,`EmptyStatement`,`NewlineAtEndOfFile`,`UpperEll`, `AvoidStarImport`, `MissingOverride` rules.
2022-05-04 23:34:55 +08:00
784681f106 [FE Code Style][step 0]add github action to check incremental code in pr (#9328)
1. add rules to checkstyle
2. add github action to check incremental code in pr
2022-05-01 17:30:29 +08:00
62b38d7a75 [fix](spark load) fix getHashValue of string type is always zero in spark load. (#9136)
Buffer flip is used incorrectly.
When the hash key is string type, the hash value is always zero.
The reason is that the buffer of string type is obtained by wrap, which is not needed to flip.
If we do so, the buffer limit for read will be zero.
2022-04-26 10:14:21 +08:00
39c0fec680 [fix] fix bug when partition_id exceeds integer range in spark load (#9073) 2022-04-20 14:50:55 +08:00
50a59f3f86 [license] Organize third-party dependent licenses for bianry releases (#8350) 2022-03-07 23:18:58 +08:00
4bdeef3b64 [chore][fix][doc](fe-plugin)(mysqldump) fix build auditlog plugin error (#7804)
1. fix problems when build fe_plugins
2. format
3. add docs about dump data using mysql dump
2022-01-26 09:11:23 +08:00
738d2d2e07 [refactor] update parent pom version and optimize build scripts (#7548) 2022-01-05 10:45:11 +08:00
2872dbfeb8 [refactor] Standardize the writing of pom files, prepare for deployment to maven (#7477) 2021-12-30 10:16:37 +08:00
382351b0ee [fix](ut) Fix run fe ut failed, be ut memory leak and build thirdparty failed (#7377) 2021-12-15 11:00:20 +08:00
926540c561 [feature] Support return bitmp/hll data in select statement (#7276)
Support return bitmp/hll data in select statement, this can be used when set show_object_data=true;
2021-12-15 09:48:27 +08:00
d420ff0afd display current load bytes to show load progress, (#7134)
this value may greate than the file size when loading
parquert or orc file, will less than file size when loading
csv file.
2021-11-24 10:08:32 +08:00
e9282205f1 [feat-opt](spark-load) support bitmap binary data from hive in spark load (#6883)
Support to load the binary data of bitmap value from Hive into Doris.
fix #6461
2021-11-20 21:38:38 +08:00
35da149ebe [SparkDpp]Add not() and xor() methods to bitmapValue (#6885)
Add not() and xor() methods to bitmapValue
2021-11-12 10:38:15 +08:00
ea17682d1f [Typo] Correct misspellings in SparkDpp (#6789)
Correct misspellings in SparkDpp
2021-10-10 23:07:39 +08:00
6ac0ab6b29 fix(sparkload): bitmap deep copy in or operator (#6480)
* fix(sparkload): bitmap deep copy in `or` operator

fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly.

* fix(sparkload): bitmap deep copy in `or` operator

fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly.

Co-authored-by: weixiang <weixiang06@meituan.com>
2021-09-02 12:15:02 +08:00
52f39e3fde [Bug][SparkLoad]: bitmap value in or operator in spark load should be deep copied (#6453)
fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly.
fix #6452
2021-08-19 14:17:31 +08:00
60ac4a9660 [Bug][SparkLoad] Fix bucket_hash_value for bool value (#6284)
Co-authored-by: weixiang <weixiang06@meituan.com>
2021-07-27 13:38:42 +08:00
ba84eacb8c (#6009) fix bucket key distribute error when using spark load (#6087) 2021-06-29 12:30:08 +08:00
9f706848b9 [Bug] Fix somg bugs about Spark Load (#5701)
The distinct count result of bitmap/hll column may be incorrect in the spark load mode.

Fix some bugs in spark load to solve the above problem.
1. FE is big end but BE is little end. BitmapValues should be transfered to little end in FE's serialization
2. BitmapUnionAggregator/HllUnionAggregator ignore `null` value
3. Make sure encodeVarint64 in FE is consistent with BE

Co-authored-by: weixiang <weixiang06@meituan.com>
2021-05-07 11:18:23 +08:00
18c2553ef8 [FE][Bug] Update Spark version to fix a security issue (#5593)
Fix CVE-2020-9480: Apache Spark RCE vulnerability in
auth-enabled standalone master
https://spark.apache.org/security.html#CVE-2020-9480
2021-04-06 11:02:04 +08:00
d8202ca9cc [Enhancement] move common codes from fe-core to fe-common and remove log4j1 (#5317) (#5318)
The io related codes may be used by new modules, so It's better to move them to fe-common.

The modification to fe-core is frequent, but there are many generated java files by thrift
will slow down the compilation, so It's better to move thrift generation process to fe-common.

Currently both log4j1 and log4j2 are used, which leads to logs are written to wrong files.
Our modification will remove log4j1 from dependency, use slf4j + slf4j -> log4j2 instead.
2021-02-04 13:41:03 +08:00
41ef9ccda9 (#5224)some little fix for spark load (#5233)
* (#5224)some little fix for spark load

* 1 use yyyy-MM-dd instead of YYYY-MM-DD
2 unify lower case for bitmap column name
2021-01-27 11:16:59 +08:00