When creating a Hive external table for Spark loading, the Hive external table includes related information such as the Hive Metastore. However, when submitting a job, it is required to have the hive-site.xml file in the Spark conf directory; otherwise, the Spark job may fail with an error message indicating that the corresponding Hive table cannot be found.
The SparkEtlJob.initSparkConfigs method sets the properties of the external table into the Spark conf. However, at this point, the Spark session has already been created, and the Hive-related parameters will not take effect. To ensure that the Spark Hive catalog properly loads Hive tables, you need to set the Hive-related parameters before creating the Spark session.
Co-authored-by: zhangshixin <zhangshixin@youzan.com>
Fix some hive partition issues.
1. Fix be will crash when using hive partitions field of `date`, `timestamp`, `decimal` type.
2. Fix hdfs uri decode error when using `timestamp` partition filed which will cause some url-encoding for special chars, such as `%3A` will encode `:`.
1. Spark dpp
Move `DppResult` and `EtlJobConfig` to sparkdpp package in `fe-common` module.
So taht `fe-core` is longer depends on `spark-dpp` module, so that the `spark-dpp.jar`
will not be moved into `fe/lib`, which reduce the size of FE output.
2. Modify start_fe.sh
Modify the CLASSPATH to make sure that doris-fe.jar is at front, so that
when loading classes with same qualified name, it will be got from doris-fe.jar firstly.
3. Upgrade hadoop and hive version
hadoop: 2.10.2 -> 3.3.3
hive: 2.3.7 -> 3.1.3
4. Override the IHiveMetastoreClient implementations from dependency
`ProxyMetaStoreClient.java` for Aliyun DLF.
`HiveMetaStoreClient.java` for origin Apache Hive metastore.
Because I need to modified some of their method to make them compatible with
different version of Hive.
5. Exclude some unused dependencies to reduce the size of FE output
Now it is only 370MB (Before is 600MB)
6. Upgrade aws-java-sdk version to 1.12.31
7. Support AWS Glue Data Catalog
8. Remove HudiScanNode(no longer support)
1. fix all checkstyle warning
2. change all checkstyle rules to error
3. remove some java doc rules
a. RequireEmptyLineBeforeBlockTagGroup
b. JavadocStyle
c. JavadocParagraph
4. suppress some rules for old codes
a. all java doc rules only affect on Nereids
b. DeclarationOrder only affect on Nereids
c. OverloadMethodsDeclarationOrder only affect on Nereids
d. VariableDeclarationUsageDistance only affect on Nereids
e. suppress OneTopLevelClass on org/apache/doris/load/loadv2/dpp/ColumnParser.java
f. suppress OneTopLevelClass on org/apache/doris/load/loadv2/dpp/SparkRDDAggregator.java
g. suppress LineLength on org/apache/doris/catalog/FunctionSet.java
h. suppress LineLength on org/apache/doris/common/ErrorCode.java
Issue Number: close#9403
set below rules' severity to error and format code according check info.
a. Merge conflicts unresolved
b. Avoid using corresponding octal or Unicode escape
c. Avoid Escaped Unicode Characters
d. No Line Wrap
e. Package Name
f. Type Name
g. Annotation Location
h. Interface Type Parameter
i. CatchParameterName
j. Pattern Variable Name
k. Record Component Name
l. Record Type Parameter Name
m. Method Type Parameter Name
n. Redundant Import
o. Custom Import Order
p. Unused Imports
q. Avoid Star Import
r. tab character in file
s. Newline At End Of File
t. Trailing whitespace found
Buffer flip is used incorrectly.
When the hash key is string type, the hash value is always zero.
The reason is that the buffer of string type is obtained by wrap, which is not needed to flip.
If we do so, the buffer limit for read will be zero.
* fix(sparkload): bitmap deep copy in `or` operator
fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly.
* fix(sparkload): bitmap deep copy in `or` operator
fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly.
Co-authored-by: weixiang <weixiang06@meituan.com>
The distinct count result of bitmap/hll column may be incorrect in the spark load mode.
Fix some bugs in spark load to solve the above problem.
1. FE is big end but BE is little end. BitmapValues should be transfered to little end in FE's serialization
2. BitmapUnionAggregator/HllUnionAggregator ignore `null` value
3. Make sure encodeVarint64 in FE is consistent with BE
Co-authored-by: weixiang <weixiang06@meituan.com>
The io related codes may be used by new modules, so It's better to move them to fe-common.
The modification to fe-core is frequent, but there are many generated java files by thrift
will slow down the compilation, so It's better to move thrift generation process to fe-common.
Currently both log4j1 and log4j2 are used, which leads to logs are written to wrong files.
Our modification will remove log4j1 from dependency, use slf4j + slf4j -> log4j2 instead.
When we use spark load from hive table, the function loadDataFromHiveTable
will read whole hive table and then filter the data in process()
if hive table have lots of partitions and history data,the load will be cost too much time and resource.
So we can do filter work in loadDataFromHiveTable function when read from hive table.
Co-authored-by: 杜安明 <anming.du@mihoyo.com>
For[Spark Load]
1 support decimal andl largeint
2 add validate logic for char/varchar/decimal
3 check data load from hive with strict mode
4 support decimal/date/datetime aggregator
1. fix write dpp result when dpp throw exception
2. boolean value:true, false(IgnoreCase), 0, 1
3. wrong dest column for source data check
4. support * in source file path
5. if job state is cancelled or finished, submitPushTasks would throw all partitions have no load data exception,
because tableToLoadPartitions was already cleaned up
#3433
This CL mainly changes:
1. Add 2 new FE modules
1. fe-common
save all common classes for other modules, currently only `jmockit`
2. spark-dpp
The Spark DPP application for Spark Load. And I removed all dpp related classes to this module, including unit tests.
2. Change the `build.sh`
Add a new param `--spark-dpp` to compile the `spark-dpp` alone. And `--fe` will compile all FE modules.
the output of `spark-dpp` module is `spark-dpp-1.0.0-jar-with-dependencies.jar`, and it will be installed to `output/fe/spark-dpp/`.
3. Modify some bugs of spark load