doris

Author	SHA1	Message	Date
gnehil	f132c9b2c6	[Improve](spark-load)update spark version for spark load to resolve cve problem (#30368 )	2024-03-21 14:07:23 +08:00
Calvin Kirs	02bded2688	[Improve](common)Optimize logging performance with LOG.isDebugEnabled() (#31091 ) * [Improve](common)Optimize logging performance with LOG.isDebugEnabled() * fix error ut	2024-02-20 09:16:14 +08:00
gnehil	2bce8bbc66	[fix](spark load) not setting the file format cause null pointer exception (#16202 )	2023-09-05 12:14:07 +08:00
gnehil	c78341b728	[improvement](spark-load) support datev2 and datetimev2 #21839	2023-07-24 09:07:53 +08:00
HonestManXin	365afb5389	[fix](sparkdpp) Hive table properties not take effect when create spark session (#21881 ) When creating a Hive external table for Spark loading, the Hive external table includes related information such as the Hive Metastore. However, when submitting a job, it is required to have the hive-site.xml file in the Spark conf directory; otherwise, the Spark job may fail with an error message indicating that the corresponding Hive table cannot be found. The SparkEtlJob.initSparkConfigs method sets the properties of the external table into the Spark conf. However, at this point, the Spark session has already been created, and the Hive-related parameters will not take effect. To ensure that the Spark Hive catalog properly loads Hive tables, you need to set the Hive-related parameters before creating the Spark session. Co-authored-by: zhangshixin <zhangshixin@youzan.com>	2023-07-20 14:36:00 +08:00
Qi Chen	4418eb36a3	[Fix](multi-catalog) Fix some hive partition issues. (#19513 ) Fix some hive partition issues. 1. Fix be will crash when using hive partitions field of `date`, `timestamp`, `decimal` type. 2. Fix hdfs uri decode error when using `timestamp` partition filed which will cause some url-encoding for special chars, such as `%3A` will encode `:`.	2023-05-11 07:49:46 +08:00
WenYao	b5b595519a	[fix](log) use logger to replace printStackTrace() (#17382 ) Use Logger to replace printStackTrace to better locate problems.	2023-03-03 14:51:30 +08:00
Mingyu Chen	726427b795	[refactor](fe) refactor and upgrade dependency tree of FE and support AWS glue catalog (#16046 ) 1. Spark dpp Move `DppResult` and `EtlJobConfig` to sparkdpp package in `fe-common` module. So taht `fe-core` is longer depends on `spark-dpp` module, so that the `spark-dpp.jar` will not be moved into `fe/lib`, which reduce the size of FE output. 2. Modify start_fe.sh Modify the CLASSPATH to make sure that doris-fe.jar is at front, so that when loading classes with same qualified name, it will be got from doris-fe.jar firstly. 3. Upgrade hadoop and hive version hadoop: 2.10.2 -> 3.3.3 hive: 2.3.7 -> 3.1.3 4. Override the IHiveMetastoreClient implementations from dependency `ProxyMetaStoreClient.java` for Aliyun DLF. `HiveMetaStoreClient.java` for origin Apache Hive metastore. Because I need to modified some of their method to make them compatible with different version of Hive. 5. Exclude some unused dependencies to reduce the size of FE output Now it is only 370MB (Before is 600MB) 6. Upgrade aws-java-sdk version to 1.12.31 7. Support AWS Glue Data Catalog 8. Remove HudiScanNode(no longer support)	2023-01-20 14:42:16 +08:00
PF FOUR	650136c32e	[Enhancement](fe): replace assertTrue(X.equals(X)) with assertEquals (#15356 )	2022-12-27 00:37:24 +08:00
xiaoDjun	c5f9fd5619	[fix](spark load)partition column is not duplicate key, spark load IndexOutOfBounds error (#14661 ) * [fix](spark load)partition column is not duplicate key，spark load IndexOutOfBoundsException error Co-authored-by: 张放(vivianv.zhang) <vivianv.zhang@huolala.cn>	2022-11-29 15:21:21 +08:00
ChenJiaHao	91bd76a902	[enhancement](FE) use forEach() to replace stream().forEach() (#14039 )	2022-11-21 15:40:43 +08:00
jiafeng.zhang	7fedfdcf6a	[fix](spark load)The where condition does not take effect when spark load loads the file (#13803 )	2022-11-01 23:01:45 +08:00
liujinhui	60d5e4dfce	[improvement](spark-load) support parquet and orc file (#13438 ) Add support for parquet/orc in SparkDpp.java Fixed sparkDpp checkstyle issue	2022-10-20 08:59:22 +08:00
DingGeGe	7aae98eb71	[fix](comment) sparkload comment mislead which file types it support (#12982 )	2022-09-29 20:23:57 +08:00
HouRong	f0cde35ea6	[performance improvement] Spark Load, SparkDpp processRDDAggregate performance improvement (#12186 ) Co-authored-by: hourong <hourong@zhihu.com>	2022-08-31 09:14:13 +08:00
jiafeng.zhang	915d8989c5	[feature](spark-load)Spark load supports string type data import (#11927 )	2022-08-22 08:56:59 +08:00
jakevin	976e7685db	[minor](*): remove redundant log and unused code. (#11620 )	2022-08-10 19:28:04 +08:00
Gabriel	e769597fd2	[Improvement] (datetime) support microsecond for date literal (#10917 ) * [Improvement] (datetime) support microsecond for date literal * remove joda dependency	2022-07-18 21:39:39 +08:00
Gabriel	3b46242483	[feature-wip] Optimize Decimal type (#10794 ) * [feature-wip](decimalv3) support decimalv3 * [feature-wip] Optimize Decimal type Co-authored-by: liaoxin <liaoxinbit@126.com>	2022-07-14 10:50:50 +08:00
Gabriel	ca94867b4e	[Feature-wip] add date v2 type (#9916 )	2022-06-26 16:07:56 +08:00
morrySnow	b7b78ae707	[style](fe)the last step of fe CheckStyle (#10134 ) 1. fix all checkstyle warning 2. change all checkstyle rules to error 3. remove some java doc rules a. RequireEmptyLineBeforeBlockTagGroup b. JavadocStyle c. JavadocParagraph 4. suppress some rules for old codes a. all java doc rules only affect on Nereids b. DeclarationOrder only affect on Nereids c. OverloadMethodsDeclarationOrder only affect on Nereids d. VariableDeclarationUsageDistance only affect on Nereids e. suppress OneTopLevelClass on org/apache/doris/load/loadv2/dpp/ColumnParser.java f. suppress OneTopLevelClass on org/apache/doris/load/loadv2/dpp/SparkRDDAggregator.java g. suppress LineLength on org/apache/doris/catalog/FunctionSet.java h. suppress LineLength on org/apache/doris/common/ErrorCode.java	2022-06-17 21:02:45 +08:00
morrySnow	e701c057dc	[style](fe) wrap and whitespace rules (#9764 ) change below rules' severity to error and fix original code error: - EmptyBlock - EmptyCatchBlock - LeftCurly - RightCurly - IllegalTokenText - MultipleVariableDeclarations - OneStatementPerLine - StringLiteralEquality - UnusedLocalVariable - Indentation - OuterTypeFilename - MethodParamPad - GenericWhitespace - NoWhitespaceBefore - OperatorWrap - ParenPad - WhitespaceAfter - WhitespaceAround	2022-05-26 16:56:20 +08:00
Shuangchi He	77297bb7ee	Fix some typos in fe/. (#9682 )	2022-05-23 12:11:01 +08:00
spaces-x	c048b1f0f9	[fix](sparkload): fix min_value will be negative number when `maxGlobalDictValue` exceeds integer range (#9436 )	2022-05-19 23:56:24 +08:00
morrySnow	235d586f11	[style](fe) code correct rules and name rules (#9670 ) * [style](fe) code correct rules and name rules * revert some change according to comments	2022-05-19 16:36:03 +08:00
morrySnow	8a0097cfb9	[style](java) format fe code with some check rules (#9460 ) Issue Number: close #9403 set below rules' severity to error and format code according check info. a. Merge conflicts unresolved b. Avoid using corresponding octal or Unicode escape c. Avoid Escaped Unicode Characters d. No Line Wrap e. Package Name f. Type Name g. Annotation Location h. Interface Type Parameter i. CatchParameterName j. Pattern Variable Name k. Record Component Name l. Record Type Parameter Name m. Method Type Parameter Name n. Redundant Import o. Custom Import Order p. Unused Imports q. Avoid Star Import r. tab character in file s. Newline At End Of File t. Trailing whitespace found	2022-05-12 20:14:38 +08:00
leo65535	d1b85d51a0	[code style](fe) Include test sources (#9366 ) Include test sources, we also need to check them.	2022-05-09 09:40:44 +08:00
leo65535	c5941fd166	[FE Code Style][sub] Adjust some check rules (#9345 ) Adjust `RedundantImport`,`UnusedImports`,`EmptyStatement`,`NewlineAtEndOfFile`,`UpperEll`, `AvoidStarImport`, `MissingOverride` rules.	2022-05-04 23:34:55 +08:00
spaces-x	62b38d7a75	[fix](spark load) fix `getHashValue` of string type is always zero in spark load. (#9136 ) Buffer flip is used incorrectly. When the hash key is string type, the hash value is always zero. The reason is that the buffer of string type is obtained by wrap, which is not needed to flip. If we do so, the buffer limit for read will be zero.	2022-04-26 10:14:21 +08:00
spaces-x	39c0fec680	[fix] fix bug when partition_id exceeds integer range in spark load (#9073 )	2022-04-20 14:50:55 +08:00
Zhengguo Yang	926540c561	[feature] Support return bitmp/hll data in select statement (#7276 ) Support return bitmp/hll data in select statement, this can be used when set show_object_data=true;	2021-12-15 09:48:27 +08:00
Zhengguo Yang	d420ff0afd	display current load bytes to show load progress, (#7134 ) this value may greate than the file size when loading parquert or orc file, will less than file size when loading csv file.	2021-11-24 10:08:32 +08:00
lihuigang	e9282205f1	[feat-opt](spark-load) support bitmap binary data from hive in spark load (#6883 ) Support to load the binary data of bitmap value from Hive into Doris. fix #6461	2021-11-20 21:38:38 +08:00
lihuigang	35da149ebe	[SparkDpp]Add not() and xor() methods to bitmapValue (#6885 ) Add not() and xor() methods to bitmapValue	2021-11-12 10:38:15 +08:00
dohongdayi	ea17682d1f	[Typo] Correct misspellings in SparkDpp (#6789 ) Correct misspellings in SparkDpp	2021-10-10 23:07:39 +08:00
Xiang Wei	6ac0ab6b29	fix(sparkload): bitmap deep copy in `or` operator (#6480 ) * fix(sparkload): bitmap deep copy in `or` operator fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly. * fix(sparkload): bitmap deep copy in `or` operator fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly. Co-authored-by: weixiang <weixiang06@meituan.com>	2021-09-02 12:15:02 +08:00
Xiang Wei	52f39e3fde	[Bug][SparkLoad]: bitmap value in `or` operator in spark load should be deep copied (#6453 ) fix multi rollup hold the same Ref of bitmapvalue which may be updated repeatedly. fix #6452	2021-08-19 14:17:31 +08:00
Xiang Wei	60ac4a9660	[Bug][SparkLoad] Fix bucket_hash_value for bool value (#6284 ) Co-authored-by: weixiang <weixiang06@meituan.com>	2021-07-27 13:38:42 +08:00
wangbo	ba84eacb8c	(#6009 ) fix bucket key distribute error when using spark load (#6087 )	2021-06-29 12:30:08 +08:00
Xiang Wei	9f706848b9	[Bug] Fix somg bugs about Spark Load (#5701 ) The distinct count result of bitmap/hll column may be incorrect in the spark load mode. Fix some bugs in spark load to solve the above problem. 1. FE is big end but BE is little end. BitmapValues should be transfered to little end in FE's serialization 2. BitmapUnionAggregator/HllUnionAggregator ignore `null` value 3. Make sure encodeVarint64 in FE is consistent with BE Co-authored-by: weixiang <weixiang06@meituan.com>	2021-05-07 11:18:23 +08:00
copperybean	d8202ca9cc	[Enhancement] move common codes from fe-core to fe-common and remove log4j1 (#5317 ) (#5318 ) The io related codes may be used by new modules, so It's better to move them to fe-common. The modification to fe-core is frequent, but there are many generated java files by thrift will slow down the compilation, so It's better to move thrift generation process to fe-common. Currently both log4j1 and log4j2 are used, which leads to logs are written to wrong files. Our modification will remove log4j1 from dependency, use slf4j + slf4j -> log4j2 instead.	2021-02-04 13:41:03 +08:00
wangbo	41ef9ccda9	(#5224 )some little fix for spark load (#5233 ) * (#5224)some little fix for spark load * 1 use yyyy-MM-dd instead of YYYY-MM-DD 2 unify lower case for bitmap column name	2021-01-27 11:16:59 +08:00
Dam1029	834834dc44	[SparkLoadk] Avoid to read whole hive table when we add a where (#5047 ) When we use spark load from hive table, the function loadDataFromHiveTable will read whole hive table and then filter the data in process() if hive table have lots of partitions and history data，the load will be cost too much time and resource. So we can do filter work in loadDataFromHiveTable function when read from hive table. Co-authored-by: 杜安明 <anming.du@mihoyo.com>	2020-12-15 09:26:42 +08:00
wangbo	2af4bc294f	[Bug] Java Version BitmapValue deserialized failed when only has 32-bit bitmap (#4884 )	2020-11-16 21:54:07 +08:00
wangbo	2c24fe80fa	[SparkDpp] Support complete types (#4524 ) For[Spark Load] 1 support decimal andl largeint 2 add validate logic for char/varchar/decimal 3 check data load from hive with strict mode 4 support decimal/date/datetime aggregator	2020-09-13 11:57:33 +08:00
xy720	aae942b982	[Spark Load][Bug] Keep the column splitting in spark load consistent with broker load / mini load (#4532 )	2020-09-06 20:33:26 +08:00
xy720	f5ee854b6f	[Spark load][Bug] Fix column terminator for spark load (#4491 ) Support specifying column separator without back slash.	2020-09-02 10:54:03 +08:00
wyb	82940a4905	[Spark Load] Fix spark load bugs (#4464 ) 1. fix write dpp result when dpp throw exception 2. boolean value：true, false(IgnoreCase), 0, 1 3. wrong dest column for source data check 4. support * in source file path 5. if job state is cancelled or finished, submitPushTasks would throw all partitions have no load data exception, because tableToLoadPartitions was already cleaned up #3433	2020-08-27 23:40:33 +08:00
wangbo	790779fb6f	[SparkLoad]remove unncessary convert from dataframe to rdd (#4304 )	2020-08-13 23:37:38 +08:00
Mingyu Chen	0e79f6908b	[CodeRefactor] Modify FE modules (#4146 ) This CL mainly changes: 1. Add 2 new FE modules 1. fe-common save all common classes for other modules, currently only `jmockit` 2. spark-dpp The Spark DPP application for Spark Load. And I removed all dpp related classes to this module, including unit tests. 2. Change the `build.sh` Add a new param `--spark-dpp` to compile the `spark-dpp` alone. And `--fe` will compile all FE modules. the output of `spark-dpp` module is `spark-dpp-1.0.0-jar-with-dependencies.jar`, and it will be installed to `output/fe/spark-dpp/`. 3. Modify some bugs of spark load	2020-07-29 16:18:05 +08:00

50 Commits