doris

Author	SHA1	Message	Date
daidai	a4e415ab09	[feature](hive)Support hive tables after alter type. (#25138 ) 1.Reconstruct the logic of decode to read parquet. The parquet reader first reads the data according to the parquet physical type, and then performs a type conversion. 2.Support hive alter table.	2023-11-02 00:24:21 +08:00
huanghaibin	082bcd820b	[feature](insert) Support wal for group commit insert (#23053 )	2023-09-26 14:46:24 +08:00
AlexYue	042cf2a1bf	[enhancement](ut) add ut for buffered reader (#18667 )	2023-04-16 18:08:22 +08:00
carlvinhust2012	eab0af7afe	[optimization](array-type) optimize the export precision of floating point numbers (#14261 ) Co-authored-by: hucheng01 <hucheng01@baidu.com>	2022-11-18 18:24:11 +08:00
Ashin Gau	1cc9eeeb1a	[feature-wip](parquet-reader) read and generate array column (#12166 ) Read and generate parquet array column. When D=1, R=0, representing an empty array. Empty array is not a null value, so the NullMap for this row is false, the offset for this row is [offset_start, offset_end) whose `offset_start == offset_end`, and offset_end is the start offset of the next row, so there is no value in the nested primitive column. When D=0, R=0, representing a null array, and the NullMap for this row is true.	2022-08-31 17:08:12 +08:00
Ashin Gau	0b5bb565a7	[feature-wip](parquet-reader) parquet dictionary decoder (#11981 ) Parse parquet data with dictionary encoding. Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files. refer: https://github.com/apache/parquet-format/blob/master/Encodings.md	2022-08-26 19:24:37 +08:00
Ashin Gau	6d925054de	[feature-wip](parquet-reader) decode parquet time & datetime & decimal (#11845 ) 1. Spark can set the timestamp precision by the following configuration: spark.sql.parquet.outputTimestampType = INT96(NANOS), TIMESTAMP_MICROS, TIMESTAMP_MILLIS DATETIME V1 only keeps the second precision, DATETIME V2 keeps the microsecond precision. 2. If using DECIMAL V2, the BE saves the value as decimal128, and keeps the precision of decimal as (precision=27, scale=9). DECIMAL V3 can maintain the right precision of decimal	2022-08-22 10:15:35 +08:00
lihangyu	01383c3217	[Enhancement](stream-load-json) using simdjson to parse json (#11665 ) Currently we use rapidjson to parse json document, It's fast but not fast enough compare to simdjson.And I found that the simdjson has a parsing front-end called simdjson::ondemand which will parse json when accessing fields and could strip the field token from the original document, using this feature we could reduce the cost of string copy(eg. we convert everthing to a string literal in _write_data_to_column by sprintf, I saw a hotspot from the flamegrame in this function, using simdjson::to_json_string will strip the token(a string piece) which is std::string_view and this is exactly we need).And second in _set_column_value we could iterate through the json document by for (auto field: object_val) {xxx}, this is much faster than looking up a field by it's field name like objectValue.FindMember("k1").The third optimization is the at_pointer interface simdjson provided, this could directly get the json field from original document.	2022-08-16 14:49:50 +08:00
Ashin Gau	8f5aed27ec	[feature-wip](parquet-reader)read and decode parquet physical type (#11637 ) # Proposed changes Read and decode parquet physical type. 1. The encoding type of boolean is bit-packing, this PR introduces the implementation of bit-packing from Impala 2. Create a parquet including all the primitive types supported by hive ## Remaining Problems 1. At present, only physical types are decoded, and there is no corresponding and conversion methods with doris logical. 2. No parsing and processing Decimal type / Timestamp / Date. 3. Int_8 / Int_16 is stored as Int_32. How to resolve these types.	2022-08-11 10:17:32 +08:00
Ashin Gau	37d1180cca	[feature-wip](parquet-reader)decode parquet data (#11536 )	2022-08-08 12:44:06 +08:00
Ashin Gau	44a1a20e65	[feature-wip](parquet-reader)parse parquet schema (#11381 ) Analyze schema elements in parquet FileMetaData, and generate the hierarchy of nested fields. For exmpale: 1. primitive type ``` // thrift: optional int32 <column-name>; // sql definition: <column-name> int32; ``` 2. nested type ``` // thrift: optional group <column-name> (LIST) { repeated group bag { optional group array_element (LIST) { repeated group bag { optional int32 array_element } } } } // sql definition: <column-name> array<array<int32>> ```	2022-08-02 10:56:13 +08:00
Mingyu Chen	cef3cbc53a	[Bug] Fix bug that the last column may be null when using multibytes separator (#5534 )	2021-03-23 09:35:30 +08:00
HappenLee	b954dfd82d	[Bug] Fix the bug of Largetint and Decimal json load failed. (#4983 ) Use param of json load "num_as_string" to use flag kParseNumbersAsStringsFlag to parse json data.	2020-12-06 08:49:30 +08:00
HangyuanLiu	3a4a38c2fc	[Bug] Fix orc decimal (#4097 ) Result may error when ORC load negative decimal value When load negative decimal which has pre zero , the result is wrong. eg -0.0014, the orc result is -14(precision ... 0)	2020-07-16 22:36:52 +08:00
xy720	c50a310f8f	[optimize] Optimize spark load/broker load reading parquet format file (#3878 ) Add BufferedReader for reading parquet file via broker	2020-06-23 13:42:22 +08:00
worker24h	ef8fd1fcbe	[Load] Support load json-data into Doris by RoutineLoad or StreamLoad (#3553 ) Doris support load json-data by RoutineLoad or StreamLoad	2020-05-21 13:00:49 +08:00
HangyuanLiu	1d296e907d	Fix orc load timestamp bug (#3047 ) The timestamp value load from orc file is error, the value has an offset with hive and spark. Becuase the time zone of orc's timestamp is stored inside orc's stripe information, so the timestamp obtained here is an offset timestamp, so parse timestamp with UTC is actual datetime literal.	2020-03-06 18:03:27 +08:00
HangyuanLiu	e23d735bac	Fix decimal bug in orc load (#2984 )	2020-02-26 10:58:18 +08:00
HangyuanLiu	43583e7bd2	Fix orc load bug (#2912 )	2020-02-16 19:14:42 +08:00
HangyuanLiu	a36193dfab	Support decimal and timestamp type in orc load (#2759 )	2020-01-15 07:40:30 +08:00
HangyuanLiu	2326b478b6	Support load orc format in Apache Doris (#2554 ) Support load orc format in Apache Doris	2020-01-07 14:22:43 +08:00
worker24h	7eab12a40e	Support reading Parquet file when loading data (#1173 )	2019-07-01 18:39:27 +08:00
Mingyu Chen	da308da17c	Fix bug that empty stream load return unexpected error msg (#1052 )	2019-04-28 09:36:19 +08:00
ZHAO Chun	11307b23c8	Fix bug: stream load ignore last line with no-newline (#785 ) #783	2019-03-21 19:18:22 +08:00
cyongli	e2311f656e	baidu palo	2017-08-11 17:51:21 +08:00

25 Commits