Commit Graph

117 Commits

Author SHA1 Message Date
e923acef1b branch-2.1: [Fix](JsonReader) Fix the issue where the null bitmap of the JSON reader was not initialized when the JSON path is specified as '$.’ #52211 (#52268)
Cherry-picked from #52211

Co-authored-by: lihangyu <lihangyu@selectdb.com>
2025-06-28 14:21:38 +08:00
8356141e03 branch-2.1: [enhancement](case) add cases for mow table load empty file #49843 (#49858)
Cherry-picked from #49843

Co-authored-by: MoanasDaddyXu <xujianxu@selectdb.com>
2025-04-08 14:04:30 +08:00
c9381b0285 [fix](load) Fix import failure when the stream load parameter specifies Transfer-Encoding:chunked (#48196) (#48503)
pick from master  #48196
2025-03-04 10:12:54 +08:00
547e88b1ee branch-2.1: [fix](csv reader) fix core dump when parsing csv with enclose #45485 (#45889)
Cherry-picked from #45485

Co-authored-by: hui lai <laihui@selectdb.com>
2024-12-25 12:09:20 +08:00
6dddd4c499 [function](cast)Make string casting to integers more like MySQL's beh… (#41541)
…avior (#38847)
https://github.com/apache/doris/pull/38847
## Proposed changes

There are two issues here. First, the results of casting are
inconsistent between FE and BE .
```
FE
mysql [(none)]>select cast('3.000' as int); 
+----------------------+
| cast('3.000' as INT) |
+----------------------+
|                    3 |
+----------------------+

mysql [(none)]>set debug_skip_fold_constant = true;

BE
mysql [(none)]>select cast('3.000' as int);
+----------------------+
| cast('3.000' as INT) |
+----------------------+
|                 NULL |
+----------------------+
```
The second issue is that casting on BE converts '3.0' to null. Here, the
casting logic for FE and BE has been unified

<!--Describe your changes.-->

## Proposed changes

Issue Number: close #xxx

<!--Describe your changes.-->

---------

Co-authored-by: Xinyi Zou <zouxinyi02@gmail.com>
2024-10-11 09:32:00 +08:00
a45dc8796a [fix](Nereids) simplify decimal comparison wrong when cast to smaller scale (#41151) (#41618)
pick from master #41151
2024-10-09 23:03:01 +08:00
677435cef8 [Pick](Branch-2.1) pick json reader fix and support specify $. as column (#39271)
#39206
#38213
2024-08-13 17:44:45 +08:00
338fa32303 [pick](simdjson) fix simdjson with object array when jsonroot is not empty (#38633)
## Proposed changes
backport: https://github.com/apache/doris/pull/38490
Issue Number: close #xxx

<!--Describe your changes.-->
2024-08-01 11:04:54 +08:00
d9fd419e47 [Fix](JsonReader) fix json with duplicate key entry may result out of bound exception (#38147)
#38146
2024-07-19 22:53:02 +08:00
9ff129b630 [fix](stream_load) fix stream load may failed caused by column name with keyword (#35822) (#37890)
#35938 #35822 
let 
KW_SQL,
KW_CACHE,
KW_COLOCATE,
KW_COMPRESS_TYPE,
KW_DORIS_INTERNAL_TABLE_ID,
KW_HOTSPOT,
KW_PRIVILEGES,
KW_RECENT,
KW_STAGES,
KW_WARM,
KW_UP,
KW_CONVERT_LSC
be as non-reserved

## Proposed changes

Issue Number: close #xxx

<!--Describe your changes.-->

---------

Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2024-07-16 20:20:34 +08:00
c66df8d9e6 [branch-2.1](load) fix no error url if no partition can be found (#36831) (#37401)
## Proposed changes

pick #36831

before
```
Stream load result: {
    "TxnId": 2014,
    "Label": "83ba46bd-280c-4e22-b581-4eb126fd49cf",
    "Comment": "",
    "TwoPhaseCommit": "false",
    "Status": "Fail",
    "Message": "[DATA_QUALITY_ERROR]Encountered unqualified data, stop processing",
    "NumberTotalRows": 1,
    "NumberLoadedRows": 1,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 1669,
    "LoadTimeMs": 58,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 10,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 47,
    "CommitAndPublishTimeMs": 0
}
```

after
```
Stream load result: {
    "TxnId": 2014,
    "Label": "83ba46bd-280c-4e22-b581-4eb126fd49cf",
    "Comment": "",
    "TwoPhaseCommit": "false",
    "Status": "Fail",
    "Message": "[DATA_QUALITY_ERROR]too many filtered rows",
    "NumberTotalRows": 1,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 1,
    "NumberUnselectedRows": 0,
    "LoadBytes": 1669,
    "LoadTimeMs": 58,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 10,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 47,
    "CommitAndPublishTimeMs": 0,
    "ErrorURL": "http://XXXX:8040/api/_load_error_log?file=__shard_4/error_log_insert_stmt_c6461270125a615b-2873833fb48d56a3_c6461270125a615b_2873833fb48d56a3"
}
```

## Proposed changes

Issue Number: close #xxx

<!--Describe your changes.-->
2024-07-08 10:41:33 +08:00
ceef9ee123 [feature](serde) support presto compatible output format (#37039) (#37253)
bp #37039
2024-07-04 13:56:05 +08:00
a6a84b8ecc [improvement](stream load)(cherry-pick) support hll_from_base64 for stream load column mapping (#36819)
picked from https://github.com/apache/doris/pull/35923
2024-06-26 20:12:40 +08:00
cbaff8a700 [fix](nereids)change the decimal's precision and scale for cast(xx as decimal) (#36540)
pick from master #36316

expression cast( xx as decimal )'s datatype maybe decimalv3 or decimalv2
depending on enable_decimal_conversion value in fe conf file. if
enable_decimal_conversion is true, the datatype is decimalv3(9, 0), but
the datatype was decimalv3(38, 9) in 2.0 releases. So this pr change the
datatype same as 2.0 releases to keep the behavior consistent.
2024-06-20 17:46:11 +08:00
48d4601ee3 [regression-test](load) add something like $.tag.[a.b] key's json case (#35134) 2024-05-31 22:45:09 +08:00
8da260ee0d [fix](hdfs)read 'fs.defaultFS' from core-site.xml for hdfs load which has no default fs (#34217) (#34372)
bp #34217
Co-authored-by: slothever <18522955+wsjz@users.noreply.github.com>
2024-05-01 00:31:49 +08:00
cd1c9edd71 [fix](pipeline-load) fix no error url when data quality error and total rows is negative (#34072) (#34204)
Co-authored-by: HHoflittlefish777 <77738092+HHoflittlefish777@users.noreply.github.com>
2024-04-27 18:19:08 +08:00
55d5ed9ab6 [test](streamload) add load empty file regression test (#34110) 2024-04-26 07:42:09 +08:00
080c07ad87 [bug](random distribution) fix data loss and incorrect in random distribution table #33962 2024-04-24 17:13:50 +08:00
Pxl
5a5063be20 [bug](fix) heap use after free when json parse failed (#33955) 2024-04-22 22:33:24 +08:00
36a70ba1e7 [Fix](Csv-Reader)Fix the issue of BE core dump caused by improper configuration of column_seperator and line_delimiter. (#33693) 2024-04-20 20:06:48 +08:00
2648a92594 [FIX](load)fix load with split-by-string (#33713) 2024-04-17 23:42:14 +08:00
b85bf3b6b0 [test](cast) add test for stream load cast (#33189) 2024-04-10 15:26:09 +08:00
b0b5f84e40 [feature](load) support compressed JSON format data for broker load (#30809) 2024-04-10 14:20:53 +08:00
8bd101129a [behavior change](output) change float output format (#32049) 2024-03-21 14:07:22 +08:00
278b232e76 [Bug](json reader) object should stop processing when encounter error (#31159)
If DATA_QUALITY_ERROR encountered we should stop processing this document any more.Otherwise there will be UB in simdjson.
2024-02-21 13:53:32 +08:00
0d32aeeaf6 [improvement](load) Enable lzo & Remove dependency on Markus F.X.J. Oberhumer's lzo library (#30573)
Issue Number: close #29406

1. increase lzop version to 0x1040,
    I set to 0x1040 only for decompressing lzo files compressed by higher version of lzop,
	no change of decompressing logic,
	actully, 0x1040 should have "F_H_FILTER" feature,
	but it mainly for audio and image data, so we do not support it.
2. use orc::lzoDecompress() instead of lzo1x_decompress_safe() to decompress lzo data
3. use crc32c::Extend() instead of lzo_crc32()
4. use olap_adler32() instead of lzo_adler32()
5. thus, remove dependency of Markus F.X.J. Oberhumer's lzo library
6. remove DORIS_WITH_LZO, so lzo file are supported by stream and broker load by default
7. add some regression test
2024-02-05 22:00:24 +08:00
17cf4ab2c1 [case](regression) streamload publish timeout (#29457)
Co-authored-by: qinhao <qinhao@newland.com.cn>
2024-01-07 19:50:16 +08:00
8a169b9906 [case](regression) Test enable pipeline load (#28172)
Co-authored-by: qinhao <qinhao@newland.com.cn>
2023-12-28 10:49:19 +08:00
172f68480b [Enhancement](load) Limit the number of incorrect data drops and add documents (#27727)
In the load process, if there are problems with the original data, we will store the error data in an error_log file on the disk for subsequent debugging. However, if there are many error data, it will occupy a lot of disk space. Now we want to limit the number of error data that is saved to the disk.

Be familiar with the usage of doris' import function and internal implementation process
Add a new be configuration item load_error_log_limit_bytes = default value 200MB
Use the newly added threshold to limit the amount of data that RuntimeState::append_error_msg_to_file writes to disk
Write regression cases for testing and verification

Co-authored-by: xy720 <22125576+xy720@users.noreply.github.com>
2023-12-22 10:43:18 +08:00
4f5821407f [case]Load data with load_parallelism=any > 1 and stream load with compress type (#27306) 2023-12-13 18:41:14 +08:00
3f3843fd4f [fix](load) fix loaded rows error (#28061) 2023-12-07 14:43:51 +08:00
605257ccb7 [Enhancement](group commit) Add regression case for wal limit (#27949) 2023-12-06 14:23:50 +08:00
358d73a0ae [FIX](complextype) fix empty quote with complex type (#27942) 2023-12-05 12:25:26 +08:00
97d36b4f38 [fix](csv_reader) fix trim_double_quotes behavior change (#27882) 2023-12-03 22:57:55 +08:00
f4afcae452 [case](regression) Stream load 2pc exceptions (#27804)
Co-authored-by: qinhao <qinhao@newland.com.cn>
2023-12-01 22:27:40 +08:00
498d27c905 [improve](json_reader) add prompt when all fields is null (#27630) 2023-11-29 18:26:42 +08:00
13b26ee920 [Fix](core) Fix wal space back pressure core and add regression test (#27311) 2023-11-27 15:10:26 +08:00
42c32c584b [case](regression) test invalid jsonpaths (#27359)
Co-authored-by: qinhao <qinhao@newland.com.cn>
2023-11-23 10:16:34 +08:00
c0f22e8feb [FIX](complextype)fix struct nested complex collection type and and regresstest (#26973) 2023-11-20 22:29:12 +08:00
7164b7cebb [case](regression) add read json by line test case (#26670)
Co-authored-by: qinhao <qinhao@newland.com.cn>
2023-11-20 17:33:40 +08:00
0ece18d6cd [FIX](regresstest) fix test_map_nested_array csv file for id(#27105) 2023-11-17 04:20:02 -06:00
c7d961cb11 [regression test](stream load) add case for strict_mode=true and max_filter_ratio=0.5 (#27125) 2023-11-17 13:39:01 +08:00
54989175fb [case] Load json data with enable_simdjson_reader=false (#26601) 2023-11-16 14:40:59 +08:00
230d8af777 [regression test](temporary_partitions) add case for temporary_partitions #27063 2023-11-16 11:49:37 +08:00
f1169d3c58 [regression-test](TRIM_DOUBLE_QUOTES) add case for TRIM_DOUBLE_QUOTES (#26998) 2023-11-14 23:52:40 +08:00
cd7ad99de0 [improvement](regression-test) add chunked transfer json test (#26902) 2023-11-14 10:31:30 +08:00
4ecaa921f9 [regression-test](num_as_string) test num_as_string (#26842) 2023-11-13 21:48:13 +08:00
607a5d25f1 [feature](streamload) support HTTP request with chunked transfer (#26520) 2023-11-08 10:07:05 +08:00
3e10e5af39 [Fix](Serde) Fix content displayed by complex types in MySQL Client (#25946)
This pr makes three changes to the display of complex types:
1. NULL value in complex types refers to being displayed as `null`, not `NULL`
2. struct type is displayed as "column_name": column_value
3. Time types such as `datetime` and `date`, are displayed with double quotes in complex types. like
    `{1, "2023-10-26 12:12:12"}`

This pr also do a code refactor:
1. nesting_level is set to a member variable of the `DataTypeSerDe`, rather than a parameter in methods.

What's more, this pr fix a bug that fileSize is not correct, introduced by this pr: #25854
2023-11-01 23:48:55 +08:00