Commit Graph

900 Commits

Author SHA1 Message Date
61d2f37bdc [fix](jdbc catalog) fix string type insert into odbc table (#22961) 2023-08-15 20:09:38 +08:00
9b2323b7fd [Pipeline](exec) support async writer in pipelien query engine (#22901) 2023-08-15 17:32:53 +08:00
b49dc8042d [feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load (#22539)
## Proposed changes

Refactor thoughts: close #22383
Descriptions about `enclose` and `escape`: #22385

## Further comments

2023-08-09: 
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.
 
Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed: 

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.  

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
2023-08-15 09:23:53 +08:00
a1223218f3 [pipeline](exec) Support shared scan in jdbc and odbc scan node (#22826)
Support shared scan in jdbc and odbc scan node to improve exec performance
2023-08-10 18:34:45 +08:00
Pxl
56392e21ae [Bug](decimalv3) fix decimalv3 keyrange set wrong number #22818 2023-08-10 18:15:40 +08:00
f1db6bd8c1 [feature](hive)append support for struct and map column type on textfile format of hive table (#22347)
1. append support for struct and map column type on textfile format  of hive table.
2. optimizer code that array column type.

```mysql
+------+------------------------------------+
| id   | perf                               |
+------+------------------------------------+
| 1    | {"key1":"value1", "key2":"value2"} |
| 1    | {"key1":"value1", "key2":"value2"} |
| 2    | {"name":"John", "age":"30"}        |
+------+------------------------------------+
```

```mysql
+---------+------------------+
| column1 | column2          |
+---------+------------------+
|       1 | {10, "data1", 1} |
|       2 | {20, "data2", 0} |
|       3 | {30, "data3", 1} |
+---------+------------------+
```
Summarizes support for complex types(support assign delimiter) :

1. array< primitive_type > and array< array< ... > >
2. map< primitive_type , primitive_type >
3. Struct< primitive_type , primitive_type ... >
2023-08-10 13:47:58 +08:00
Pxl
591aee528d [Bug](exchange) change BlockSerializer from unique_ptr to object (#22653)
change BlockSerializer from unique_ptr to object
2023-08-07 14:47:21 +08:00
96f42ca20a [fix](memory) Independent count exec node memory profile (#22598)
Independent count exec node memory profile, after #22582
2023-08-06 10:56:31 +08:00
bad8237850 [BugFix](Es Catalog) fix bug that es catalog will return error when query partial columns (#22423)
Bug:
When the value of some ES column is empty, querying these value in doc_values mode will receive an error.

Reson:
In doc values mode, these values are empty, We need to determine if the array is empty
2023-08-04 11:28:30 +08:00
86e6f5d039 [FIX](decimal)fix decimal precision (#22364)
Now we make wrong for decimal parse from string
if given string precision is bigger than defined decimal precision, we will return a overflow error, but only digit part is bigger than typed digit length , we should return overflow error when we traverse given string to decimal value
2023-08-03 21:13:58 +08:00
e991f607d5 [fix](string-column) fix unescape length error (#22411) 2023-08-02 12:18:05 +08:00
b8399148ef [fix](DOE) es catalog not working with pipeline,datetimev2, array and esquery (#22046) 2023-08-01 21:45:16 +08:00
1c6246f7ee [improve](agg) support distinct agg node (#22169)
select c_name from customer union select c_name from customer
this sql used agg node to get distinct row of c_name,
so it's no need to wait for inserted all data to hash map,
could output the data which it's inserted into hash map successed.
2023-07-28 13:54:10 +08:00
7d5d416b25 [Fix](EsCatalog) fix be core when query the table of Es catalog with null fields (#22279) 2023-07-28 09:53:55 +08:00
8caa5a9ba4 [Fix](mutli-catalog) Fix null partitions error in iceberg tables. (#22185)
### Issue
when partition has null partitions, it throws error
`Failed to fill partition column: t_int=null`

### Resolution
- Fix the following null partitions error in iceberg tables by replacing null partition to '\N'.
- Add regression test for hive null partition.
2023-07-27 23:57:35 +08:00
23e7423748 [pipeline](refactor) refactor pipeline task schedule logics (#22028) 2023-07-25 17:18:26 +08:00
d0219062ef [refactor](be) use std::move to improve performance of push_back #22056 2023-07-24 08:51:28 +08:00
6875ef4b8b [refactor](mem_reuse) refactor mem_reuse in MutableBlock (#21564) 2023-07-20 22:53:19 +08:00
d86c67863d Remove unused code (#21735) 2023-07-12 14:48:13 +08:00
ff42cd9b49 [feature](hive)add read of the hive table textfile format array type (#21514) 2023-07-11 22:37:48 +08:00
Pxl
ca71048f7f [Chore](status) avoid empty error msg on status (#21454)
avoid empty error msg on status
2023-07-11 13:48:16 +08:00
7b403bff62 [feature](partial update)support insert new rows in non-strict mode partial update with nullable unmentioned columns (#21623)
1. expand the semantics of variable strict_mode to control the behavior for stream load: if strict_mode is true, the stream load can only update existing rows; if strict_mode is false, the stream load can insert new rows if the key is not present in the table
2. when inserting a new row in non-strict mode stream load, the unmentioned columns should have default value or be nullable
2023-07-11 09:38:56 +08:00
b86dd11a7d [fix](pipeline) refactor olap table sink close (#20771)
For pipeline, olap table sink close is divided into three stages, try_close() --> pending_finish() --> close()
only after all node channels are done or canceled, pending_finish() will return false, close() will start.
this will avoid block pipeline on close().

In close, check the index channel intolerable failure status after each node channel failure,
if intolerable failure is true, the close will be terminated in advance, and all node channels will be canceled to avoid meaningless blocking.
2023-07-04 11:27:51 +08:00
Pxl
3727483c06 [Chore](build) update ldb_toolchain to v0.18 (#20802)
* update ldb_toolchain to v0.18

* update
2023-06-14 18:38:35 +08:00
31a4f96f01 [refactor](exprcontext) move close to expr context's dector method (#20747)
The close method does nothing. But I am not sure we could remove it. So that I add it to dector method and remove many many calls.
2023-06-14 18:01:07 +08:00
51bbf17786 [Refactor](Profile) Add and refactor the join profile (#20693) 2023-06-13 09:06:51 +08:00
93b53cf2f4 [improvement](exception-safe) create and prepare node/sink support exception safe (#20551) 2023-06-09 21:06:59 +08:00
b60860c5e5 [refactor](profile) refactor the join profile when its shared hash table (#20391)
in join node, if it's broadcast_join
and shared hash table, some counter/timer about build hash table is useless,
so we could add those counter/timer in faker profile, and those will not display in web profile.
2023-06-09 08:59:49 +08:00
Pxl
a15a0b9193 [Chore](build) use file(GLOB_RECURSE xxx CONFIGURE_DEPENDS) to replace set cpp (#20461)
use file(GLOB_RECURSE xxx CONFIGURE_DEPENDS) to replace set cpp
2023-06-08 19:36:21 +08:00
576288cc89 [Profile](exec) Remove unless profile in pipeline exec engine (#20337) 2023-06-02 11:39:11 +08:00
56fa38de1d [Enhencement](JDBC Catalog) refactor jdbc catalog insert logic (#19950)
This PR refactors the old way of writing data to JDBC External Table & JDBC Catalog, mainly including the following tasks
1. Continuing the work of @BePPPower 's PR #18594, changing the logic of splicing Inster sql to operating off-heap memory and using preparedStatement.set to write data logic to complete
2. Supplement the support written by largeint type, mainly to adapt to Java.Math.BigInteger, which uses binary operations
3. Delete the splicing SQL logic in the JDBC External Table & JDBC Catalog related written code

ToDo: Binary type,like bit,binary, blob...

Finally, special thanks to @BePPPower , @AshinGau  for his work

Co-authored-by: Tiewei Fang <43782773+BePPPower@users.noreply.github.com>
2023-05-30 22:03:39 +08:00
4475a69c57 [Fix](multi-catalog) Fix q03 in text_external_brown regression test by handling correctly when text converter parsing error. (#20190)
Issue Number: close #20189

Fix `q03` in `text_external_brown` regression test by handling correctly when text converter parsing error.
2023-05-30 15:08:28 +08:00
9f8de89659 [refactor](exec) replace the single pointer with an array of 'conjuncts' in ExecNode (#19758)
Refactoring the filtering conditions in the current ExecNode from an expression tree to an array can simplify the process of adding runtime filters. It eliminates the need for complex merge operations and removes the requirement for the frontend to combine expressions into a single entity.

By representing the filtering conditions as an array, each condition can be treated individually, making it easier to add runtime filters without the need for complex merging logic. The array can store the individual conditions, and the runtime filter logic can iterate through the array to apply the filters as needed.

This refactoring simplifies the codebase, improves readability, and reduces the complexity associated with handling filtering conditions and adding runtime filters. It separates the conditions into discrete entities, enabling more straightforward manipulation and management within the execution node.
2023-05-29 11:47:31 +08:00
9458a24cd7 [fix](multi-catalog) values in sqlserver should be enclosed by single quotes (#19971)
Fix errors when inserting string/date/datetime values into SQLServer:

ERROR 1105 (HY000): errCode = 2, detailMessage = (172.21.0.101)[INTERNAL_ERROR]UdfRuntimeException: JDBC executor sql has error:
CAUSED BY: SQLServerException: Invalid column name '2021-10-30'.
When using double quotes enclose string values, it will be parsed as column name, so we should enclose string values with single quotes.
2023-05-26 20:04:45 +08:00
317338913c [Bug](topn) Fix topn fetch set real default value (#20074)
1. Before this PR if rowset does not contain column which should be read for related SlotDescriptor will call `insert_default` to column, but it's not this real defautl value.Real default value relevant information should be provided by the frontend side.

2. Support fetch when light schema change is not enabled, but disable for AGG or UNIQUE MOR model
2023-05-26 16:06:55 +08:00
53ae24912f [vectorized](feature) support partition sort node (#19708) 2023-05-25 11:22:02 +08:00
53ba46e404 [Fix][Refactor] Fix 'not member call on null pointer of type 'doris::TextConverter' error in ubsan env and refactor text converter. (#19849)
Fix 'not member call on null pointer of type doris::TextConverter' error in ubsan env and refactor text converter.
2023-05-22 21:00:19 +08:00
481e9aebdb [Refactor](spark load) remove parquet scanner (#19251) 2023-05-18 19:19:13 +08:00
ef0657c072 [Bug](pipeline) RegressionTest failed release resouce cause DCHECK failed (#19783)
RegressionTest failed release resouce cause DCHECK failed
2023-05-18 18:57:25 +08:00
fd4fa5c64e [Optimize](row store) optimize serialization and deserialization (#19691)
1. Get DataTypeSerde in advance to avoid get temporary DataTypeSerde iterate each column
2. Iterate the original row once is enoungh for deserializing by introducing a map for record the index of each column's unique id
2023-05-18 16:22:38 +08:00
fe42e52851 [pipeline](CTE) Support multi stream data sink in pipeline (#19519) 2023-05-18 10:34:37 +08:00
1d05feea1b [Feature](Nereids) add executable function to support fold constant for functions (#18209)
1. Add date-time functions for fold constant for Nereids.
This is the list of executable date-time function nereids supports up to now:
- now()
- now(int)
- current_timestamp()
- current_timestamp(int)
- localtime()
- localtimestamp()
- curdate()
- current_date()
- curtime()
- current_time()
- date_{add/sub}(),{years/months/days/hours/minutes/seconds}_{add/sub}()
- datediff()
- {date/datev2}()
- {year/quarter/month/day/hour/minute/second}()
- dayof{year/month/week}()
- date_format()
- date_trunc()
- from_days()
- last_day()
- to_monday()
- from_unixtime()
- unix_timestamp()
- utc_timestamp()
- to_date()
- to_days()
- str_to_date()
- makedate()

2. solved problem:
- enable datev2/datetimev2 default.
- refactor Nereids foldConstantOnFE and support fold nested expression.
- separate the executable into multi-files for easily-reading and adding new functions
2023-05-17 21:26:31 +08:00
30c4f25cb3 [fix](multi-catalog) verify the precision of datetime types for each data source (#19544)
Fix threes bugs of timestampv2 precision:
1. Hive catalog doesn't set the precision of timestampv2, and can't get the precision from hive metastore, so set the largest precision for timestampv2;
2. Jdbc catalog use datetimev1 to parse timestamp, and convert to timestampv2, so the precision is lost.
3. TVF doesn't use the precision from meta data of file format.
2023-05-17 20:50:15 +08:00
272a7565b8 [improvement](tracing) Remove useless span levels from be side tracing (#19665)
1. Remove an exec node method corresponding to a span and replace it with an exec node corresponding to a span;
2. Fix some problems with tracing in pipeline.
2023-05-17 19:04:52 +08:00
1462e44162 [Bug](topn) fix rowid fetcher merge with empty block (#19712) 2023-05-17 10:56:32 +08:00
2bdfaac609 [fix](ubsan) fix ubsan errors (#19658)
ixu ubsan errors:

doris/be/src/util/string_parser.hpp:275:58: runtime error: signed integer overflow: 2147483647 + 1 cannot be represented in type 'int'

doris/be/src/vec/functions/functions_comparison.h:214:51: runtime error: addition of unsigned offset to 0x7fea6c6b7010 overflowed to 0x7fea6c6b700c

doris/be/src/vec/functions/multiply.cpp:67:50: runtime error: signed integer overflow: 1295699415680000000 * 0x0000000000015401d0a4cd4890a77700 cannot be represented in type '__int128

doris/be/src/vec/aggregate_functions/aggregate_function_percentile_approx.h:445:73: runtime error: addition of unsigned offset to 0x7feca3343d10 overflowed to 0x7feca3343d08 

doris/be/src/exec/schema_scanner/schema_tables_scanner.cpp:330:24: run
2023-05-17 09:32:03 +08:00
e22f5891d2 [WIP](row store) two phase opt read row store (#18654) 2023-05-16 13:21:58 +08:00
92bf485abd [Bug] Fix doris pipeline shared scan and top n opt (#19599) 2023-05-15 10:00:44 +08:00
8ef9212ddc [enhancement](exceptionsafe) force check exec node method's return value (#19538) 2023-05-12 10:21:00 +08:00
1d421a26d9 [bugfix](memory) merge block may allocate failed (#19507) 2023-05-11 10:42:47 +08:00