Commit Graph

2085 Commits

Author SHA1 Message Date
527293aa41 [refactor](dynamic table) remove dynamic table (#23298) 2023-08-23 14:15:14 +08:00
ba882dea21 [pipelineX](dependency) Build DAG between pipelines (#23355) 2023-08-23 13:21:32 +08:00
14296ee87f [fix](window_function) wrong order by range (#23346) 2023-08-23 11:23:00 +08:00
Pxl
8ed4045df9 [Chore](primitive-type) remove VecPrimitiveTypeTraits (#22842) 2023-08-23 08:37:40 +08:00
Pxl
e6d20f842c [Bug](compile) fix compile failed on function case (#23335) 2023-08-22 22:10:53 +08:00
5c2fae7ce5 [pipeline](exec) Refactor the table sink code in remove unless code (#23223)
Refactor the table sink code in remove unless code
2023-08-22 20:42:14 +08:00
Pxl
1a1f86486d [Improvement](function) opt for case when (#23068)
opt for case when
2023-08-22 18:31:40 +08:00
0b51e6d8e1 [refractor](FunctionArrayIndex) make the codes more simple 2023-08-22 17:48:59 +08:00
9d2e23b1aa [fix](parquet) A row of complex type may be stored across more pages (#23277)
A row of complex type may be stored across two(or more) pages, and the parameter `align_rows` indicates that whether the reader should read the remaining value of the last row in previous page.
2023-08-22 14:47:10 +08:00
5ff7b57fc1 [fix](parquet) parquet reader confuses logical/physical/slot id of columns (#23198)
`ParquetReader` confuses logical/physical/slot id of columns. If only reading the scalar types, there's nothing wrong, but when reading complex types, `RowGroup` and `PageIndex` will get wrong statistics. Therefore, if the query contains complex types and pushed-down predicates, the probability of the result set is incorrect.
2023-08-22 13:35:29 +08:00
12075f9853 [pipelineX](projection) Support projection and blocking agg (#23256) 2023-08-21 22:23:02 +08:00
dcd6c3c022 [pipelineX](refactor) propose a new pipeline execution model (#22562) 2023-08-21 15:38:45 +08:00
d4694167a8 [Enhancement](chore) Some Status relevant enhancement (#23072) 2023-08-21 14:14:38 +08:00
37b49f60b7 [refactor](conf) add be conf for partition topn partitions threshold (#23220)
add be conf for partition topn partitions threshold
2023-08-21 10:52:41 +08:00
33dfa0c454 [Improve](serde) support text serde for nested type-array/map (#22738)
Now we can not support nested type array/map 
so this pr aim to:
1. add format option for string convert defined datatype to keep with origin from_string
2. support array map can nested array and map
2023-08-21 10:32:28 +08:00
0967d7ec04 [improvement](agg) Do not serialize bitmap to string (#23172) 2023-08-21 10:10:15 +08:00
Pxl
a11e0e3bc4 [Bug](agg) fix QUANTILE_UNION many problems (#23181)
fix QUANTILE_UNION many problems
2023-08-21 10:04:27 +08:00
4bf055c818 [fix](parquet) the key colum of map type in parquet may be nullable (#23180)
Fix errors when reading map type with nullable key column in parquet file. `ParquetReader` support to read nullable key column, but add a check to prevent reading nullable key column. Unfortunately, this check error was not thrown correctly, causing the BE to crash, and thrown meaningless error logs in be.out:
```
...
11# doris::vectorized::ParquetReader::get_columns(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, doris::TypeDescriptor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, doris::TypeDescriptor> > >*, std::unordered_set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >*) at /root/doris/be/src/vec/exec/format/parquet/vparquet_reader.cpp:508
12# doris::vectorized::VFileScanner::_get_next_reader() in /root/yun_you_external/output/be/lib/doris_be
13# doris::vectorized::VFileScanner::_get_block_impl(doris::RuntimeState*, doris::vectorized::Block*, bool*) at /root/doris/be/src/vec/exec/scan/vfile_scanner.cpp:241
...
```
2023-08-20 22:59:18 +08:00
433a6103ab [Enhancement](scanner) allocate blocks in scanner_context on demand and free them on close (#23182)
Introduced #19389 , removed #20785
2023-08-19 12:13:24 +08:00
0838ff4bf4 [fix](Outfile) fix bug that the fileSize is not correct when outfile is completed (#22951) 2023-08-18 22:31:44 +08:00
419e922a69 [fix](json)Fix the bug that does not stop when reading json files (#23062)
* [fix](json)Fix the bug that does not stop when reading json files
2023-08-18 18:23:19 +08:00
Pxl
477961dc21 [Chore](agg) refactor of hash map (#22958)
refactor of hash map
2023-08-18 17:59:30 +08:00
3d4ec1ac88 [pipeline](exec) support async writer in jdbc sink in pipeline query engine (#23144)
support async writer in jdbc sink in pipeline query engine
2023-08-18 17:07:57 +08:00
1c3cc77a54 [fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty (#21236)
* [fix](function) to_bitmap parameter parsing failure returns null instead of bitmap_empty

* add ut

* fix nereids

* fix regression-test
2023-08-18 14:37:49 +08:00
795006ea3d [fix](multi-catalog) conversion of compatible numerical types (#23113)
Hive support schema change, but doesn't rewrite the parquet file, so the physical type of parquet file may not equal the logical type of table schema.
2023-08-18 14:05:33 +08:00
a5ca6cadd6 [Improvement] Optimize count operation for iceberg (#22923)
Iceberg has its own metadata information, which includes count statistics for table data. If the table does not contain equli'ty delete, we can get the count data of the current table directly from the count statistics.
2023-08-18 09:57:51 +08:00
314f5a5143 [Fix](orc-reader) Fix filling partition or missing column used incorrect row count. (#23096)
[Fix](orc-reader) Fix filling partition or missing column used incorrect row count.

`_row_reader->nextBatch` returns number of read rows. When orc lazy materialization is turned on, the number of read rows includes filtered rows, so caller must look at `numElements` in the row batch to determine how
many rows were not filtered which will to fill to the block.

In this case, filling partition or missing column used incorrect row count which will cause be crash by `filter.size() != offsets.size()` in filter column step.

When orc lazy materialization is turned off, add `_convert_dict_cols_to_string_cols(block, nullptr)` if `(block->rows() == 0)`.
2023-08-17 23:26:11 +08:00
57568ba472 [fix](be)shouldn't use arena to alloc memory for SingleValueDataString (#23075)
* [fix](be)shouldn't use arena to alloc memory for SingleValueDataString

* format code
2023-08-17 22:18:09 +08:00
c5c984b79b [refactor](bitmap) using template to reduce duplicate code (#23060)
* [refactor](bitmap) support for batch value insertion

* fix values was not filled for int8 and int16
2023-08-17 18:14:29 +08:00
b252c49071 [fix](hash join) fix heap-use-after-free of HashJoinNode (#23094) 2023-08-17 16:29:47 +08:00
e289e03a1a [fix](executor)fix no return with old type in time_round 2023-08-17 15:34:26 +08:00
Pxl
cf1865a1c8 [Bug](scan) fix core dump due to store_path_map (#23084)
fix core dump due to store_path_map
2023-08-17 15:24:43 +08:00
8b51da0523 [Fix](load) fix partiotion Null pointer exception (#22965) 2023-08-17 14:09:47 +08:00
343a6dc29d [improvement](hash join) Return result early if probe side has no data (#23044) 2023-08-17 09:17:09 +08:00
390c52f73a [Improve](complex-type) update for array/map element_at with nested complex type with local tvf (#22927) 2023-08-16 20:47:36 +08:00
Pxl
d5df3bae25 [Bug](exchange) fix dcheck fail when VDataStreamRecvr input empty block (#22992)
fix dcheck fail when VDataStreamRecvr input empty block
2023-08-16 10:21:19 +08:00
f191736bfe [bug](shuffle) Fix DCHECK failure if exchange node has limit (#22993) 2023-08-15 19:14:37 +08:00
9b2323b7fd [Pipeline](exec) support async writer in pipelien query engine (#22901) 2023-08-15 17:32:53 +08:00
50f66b1246 [fix](pipeline) fix bug of datastream sender when doing BUCKET_SHFFULE_HASH_PARTITIONED shuffle (#22988)
This issue is introduced by #22765, if #22765 is picked to 2.0, then also need to pick this PR.

When shuffle type is BUCKET_SHFFULE_HASH_PARTITIONED, since data of multi buckets maybe sent to the same channel, send eos too early may cause data lost.
2023-08-15 17:30:27 +08:00
f1864d9fcf [fix](function) fix str_to_date with specific format #22981 2023-08-15 15:30:48 +08:00
9b42093742 [feature](agg) Make 'map_agg' support array type as value (#22945) 2023-08-15 14:44:50 +08:00
c2ff940947 [refactor](parquet)change decimal type export as fixed-len-byte on parquet write (#22792)
before the parquet write export decimal as byte-binary,
but can't be import those fied to Hive.
Now, change to export decimal as fixed-len-byte-array in order to import hive directly.
2023-08-15 13:17:50 +08:00
94bf8fb3c5 [performance](executor) optimize time_round function only one arg (#22855) 2023-08-15 13:16:42 +08:00
d431a35721 [Fix](inverted index) fix non-index match function core (#22959) 2023-08-15 11:27:12 +08:00
xy
b5ea3454a6 [Bug](aggregation)fix for map_agg when columns[1] is nullable (#22932)
In the map_agg handler function, added the judgment on columns[1]->is_nullable()
2023-08-15 11:26:03 +08:00
Pxl
3f55d5d4d5 [Chore](excution) change some log fatal and dcheck to exception (#22890)
change some log fatal and dcheck to exception
2023-08-15 10:45:00 +08:00
8318dfa9a3 [fix](datastream sender) fix wrong result of BUCKET_SHFFULE_HASH_PARTITIONED shuffle (#22973)
fix wrong result of BUCKET_SHFFULE_HASH_PARTITIONED shuffle
2023-08-15 10:21:14 +08:00
911bd0e818 [bug](if) fix if function not handle const nullable value (#22823)
fix if function not handle const nullable value
2023-08-15 10:16:48 +08:00
b49dc8042d [feature](load) refactor CSV reading process during scanning, and support enclose and escape for stream load (#22539)
## Proposed changes

Refactor thoughts: close #22383
Descriptions about `enclose` and `escape`: #22385

## Further comments

2023-08-09: 
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.
 
Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed: 

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.  

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
2023-08-15 09:23:53 +08:00
7bc98748cf [fix](datastream sender) fix wrong result of broadcast join; fix wrong result of pipeline (#22942)
Fix bug of #22765
Close #22924
2023-08-14 18:59:19 +08:00