Commit Graph

197 Commits

Author SHA1 Message Date
e1a1a04c2f [Enhancement](Doe) Be query es use fe generate dsl. (#11840) 2022-08-18 10:31:17 +08:00
cfb90b39c7 (vec-stream-load-json) simdjson throw execption lead to core dump (#11880)
when config::enable_simdjson_parser=true in vec streamload, may lead to core dump when json input invalid format string like '{ "a', or all the fields is null like '{}', this may lead to simdjson lib throw some unhandled expection like `Objects and arrays can only be iterated when they are first encountered`.We should take care of these cases

Signed-off-by: eldenmoon <15605149486@163.com>
2022-08-18 10:27:34 +08:00
50ef6e35be [enhancement](RowDescriptor) enhance tuple_idx check during runtime (#11835) 2022-08-17 17:50:48 +08:00
3a49156e30 [performance] (vectorization)optimize In Expr (#11826)
Co-authored-by: Wang Bo <wangbo36@meituan.com>
2022-08-17 10:46:37 +08:00
f39f57636b [feature-wip](parquet-reader) update column read model and add page index (#11601) 2022-08-16 15:04:07 +08:00
01383c3217 [Enhancement](stream-load-json) using simdjson to parse json (#11665)
Currently we use rapidjson to parse json document, It's fast but not fast enough compare to simdjson.And I found that the simdjson has a parsing front-end called simdjson::ondemand which will parse json when accessing fields and could strip the field token from the original document, using this feature we could reduce the cost of string copy(eg. we convert everthing to a string literal in _write_data_to_column by sprintf, I saw a hotspot from the flamegrame in this function, using simdjson::to_json_string will strip the token(a string piece) which is std::string_view and this is exactly we need).And second in _set_column_value we could iterate through the json document by for (auto field: object_val) {xxx}, this is much faster than looking up a field by it's field name like objectValue.FindMember("k1").The third optimization is the at_pointer interface simdjson provided, this could directly get the json field from original document.
2022-08-16 14:49:50 +08:00
4be6e70f1c [fix](query) fix orderby keys limit return less or no result (#11757)
The bug is caused by use _num_rows_read for limit check. _num_rows_read is count of rows read from storage, but may be filtered by filter_block for WHERE predicate.

Add a _num_rows_return, which is rows after filter_block for WHERE predicate, for count for really returned rows.
2022-08-16 14:31:47 +08:00
288b440b14 [improvement](vectorized) Improve count distinct performance by using fastunion (#11516)
Improve count distinct performance by using fastunion.
Testing our user real data has a 10-40% performance improvement.
2022-08-16 12:18:46 +08:00
5104982614 [enhancement](tracing) append the profile counter to trace. (#11458)
1. append the profile counter and infos to span attributes.
2. output traceid to audit log.
2022-08-15 21:36:38 +08:00
0b9bfd15b7 [feature-wip](parquet-reader) parquet physical type to doris logical type (#11769)
Two improvements have been added:
1. Translate parquet physical type into doris logical type.
2. Decode parquet column chunk into doris ColumnPtr, and add unit tests to show how to use related API.
2022-08-15 16:08:11 +08:00
1c4927eac3 [fix](core)fix bug for status not init(#11730) 2022-08-12 17:42:37 +08:00
15abafee71 [Bug](runtime filters) support late-arrival runtime filters (#11599) 2022-08-12 11:55:15 +08:00
0ab43c51e8 [Feature](unique-key-merge-on-write) some fix on delete bitmap usage (#11623) 2022-08-12 11:54:31 +08:00
7d97aa194b [feature-wip](datev2) Support to use datev2 as partition column (#11618) 2022-08-12 11:54:01 +08:00
9b9ed1aef1 [data lake](arrow scanner)Fix file arrow scanner column index out of range core. (#11691) 2022-08-12 11:34:29 +08:00
9950501fdf [fix](profile) close eof scanner before transfer done (#11705)
We should close eof scanners before transfer done, otherwise,
they are closed until scannode is closed. Because plan is closed
after the plan is finished, so query profile would leak stats from
scanners closed by scannode::close. e.g. SegmentTotalNum in profile
is less.
2022-08-12 11:28:43 +08:00
5d66839035 [feature-wip](unique-key-merge-on-write) push down runtime filter on unique key with merge on write table (#11695) 2022-08-11 22:50:13 +08:00
8f5aed27ec [feature-wip](parquet-reader)read and decode parquet physical type (#11637)
# Proposed changes

Read and decode parquet physical type.
1. The encoding type of boolean is bit-packing, this PR introduces the implementation of bit-packing from Impala
2. Create a parquet including all the primitive types supported by hive

## Remaining Problems
1. At present, only physical types are decoded, and there is no corresponding and conversion methods with doris logical.
2. No parsing and processing Decimal type / Timestamp / Date.
3. Int_8 / Int_16 is stored as Int_32. How to resolve these types.
2022-08-11 10:17:32 +08:00
70b39475cf [fix](scanner) delete predicates might be inconsistent with rowset readers (#11598) 2022-08-10 19:40:54 +08:00
c8418d13b5 [improvement](config)Use session variable to replace configuration for 'enable_function_pushdown' (#11641) 2022-08-10 19:25:02 +08:00
0291f84a9e [fix](like-predicate) Add missing functions in LikeColumnPredicate (#11631) 2022-08-10 15:03:14 +08:00
01e4522612 [fix]collect_list/collect_set without GROUP BY for NOT NULL column (#11529)
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-08-09 20:49:37 +08:00
f9b151744d optimize topn query if order by columns is prefix of sort keys of table (#10694)
* [feature](planner): push limit to olapscan when meet sort.

* if olap_scan_node's sort_info is set, push sort_limit, read_orderby_key
and read_orderby_key_reverse for olap scanner

* There is a common query pattern to find latest time serials data.
 eg. SELECT * from t_log WHERE t>t1 AND t<t2 ORDER BY t DESC LIMIT 100

If the ORDER BY columns is the prefix of the sort key of table, it can
be greatly optimized to read much fewer data instead of read all data
between t1 and t2.

By leveraging the same order of ORDER BY columns and sort key of table,
just read the LIMIT N rows for each related segment and merge N rows.

1. set read_orderby_key to true for read_params and _reader_context
   if olap_scan_node's sort info is set.
2. set read_orderby_key_reverse to true for read_params and _reader_context
   if is_asc_order is false.
3. rowset reader force merge read segments if read_orderby_key is true.
4. block reader and tablet reader force merge read rowsets if read_orderby_key is true.

5. for ORDER BY DESC, read and compare in reverse order
5.1 segment iterator read backward using a new BackwardBitmapRangeIterator and
    reverse the result block before return to caller.
5.2 VCollectIterator::LevelIteratorComparator, VMergeIteratorContext return
    opposite result for _is_reverse order in its compare function.

Co-authored-by: jackwener <jakevingoo@gmail.com>
2022-08-09 09:08:44 +08:00
ed7f7dead9 [Refactor](push-down predicate) Derive push-down predicate from vconjuncts (#11468)
* [Refactor](push-down predicate) Derive push-down predicate from vconjuncts
2022-08-08 19:19:26 +08:00
9349746987 [Fix](stream-load-json) fix VJsonReader::_write_data_to_column invalid column type cast when meet null (#11564)
column_ptr will be a none nullable column pointer after `column_ptr = &nullable_column->get_nested_column()`
so we should not cast column_ptr to ColumnNullable any more
2022-08-08 15:57:39 +08:00
37d1180cca [feature-wip](parquet-reader)decode parquet data (#11536) 2022-08-08 12:44:06 +08:00
Pxl
2cd3bf80dc [bugfix](schema change)fix core dump on vectorized_alter_table (#11538) 2022-08-08 10:45:28 +08:00
e8a344b683 [feature-wip](parquet-reader) add predicate filter and column reader (#11488) 2022-08-08 10:21:24 +08:00
95753ec868 [feature](parquet-reader) add group filter util (#11533)
* [feature-wip](parquet-reader) add group filter util

Co-authored-by: jinzhe <jinzhe@selectdb.com>
2022-08-05 14:02:48 +08:00
321107cb40 [refactor](schema change) Using tablet schema shared ptr instead of raw ptr (#11475)
* Using tabletschema shared ptr instead of raw ptrs


Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-08-05 11:04:38 +08:00
6eb8ac0ebf [feature-wip][multi-catalog]Support caseSensitive field name in file scan node (#11310)
* Impl case sentive in file scan node
2022-08-05 08:03:16 +08:00
092a394782 [improvement](agg)limit the output of agg node (#11461)
* [improvement](agg)limit the output of agg node
2022-08-05 07:53:55 +08:00
aed0282046 [feature-wip](parquet-reader)get compressed parquet page data (#11493) 2022-08-04 17:44:52 +08:00
Pxl
ec3c911f97 [Feature][Materialized-View] support materialized view on vectorized engine (#10792) 2022-08-04 14:07:48 +08:00
ecbf87d77b [bugfix](memtracker)fix exceed memory limit log (#11485) 2022-08-04 10:22:20 +08:00
1b4d6a620a (feature-wip)[parquet-reader] support page index serde (#11415) 2022-08-03 10:36:06 +08:00
842a5b8e24 [refactor](agg) Abstract the hash operation into a method" (#11399) 2022-08-02 17:27:19 +08:00
38ffe685b5 [Bug](ODBC) fix vectorized null value error report in odbc scan node (#11420)
* [Bug](ODBC) fix vectorized null value error report in odbc scan node

Co-authored-by: lihaopeng <lihaopeng@baidu.com>
2022-08-02 15:44:12 +08:00
44a1a20e65 [feature-wip](parquet-reader)parse parquet schema (#11381)
Analyze schema elements in parquet FileMetaData, and generate the hierarchy of nested fields.
For exmpale:
1. primitive type
```
// thrift:
optional int32 <column-name>;
// sql definition:
<column-name> int32;
```
2. nested type
```
// thrift:
optional group <column-name> (LIST) {
  repeated group bag {
    optional group array_element (LIST) {
      repeated group bag {
        optional int32 array_element
      }
    }
  }
}
// sql definition:
<column-name> array<array<int32>>
```
2022-08-02 10:56:13 +08:00
1cf57a985d [fix] Fix the query result error caused by the grouping sets statemen… (#11316)
* [fix] Fix the query result error caused by the grouping sets statement grouping as an expression
2022-08-01 13:52:18 +08:00
4f5e1601df [bug](scanner) Improve limit query performance on olapScannode and avoid infinite loop (#11301)
1. Fix a bug that query large column table may cause infinite loop
2. Optimize the query logic with limit, for the case where the limit value is relatively small, reduce the parallelism of the scanner, reduce unnecessary resource consumption, and increase the number of similar queries that the system can carry at the same time, and increase the query speed by more than 60%
2022-08-01 13:50:12 +08:00
b35daf0a04 [improvement](light-schema-change) Support tablet schema cache (#11131) 2022-08-01 12:18:00 +08:00
0325fa436e [fix](agg)Add field of 'is_first_phase' in TAggregationNode (#11321) 2022-08-01 11:49:50 +08:00
d360974dce [improvement](agg)Use phmap::flat_hash_set in AggregateFunctionUniq (#11363)
This reverts commit 688b55053dd1fc5113343a6f565ad732ddd9612a.
2022-08-01 10:36:11 +08:00
688b55053d Revert "[improvement]Use phmap::flat_hash_set in AggregateFunctionUniq (#11257)" (#11356)
This reverts commit a7199fb98e18b925664b38460b667d04cbee8e01.
2022-07-30 23:15:36 +08:00
1f30e563a7 [refactor][vectorized] refactor first/last value agg functions (#10661)
* refactor first and last
[refactor][vectorized] refactor first/last value agg functions

* add some change

* remove first/last about always nullable

* remove always nullable and register it

* refactor value remove bool null flag

* refactor win first last to ptr and pos
2022-07-30 18:38:56 +08:00
18864ab7fe weak relationship between MemTracker and MemTrackerLimiter (#11347) 2022-07-30 18:33:54 +08:00
d6f937cb01 (performance)[scanner] Isolate local and remote queries using different scanner… (#11006) 2022-07-29 19:14:46 +08:00
84ce2a1e98 [feature-wip](multi-catalog)(fix) partition value error when a block contains multiple splits (#11260)
`FileArrowScanner::get_next` returns a block when full, so it maybe contains multiple
splits in small files or crosses two splits in large files.
However, a block can only fill the partition values from one file. Different splits may be
from different files, causing the error of embed partition values.
2022-07-29 18:48:59 +08:00
a7199fb98e [improvement]Use phmap::flat_hash_set in AggregateFunctionUniq (#11257) 2022-07-29 16:55:22 +08:00