Commit Graph

67 Commits

Author SHA1 Message Date
71f167ac51 [fix](sort) fix nullable column sorting incorrectly (#13125) 2022-10-14 12:45:04 +08:00
Pxl
bdcb600f3d [Bug](load) fix core dump on big block load (#13014) 2022-10-10 12:38:32 +08:00
1ba9e4b568 [Improvement](sort) Reuse memory in sort node (#12921) 2022-09-28 09:44:35 +08:00
Pxl
9607f60845 [Feature](serialize) move block_data_version to fe heart beat (#12667)
Move block_data_version from be config to fe heart beat
2022-09-27 18:25:54 +08:00
35076431ab [fix](column)fix get_shrinked_column misspell (#12961)
Fix misspell
2022-09-26 17:32:03 +08:00
59699a4321 [feature](JSON datatype)Support JSON datatype (#10322)
Add `JSON` datatype, following features are implemented by this PR:
1. `CREATE` tables with `JSON` type columns
2. `INSERT` values containing `JSON` type value stored in `String`, which is represented as binary format(AKA `JSONB`) at BE 
3. `SELECT` JSON columns

Detail design refers [DSIP-016: Support JSON type](https://cwiki.apache.org/confluence/display/DORIS/DSIP-016%3A+Support+JSON+type)

* add JSONB data storage format type

* fix JsonLiteral resolve bug

* add DataTypeJson case in data_type_factory

* add JSON syntax check in FE

* add operators for jsonb_document, currently not support comparison between any JSON type value

* add ColumnJson and DataTypeJson

* add JsonField to store JsonValue

* add JsonValue to convert String JSON to BINARY JSON and JsonLiteral case for vliteral

* add push_json for MysqlResultWriter

* JSON column need no zone_map_index

* Revert "JSON column need no zone_map_index"

This reverts commit f71d1ce1ded9dbae44a5d58abcec338816b70d79.

* add JSON writer and reader, ignore zone-map for JSON column

* add json_to_string for DataTypeJson

* add olap_data_convertor for JSON type

* add some enum

* add OLAP_FIELD_TYPE_JSON type, FieldTypeTraits for it and corresponding cases or functions

* fix column_json offsets overflow bug, format code

* remove useless TODOs, add CmpType cases for JSON type

* add license header

* format license

* format be codes

* resolve rebase master conflicts

* fix bugs for CREATE and meta related code

* refactor JsonValue constructors, add fe JSON cases and fix some bugs, reformat codes

* modification be codes along code review advice

* fix rebase conflicts with master

* add unit test for json_value and column_json

* fix rebase error

* rename json to jsonb

* fix some data convert bugs, set Mysql type to JSON
2022-09-25 14:06:49 +08:00
3cfaae0031 [Improvement](sort) Use heap sort to optimize sort node (#12700) 2022-09-21 10:01:52 +08:00
b136d80e1a [enhancement](compress) reuse compression ctx and buffer (#12573)
Reuse compression ctx and buffer.
Use a global instance for every compression algorithm, and use a
thread saft buffer pool to reuse compression buffer, pool size is equal
to max parallel thread num in compression, and this will not be too large.

Test shows this feature increase 5% of data import and compaction.

Co-authored-by: yixiutt <yixiu@selectdb.com>
2022-09-15 10:59:46 +08:00
Pxl
0ead048b93 [Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456)
1. remove ColumnString terminating zero
    2. add a data_version for pblock
    3. change EncryptionMode to enum class
2022-09-14 21:25:22 +08:00
56b2fc43d4 [enhancement](array-type) shrink column suffix zero for type ARRAY<CHAR> (#12443)
In compute level, CHAR type will shrink suffix zeros.
To keep the logic the same as CHAR type, we also shrink for ARRAY or ARRAY<ARRAY> types.

Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-09-13 23:24:48 +08:00
Pxl
2306e46658 [Enhancement](compaction) reduce VMergeIterator copy block (#12316)
This pr change make VMergeIterator support return row reference to instead copy a full block.
2022-09-13 16:19:34 +08:00
66491ec137 [Improvement](sort) improve partial sort algorithm (#12349)
* [Improvement](sort) improve partial sort algorithm
2022-09-09 15:44:18 +08:00
b8e38b9167 [Bug](load) block call clear_column_data may have ref not equal 1 (#12350) 2022-09-05 20:40:40 +08:00
7b352c93ff [improvement](sink) avoid frequent allocation and deallocation when serializing block (#12310) 2022-09-05 12:23:43 +08:00
90fb3b7783 [Improvement](load) accelerate tablet sink (#12174) 2022-09-01 10:08:09 +08:00
573e5476dd [Opt](load) Speed up the vectorized load (#12146)
* [Opt](load) Speed up the vectorized load
2022-08-31 16:23:36 +08:00
Pxl
620d33a763 [Enchancement](optimize) set result_size_hint to filter_block (#11972) 2022-08-25 11:42:52 +08:00
c22d097b59 [improvement](compress) Support compress/decompress block with lz4 (#11955) 2022-08-22 17:35:43 +08:00
dc8f64b3e3 [improvement](agg) Serialize the fixed-length aggregation results with corresponding columns instead of ColumnString (#11801) 2022-08-22 10:12:06 +08:00
01bd7f224b [bugifx](compaction) fix filter_delete if schema has sequence column (#11909)
introduced in #11721. Use last column as delete sign, but if sequence column
exist, it's wrong.

Co-authored-by: yixiutt <yixiu@selectdb.com>
2022-08-19 14:56:06 +08:00
b8a33d2629 [Improvement](load) turn enable_vectorized_load on by default (#11833) 2022-08-18 14:43:09 +08:00
01383c3217 [Enhancement](stream-load-json) using simdjson to parse json (#11665)
Currently we use rapidjson to parse json document, It's fast but not fast enough compare to simdjson.And I found that the simdjson has a parsing front-end called simdjson::ondemand which will parse json when accessing fields and could strip the field token from the original document, using this feature we could reduce the cost of string copy(eg. we convert everthing to a string literal in _write_data_to_column by sprintf, I saw a hotspot from the flamegrame in this function, using simdjson::to_json_string will strip the token(a string piece) which is std::string_view and this is exactly we need).And second in _set_column_value we could iterate through the json document by for (auto field: object_val) {xxx}, this is much faster than looking up a field by it's field name like objectValue.FindMember("k1").The third optimization is the at_pointer interface simdjson provided, this could directly get the json field from original document.
2022-08-16 14:49:50 +08:00
f9b151744d optimize topn query if order by columns is prefix of sort keys of table (#10694)
* [feature](planner): push limit to olapscan when meet sort.

* if olap_scan_node's sort_info is set, push sort_limit, read_orderby_key
and read_orderby_key_reverse for olap scanner

* There is a common query pattern to find latest time serials data.
 eg. SELECT * from t_log WHERE t>t1 AND t<t2 ORDER BY t DESC LIMIT 100

If the ORDER BY columns is the prefix of the sort key of table, it can
be greatly optimized to read much fewer data instead of read all data
between t1 and t2.

By leveraging the same order of ORDER BY columns and sort key of table,
just read the LIMIT N rows for each related segment and merge N rows.

1. set read_orderby_key to true for read_params and _reader_context
   if olap_scan_node's sort info is set.
2. set read_orderby_key_reverse to true for read_params and _reader_context
   if is_asc_order is false.
3. rowset reader force merge read segments if read_orderby_key is true.
4. block reader and tablet reader force merge read rowsets if read_orderby_key is true.

5. for ORDER BY DESC, read and compare in reverse order
5.1 segment iterator read backward using a new BackwardBitmapRangeIterator and
    reverse the result block before return to caller.
5.2 VCollectIterator::LevelIteratorComparator, VMergeIteratorContext return
    opposite result for _is_reverse order in its compare function.

Co-authored-by: jackwener <jakevingoo@gmail.com>
2022-08-09 09:08:44 +08:00
Pxl
6e98ebba27 [Vectorized] Support sort combinator (#10469) 2022-07-23 17:58:31 +08:00
babab5d535 [feature-wip] support datetimev2 (#11085) 2022-07-23 16:07:59 +08:00
7e3fc0d321 [enhancement](vec) Support outer join for vectorized exec engine (#11068)
Hash join node adds three new attributes.
The following will take an SQL as an example to illustrate the meaning of these three attributes

```
select t1. a from t1 left join t2 on t1. a=t2. b;
```
1. vOutputTupleDesc:Tuple2(a'')

2. vIntermediateTupleDescList: Tuple1(a', b'<nullable>)

2. vSrcToOutputSMap: <Tuple1(a'), Tuple2(a'')>

The slot in intermediatetuple corresponds to the slot in output tuple one by one through the expr calculation of the left child in vsrctooutputsmap.

This code mainly merges the contents of two PRs:
1.  [fix](vectorized) Support outer join for vectorized exec engine (https://github.com/apache/doris/pull/10323)
2. [Fix](Join) Fix the bug of outer join function under vectorization #9954

The following is the specific description of the first PR
In a vectorized scenario, the query plan will generate a new tuple for the join node.
This tuple mainly describes the output schema of the join node.
Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema.
For example:
1. The case where the null side column caused by outer join is converted to nullable.
2. The projection of the outer tuple.

The following is the specific description of the second PR
This pr mainly fixes the following problems:
1. Solve the query combined with inline view and outer join. After adding a tuple to the join operator, the position of the `tupleisnull` function is inconsistent with the row storage. Currently the vectorized `tupleisnull` will be calculated in the HashJoinNode.computeOutputTuple() function.
2. Column nullable property error problem. At present, once the outer join occurs, the column on the null-side side will be planned to be nullable in the semantic parsing stage.

For example:
```
select * from (select a as k1 from test) tmp right join b on tmp.k1=b.k1
```
At this time, the nullable property of column k1 in the `tmp` inline view should be true.

In the vectorized code, the virtual `tableRef` of tmp will be used in constructing the output tuple of HashJoinNode (specifically, the function HashJoinNode.computeOutputTuple()). So the **correctness** of the column nullable property of this tableRef is very important.
In the above case, since the tmp table needs to perform a right join with the b table, as a null-side tmp side, it is necessary to change the column attributes involved in the tmp table to nullable.

In non-vectorized code, since the virtual tableRef tmp is not used at all, it uses the `TupleIsNull` function in `outputsmp` to ensure data correctness.
That is to say, the a column of the original table test is still non-null, and it does not affect the correctness of the result.

The vectorized nullable attribute requirements are very strict.
Outer join will change the nullable attribute of the join column, thereby changing the nullable attribute of the column in the upper operator layer by layer.
Since FE has no mechanism to modify the nullable attribute in the upper operator tuple layer by layer after the analyzer.
So at present, we can only preset the attributes before the lower join as nullable in the analyzer stage in advance, so as to avoid the problem.
(At the same time, be also wrote some evasive code in order to deal with the problem of null to non-null.)

Co-authored-by: EmmyMiao87
Co-authored-by: HappenLee
Co-authored-by: morrySnow

Co-authored-by: EmmyMiao87 <522274284@qq.com>
2022-07-21 23:39:25 +08:00
c9f86bc7e2 [refactor] Refactoring Status static methods to format message using fmt(#9533) 2022-07-02 18:58:23 +08:00
97996c9275 [fix](Insert) fix 5 concurrent "insert...select..." OOM (#10501)
* [hotfix](dev-1.0.1) 5 concurrent insert...select... OOM

Co-authored-by: minghong <minghong.zhou@163.com>
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-07-01 15:29:26 +08:00
Pxl
a9d23ce337 [refactor] remove collator (#10518) 2022-07-01 10:35:32 +08:00
06e436b7cc [bugfix]dump_one_line failed to dump last column (#10522)
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-07-01 10:33:49 +08:00
ca94867b4e [Feature-wip] add date v2 type (#9916) 2022-06-26 16:07:56 +08:00
eebfbd0c91 Revert "[fix](vectorized) Support outer join for vectorized exec engine (#10323)" (#10424)
This reverts commit 2cc670dba697a330358ae7d485d856e4b457c679.
2022-06-25 22:18:08 +08:00
2cc670dba6 [fix](vectorized) Support outer join for vectorized exec engine (#10323)
In a vectorized scenario, the query plan will generate a new tuple for the join node.
This tuple mainly describes the output schema of the join node.
Adding this tuple mainly solves the problem that the input schema of the join node is different from the output schema.
For example:
1. The case where the null side column caused by outer join is converted to nullable.
2. The projection of the outer tuple.
2022-06-24 08:59:30 +08:00
5e47b03595 [feature-wip](array-type) Add array aggregation functions (#10108) 2022-06-17 11:07:49 +08:00
d58e00c49c [fix](brpc) Embed serialized request into the attachment and transmit it through http brpc (#9803)
When the length of `Tuple/Block data` is greater than 2G, serialize the protoBuf request and embed the
`Tuple/Block data` into the controller attachment and transmit it through http brpc.

This is to avoid errors when the length of the protoBuf request exceeds 2G:
`Bad request, error_text=[E1003]Fail to compress request`.

In #7164, `Tuple/Block data` was put into attachment and sent via default `baidu_std brpc`,
but when the attachment exceeds 2G, it will be truncated. There is no 2G limit for sending via `http brpc`.

Also, in #7921, consider putting `Tuple/Block data` into attachment transport by default, as this theoretically
reduces one serialization and improves performance. However, the test found that the performance did not improve,
but the memory peak increased due to the addition of a memory copy.
2022-06-13 20:41:48 +08:00
c426c2e4b1 [Vectorized-Load] Support vectorized load table with materialized view (#9923)
* [Vectorized-Load] Support vectorized load table with materialized view

* fix ut

Co-authored-by: lihaopeng <lihaopeng@baidu.com>
2022-06-02 14:59:01 +08:00
f377c26bf7 [refactor][be] Optimize headers (#9708) 2022-05-30 16:12:10 +08:00
2a11a4ab99 [feature-wip][array-type] Support more sub types. (#9466)
Please refer to #9465
2022-05-26 08:41:34 +08:00
73e31a2179 [stream-load-vec]: memtable flush only if necessary after aggregated (#9459)
Co-authored-by: weixiang <weixiang06@meituan.com>
2022-05-25 21:12:24 +08:00
8470543144 [Improvement] fix typo (#9743) 2022-05-25 19:29:01 +08:00
f5bef328fe [fix] disable transfer data large than 2GB by brpc (#9770)
because of brpc and protobuf cannot transfer data large than 2GB, if large than 2GB will overflow, so add a check before send
2022-05-25 18:41:13 +08:00
8fa677b59c [Refactor][Bug-Fix][Load Vec] Refactor code of basescanner and vjson/vparquet/vbroker scanner (#9666)
* [Refactor][Bug-Fix][Load Vec] Refactor code of basescanner and vjson/vparquet/vbroker scanner
1. fix bug of vjson scanner not support `range_from_file_path`
2. fix bug of vjson/vbrocker scanner core dump by src/dest slot nullable is different
3. fix bug of vparquest filter_block reference of column in not 1
4. refactor code to simple all the code

It only changed vectorized load, not original row based load.

Co-authored-by: lihaopeng <lihaopeng@baidu.com>
2022-05-20 11:43:03 +08:00
718a51a388 [refactor][style] Use clang-format to sort includes (#9483) 2022-05-10 21:25:35 +08:00
c9961c9bb9 [style] clang-format all c++ code (#9305)
- sh build-support/clang-format.sh  to  clang-format all c++ code
2022-04-29 16:14:22 +08:00
d330bc3806 [Vectorized](stream-load-vec) Support stream load in vectorized engine (#8709) (#9280)
Implement vectorized stream load.
Added fe configuration option `enable_vectorized_load` to enable vectorized stream load.

    Co-authored-by: tengjp@outlook.com
    Co-authored-by: mrhhsg@gmail.com
    Co-authored-by: minghong.zhou@163.com
    Co-authored-by: HappenLee <happenlee@hotmail.com>
    Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com>
2022-04-29 09:50:51 +08:00
Pxl
951c2a90eb [fix](Lateral-View)(Vectorized) core dump on lateral-view with nullable column (#9191) 2022-04-26 10:24:11 +08:00
ae680b4248 [UDF] support RPC udaf part 1: support create RPC udaf in fe (#8510) 2022-04-21 17:38:58 +08:00
Pxl
dda7604e16 [Bug][Storage-vectorized] fix code dump on outer join with not nullable column (#9112) 2022-04-21 11:02:04 +08:00
0bf72caf68 [Bug][Vectorized] Fix UB when doing ORDER BY. (#9023) 2022-04-15 14:02:29 +08:00
6ed59bb98b [refactor](code_style) remove useless inline #8933
1.Member functions defined in a class are inline by default (implicitly), and do not need to be added
2.inline is a keyword used for implementation, which has no effect when placed before the function declaration
2022-04-10 18:29:55 +08:00