[feature](vectorization) Support Vectorized Exec Engine In Doris (#7785)

# Proposed changes

Issue Number: close #6238

    Co-authored-by: HappenLee <happenlee@hotmail.com>
    Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com>
    Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com>
    Co-authored-by: wangbo <506340561@qq.com>
    Co-authored-by: emmymiao87 <522274284@qq.com>
    Co-authored-by: Pxl <952130278@qq.com>
    Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com>
    Co-authored-by: thinker <zchw100@qq.com>
    Co-authored-by: Zeno Yang <1521564989@qq.com>
    Co-authored-by: Wang Shuo <wangshuo128@gmail.com>
    Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com>
    Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>
    Co-authored-by: xinghuayu007 <1450306854@qq.com>
    Co-authored-by: weizuo93 <weizuo@apache.org>
    Co-authored-by: yiguolei <guoleiyi@tencent.com>
    Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com>
    Co-authored-by: awakeljw <993007281@qq.com>
    Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com>
    Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com>


## Problem Summary:

### 1. Some code from clickhouse

**ClickHouse is an excellent implementation of the vectorized execution engine database,
so here we have referenced and learned a lot from its excellent implementation in terms of
data structure and function implementation.
We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers.**

The following comment has been added to the code from Clickhouse, eg:
// This file is copied from
// https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h
// and modified by Doris

### 2. Support exec node and query:
* vaggregation_node
* vanalytic_eval_node
* vassert_num_rows_node
* vblocking_join_node
* vcross_join_node
* vempty_set_node
* ves_http_scan_node
* vexcept_node
* vexchange_node
* vintersect_node
* vmysql_scan_node
* vodbc_scan_node
* volap_scan_node
* vrepeat_node
* vschema_scan_node
* vselect_node
* vset_operation_node
* vsort_node
* vunion_node
* vhash_join_node

You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set.

### 3. Data Model

Vec Exec Engine Support **Dup/Agg/Unq** table, Support Block Reader Vectorized.
Segment Vec is working in process.

### 4. How to use

1. Set the environment variable `set enable_vectorized_engine = true; `(required)
2. Set the environment variable `set batch_size = 4096; ` (recommended)

### 5. Some diff from origin exec engine

https://github.com/doris-vectorized/doris-vectorized/issues/294

## Checklist(Required)

1. Does it affect the original behavior: (No)
2. Has unit tests been added: (Yes)
3. Has document been added or modified: (No)
4. Does it need to update dependencies: (No)
5. Are there any changes that cannot be rolled back: (Yes)
This commit is contained in:
HappenLee
2022-01-18 10:07:15 +08:00
committed by GitHub
parent ebc27a40d7
commit e1d7233e9c
498 changed files with 69593 additions and 479 deletions

View File

@ -0,0 +1,471 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
// This file is copied from
// https://github.com/ClickHouse/ClickHouse/blob/master/src/Functions/FunctionBitmap.h
// and modified by Doris
#include "util/string_parser.hpp"
#include "vec/functions/function_totype.h"
#include "vec/functions/function_const.h"
#include "vec/functions/simple_function_factory.h"
#include "vec/functions/function_string.h"
#include "gutil/strings/numbers.h"
#include "gutil/strings/split.h"
namespace doris::vectorized {
struct BitmapEmpty {
static constexpr auto name = "bitmap_empty";
using ReturnColVec = ColumnBitmap;
static DataTypePtr get_return_type() {
return std::make_shared<DataTypeBitMap>();
}
static auto init_value() {
return BitmapValue{};
}
};
struct NameToBitmap {
static constexpr auto name = "to_bitmap";
};
struct ToBitmapImpl {
using ReturnType = DataTypeBitMap;
static constexpr auto TYPE_INDEX = TypeIndex::String;
using Type = String;
using ReturnColumnType = ColumnBitmap;
static Status vector(const ColumnString::Chars& data, const ColumnString::Offsets& offsets,
std::vector<BitmapValue>& res) {
auto size = offsets.size();
res.reserve(size);
for (size_t i = 0; i < size; ++i) {
const char* raw_str = reinterpret_cast<const char*>(&data[offsets[i - 1]]);
size_t str_size = offsets[i] - offsets[i - 1] - 1;
StringParser::ParseResult parse_result = StringParser::PARSE_SUCCESS;
uint64_t int_value = StringParser::string_to_unsigned_int<uint64_t>(raw_str, str_size,
&parse_result);
// TODO: which where cause problem in to_bitmap(null), rethink how to slove the problem
// of null
// if (UNLIKELY(parse_result != StringParser::PARSE_SUCCESS)) {
// return Status::RuntimeError(
// fmt::format("The input: {:.{}} is not valid, to_bitmap only support bigint "
// "value from 0 to 18446744073709551615 currently",
// raw_str, str_size));
// }
res.emplace_back();
res.back().add(int_value);
}
return Status::OK();
}
};
struct NameBitmapFromString {
static constexpr auto name = "bitmap_from_string";
};
struct BitmapFromString {
using ReturnType = DataTypeBitMap;
static constexpr auto TYPE_INDEX = TypeIndex::String;
using Type = String;
using ReturnColumnType = ColumnBitmap;
static Status vector(const ColumnString::Chars& data, const ColumnString::Offsets& offsets,
std::vector<BitmapValue>& res) {
auto size = offsets.size();
res.reserve(size);
std::vector<uint64_t> bits;
for (size_t i = 0; i < size; ++i) {
const char* raw_str = reinterpret_cast<const char*>(&data[offsets[i - 1]]);
int str_size = offsets[i] - offsets[i - 1] - 1;
if (SplitStringAndParse({raw_str, str_size}, ",", &safe_strtou64, &bits)) {
res.emplace_back(bits);
} else {
res.emplace_back();
}
bits.clear();
}
return Status::OK();
}
};
struct NameBitmapHash {
static constexpr auto name = "bitmap_hash";
};
struct BitmapHash {
using ReturnType = DataTypeBitMap;
static constexpr auto TYPE_INDEX = TypeIndex::String;
using Type = String;
using ReturnColumnType = ColumnBitmap;
static Status vector(const ColumnString::Chars& data, const ColumnString::Offsets& offsets,
std::vector<BitmapValue>& res) {
auto size = offsets.size();
res.reserve(size);
for (size_t i = 0; i < size; ++i) {
const char* raw_str = reinterpret_cast<const char*>(&data[offsets[i - 1]]);
size_t str_size = offsets[i] - offsets[i - 1] - 1;
uint32_t hash_value =
HashUtil::murmur_hash3_32(raw_str, str_size, HashUtil::MURMUR3_32_SEED);
res.emplace_back();
res.back().add(hash_value);
}
return Status::OK();
}
};
struct NameBitmapCount {
static constexpr auto name = "bitmap_count";
};
struct BitmapCount {
using ReturnType = DataTypeInt64;
static constexpr auto TYPE_INDEX = TypeIndex::BitMap;
using Type = DataTypeBitMap::FieldType;
using ReturnColumnType = ColumnVector<Int64>;
using ReturnColumnContainer = ColumnVector<Int64>::Container;
static Status vector(const std::vector<BitmapValue>& data, ReturnColumnContainer& res) {
size_t size = data.size();
res.reserve(size);
for (size_t i = 0; i < size; ++i) {
res.push_back(data[i].cardinality());
}
return Status::OK();
}
};
struct NameBitmapAnd {
static constexpr auto name = "bitmap_and";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapAnd {
using ResultDataType = DataTypeBitMap;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
static Status vector_vector(const TData& lvec, const TData& rvec, TData& res) {
size_t size = lvec.size();
for (size_t i = 0; i < size; ++i) {
res[i] = lvec[i];
res[i] &= rvec[i];
}
return Status::OK();
}
};
struct NameBitmapOr {
static constexpr auto name = "bitmap_or";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapOr {
using ResultDataType = DataTypeBitMap;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
static Status vector_vector(const TData& lvec, const TData& rvec, TData& res) {
size_t size = lvec.size();
for (size_t i = 0; i < size; ++i) {
res[i] = lvec[i];
res[i] |= rvec[i];
}
return Status::OK();
}
};
struct NameBitmapXor {
static constexpr auto name = "bitmap_xor";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapXor {
using ResultDataType = DataTypeBitMap;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
static Status vector_vector(const TData& lvec, const TData& rvec, TData& res) {
size_t size = lvec.size();
for (size_t i = 0; i < size; ++i) {
res[i] = lvec[i];
res[i] ^= rvec[i];
}
return Status::OK();
}
};
struct NameBitmapNot {
static constexpr auto name = "bitmap_not";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapNot {
using ResultDataType = DataTypeBitMap;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
static Status vector_vector(const TData& lvec, const TData& rvec, TData& res) {
size_t size = lvec.size();
for (size_t i = 0; i < size; ++i) {
res[i] = lvec[i];
res[i] -= rvec[i];
}
return Status::OK();
}
};
struct NameBitmapContains {
static constexpr auto name = "bitmap_contains";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapContains {
using ResultDataType = DataTypeUInt8;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using LTData = std::vector<BitmapValue>;
using RTData = typename ColumnVector<T1>::Container;
using ResTData = typename ColumnVector<UInt8>::Container;
static Status vector_vector(const LTData& lvec, const RTData& rvec, ResTData& res) {
size_t size = lvec.size();
for (size_t i = 0; i < size; ++i) {
res[i] = lvec[i].contains(rvec[i]);
}
return Status::OK();
}
};
struct NameBitmapHasAny {
static constexpr auto name = "bitmap_has_any";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapHasAny {
using ResultDataType = DataTypeUInt8;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
using ResTData = typename ColumnVector<UInt8>::Container;
static Status vector_vector(const TData& lvec, const TData& rvec, ResTData& res) {
size_t size = lvec.size();
for (size_t i = 0; i < size; ++i) {
auto bitmap = const_cast<BitmapValue&>(lvec[i]);
bitmap &= rvec[i];
res[i] = bitmap.cardinality() != 0;
}
return Status::OK();
}
};
struct NameBitmapMin {
static constexpr auto name = "bitmap_min";
};
struct BitmapMin {
using ReturnType = DataTypeInt64;
static constexpr auto TYPE_INDEX = TypeIndex::BitMap;
using Type = DataTypeBitMap::FieldType;
using ReturnColumnType = ColumnVector<Int64>;
using ReturnColumnContainer = ColumnVector<Int64>::Container;
static Status vector(const std::vector<BitmapValue>& data, ReturnColumnContainer& res) {
size_t size = data.size();
res.reserve(size);
for (size_t i = 0; i < size; ++i) {
auto min = const_cast<std::vector<BitmapValue>&>(data)[i].minimum();
res.push_back(min.val);
}
return Status::OK();
}
};
struct NameBitmapMax {
static constexpr auto name = "bitmap_max";
};
struct BitmapMax {
using ReturnType = DataTypeInt64;
static constexpr auto TYPE_INDEX = TypeIndex::BitMap;
using Type = DataTypeBitMap::FieldType;
using ReturnColumnType = ColumnVector<Int64>;
using ReturnColumnContainer = ColumnVector<Int64>::Container;
static Status vector(const std::vector<BitmapValue>& data, ReturnColumnContainer& res) {
size_t size = data.size();
res.reserve(size);
for (size_t i = 0; i < size; ++i) {
auto max = const_cast<std::vector<BitmapValue>&>(data)[i].maximum();
res.push_back(max.val);
}
return Status::OK();
}
};
struct NameBitmapToString {
static constexpr auto name = "bitmap_to_string";
};
struct BitmapToString {
using ReturnType = DataTypeString;
static constexpr auto TYPE_INDEX = TypeIndex::BitMap;
using Type = DataTypeBitMap::FieldType;
using ReturnColumnType = ColumnString;
using Chars = ColumnString::Chars;
using Offsets = ColumnString::Offsets;
static Status vector(const std::vector<BitmapValue>& data, Chars& chars, Offsets& offsets) {
size_t size = data.size();
offsets.resize(size);
chars.reserve(size);
for (size_t i = 0; i < size; ++i) {
StringOP::push_value_string(data[i].to_string(), i, chars, offsets);
}
return Status::OK();
}
};
struct NameBitmapAndCount {
static constexpr auto name = "bitmap_and_count";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapAndCount {
using ResultDataType = DataTypeInt64;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
using ResTData = typename ColumnVector<Int64>::Container;
static Status vector_vector(const TData& lvec, const TData& rvec, ResTData& res) {
size_t size = lvec.size();
BitmapValue val;
for (size_t i = 0; i < size; ++i) {
val |= lvec[i];
val &= rvec[i];
res[i] = val.cardinality();
val.clear();
}
return Status::OK();
}
};
struct NameBitmapOrCount {
static constexpr auto name = "bitmap_or_count";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapOrCount {
using ResultDataType = DataTypeInt64;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
using ResTData = typename ColumnVector<Int64>::Container;
static Status vector_vector(const TData& lvec, const TData& rvec, ResTData& res) {
size_t size = lvec.size();
BitmapValue val;
for (size_t i = 0; i < size; ++i) {
val |= lvec[i];
val |= rvec[i];
res[i] = val.cardinality();
val.clear();
}
return Status::OK();
}
};
struct NameBitmapXorCount {
static constexpr auto name = "bitmap_xor_count";
};
template <typename LeftDataType, typename RightDataType>
struct BitmapXorCount {
using ResultDataType = DataTypeInt64;
using T0 = typename LeftDataType::FieldType;
using T1 = typename RightDataType::FieldType;
using TData = std::vector<BitmapValue>;
using ResTData = typename ColumnVector<Int64>::Container;
static Status vector_vector(const TData& lvec, const TData& rvec, ResTData& res) {
size_t size = lvec.size();
BitmapValue val;
for (size_t i = 0; i < size; ++i) {
val |= lvec[i];
val ^= rvec[i];
res[i] = val.cardinality();
val.clear();
}
return Status::OK();
}
};
using FunctionBitmapEmpty = FunctionConst<BitmapEmpty, false>;
using FunctionToBitmap = FunctionUnaryToType<ToBitmapImpl, NameToBitmap>;
using FunctionBitmapFromString = FunctionUnaryToType<BitmapFromString,NameBitmapFromString>;
using FunctionBitmapHash = FunctionUnaryToType<BitmapHash, NameBitmapHash>;
using FunctionBitmapCount = FunctionUnaryToType<BitmapCount, NameBitmapCount>;
using FunctionBitmapAndCount =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapAndCount, NameBitmapAndCount>;
using FunctionBitmapOrCount =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapOrCount, NameBitmapOrCount>;
using FunctionBitmapXorCount =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapXorCount, NameBitmapXorCount>;
using FunctionBitmapMin = FunctionUnaryToType<BitmapMin, NameBitmapMin>;
using FunctionBitmapMax = FunctionUnaryToType<BitmapMax, NameBitmapMax>;
using FunctionBitmapToString = FunctionUnaryToType<BitmapToString, NameBitmapToString>;
using FunctionBitmapAnd =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapAnd, NameBitmapAnd>;
using FunctionBitmapOr =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapOr, NameBitmapOr>;
using FunctionBitmapXor =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapXor, NameBitmapXor>;
using FunctionBitmapNot =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapNot, NameBitmapNot>;
using FunctionBitmapContains =
FunctionBinaryToType<DataTypeBitMap, DataTypeInt64, BitmapContains, NameBitmapContains>;
using FunctionBitmapHasAny =
FunctionBinaryToType<DataTypeBitMap, DataTypeBitMap, BitmapHasAny, NameBitmapHasAny>;
void register_function_bitmap(SimpleFunctionFactory& factory) {
factory.register_function<FunctionBitmapEmpty>();
factory.register_function<FunctionToBitmap>();
factory.register_function<FunctionBitmapFromString>();
factory.register_function<FunctionBitmapHash>();
factory.register_function<FunctionBitmapCount>();
factory.register_function<FunctionBitmapAndCount>();
factory.register_function<FunctionBitmapOrCount>();
factory.register_function<FunctionBitmapXorCount>();
factory.register_function<FunctionBitmapMin>();
factory.register_function<FunctionBitmapMax>();
factory.register_function<FunctionBitmapToString>();
factory.register_function<FunctionBitmapAnd>();
factory.register_function<FunctionBitmapOr>();
factory.register_function<FunctionBitmapXor>();
factory.register_function<FunctionBitmapNot>();
factory.register_function<FunctionBitmapContains>();
factory.register_function<FunctionBitmapHasAny>();
}
} // namespace doris::vectorized