doris

Author	SHA1	Message	Date
HappenLee	baa3b14870	[fix] Use fmt::to_string replace memory buffer::data() (#8311 )	2022-03-06 13:44:11 +08:00
caiconghui	538df28737	[improvement](routine-load) Support routine load task succeed with empty data consumed (#8256 )	2022-03-03 22:35:50 +08:00
Zhengguo Yang	09bfb8b9d3	[fix] (rpc-udf) Fixed the problem that the query could not be interrupted (#8248 ) if an error occurred in the rpc server during the execution of rpc-udf. Add java,cpp,python demo of rpc-udf server	2022-03-03 09:30:03 +08:00
zbtzbtzbt	ada39dd9ad	[improvement][vec] better memequal impl to speed up string compare (#8229 ) like #8214 faster string compare operator in vec engine.	2022-03-01 11:25:12 +08:00
yiguolei	27d2e3e949	[refactor](fe) Remove old fe meta version (#8246 ) Remove old FE meta version < 100.	2022-02-28 17:47:01 +08:00
qiye	87b96cfcd6	[feature](iceberg) Step3: Support query iceberg external table (#8179 ) 1. Add Iceberg scan node 2. Add Iceberg/Hive table type in thrift 3. Support querying Iceberg tables of format types `parquet` and `orc`	2022-02-26 17:04:11 +08:00
Mingyu Chen	9a7931cfed	[fix](mem-pool) fix bug that mem pool failed to allocate in ASAN mode (#8216 ) Also fix BE ut: 1. fix scheme_change_test memory leak 2. fix mem_pool_test Do not using DEFAULT_PADDING_SIZE = 0x10 in mem_pool when running ut. 3. remove plugin_test	2022-02-24 10:52:58 +08:00
dataroaring	d6aebc0c2c	[improvement] make asan work as much as possible (#8148 ) * make ASAN poisoning work as much as possible Before this patch a use after poison is reported like below ==19305==ERROR: AddressSanitizer: unknown-crash on address 0x625000137013 at pc 0x561c44bcf6b8 bp 0x7ffb75a00910 sp 0x7ffb75a000b8 After this patch the use after poison is reported like below ==17782==ERROR: AddressSanitizer: use-after-poison on address 0x625000137033 at pc 0x55633c8f56b8 bp 0x7ff3dc437930 sp 0x7ff3dc43 Before this patch, a false memory usage is reported like below ==33080==AddressSanitizer CHECK failed: ../../../../src/libsanitizer/ asan/asan_allocator.cpp:189 "((old)) == ((kAllocBegMagic))"	2022-02-22 09:29:22 +08:00
Mingyu Chen	6e8d52f3fc	[fix](stream-load) fix bug that stream load may be blocked with unqualified data (#8176 ) Co-authored-by: morningman <chenmingyu@baidu.com>	2022-02-22 09:26:23 +08:00
zuochunwei	47067e40a6	[refactor](common) optimize Status implemention: no dynamic new (#8117 )	2022-02-22 09:23:29 +08:00
zuochunwei	d0ee101c2f	[refactor] (runtime)tidy up the plan_fragment_executor codes (#8110 ) Co-authored-by: zuochunwei <zuochunwei@meituan.com>	2022-02-22 09:20:27 +08:00
Zhengguo Yang	409aefdfbf	[refactor] add some log when close parquet file (#8144 )	2022-02-21 09:36:53 +08:00
zhannngchen	826738d97f	[docs]Some doc improvements and typo fix (#8153 )	2022-02-21 09:36:01 +08:00
Zhengguo Yang	50864aca7d	[refactor] fix warings when compile with clang (#8069 )	2022-02-19 11:29:02 +08:00
yinzhijian	936da4f10a	[feature](thread-pool) Support thread pool per disk for scanners (#7994 ) Support thread pool per disk for scanners to prevent pool performance from some high ioutil disks happening key point: 1. each disk has a thread pool for scanners 2. whenever a thread pool of one disk runs out of local work, tasks can be retrieved from other threads(disks). This is done round-robin. performance testing: vec version: 25% faster than single thread pool in a high io util disk test case normal version: 8% faster than single thread pool in a high io util disk test case	2022-02-18 09:40:58 +08:00
zhangstar333	f8411f3c6a	[refactor](mysql_table_writer)split into two parts of vectorized and row mode (#8081 )	2022-02-17 11:29:25 +08:00
Mingyu Chen	26289c28b0	[fix](load)(compaction) Fix NodeChannel coredump bug and modify some compaction logic (#8072 ) 1. Fix the problem of BE crash caused by destruct sequence. (close #8058) 2. Add a new BE config `compaction_task_num_per_fast_disk` This config specify the max concurrent compaction task num on fast disk(typically .SSD). So that for high speed disk, we can execute more compaction task at same time, to compact the data as soon as possible 3. Avoid frequent selection of unqualified tablet to perform compaction. 4. Modify some log level to reduce the log size of BE. 5. Modify some clone logic to handle error correctly.	2022-02-17 10:52:08 +08:00
zhangstar333	0003822da7	[feature](vec) add ColumnHLL to support hll type (#7828 )	2022-02-17 10:44:42 +08:00
weizuo93	a6bf8c13eb	[Feature](Transaction) Support two phase commit (2PC) for stream load (#7473 ) The two phase batch commit means： During Stream load, after data is written, the message will be returned to the client, the data is invisible at this point and the transaction status is PRECOMMITTED. The data will be visible only after COMMIT is triggered by client. 1. User can invoke the following interface to trigger commit operations for transaction： curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" \ http://fe_host:http_port/api/{db}/_stream_load_2pc or curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:commit" \ http://be_host:webserver_port/api/{db}/_stream_load_2pc 2.User can invoke the following interface to trigger abort operations for transaction： curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" \ http://fe_host:http_port/api/{db}/_stream_load_2pc or curl -X PUT --location-trusted -u user:passwd -H "txn_id:txnId" -H "txn_operation:abort" \ http://be_host:webserver_port/api/{db}/_stream_load_2pc	2022-02-16 11:55:04 +08:00
zhangstar333	25d64775d1	[Vectorized][Feature] Support mysql external table insert into stm (#7979 )	2022-02-15 14:58:58 +08:00
Mingyu Chen	884fddbf33	[fix](compatibility) Fix compatibility issue of PRowBatch and some tablet sink bugs (#8000 ) 1. set both `tuple_offsets` and `new_tuple_offsets` in PRowBatch for compatibility 2. set FE config `repair_slow_replica` default to false Avoid impacting the load process after upgrading. Eg, if there are only 2 replicas, one is with high version count. After upgrade, that replica will be set to bad, so that the load process will be stopped because only 1 replica is alive. 3. Fix a bug that NodeChannel may be blocked at `close_wait()` Forget to set `add_batch_finish` flag after the last rpc finished. 4. Fix a NPE of RoutineLoadScheduler	2022-02-15 11:23:19 +08:00
yiguolei	aea3e4e59b	[refactor] Remove version hash from BE and related test in BE (#8027 )	2022-02-14 09:29:27 +08:00
Zhengguo Yang	7a73645eee	[refactor] remove some unused code (#8022 )	2022-02-12 15:17:28 +08:00
yiguolei	6b9cb49779	[Refactor] remove plugin folder in be since it is useless and it need fPIC tag to build and we will remove all fPIC tag in the future (#8008 )	2022-02-12 12:28:14 +08:00
Pxl	a4e7c76336	[Enhancement] use std::search to replace custom search (#7999 )	2022-02-11 10:47:58 +08:00
HappenLee	ef233701b3	[feature](vec)(load) Support vtablet sink to enable insert into by using vec query engine (#7957 ) Support vtablet sink to enable insert into query in vec query engine	2022-02-08 11:04:09 +08:00
Zhengguo Yang	f8d086d87f	[feature](rpc) (experimental)Support implement UDF through GRPC protocol. (#7519 ) Support implement UDF through GRPC protocol. This brings several benefits: 1. The udf implementation language is not limited to c++, users can use any familiar language to implement udf 2. UDF is decoupled from Doris, udf will not cause doris coredump, udf computing resources are separated from doris, and doris services are not affected But RPC's UDF has a fixed overhead, so its performance is much slower than C++ UDF, especially when the amount of data is large. Create function like ``` CREATE FUNCTION rpc_add(INT, INT) RETURNS INT PROPERTIES ( "SYMBOL"="add_int", "OBJECT_FILE"="127.0.0.1:9999", "TYPE"="RPC" ); ``` Function service need to implement `check_fn` and `fn_call` methods Note: THIS IS AN EXPERIMENTAL FEATURE, THE INTERFACE AND DATA STRUCTURE MAY BE CHANGED IN FUTURE !!!	2022-02-08 09:25:09 +08:00
Mingyu Chen	c0e59e59aa	[fix][refactor] fix bugs and refactor some code by lint (#7871 ) 1. Fix some `passedByValue` issues. 2. Fix some `dereferenceBeforeCheck` issues. 3. Fix some `uninitMemberVar` issues. 4. Fix some iterator `eraseDereference` issues. 5. Fix compile issue introduced from #7923 #7905 #7848	2022-02-01 14:31:14 +08:00
Mingyu Chen	82f421a019	[fix](brpc-attachment) Fix bug that may cause BE crash when enable `transfer_data_by_brpc_attachment` (#7921 ) This PR mainly changes: 1. Fix bug when enable `transfer_data_by_brpc_attachment` In `data_stream_sender`, we will send a serialized PRowBatch data to multiple Channels. And if `transfer_data_by_brpc_attachment` is enabled, we will mistakenly clear the data in PRowBatch after sending PRowBatch to the first Channel. As a result, the following Channel cannot receive the correct data, causing an error. So I use a separate buffer instead of `tuple_data` in PRowBatch to store the serialized data and reuse it in multiple channels. 2. Fix bug that the the offset in serialized row batch may overflow Use int64 to replace int32 offset. And for compatibility, add a new field `new_tuple_offsets` in PRowBatch.	2022-02-01 08:51:16 +08:00
Pxl	3ee000c13c	[chore] support build with libc++ && add some build config (#7903 ) support LIBCPP/LDD/BUILD_META_TOOL for build.sh	2022-01-30 16:47:22 +08:00
qiye	6a1a2a2ed5	[fix](query) Add init function for result_file_sink (#7927 ) Add init function in `result_file_sink` to fix the error "Empty partition info", which is occasional reported when using SELECT INFO OUTFILE.	2022-01-29 10:08:57 +08:00
EmmyMiao87	1d900d8605	(fix)[planner] Fix the right tuple ids in empty set node (#7931 ) The tuple ids of the empty set node must be exactly the same as the tuple ids of the origin root node. In the issue, we found that once the tree where the root node is located has a window function, the tuple ids of the empty set node cannot be calculated correctly. This pr mostly fixes the problem. In order to calculate the correct tuple ids, the tuple ids obtained from the SelectStmt.getMaterializedTupleIds() function in the past are changed to directly use the tuple ids of the origin root node. Although we tried to fix #7929 by modifying the SelectStmt.getMaterializedTupleIds() function, this method can't get the tuple of the last correct window function. So we use other ways to construct tupleids of empty nodes.	2022-01-29 09:46:05 +08:00
caiconghui	d2386dd85d	[improvement](rewrite) Make RewriteDateLiteralRule to be compatible with mysql (#7876 )	2022-01-27 10:32:18 +08:00
Amos Bird	800a36343a	[chore] Prolog of hermetic build with GCC 11 and Clang 13. (#7712 ) Prepare to generate hermetic build using GCC 11 and Clang 13. The ideal toolchain would be ldb toolchain generated by [ldb_toolchain_gen.sh](https://github.com/amosbird/ldb_toolchain_gen/releases/download/v0.3/ldb_toolchain_gen.sh) To kick off a clang build, set `DORIS_TOOLCHAIN=clang` before running any build scripts.	2022-01-21 12:12:04 +08:00
Mingyu Chen	ef984a6a72	[improvement](load) Improve load fault tolerance (#7674 ) Currently, if we encounter a problem with a replica of a tablet during the load process, such as a write error, rpc error, -235, etc., it will cause the entire load job to fail, which results in a significant reduction in Doris' fault tolerance. This PR mainly changes: 1. refined the judgment of failed replicas in the load process, so that the failure of a few replicas will not affect the normal completion of the load job. 2. fix a bug introduced from #7754 that may cause BE coredump	2022-01-20 09:23:21 +08:00
Mingyu Chen	5fc0a9f40d	[improvement](Load) Cancel the load job ASAP when encounter unqualified data (#6319 ) This PR mainly changes: 1. Help to Cancel the load job ASAP when encounter unqualified data. Solution is described in #6318 . Also replace some std::stringstream with fmt::memory_buffer to avoid performance issues. 2. fix a NPE bug when create user with empty host 3. fix compile warning after rebasing the master(vectorization)	2022-01-18 13:13:55 +08:00
Mingyu Chen	efb4e189df	[fix](lateral-view) Fix some lateral view bugs (#7772 ) 1. Fix bug that BE may crash when input node of TableFunctionNode has non-null column 2. Fix bug that TableFunctionNode may not return all results	2022-01-18 12:09:32 +08:00
HappenLee	e1d7233e9c	[feature](vectorization) Support Vectorized Exec Engine In Doris (#7785 ) # Proposed changes Issue Number: close #6238 Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com> Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com> Co-authored-by: wangbo <506340561@qq.com> Co-authored-by: emmymiao87 <522274284@qq.com> Co-authored-by: Pxl <952130278@qq.com> Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> Co-authored-by: thinker <zchw100@qq.com> Co-authored-by: Zeno Yang <1521564989@qq.com> Co-authored-by: Wang Shuo <wangshuo128@gmail.com> Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> Co-authored-by: xinghuayu007 <1450306854@qq.com> Co-authored-by: weizuo93 <weizuo@apache.org> Co-authored-by: yiguolei <guoleiyi@tencent.com> Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com> Co-authored-by: awakeljw <993007281@qq.com> Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com> Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com> ## Problem Summary: ### 1. Some code from clickhouse ClickHouse is an excellent implementation of the vectorized execution engine database, so here we have referenced and learned a lot from its excellent implementation in terms of data structure and function implementation. We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers. The following comment has been added to the code from Clickhouse, eg: // This file is copied from // https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h // and modified by Doris ### 2. Support exec node and query: * vaggregation_node * vanalytic_eval_node * vassert_num_rows_node * vblocking_join_node * vcross_join_node * vempty_set_node * ves_http_scan_node * vexcept_node * vexchange_node * vintersect_node * vmysql_scan_node * vodbc_scan_node * volap_scan_node * vrepeat_node * vschema_scan_node * vselect_node * vset_operation_node * vsort_node * vunion_node * vhash_join_node You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set. ### 3. Data Model Vec Exec Engine Support Dup/Agg/Unq table, Support Block Reader Vectorized. Segment Vec is working in process. ### 4. How to use 1. Set the environment variable `set enable_vectorized_engine = true; `(required) 2. Set the environment variable `set batch_size = 4096; ` (recommended) ### 5. Some diff from origin exec engine https://github.com/doris-vectorized/doris-vectorized/issues/294 ## Checklist(Required) 1. Does it affect the original behavior: (No) 2. Has unit tests been added: (Yes) 3. Has document been added or modified: (No) 4. Does it need to update dependencies: (No) 5. Are there any changes that cannot be rolled back: (Yes)	2022-01-18 10:07:15 +08:00
Mingyu Chen	5f8d91257b	[improvement](routine-load) Reduce the probability that the routine load task rpc timeout (#7754 ) If an load task has a relatively short timeout, then we need to ensure that each RPC of this task does not get blocked for a long time. And an RPC is usually blocked for two reasons. 1. handling "memory exceeds limit" in the RPC If the system finds that the memory occupied by the load exceeds the threshold, it will select the load channel that occupies the most memory and flush the memtable in it. this operation is done in the RPC, which may be more time consuming. 2. close the load channel When the load channel receives the last batch, it will end the task. It will wait for all memtables flushes to finish synchronously. This process is also time consuming. Therefore, this PR solves this problem by. 1. Use timeout to determine whether it is a high-priority load task If the timeout of an load task is relatively short, then we mark it as a high-priority task. 2. not processing "memory exceeds limit" for high priority tasks 3. use a separate flush thread to flush memtable for high priority tasks.	2022-01-16 10:41:31 +08:00
Zhengguo Yang	f3817829bb	[fix] fix malloc and free mismatch issue (#7702 ) The memory allocate by `malloc` should be freed by `free`	2022-01-14 09:32:33 +08:00
Lijia Liu	8685b6b985	[improvement](executor) Optimize lock of client cache (#7543 )	2022-01-11 15:05:24 +08:00
caiconghui	83f6eef506	[improvement](routine-load) Make routine load work with old kafka version (#7630 ) Co-authored-by: caiconghui1 <caiconghui1@jd.com>	2022-01-10 17:30:24 +08:00
Userwhite	15d54bae0e	[fix](error-hub) use lock to protect the creation of error hub (#7605 ) Add a lock when creating error_hub to ensure that no multiple threads create error_hub (which could lead to a CORE) #7604	2022-01-09 16:57:31 +08:00
924060929	563545475e	[Optimize](Runtime Filter) Support merge in runtime filter(#7546 ) (#7547 ) Support merge IN predicate when exist remote target(e.g. shuffle hash join). Remote the code that IN predicate implicit conversion to Bloom filter then exist remote target. Close related #7546	2022-01-06 19:08:35 +08:00
caiconghui	9ddcf0625c	[improvement](load) Transaction for load job with no data for all partitions should be considered as normal and should not be aborted (#7240 ) If the load result set is empty, or the load data is all filtered by the `where` condition, it will not return failed with msg `all partitions have no load data`, but will return success directly.	2022-01-05 10:38:33 +08:00
Mingyu Chen	7b13ac5b31	[deps][chore] make openssl works with old glibc version (#7541 ) 1. build OpenSSL with --with-rand-seed=devrandom 2. Modified: brpc 1.0.0-rc02 -> 1.0.0	2021-12-31 23:19:04 +08:00
pengxiangyu	dc9cd34047	[docs] Add user manual for hdfs load and transaction. (#7497 )	2021-12-30 10:22:48 +08:00
GoGoWen	a8a5c0a6a8	[improvement](load) memory usage optimization for load job (#7454 ) Reduce memory usage when loading unqualified data	2021-12-24 21:30:28 +08:00
pengxiangyu	20ef8a6e21	[feature-wip](remote storage)(step1) use a struct instead of string for parameter path, add basic remote method (#7098 ) For the first, we need to make a parameter to discribe the data is local or remote. At then, we need to support some basic function to support the operation for remote storage.	2021-12-22 22:58:23 +08:00
Mingyu Chen	0499b2211b	[feat](lateral-view) Support execution of lateral view stmt (#7255 ) 1. Add table function node 2. Add 3 table functions: explode_split, explode_bitmap and explode_json_array	2021-12-16 10:46:15 +08:00

1 2 3 4 5 ...

478 Commits