doris

Author	SHA1	Message	Date
Zhengguo Yang	7a73645eee	[refactor] remove some unused code (#8022 )	2022-02-12 15:17:28 +08:00
yiguolei	6b9cb49779	[Refactor] remove plugin folder in be since it is useless and it need fPIC tag to build and we will remove all fPIC tag in the future (#8008 )	2022-02-12 12:28:14 +08:00
Pxl	a4e7c76336	[Enhancement] use std::search to replace custom search (#7999 )	2022-02-11 10:47:58 +08:00
HappenLee	ef233701b3	[feature](vec)(load) Support vtablet sink to enable insert into by using vec query engine (#7957 ) Support vtablet sink to enable insert into query in vec query engine	2022-02-08 11:04:09 +08:00
Zhengguo Yang	f8d086d87f	[feature](rpc) (experimental)Support implement UDF through GRPC protocol. (#7519 ) Support implement UDF through GRPC protocol. This brings several benefits: 1. The udf implementation language is not limited to c++, users can use any familiar language to implement udf 2. UDF is decoupled from Doris, udf will not cause doris coredump, udf computing resources are separated from doris, and doris services are not affected But RPC's UDF has a fixed overhead, so its performance is much slower than C++ UDF, especially when the amount of data is large. Create function like ``` CREATE FUNCTION rpc_add(INT, INT) RETURNS INT PROPERTIES ( "SYMBOL"="add_int", "OBJECT_FILE"="127.0.0.1:9999", "TYPE"="RPC" ); ``` Function service need to implement `check_fn` and `fn_call` methods Note: THIS IS AN EXPERIMENTAL FEATURE, THE INTERFACE AND DATA STRUCTURE MAY BE CHANGED IN FUTURE !!!	2022-02-08 09:25:09 +08:00
Mingyu Chen	c0e59e59aa	[fix][refactor] fix bugs and refactor some code by lint (#7871 ) 1. Fix some `passedByValue` issues. 2. Fix some `dereferenceBeforeCheck` issues. 3. Fix some `uninitMemberVar` issues. 4. Fix some iterator `eraseDereference` issues. 5. Fix compile issue introduced from #7923 #7905 #7848	2022-02-01 14:31:14 +08:00
Mingyu Chen	82f421a019	[fix](brpc-attachment) Fix bug that may cause BE crash when enable `transfer_data_by_brpc_attachment` (#7921 ) This PR mainly changes: 1. Fix bug when enable `transfer_data_by_brpc_attachment` In `data_stream_sender`, we will send a serialized PRowBatch data to multiple Channels. And if `transfer_data_by_brpc_attachment` is enabled, we will mistakenly clear the data in PRowBatch after sending PRowBatch to the first Channel. As a result, the following Channel cannot receive the correct data, causing an error. So I use a separate buffer instead of `tuple_data` in PRowBatch to store the serialized data and reuse it in multiple channels. 2. Fix bug that the the offset in serialized row batch may overflow Use int64 to replace int32 offset. And for compatibility, add a new field `new_tuple_offsets` in PRowBatch.	2022-02-01 08:51:16 +08:00
Pxl	3ee000c13c	[chore] support build with libc++ && add some build config (#7903 ) support LIBCPP/LDD/BUILD_META_TOOL for build.sh	2022-01-30 16:47:22 +08:00
qiye	6a1a2a2ed5	[fix](query) Add init function for result_file_sink (#7927 ) Add init function in `result_file_sink` to fix the error "Empty partition info", which is occasional reported when using SELECT INFO OUTFILE.	2022-01-29 10:08:57 +08:00
EmmyMiao87	1d900d8605	(fix)[planner] Fix the right tuple ids in empty set node (#7931 ) The tuple ids of the empty set node must be exactly the same as the tuple ids of the origin root node. In the issue, we found that once the tree where the root node is located has a window function, the tuple ids of the empty set node cannot be calculated correctly. This pr mostly fixes the problem. In order to calculate the correct tuple ids, the tuple ids obtained from the SelectStmt.getMaterializedTupleIds() function in the past are changed to directly use the tuple ids of the origin root node. Although we tried to fix #7929 by modifying the SelectStmt.getMaterializedTupleIds() function, this method can't get the tuple of the last correct window function. So we use other ways to construct tupleids of empty nodes.	2022-01-29 09:46:05 +08:00
caiconghui	d2386dd85d	[improvement](rewrite) Make RewriteDateLiteralRule to be compatible with mysql (#7876 )	2022-01-27 10:32:18 +08:00
Amos Bird	800a36343a	[chore] Prolog of hermetic build with GCC 11 and Clang 13. (#7712 ) Prepare to generate hermetic build using GCC 11 and Clang 13. The ideal toolchain would be ldb toolchain generated by [ldb_toolchain_gen.sh](https://github.com/amosbird/ldb_toolchain_gen/releases/download/v0.3/ldb_toolchain_gen.sh) To kick off a clang build, set `DORIS_TOOLCHAIN=clang` before running any build scripts.	2022-01-21 12:12:04 +08:00
Mingyu Chen	ef984a6a72	[improvement](load) Improve load fault tolerance (#7674 ) Currently, if we encounter a problem with a replica of a tablet during the load process, such as a write error, rpc error, -235, etc., it will cause the entire load job to fail, which results in a significant reduction in Doris' fault tolerance. This PR mainly changes: 1. refined the judgment of failed replicas in the load process, so that the failure of a few replicas will not affect the normal completion of the load job. 2. fix a bug introduced from #7754 that may cause BE coredump	2022-01-20 09:23:21 +08:00
Mingyu Chen	5fc0a9f40d	[improvement](Load) Cancel the load job ASAP when encounter unqualified data (#6319 ) This PR mainly changes: 1. Help to Cancel the load job ASAP when encounter unqualified data. Solution is described in #6318 . Also replace some std::stringstream with fmt::memory_buffer to avoid performance issues. 2. fix a NPE bug when create user with empty host 3. fix compile warning after rebasing the master(vectorization)	2022-01-18 13:13:55 +08:00
Mingyu Chen	efb4e189df	[fix](lateral-view) Fix some lateral view bugs (#7772 ) 1. Fix bug that BE may crash when input node of TableFunctionNode has non-null column 2. Fix bug that TableFunctionNode may not return all results	2022-01-18 12:09:32 +08:00
HappenLee	e1d7233e9c	[feature](vectorization) Support Vectorized Exec Engine In Doris (#7785 ) # Proposed changes Issue Number: close #6238 Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com> Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com> Co-authored-by: wangbo <506340561@qq.com> Co-authored-by: emmymiao87 <522274284@qq.com> Co-authored-by: Pxl <952130278@qq.com> Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> Co-authored-by: thinker <zchw100@qq.com> Co-authored-by: Zeno Yang <1521564989@qq.com> Co-authored-by: Wang Shuo <wangshuo128@gmail.com> Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> Co-authored-by: xinghuayu007 <1450306854@qq.com> Co-authored-by: weizuo93 <weizuo@apache.org> Co-authored-by: yiguolei <guoleiyi@tencent.com> Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com> Co-authored-by: awakeljw <993007281@qq.com> Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com> Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com> ## Problem Summary: ### 1. Some code from clickhouse ClickHouse is an excellent implementation of the vectorized execution engine database, so here we have referenced and learned a lot from its excellent implementation in terms of data structure and function implementation. We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers. The following comment has been added to the code from Clickhouse, eg: // This file is copied from // https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h // and modified by Doris ### 2. Support exec node and query: * vaggregation_node * vanalytic_eval_node * vassert_num_rows_node * vblocking_join_node * vcross_join_node * vempty_set_node * ves_http_scan_node * vexcept_node * vexchange_node * vintersect_node * vmysql_scan_node * vodbc_scan_node * volap_scan_node * vrepeat_node * vschema_scan_node * vselect_node * vset_operation_node * vsort_node * vunion_node * vhash_join_node You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set. ### 3. Data Model Vec Exec Engine Support Dup/Agg/Unq table, Support Block Reader Vectorized. Segment Vec is working in process. ### 4. How to use 1. Set the environment variable `set enable_vectorized_engine = true; `(required) 2. Set the environment variable `set batch_size = 4096; ` (recommended) ### 5. Some diff from origin exec engine https://github.com/doris-vectorized/doris-vectorized/issues/294 ## Checklist(Required) 1. Does it affect the original behavior: (No) 2. Has unit tests been added: (Yes) 3. Has document been added or modified: (No) 4. Does it need to update dependencies: (No) 5. Are there any changes that cannot be rolled back: (Yes)	2022-01-18 10:07:15 +08:00
Mingyu Chen	5f8d91257b	[improvement](routine-load) Reduce the probability that the routine load task rpc timeout (#7754 ) If an load task has a relatively short timeout, then we need to ensure that each RPC of this task does not get blocked for a long time. And an RPC is usually blocked for two reasons. 1. handling "memory exceeds limit" in the RPC If the system finds that the memory occupied by the load exceeds the threshold, it will select the load channel that occupies the most memory and flush the memtable in it. this operation is done in the RPC, which may be more time consuming. 2. close the load channel When the load channel receives the last batch, it will end the task. It will wait for all memtables flushes to finish synchronously. This process is also time consuming. Therefore, this PR solves this problem by. 1. Use timeout to determine whether it is a high-priority load task If the timeout of an load task is relatively short, then we mark it as a high-priority task. 2. not processing "memory exceeds limit" for high priority tasks 3. use a separate flush thread to flush memtable for high priority tasks.	2022-01-16 10:41:31 +08:00
Zhengguo Yang	f3817829bb	[fix] fix malloc and free mismatch issue (#7702 ) The memory allocate by `malloc` should be freed by `free`	2022-01-14 09:32:33 +08:00
Lijia Liu	8685b6b985	[improvement](executor) Optimize lock of client cache (#7543 )	2022-01-11 15:05:24 +08:00
caiconghui	83f6eef506	[improvement](routine-load) Make routine load work with old kafka version (#7630 ) Co-authored-by: caiconghui1 <caiconghui1@jd.com>	2022-01-10 17:30:24 +08:00
Userwhite	15d54bae0e	[fix](error-hub) use lock to protect the creation of error hub (#7605 ) Add a lock when creating error_hub to ensure that no multiple threads create error_hub (which could lead to a CORE) #7604	2022-01-09 16:57:31 +08:00
924060929	563545475e	[Optimize](Runtime Filter) Support merge in runtime filter(#7546 ) (#7547 ) Support merge IN predicate when exist remote target(e.g. shuffle hash join). Remote the code that IN predicate implicit conversion to Bloom filter then exist remote target. Close related #7546	2022-01-06 19:08:35 +08:00
caiconghui	9ddcf0625c	[improvement](load) Transaction for load job with no data for all partitions should be considered as normal and should not be aborted (#7240 ) If the load result set is empty, or the load data is all filtered by the `where` condition, it will not return failed with msg `all partitions have no load data`, but will return success directly.	2022-01-05 10:38:33 +08:00
Mingyu Chen	7b13ac5b31	[deps][chore] make openssl works with old glibc version (#7541 ) 1. build OpenSSL with --with-rand-seed=devrandom 2. Modified: brpc 1.0.0-rc02 -> 1.0.0	2021-12-31 23:19:04 +08:00
pengxiangyu	dc9cd34047	[docs] Add user manual for hdfs load and transaction. (#7497 )	2021-12-30 10:22:48 +08:00
GoGoWen	a8a5c0a6a8	[improvement](load) memory usage optimization for load job (#7454 ) Reduce memory usage when loading unqualified data	2021-12-24 21:30:28 +08:00
pengxiangyu	20ef8a6e21	[feature-wip](remote storage)(step1) use a struct instead of string for parameter path, add basic remote method (#7098 ) For the first, we need to make a parameter to discribe the data is local or remote. At then, we need to support some basic function to support the operation for remote storage.	2021-12-22 22:58:23 +08:00
Mingyu Chen	0499b2211b	[feat](lateral-view) Support execution of lateral view stmt (#7255 ) 1. Add table function node 2. Add 3 table functions: explode_split, explode_bitmap and explode_json_array	2021-12-16 10:46:15 +08:00
Mingyu Chen	2b90967c4c	[fix][refactor](broker load) refactor the scheduling logic of broker load (#7371 ) 1. Refactor the scheduling logic of broker load. Details see #7367 2. Fix bug that loadedBytes in SHOW LOAD result is wrong. 3. Cancel the thread of LoadTimeoutChecker Now for PENDING load jobs, there will be no timeout. And the timeout of a load job start when pending load task is scheduled. 4. Fix a bug that the loading task is never submitted to the pool. The logic of BlockedPolicy is wrong. We should make sure the task is submitted to the pool, or the RejectedExecutionException should be thrown. 5. Now the transaction of a load job will begin in pending task, instead of when submitting the job.	2021-12-16 10:39:22 +08:00
zhoubintao	85521944dd	[refactor](olap-scan-node) Refactor olap scannode (#7131 ) 1. Delete useless variables 2. Add const modifier for read-only function 3. Delete the empty destructor, the compiler will automatically generate it, refer to the 3/5/0 rule: [https://en.cppreference.com/w/cpp/language/rule_of_three] 4. It is recommended to add the override keyword (instead of the virtual keyword) to the subclass virtual function. Override will let the compiler help check and improve security. This is also the reason why C++11 introduces override	2021-12-16 10:33:41 +08:00
Zhengguo Yang	926540c561	[feature] Support return bitmp/hll data in select statement (#7276 ) Support return bitmp/hll data in select statement, this can be used when set show_object_data=true;	2021-12-15 09:48:27 +08:00
HappenLee	4e02109926	[refactor][fix](constants-fold) Refactor the code of fold constant mgr and fix some undefined behavior and mem leak (#7373 ) 1. Fix some memory leaks 2. Remove redundant and invalid code 3. Fix some buggy writes to reduce extra memory copies and return null pointers to string 4. Reframing the naming to make the structure clearer	2021-12-14 15:53:56 +08:00
SleepyBear	e0889aee1e	[typo](load) correct the error of ‘EtlJobMgr::get_job_status’ function (#7353 )	2021-12-11 16:54:25 +08:00
thinker	80c11da3df	[refactor] modify the implements of Tuple & RowBatch (#7319 ) code refactor: improve code's readability, avoid const_cast 1. make loop simpler and clearer by using range-based loop grammar, it's safer than old loop style 2. iteration for _row_desc.tuple_descriptors() use index replace index and iterator mixed 3. add new function To cast_to(From from), use this union-based casting between two types to replace reinterpret_cast, this new cast is more readable 4. avoid using the same variable name for nested loop, it's dangerous 5. add const keyword for member functions followed CppCoreGuidelines	2021-12-09 22:36:37 +08:00
thinker	f9be31d4bc	[refactor](rowbatch) make RowBatch better (#7286 ) 1. add const keyword for RowBatch's read-only member functions 2. should use member object rather than member object pointer as possible as you can	2021-12-06 10:31:43 +08:00
thinker	8a6528a2fb	[fix](executor) set the length of StringValue to 0 when it is null (#7284 ) the tuple String Slot's ptr and len are not assigned appropriately on send side, the receive side may crash in some situation. detail description: on send side, when we call RowBatch::serialize(PRowBatch* output_batch) to pack RowBatch, the Tuple::deep_copy() will be called, for each String Slot, only String Slots that is not null will set ptr and len with proper value, the null String Slots will keep original status, the ptr member will point randomly and the len member may unexpect. on recv side, unpack is processed by RowBatch::RowBatch(const RowDescriptor&, const PRowBatch&...), in this function, each String Slot will transfer offset to valid string_val->ptr whether the String Slot is null or not. but some business logic depends on string_val->len=0, such as AggregateFuncTraits::init(), HyperLogLog::deserialize() will return correctly if slice.size<=0. so if string_val->len is set to 0 in send side, everything will be ok, otherwise server may crash. by netcomm viewpoint, we should make sure transfer correct data, it's sender's responsibility to set data with proper value, and do not make any presume which way the recv side will use it.	2021-12-06 10:30:26 +08:00
Xinyi Zou	fc9e502b51	[improvement](brpc)(config) Support transfer RowBatch in Controller Attachment (#7164 ) Transfer RowBatch in Protobuf Request to Controller Attachment, when the maximum length of the RowBatch in the Protobuf Request is exceeded. This can avoid reaching the upper limit of the Protobuf Request length (2G), and it is expected that performance can be improved.	2021-12-02 11:41:38 +08:00
Zhengguo Yang	d8ba6e3eb6	1. Fix an error when fetch string type field may cause malform packet error. (#7262 ) This is beacuse of an const MAX_PHYSICAL_PACKET_LENGTH in fe should be 2^24 -1, but it is set as 2^24 -2 by mistake. 2. Fix bitmap_to_string may failed when the result is large than 2G	2021-12-01 10:02:34 +08:00
曹建华	948a2a738d	[performance] Improve DeltaWriter's performance. (#7216 ) 1. Support batch write for DeltaWriter. 2. Use mutex instead of SpinLock.	2021-11-26 10:15:27 +08:00
HappenLee	fb5adaf18e	[fix](mem-tracker) Fix mem limit -1 in partition aggregate node (#7181 ) Make error message more clear.	2021-11-24 10:43:35 +08:00
Zhengguo Yang	d420ff0afd	display current load bytes to show load progress, (#7134 ) this value may greate than the file size when loading parquert or orc file, will less than file size when loading csv file.	2021-11-24 10:08:32 +08:00
Zhengguo Yang	e2d3d0134e	dd a method to get doris current memory usage (#6979 ) Add all memory usage check when TryConsume memory	2021-11-24 10:07:54 +08:00
Xinyi Zou	ad0d2b82ab	[fix](memory) fix bug that ~BitShufflePageDecoder destroys uninitialized chunk (#7172 ) Added a safe way to destroy Chunk.	2021-11-23 15:24:25 +08:00
xy720	836c95c2ca	[feat](memory-track) Print peak memory use of all backend after query in audit log (#7030 ) Add a new field `peakMemoryBytes` in fe.audit.log	2021-11-22 14:46:08 +08:00
thinker	fcd4f0b5c2	[fix](profile) fix some bugs about ReportProfile on BE (#7144 ) 1. setting _report_thread_active to false is not necessary protected by _report_thread_lock, because _report_thread_active's type is bool, writing data is multi-threadly safety if size <= marchine word length 2. report_profile thread terminates early is possiable, in the function report_profile(), while (_report_thread_active) may break if _report_thread_active is false, the thread of calling open() may be scheduled out between _report_thread_started_cv.wait(l) and _report_thread_active = true, we should not assume that how long time elapsed between a thread be scheduled twice	2021-11-20 21:43:57 +08:00
Mingyu Chen	a81f4da4e4	[feat](minidump) Add minidump support (#7124 ) Now minidump file will be created when BE crashes. And user can manually trigger a minidump by sending SIGUSR1 to BE process. More details can be found in minidump.md documents	2021-11-20 21:41:26 +08:00
Xinyi Zou	f5a35c28e9	[Optimize] [Memory] BitShufflePageDecoder use memory allocated by ChunkAllocator instead of Faststring (#6515 ) BitShufflePageDecoder reuses the memory for storing decoder results, allocate memory directly from the `ChunkAllocator`, the performance is improved to a certain extent. In the case of #6285, the total time consumption is reduced by 13.5%, and the time consumption ratio of `~Reader()` has also been reduced from 17.65% to 1.53%, and the memory allocation is unified to `ChunkAllocator` for centralized management , Which is conducive to subsequent memory optimization. which can avoid the memory waste caused by `Mempool`, because the chunk can be free at any time, but the performance is lower than the allocation from `Mempool`. The guess is that there is no `Mempool` after secondary allocation of large chunks , Will directly apply for a large number of small chunks from `ChunkAllocator`, and it takes longer to lock in `pop_free_chunk` and `push_free_chunk` (but this is not proven from the flame graphs of BE's cpu and contention).	2021-11-17 11:20:21 +08:00
Zhengguo Yang	6c6380969b	[refactor] replace boost smart ptr with stl (#6856 ) 1. replace all boost::shared_ptr to std::shared_ptr 2. replace all boost::scopted_ptr to std::unique_ptr 3. replace all boost::scoped_array to std::unique<T[]> 4. replace all boost:thread to std::thread	2021-11-17 10:18:35 +08:00
Mingyu Chen	dcad6ff5e5	[License] Add License header for missing files (#7130 ) 1. Add License header for missing files 2. Modify the spark pom.xml to correct the location of `thrift`	2021-11-16 18:37:54 +08:00
Xiang Wei	896a08cbcf	[Enhancement] add thread id in be log (#6891 ) Add thread id in be log in order to quickly find the query id that caused the BE crushed by segmentation fault See #6890	2021-11-14 18:52:01 +08:00

1 2 3 4 5 ...

456 Commits