doris

Author	SHA1	Message	Date
zhangstar333	53ae24912f	[vectorized](feature) support partition sort node (#19708 )	2023-05-25 11:22:02 +08:00
Xinyi Zou	cf7a74f6ec	[fix](memory) query check cancel while waiting for memory in Allocator, and optimize log (#19967 ) After the query check process memory exceed limit in Allocator, it will wait up to 5s. Before, Allocator will not check whether the query is canceled while waiting for memory, this causes the query to not end quickly.	2023-05-24 11:08:48 +08:00
Xinyi Zou	068a32bc49	[Improvement](memory) faststring use Allocator #19762 After the outer catch exception, faststring resize reserve build may throw a memory alloc failure exception from the Allocator. Currently page body compress will catch memory alloc failure exception	2023-05-18 15:00:49 +08:00
Gabriel	8fd1eb0d1e	[minor](hash table) parameterize hash table (#19653 )	2023-05-17 09:58:26 +08:00
Xinyi Zou	16f5d3d5b3	[Improvement](memory) new page use Allocator (#19472 )	2023-05-16 19:09:17 +08:00
Pxl	4eb2604789	[Bug](function) fix function define of Retention inconsist and change some static_cast to assert cast (#19455 ) 1. fix function define of `Retention` inconsist, this function return tinyint on `FE` and return uint8 on `BE` 2. make assert_cast support cast to derived 3. change some static cast to assert cast 4. support sum(bool)/avg(bool)	2023-05-15 11:50:02 +08:00
Xinyi Zou	58cb404661	[fix](memory) Allocator throws Exception instead of std::bad_alloc (#19285 ) W0505 01:31:25.840227 1727715 scanner_scheduler.cpp:340] Scan thread read VScanner failed: [MEM_LIMIT_EXCEEDED]PreCatch error code:11, [E11] Allocator sys memory check failed: Cannot alloc:16384, consuming tracker:<Orphan>, exec node:<>, process memory used 5.87 GB exceed limit 5.64 GB or sys mem available 252.17 GB less than low water mark 1.60 GB, failed alloc size 16.00 KB. @ 0x555c19e0cca8 doris::Exception::Exception() @ 0x555c1c3e0c3f Allocator<>::sys_memory_check() @ 0x555c1c3e1052 Allocator<>::memory_check() @ 0x555c19e0a645 Allocator<>::alloc() @ 0x555c1c34508b COWHelper<>::create<>() @ 0x555c1e23f574 doris::vectorized::ConvertThroughParsing<>::execute<>() @ 0x555c1e23f209 doris::vectorized::FunctionConvertFromString<>::execute_impl() @ 0x555c1e23f4aa doris::vectorized::FunctionConvertFromString<>::execute_impl() @ 0x555c1e15ac29 doris::vectorized::PreparedFunctionImpl::execute_without_low_cardinality_columns() @ 0x555c1e15ac56 doris::vectorized::PreparedFunctionImpl::execute() @ 0x555c1e245276 _ZNSt17_Function_handlerIFN5doris6StatusEPNS0_15FunctionContextERNS0_10vectorized5BlockERKSt6vectorImSaImEEmmEZNKS4_12FunctionCast14create_wrapperINS4_14DataTypeNumberIiEEEESt8functionISC_ERKSt10shared_ptrIKNS4_9IDataTypeEEPKT_bEUlS3_S6_SB_mmE_E9_M_invokeERKSt9_Any_dataOS3_S6_SB_OmSY_ @ 0x555c1e2a9341 _ZZNK5doris10vectorized12FunctionCast23prepare_remove_nullableEPNS_15FunctionContextERKSt10shared_ptrIKNS0_9IDataTypeEES9_bENKUlS3_RNS0_5BlockERKSt6vectorImSaImEEmmE_clES3_SB_SG_mm @ 0x555c1e2a8d42 _ZNSt17_Function_handlerIFN5doris6StatusEPNS0_15FunctionContextERNS0_10vectorized5BlockERKSt6vectorImSaImEEmmEZNKS4_12FunctionCast23prepare_remove_nullableES3_RKSt10shared_ptrIKNS4_9IDataTypeEESJ_bEUlS3_S6_SB_mmE_E9_M_invokeERKSt9_Any_dataOS3_S6_SB_OmSQ_ @ 0x555c1e20e42b doris::vectorized::PreparedFunctionCast::execute_impl() @ 0x555c1e15ac29 doris::vectorized::PreparedFunctionImpl::execute_without_low_cardinality_columns() @ 0x555c1e15ac56 doris::vectorized::PreparedFunctionImpl::execute() @ 0x555c1d63e960 doris::vectorized::IFunctionBase::execute() @ 0x555c1d628700 doris::vectorized::VCastExpr::execute() @ 0x555c1d6163e5 doris::vectorized::VExprContext::execute() @ 0x555c20a83fe1 doris::vectorized::VFileScanner::_convert_to_output_block() @ 0x555c20a809af doris::vectorized::VFileScanner::_get_block_impl() @ 0x555c209b9bc4 doris::vectorized::VScanner::get_block() @ 0x555c209b1a50 doris::vectorized::ScannerScheduler::_scanner_scan() @ 0x555c209b2ac1 _ZNSt17_Function_handlerIFvvEZZN5doris10vectorized16ScannerScheduler18_schedule_scannersEPNS2_14ScannerContextEENK3$_0clEvEUlvE1_E9_M_invokeERKSt9_Any_data @ 0x555c1a8378cf doris::ThreadPool::dispatch_thread() @ 0x555c1a830fac doris::Thread::supervise_thread() @ 0x7f461faa117a start_thread @ 0x7f462033bdf3 __GI___clone @ (nil) (unknown)	2023-05-05 18:01:48 +08:00
Xinyi Zou	e17a171a3c	[fix](vertical_compaction) Fix continuous_agg_count PODArray wrong boundary judgment #19187	2023-05-04 14:50:30 +08:00
Xinyi Zou	1379d7f3e0	[fix](memory) mmap threshold can be modified in conf, Increase to 128M	2023-04-28 18:17:22 +08:00
Pxl	ec517a53a8	[Chore](build) upgrade clang-format version to 16 && move thrift to fe-common (#19155 ) upgrade clang-format version to 16 move thrift to fe-common fix core dump on pipeline engine when operator canceled and not prepared	2023-04-28 14:14:51 +08:00
yiguolei	8d7a9fd21b	[refactor](exceptionsafe) add factory creator to some class (#18978 ) make vexprecontext,vexpr,function,query context,runtimestate thread safe. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-04-24 10:32:11 +08:00
Jerry Hu	0c95d760fe	[fix](fixed_hashtable) The incorrect implementation of copy constructor (#18921 )	2023-04-24 08:36:52 +08:00
yiguolei	63a76ed115	[refactor](exceptionsafe) disallow call new method explicitly (#18830 ) disallow call new method explicitly force to use create_shared or create_unique to use shared ptr placement new is allowed reference https://abseil.io/tips/42 to add factory method to all class. I think we should follow this guide because if throw exception in new method, the program will terminate. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-04-21 09:13:24 +08:00
Jerry Hu	c4e469c82c	[feature](agg) Support spill to disk in aggregation (#18051 )	2023-04-20 18:59:08 +08:00
Adonis Ling	e412dd12e8	[chore](build) Use include-what-you-use to optimize includes (PART II) (#18761 ) Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.	2023-04-19 23:11:48 +08:00
Xinyi Zou	79c446c89f	[enhancement](exception) Column filter/replicate supports exception safety (#18503 )	2023-04-18 19:23:09 +08:00
amory	564446e52f	[Refact](type system) refact serde for type system and pb serde impl (#18627 )	2023-04-18 14:13:56 +08:00
Jerry Hu	3de4d64657	[chore](hashtable) Use doris' Allocator to replace std::allocator in phmap (#18735 )	2023-04-18 09:58:28 +08:00
zclllyybb	092d81f88a	[BugFix](functions) fix multi_search_all_positions #18682	2023-04-17 08:32:57 +08:00
Xinyi Zou	c704351273	[enhancement](memory) Refactor memory limit exceeded behavior (#18590 ) No check mem tracker limit and no cancel task in mem hook, only in Allocator. This helps in clearer analysis of memory issues and reduces performance loss. PODArray/hash table/arena memory allocation will use Allocator. Optimize mem limit exceeded log printing Optimize compilation time	2023-04-14 10:42:35 +08:00
Zhengguo Yang	4335c9998f	[chore](ARM) Add some vectorization compatibility code on aarch64 (#18553 ) update sse2noen to support more sse code on arm cpus	2023-04-13 10:15:33 +08:00
ZhangYu0123	5efafefeda	[refactor](string) remove volnitsky search algorithm (#18474 )	2023-04-10 10:56:07 +08:00
Pxl	c9b4eaea76	[Chore](storage) change FieldType to enum class #18500	2023-04-10 08:53:44 +08:00
ZhangYu0123	b627088e8c	[Optimization](String) Optimize q20 q21 q22 q23 LIKE_SUBSTRING (like '%xxx%') (#18309 ) Optimize q20, q21, q22, q23 LIKE_SUBSTRING (like '%xxxx%'). Idea is from clickhouse stringsearcher: Stringsearcher is about 10%~20% faster than volnitsky algorithm when needle size is less than 10 using two chars at beginning search in SIMD . Stringsearcher is faster than volnitsky algorithm, when needle size is less than 21. The changes are as follows: Using first two chars of needle at beginning search. We can compare two chars of needle and [n:n+17) chars in haystack in SIMD in one loop. Filter efficiency will be higher. When env support SIMD, we use stringsearcher. Test result in clickbench: q20 is about 15% up. q20: SELECT COUNT() FROM hits WHERE URL LIKE '%google%'; q21, q22 is about 1%~5% up. q21: SELECT SearchPhrase, MIN(URL), COUNT() AS c FROM hits WHERE URL LIKE '%google%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10; q22: SELECT SearchPhrase, MIN(URL), MIN(Title), COUNT() AS c, COUNT(DISTINCT UserID) FROM hits WHERE Title LIKE '%Google%' AND URL NOT LIKE '%.google.%' AND SearchPhrase <> '' GROUP BY SearchPhrase ORDER BY c DESC LIMIT 10; q23 is about 30%~40% up and not stable. q23: SELECT FROM hits WHERE URL LIKE '%google%' ORDER BY EventTime LIMIT 10;	2023-04-03 18:09:15 +08:00
yiguolei	a77921d767	[refactor](typesystem) remove unused rpc common file and using function rpc (#18270 ) rpc common is duplicate, all its method is included in function rpc. So that I remove it. get_field_type is never used, remove it. --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-31 18:13:25 +08:00
Jerry Hu	22a705543b	[fix](string_ref) Incorrect result caused by the improperly comparing of StringRef on macOS with Apple silicon or using non-avx2 #18264 On macOS systems with Apple silicon, the '==' operator of StringRef uses string_compare, which takes StringRef as a C-String with null-terminated chars.	2023-03-31 15:11:11 +08:00
zclllyybb	f800ba8f4c	[Exec](opt) Optimize function call for const columns (#18212 )	2023-03-31 11:36:21 +08:00
Xinyi Zou	e5793249cd	[opt](hashtable) Modify default filled strategy to 75% (#18242 )	2023-03-31 09:28:11 +08:00
lihangyu	e0f6083e73	[refactor](dynamic table) add `get_type_as_tprimitive_type` and `get_type_as_primitive_type` in IDataType to get `PrimitiveType` and `TPrimitiveType` (#18260 )	2023-03-31 09:03:06 +08:00
Xinyi Zou	d9fe5f7b67	[enhancement](memory) Remove MemPool and replace it with Arena (#17820 ) Arena can replace MemPool in most scenarios. Except for memory reuse, MemPool supports reuse of previous memory chunks after clear, but Arena does not. Some comparisons between MemPool and Arena: 1. Expansion Arena is less than 128M index 2 alloc chunk; more than 128M memory, allocate 128M * n > `size`, n is equal to the minimum value that satisfies the expression; MemPool less than 512K index 2 alloc chunk, greater than 512K memory, separately apply for a `size` length chunk After Arena applied for a chunk larger than 128M last time, the minimum chunk applied for after that is 128M. Does this seem to be a waste of memory? MemPool is also similar. After the chunk of 512K was applied for last time, the minimum chunk of subsequent applications is 512K. 2. Alignment MemPool defaults to 16 alignment, because memtable and other places that use int128 require 16 alignment; Arena has no default alignment; 3. Memory reuse Arena only supports `rollback`, which reuses the memory of the current chunk, usually the memory requested last time. MemPool supports clear(), all chunks can be reused; or call ReturnPartialAllocation() to roll back the last requested memory; if the last chunk has no memory, search for the most free chunk for allocation 4. Realloc Arena supports realloc contiguous memory; it also supports realloc contiguous memory from any position at the time of the last allocation. The difference between `alloc_continue` and `realloc` is: 1. Alloc_continue does not need to specify the old size, but the default old size = head->pos - range_start 2. alloc_continue supports expansion from range_start when additional_bytes is between head and pos, which is equivalent to reusing a part of memory, while realloc completely allocates a new memory MemPool does not support realloc, but supports transferring or absorbing chunks between two MemPools 5. check mem limit MemPool checks the mem limit, and Arena checks at the Allocator layer. 6. Support for ASAN Arena does something extra 7. Error handling MemPool supports returning the error message of application failure directly through `Status`, and Arena throws Exception. Tests that Arena can consider 1. After the last applied chunk is larger than 128M, the minimum applied chunk is 128M, which seems to waste memory; 2. Support clear, memory multiplexing; 3. Increase the large list, alloc the memory larger than 128M, and the size is equal to `size`, so as to avoid the current chunk not being fully used, which is wasteful. 4. In some cases, it may be possible to allocate backwards to find chunks t	2023-03-29 20:56:49 +08:00
Xinyi Zou	990479e177	[refactor](memory) Query waits for memory free in Allocator, after memory exceed limit. (#18075 ) After the memory exceeds the limit, the previous query waited for memory free in the mem hook, and changed it to wait in the Allocator. more controllable and safe	2023-03-27 09:06:03 +08:00
TengJianPing	78abb40fdc	[improvement](string) throw exception instead of log fatal if string column exceed total size limit (#17989 ) Throw exception instead of log fatal if string column exceed total size limit, so that we can catch it and let query fail, instead of causing be exit.	2023-03-27 08:55:26 +08:00
Xinyi Zou	5846b3fc54	[fix](memory) Remove PODArray peak allocated memory tracking #18010 #11740 , solved the problem that the query memory statistics are higher than the actual physical memory, because PODArray does not have memset 0 when allocating memory, and the query mem tracker is virtual memory. But in extreme cases, such as csv load, PODArray frequent insert will cause performance problems. So revert part of #11740 and part of #12820. The accuracy of the query mem tracker, there is currently no feedback, no further attention.	2023-03-26 09:45:10 +08:00
yiguolei	7ae51c856e	[refactor](unify exception) unify exception definition and error code (#18006 ) * [refactor](unify exception) unify exception definition and error code --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-25 12:41:07 +08:00
lihangyu	043f77200f	[Bug](dynamic-table) Fix column alignment logic and support filtering null values when slot is not null (#17842 ) Before this PR when encountering null values with some columns which is specified as `NOT NULL`, null values will not be filtered,thi behavior does not match with the original load behavior. Second column alignment logic has bug : ``` template <typename ColumnInserterFn> void align_variant_by_name_and_type(ColumnObject& dst, const ColumnObject& src, size_t row_cnt, ColumnInserterFn inserter) { CHECK(dst.is_finalized() && src.is_finalized()); // Use rows() here instead of size(), since size() will check_consistency // but we could not check_consistency since num_rows will be upgraded even // if src and dst is empty, we just increase the num_rows of dst and fill // num_rows of default values when meet new data size_t num_rows = dst.rows(); ```	2023-03-17 16:53:30 +08:00
lihangyu	9b7596f1c6	[Feature](Dynamic schema table) step1 support schema change expression (#17494 ) 1. introduce a new type `VARIANT` to encapsulate dynamic generated columns for hidding the detail of types and names of newly generated columns 2. introduce a new expression `SchemaChangeExpr` for doing schema change for extensibility	2023-03-13 15:12:42 +08:00
Pxl	16fc3a0e22	[Chore](compile) remove some unused static on inline function to reduce compile time (#17603 ) remove some unused static on inline function to reduce compile time	2023-03-13 11:11:59 +08:00
Pxl	e2ac06d6d6	[Chore](execution) change PipelineTaskState to enum class && remove some row-based code (#17300 ) 1. change PipelineTaskState to enum class 2. remove some row-based code on FoldConstantExecutor::_get_result 3. reduce memcpy on minmax runtime filter function(Now we can guarantee that the input data is aligned) 4. add Wunused-template check, and remove some unused function, change some static function to inline function.	2023-03-08 12:41:15 +08:00
yiguolei	4692d6764c	[refactor](remove string val) remove string val structure, it is same with string ref (#17461 ) remove stringval, decimalv2val, bigintval	2023-03-08 10:42:20 +08:00
yiguolei	9477c48ef8	[refactor](functioncontext) remove duplicate type definition in function context (#17421 ) remove duplicate type definition in function context remove unused method in function context not need stale state in vexpr context because vexpr is stateless and function context saves state and they are cloned. remove useless slot_size in all tuple or slot descriptor. remove doris_udf namespace, it is useless. remove some unused macro definitions. init v_conjuncts in vscanner, not need write the same code in every scanner. using unique ptr to manage function context since it could only belong to a single expr context. Issue Number: close #xxx --------- Co-authored-by: yiguolei <yiguolei@gmail.com>	2023-03-06 16:07:09 +08:00
ZhaoChangle	e82b827bc8	[optimize](vectorization)Optimize to_string's performance. (#17076 )	2023-03-03 10:35:59 +08:00
HappenLee	3e40467ce6	[Bug](vec) Fix chinese pinyin order by (#17152 ) bug: some chinese word not sort by pinyin in GBK coding CREATE TABLE `test_convert` ( `a` varchar(100) NULL ) ENGINE=OLAP DUPLICATE KEY(`a`) DISTRIBUTED BY HASH(`a`) BUCKETS 3 PROPERTIES ( "replication_allocation" = "tag.location.default: 1" ); insert into test_convert values("b"), ("a"), ("c"), ("睿"), ("多"), ("丝"); Query OK, 6 rows affected (0.03 sec) {'label':'insert_ca73a6acc2194d5b_888218a3949355a6', 'status':'VISIBLE', 'txnId':'18068'} mysql [test]>select * from test_convert; +------+ \| a \| +------+ \| a \| \| c \| \| 丝 \| \| b \| \| 多 \| \| 睿 \| +------+ 6 rows in set (0.01 sec) mysql [test]>select * from test_convert order by convert(a using gbk); +------+ \| a \| +------+ \| a \| \| b \| \| c \| \| 多 \| \| 丝 \| \| 睿 \| +------+ 6 rows in set (0.01 sec)	2023-02-28 14:29:56 +08:00
TengJianPing	aab8dad191	[fix](sort) fix bug of sort (#17151 ) The logic of topn and full sort is wrong when there are both offsets and limits, the offset is not considered when doing the max heap optimization, which will lead to wrong result.	2023-02-27 10:55:12 +08:00
HappenLee	a8a5cbb403	[Opt](Hash) Deduce virtual function call is null at in single nullable column (#16650 )	2023-02-14 08:44:12 +08:00
lihangyu	36955a6769	[regression-test](dynamic-table) add regression test for dynamic table (#16656 )	2023-02-14 00:03:19 +08:00
lihangyu	37d1519316	[WIP](dynamic-table) support dynamic schema table (#16335 ) Issue Number: close #16351 Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.	2023-02-11 13:37:50 +08:00
yiguolei	d390e63a03	[enhancement](stream receiver) make stream receiver exception safe (#16412 ) make stream receiver exception safe change get_block(block*) to get_block(block , bool* eos) unify stream semantic	2023-02-07 12:44:20 +08:00
lihangyu	f94a78ab4a	[Fix](topn) fix wrong nullable cast for RowId column and use heapsorter for two phase read (#16399 ) convert_nullable_flags does not contain nullable info for RowID column, but valid_column_ids contain RowID column, nullable falg will be undefined for RowID column	2023-02-03 20:49:45 +08:00
Pxl	5e4bb98900	[Chore](build) enable -Wpedantic and update lowest gcc version to 11.1 (#16290 ) enable -Wpedantic and update lowest gcc version to 11.1	2023-02-03 11:28:48 +08:00
TengJianPing	a7b030778a	[fix](sort) fix heap-use-after-free error if sort with limit and is spilled (#16267 )	2023-01-31 09:59:03 +08:00

1 2 3

144 Commits