doris

Author	SHA1	Message	Date
HappenLee	fcefed7c1c	[Bug][Vectorized] Fix core bug of segment vectorized (#8800 ) * [Bug][Vectorized] Fix core bug of segment vectorized 1. Read table with delete condition 2. Read table with default value HLL/Bitmap Column * refactor some code Co-authored-by: lihaopeng <lihaopeng@baidu.com>	2022-04-03 19:50:25 +08:00
caiconghui	4076c5466b	[refactor][improvement](type_info) use template and single instance to refactor get type info logic (#8680 ) 1. use const pointer instead of shared_ptr 2. Restrict array types to support only primitive types and nest up to 9 levels.	2022-04-03 10:10:36 +08:00
zbtzbtzbt	e285d09157	[Enhancement](load) speed up stream load for duplicate table, use template for faster get_type_info. (#8500 )	2022-03-25 15:18:43 +08:00
Adonis Ling	b638c07533	[feature-wip](array-type) Support nested array insertion. (#8305 ) (#8586 ) Please refer to #8304 .	2022-03-22 15:28:26 +08:00
Adonis Ling	38ec3cbbdf	[feature-wip](array-type) Support ArrayLiteral in SQL. (#8089 ) (#8582 ) Please refer to #8074	2022-03-22 15:07:06 +08:00
Xinyi Zou	eeae516e37	[Feature](Memory) Hook TCMalloc new/delete automatically counts to MemTracker (#8476 ) Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G Implement a new way of memory statistics based on TCMalloc New/Delete Hook, MemTracker and TLS, and it is expected that all memory new/delete/malloc/free of the BE process can be counted.	2022-03-20 23:06:54 +08:00
Xinyi Zou	e17aef9467	[refactor] refactor the implement of MemTracker, and related usage (#8322 ) Modify the implementation of MemTracker: 1. Simplify a lot of useless logic; 2. Added MemTrackerTaskPool, as the ancestor of all query and import trackers, This is used to track the local memory usage of all tasks executing; 3. Add cosume/release cache, trigger a cosume/release when the memory accumulation exceeds the parameter mem_tracker_consume_min_size_bytes; 4. Add a new memory leak detection mode (Experimental feature), throw an exception when the remaining statistical value is greater than the specified range when the MemTracker is destructed, and print the accurate statistical value in HTTP, the parameter memory_leak_detection 5. Added Virtual MemTracker, cosume/release will not sync to parent. It will be used when introducing TCMalloc Hook to record memory later, to record the specified memory independently; 6. Modify the GC logic, register the buffer cached in DiskIoMgr as a GC function, and add other GC functions later; 7. Change the global root node from Root MemTracker to Process MemTracker, and remove Process MemTracker in exec_env; 8. Modify the macro that detects whether the memory has reached the upper limit, modify the parameters and default behavior of creating MemTracker, modify the error message format in mem_limit_exceeded, extend and apply transfer_to, remove Metric in MemTracker, etc.; Modify where MemTracker is used: 1. MemPool adds a constructor to create a temporary tracker to avoid a lot of redundant code; 2. Added trackers for global objects such as ChunkAllocator and StorageEngine; 3. Added more fine-grained trackers such as ExprContext; 4. RuntimeState removes FragmentMemTracker, that is, PlanFragmentExecutor mem_tracker, which was previously used for independent statistical scan process memory, and replaces it with _scanner_mem_tracker in OlapScanNode; 5. MemTracker is no longer recorded in ReservationTracker, and ReservationTracker will be removed later;	2022-03-11 22:04:23 +08:00
Adonis Ling	b40e9144cb	[feature-wip][array-type] Refactor type info for nested array. (#8279 )	2022-03-02 14:20:39 +08:00
zuochunwei	cce721ad5b	[improvement](olap) using placement-new to avoid dynamic mallocing for ParsedPage (#8172 ) use C++ placement-new feature to save consuming, placement-new can separate mallocing from constructing.	2022-02-25 11:09:53 +08:00
zuochunwei	802fcbbb05	(#8162 )refactor binary dict Co-authored-by: zuochunwei <zuochunwei@meituan.com>	2022-02-22 11:23:54 +08:00
Mingyu Chen	c0e59e59aa	[fix][refactor] fix bugs and refactor some code by lint (#7871 ) 1. Fix some `passedByValue` issues. 2. Fix some `dereferenceBeforeCheck` issues. 3. Fix some `uninitMemberVar` issues. 4. Fix some iterator `eraseDereference` issues. 5. Fix compile issue introduced from #7923 #7905 #7848	2022-02-01 14:31:14 +08:00
zuochunwei	4e783afa7a	[feature] add Generic debug timer for debugging or profiling (#7923 ) add a group of debug-timer for the purpose of profiling or testing you can use these timers for custom meaning purpose unlike the specific named timer	2022-01-31 22:15:43 +08:00
wangbo	3f221e1d0b	[fix](memory-leak) using unique_ptr to refactor some fields (#7933 ) Using unique_ptr to refactor some class members. Fix mem leak for `SegmentIterator`'s `_pre_eval_block_predicate`.	2022-01-30 16:49:04 +08:00
Zeno Yang	a72eaa2b2e	[fix](Vectorized) optinmize dict page decoder init (#7917 ) this may cause mem leak	2022-01-29 11:47:57 +08:00
wangbo	cf02e43ec1	[improvement](vectorized) optimize dict read (#7805 )	2022-01-22 10:18:30 +08:00
HappenLee	e1d7233e9c	[feature](vectorization) Support Vectorized Exec Engine In Doris (#7785 ) # Proposed changes Issue Number: close #6238 Co-authored-by: HappenLee <happenlee@hotmail.com> Co-authored-by: stdpain <34912776+stdpain@users.noreply.github.com> Co-authored-by: Zhengguo Yang <yangzhgg@gmail.com> Co-authored-by: wangbo <506340561@qq.com> Co-authored-by: emmymiao87 <522274284@qq.com> Co-authored-by: Pxl <952130278@qq.com> Co-authored-by: zhangstar333 <87313068+zhangstar333@users.noreply.github.com> Co-authored-by: thinker <zchw100@qq.com> Co-authored-by: Zeno Yang <1521564989@qq.com> Co-authored-by: Wang Shuo <wangshuo128@gmail.com> Co-authored-by: zhoubintao <35688959+zbtzbtzbt@users.noreply.github.com> Co-authored-by: Gabriel <gabrielleebuaa@gmail.com> Co-authored-by: xinghuayu007 <1450306854@qq.com> Co-authored-by: weizuo93 <weizuo@apache.org> Co-authored-by: yiguolei <guoleiyi@tencent.com> Co-authored-by: anneji-dev <85534151+anneji-dev@users.noreply.github.com> Co-authored-by: awakeljw <993007281@qq.com> Co-authored-by: taberylyang <95272637+taberylyang@users.noreply.github.com> Co-authored-by: Cui Kaifeng <48012748+azurenake@users.noreply.github.com> ## Problem Summary: ### 1. Some code from clickhouse ClickHouse is an excellent implementation of the vectorized execution engine database, so here we have referenced and learned a lot from its excellent implementation in terms of data structure and function implementation. We are based on ClickHouse v19.16.2.2 and would like to thank the ClickHouse community and developers. The following comment has been added to the code from Clickhouse, eg: // This file is copied from // https://github.com/ClickHouse/ClickHouse/blob/master/src/Interpreters/AggregationCommon.h // and modified by Doris ### 2. Support exec node and query: * vaggregation_node * vanalytic_eval_node * vassert_num_rows_node * vblocking_join_node * vcross_join_node * vempty_set_node * ves_http_scan_node * vexcept_node * vexchange_node * vintersect_node * vmysql_scan_node * vodbc_scan_node * volap_scan_node * vrepeat_node * vschema_scan_node * vselect_node * vset_operation_node * vsort_node * vunion_node * vhash_join_node You can run exec engine of SSB/TPCH and 70% TPCDS stand query test set. ### 3. Data Model Vec Exec Engine Support Dup/Agg/Unq table, Support Block Reader Vectorized. Segment Vec is working in process. ### 4. How to use 1. Set the environment variable `set enable_vectorized_engine = true; `(required) 2. Set the environment variable `set batch_size = 4096; ` (recommended) ### 5. Some diff from origin exec engine https://github.com/doris-vectorized/doris-vectorized/issues/294 ## Checklist(Required) 1. Does it affect the original behavior: (No) 2. Has unit tests been added: (Yes) 3. Has document been added or modified: (No) 4. Does it need to update dependencies: (No) 5. Are there any changes that cannot be rolled back: (Yes)	2022-01-18 10:07:15 +08:00
pengxiangyu	20ef8a6e21	[feature-wip](remote storage)(step1) use a struct instead of string for parameter path, add basic remote method (#7098 ) For the first, we need to make a parameter to discribe the data is local or remote. At then, we need to support some basic function to support the operation for remote storage.	2021-12-22 22:58:23 +08:00
xinghuayu007	dd36ccc3bf	[feature](storage-format) Z-Order Implement (#7149 ) Support sort data by Z-Order: ``` CREATE TABLE table2 ( siteid int(11) NULL DEFAULT "10" COMMENT "", citycode int(11) NULL COMMENT "", username varchar(32) NULL DEFAULT "" COMMENT "", pv bigint(20) NULL DEFAULT "0" COMMENT "" ) ENGINE=OLAP DUPLICATE KEY(siteid, citycode) COMMENT "OLAP" DISTRIBUTED BY HASH(siteid) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "data_sort.sort_type" = "ZORDER", "data_sort.col_num" = "2", "in_memory" = "false", "storage_format" = "V2" ); ```	2021-12-02 11:39:51 +08:00
Zhengguo Yang	6c6380969b	[refactor] replace boost smart ptr with stl (#6856 ) 1. replace all boost::shared_ptr to std::shared_ptr 2. replace all boost::scopted_ptr to std::unique_ptr 3. replace all boost::scoped_array to std::unique<T[]> 4. replace all boost:thread to std::thread	2021-11-17 10:18:35 +08:00
Zhengguo Yang	7297b275f1	[Optimize] Optimize cpu consumption when importing parquet files (#6782 ) Remove part of dynamic_cast, reduce the overhead caused by type conversion, and probably reduce the cpu consumption of parquet file import by about 10%	2021-10-03 12:14:35 +08:00
wangbo	d8cde8c044	(#6454 ) Remove useless code for Segment V2 (#6455 )	2021-09-02 09:59:21 +08:00
Zhengguo Yang	8738ce380b	Add long text type STRING, with a maximum length of 2GB. Usage is similar to varchar, and there is no guarantee for the performance of storing extremely long data (#6391 )	2021-08-18 09:05:40 +08:00
Zhengguo Yang	ed3ff470ce	[ARRAY] Support array type load and select not include access by index (#5980 ) This is part of the array type support and has not been fully completed. The following functions are implemented 1. fe array type support and implementation of array function, support array syntax analysis and planning 2. Support import array type data through insert into 3. Support select array type data 4. Only the array type is supported on the value lie of the duplicate table this pr merge some code from #4655 #4650 #4644 #4643 #4623 #2979	2021-07-13 14:02:39 +08:00
Lijia Liu	4d64612b96	[ARRAY]Save array's size instead of offset. (#5983 ) * Save array's size instead of offset. * Optimize variable name * Fix comment	2021-06-10 12:32:58 +08:00
HappenLee	b423274f17	[Enhance] Make MemTracker more accurate (#5515 ) (#5516 ) * [Enhance] Make MemTracker more accurate (#5515) This PR main about: 1. Improve the readability of MemTrackers' name 2. Add the MemTracker of: * Load * Compaction * SchemaChange * StoragePageCache * TabletManager 3. Change SchemaChange to a Singleon * revise some code for Code Review * change the name of mem_tracker * keep reader_context have the same lifetime of rowset_reader in schema change. * change vlog notice to log(warning) in schema change	2021-04-08 09:14:55 +08:00
Zhengguo Yang	139709d060	[Storage] Optimize Zone map create policy (#5260 ) If there are too large fields in the table, there may be only one row in each page, and this row also has a zone map index This causes the stored data to expand three times the original data, It also takes up more memory when reading those segments Therefore, we need to Disable the creation of zonemap indexes for segments with too few rows	2021-01-24 10:11:21 +08:00
HappenLee	a5298d617d	[Performance Improve] Push Down _conjunctf of 'not in' and '!=' to Storage Engine. (#5207 )	2021-01-23 21:07:01 +08:00
Skysheepwang	6c098e45fc	[Optimize][Cache]Implementation of Separated Page Cache (#5008 ) #4995 Implementation of Separated Page Cache - Add config "index_page_cache_ratio" to set the ratio of capacity of index page cache - Change the member of StoragePageCache to maintain two type of cache - Change the interface of StoragePageCache for selecting type of cache - Change the usage of page cache in read_and_decompress_page in page_io.cpp - add page type as argument - check if current page type is available in StoragePageCache (cover the situation of ratio == 0 or 1) - Add type as argument in superior call of read_and_decompress_page - Change Unit Test	2021-01-04 12:19:24 +08:00
Yingchun Lai	11c0aafa5c	[UT] Speed up BE unit test (#5131 ) There are some long loops and sleeps in unit tests, it will cost a very long time to run all unit tests, especially run in TSAN mode. This patch speed up unit tests by shortening long loops and sleeps, on my environment all unit tests finished in 1 minite. It's useful to do basic functional unit tests. You can switch to run in this mode by adding a new environment variable 'DORIS_ALLOW_SLOW_TESTS'. For example, you can set: export DORIS_ALLOW_SLOW_TESTS=1 and also you can disable it by setting: export DORIS_ALLOW_SLOW_TESTS=0	2020-12-27 22:19:56 +08:00
sduzh	6fedf5881b	[CodeFormat] Clang-format cpp sources (#4965 ) Clang-format all c++ source files.	2020-11-28 18:36:49 +08:00
sduzh	10e1e29711	Remove header file common/names.h (#4945 )	2020-11-26 17:00:48 +08:00
Lijia Liu	b48c768dc7	[ComplexType] Restructure storage type to support complex types expending (#4905 ) This CL includes: * Change the column metadata to a tree structure. * Refactor the segment_v2.ColumnReader and sgment_v2.ColumnWriter to support complex type. * Implements the reading and writing of array type.	2020-11-16 21:59:41 +08:00
Zhengguo Yang	75e0ba32a1	Fixes some be typo (#4714 )	2020-10-13 09:37:15 +08:00
caiconghui	9785e103ea	[Bug] Fix bug that delete stmt with filter condition delete all data from table on segment v2 (#3943 ) When we get different columns's row ranges by column_delete_conditions, we should use union operation instead of intersection operation to get final get final row ranges. The root cause is that we lost the relationship of the two delete conditions in same delete stmt. Base data: ``` k1, k2 1, 2 1, 3 case 1: delete from tbl where k1=1 and k2=2; case 2: delete from tbl where k1=1; delete from tbl where k2=2; ``` We treat the above 2 cases as same, which is incorrect. So we need to process every rowset of delete conditions separately.	2020-07-02 11:07:23 +08:00
Mingyu Chen	8aa8b8c96d	[Code Refactor] Using block manager to unify the data file access. (#3189 ) Earlier we introduced `BlockManager` to separate data access logic from underlying file read and write logic. This CL further unifies all `SegmentV2` data access to the `BlockManager`, removes the previous `FileManager` class, and move the file cache to the `FileBlockManager`. There are no logical changes to this CL. After this CL, all user table data is read through the `WritableBlock` and `ReadableBlock` returned by the `BlockManager`, and no file operations are performed directly.	2020-03-25 20:39:07 +08:00
kangpinghuang	6beadfda71	[Bug] Fix delete predicate bug for segment v2 (#3164 ) This bug is because the min and max wrapper field is not initialized when there is no predicate of that column.	2020-03-20 20:35:55 +08:00
Dayue Gao	d2d95bfa84	[segment_v2] Switch to Unified and Extensible Page Format (#2953 ) Fixes #2892 IMPORTANT NOTICE: this CL makes incompatible changes to V2 storage format, developers need to create new tables for test. This CL refactors the metadata and page format for segment_v2 in order to * make it easy to extend existing page type * make it easy to add new page type while not sacrificing code reuse * make it possible to use SIMD to speed up page decoding Here we summary the main code changes * Page and index metadata is redesigned, please see `segment_v2.proto` * The new class `PageIO` is the single place for reading and writing all pages. This removes lots of duplicated code. `PageCompressor` and `PageDecompressor` are now useless and removed. * The type of value ordinal is changed from `rowid_t` to 64-bits `ordinal_t`, this affects ordinal index as well. * Column's ordinal index is now implemented by IndexPage, the same with IndexedColumn. * Zone map index is now implemented by IndexedColumn	2020-02-27 15:09:57 +08:00
kangkaisen	625411bd28	Doris support in memory olap table (#2847 )	2020-02-18 10:45:54 +08:00
kangkaisen	6c33f80544	Add disable_storage_page_cache config (#2890 ) 1. when read column data page: for compaction, schema_change, check_sum: we don't use page cache for query and config::disable_storage_page_cache is false, we use page cache 2. when read column index page if config::disable_storage_page_cache is false, we use page cache	2020-02-16 19:13:30 +08:00
kangpinghuang	a27e89065b	Add file cache for v2 (#2782 ) Add file descriptor cache for segment v2 to solve too many open file problems	2020-02-04 00:16:01 +08:00
kangpinghuang	5229ea24da	Fix bloom filter statistics bug (#2609 )	2019-12-30 23:23:39 +08:00
kangkaisen	5fd7133e69	Fix bitmap, hll, segment v2 DefaultValue bug (#2570 ) 1. Change the bitmap and HLL default value to empty bitmap and empty bitmap HLL 2. Fix DefaultValueColumnIterator bug 3. Fix uint24.h ostream bug	2019-12-27 14:01:45 +08:00
kangpinghuang	b4d935ab37	Fix compaction with delete rowset bug (#2523 ) [STORAGE][SEGMENTV2] when base compaction rowsets with delete rowset of more than two condition, stats rows_del_filtered is wrong and compaction will fail because of line check.	2019-12-21 12:13:46 +08:00
kangpinghuang	d31f774852	Add block split bloom filter (#2471 ) [STORAGE][SEGMENTV2] use block split bloom filter build bloom filter against data page add distinct value to bloom filter add ordinal index to bloom filter index	2019-12-18 12:57:44 +08:00
kangkaisen	f828670245	Add Bitmap index reader (#2319 ) [STORAGE] [INDEX] For #2061 and #2062 Add bitmap index reader SegmentIterator support bitmap index Add some metrics	2019-12-03 23:01:40 +08:00
kangpinghuang	068eed8eb0	Add delete state of row block v2 for performance (#2055 )	2019-11-11 20:07:37 +08:00
kangpinghuang	c25e826dce	Fix default value column bug (#2134 )	2019-11-07 19:06:24 +08:00
kangpinghuang	cfc98e3571	Fix string type column zone map bug (#2144 ) string type column's zone map of segment is wrong and segments are filtered incorrectly.	2019-11-07 15:57:38 +08:00
Dayue Gao	d25f0ba69a	Make ColumnReader load lazily (#2026 ) [Storage][SegmentV2] Currently `segment_v2::Segment::open` will eagerly initialize all column readers, regardless of whether the column is queried or not. Initializing `segment_v2::ColumnReader` incurs additional I/O cost to read ordinal index and zonemap index and should be delayed to the time it's needed.	2019-10-23 10:25:28 +08:00
kangpinghuang	9c2d149c36	add profile for segment v2 (#2015 )	2019-10-22 09:43:16 +08:00

1 2

64 Commits