doris

Author	SHA1	Message	Date
wangbo	726eaa68ea	[fix](vectorization) Vectorization decimal arithmetic inconsistent (#8626 )	2022-03-28 10:12:39 +08:00
yinzhijian	f96bc62573	[feature](balance) Support balance between disks on a single BE (#8553 ) Current situation of Doris is that the cluster is balanced, but the disks of a backend may be unbalanced. for example, backend A have two disks: disk1 and disk2, disk1's usage is 98%, but disk2's usage is only 40%. disk1 is unable to take more data, therefore only one disk of backend A can take new data, the available write throughput of backend A is only half of its ability, and we can not resolve this through load or partition rebalance now. So we introduce disk rebalancer, disk rebalancer is different from other rebalancer(load or partition) which take care of cluster-wide data balancing. it takes care about backend-wide data balancing. [For more details see #8550](https://github.com/apache/incubator-doris/issues/8550)	2022-03-28 10:03:21 +08:00
Pxl	02612c7ec0	[Refactor] Remove ununsed file (#8657 )	2022-03-27 01:41:06 +08:00
yiguolei	aeee738af0	Revert "[Refactor][agent_task] Remove etl mgr and etl job pool from be (#8635 )" (#8666 ) This reverts commit 6bc982c37436acf288f566cf10e084731b80fa44.	2022-03-25 18:32:50 +08:00
zbtzbtzbt	e285d09157	[Enhancement](load) speed up stream load for duplicate table, use template for faster get_type_info. (#8500 )	2022-03-25 15:18:43 +08:00
yiguolei	6bc982c374	[Refactor][agent_task] Remove etl mgr and etl job pool from be (#8635 )	2022-03-25 15:17:39 +08:00
dataroaring	8b4e57287f	ow num is more accurate than column num in data_types (#8628 )	2022-03-25 14:38:27 +08:00
Zhengguo Yang	cfb57be731	[api-change] add soft limit of String type length (#8567 ) 1. add a config string_type_soft_limit to soft limit max length of string type 2. disable using String type in Key column, partition column and distribution column 3. remove String type alias BLOB for futrue use	2022-03-25 09:28:41 +08:00
caiconghui	c69dd54116	[refactor](mutex) Use std::mutex to replace Mutex and refactor some lock logic (#8452 )	2022-03-24 14:50:02 +08:00
Xinyi Zou	aaaaae53b5	[feature] (memory) Switch TLS mem tracker to separate more detailed memory usage (#8605 ) In pr #8476, all memory usage of a process is recorded in the process mem tracker, and all memory usage of a query is recorded in the query mem tracker, and it is still necessary to manually call `transfer to` to track the cached memory size. We hope to separate out more detailed memory usage based on Hook TCMalloc new/delete + TLS mem tracker. In this pr, the more detailed mem tracker is switched to TLS, which automatically and accurately counts more detailed memory usage than before.	2022-03-24 14:29:34 +08:00
HappenLee	5f606c9d57	[fix] Fix coredump of stddev function (#8543 ) This is only a temporary fix its performance is not ideal. Finally, we need to reconstruct the functions of `stddev` and delete the interface of `insert_to_null_default ()`.	2022-03-24 11:39:29 +08:00
Pxl	2760bcbcc1	[fix] fix core dump on deep_copy_tuple when data is null (#8620 )	2022-03-24 09:15:38 +08:00
Mingyu Chen	a58e56f0b4	[fix](load) fix another bug that BE may crash when calling `mark_as_failed` (#8607 ) Same as #8501	2022-03-24 09:13:54 +08:00
Pxl	7fc22c2456	[fix][vectorized] fix core on get_predicate_column_ptr && fix double copy on _read_columns_by_rowids (#8581 )	2022-03-24 09:12:42 +08:00
spaces-x	bea9a7ba4f	[feature] Support pre-aggregation for quantile type (#8234 ) Add a new column-type to speed up the approximation of quantiles. 1. The new column-type is named `quantile_state` with fixed aggregation function `quantile_union`, which stores the intermediate results of pre-aggregated approximation calculations for quantiles. 2. support pre-aggregation of new column-type and quantile_state related functions.	2022-03-24 09:11:34 +08:00
HappenLee	36c85d2f06	[fix][vectorized] Fix bug of left semi/anti with other join conjunct (#8596 )	2022-03-23 10:34:47 +08:00
HappenLee	92feb9c6c8	[fix] Fix error crc32 method to cal uint128 and int128 (#8577 )	2022-03-23 10:33:32 +08:00
Gabriel	b89e4c7bba	[feature-wip](java-udf) support java UDF with fixed-length input and output (#8516 ) This feature is propsoed in [DSIP-1](https://cwiki.apache.org/confluence/display/DORIS/DSIP-001%3A+Java+UDF). This PR support fixed-length input and output Java UDF. Phase I in DIP-1 is done after this PR. To support Java UDF effeciently, I use no data copy in JNI call and all compute operations are off-heap in Java. To achieve that, I use a UdfExecutor instead. For users, a UDF class must have a public evaluate method.	2022-03-23 10:32:50 +08:00
camby	9f0b93e3c6	[feature-wip](array-type) Fix conflict while merge array-type branch (#8594 )	2022-03-22 16:35:30 +08:00
Adonis Ling	2580da4f72	[feature-wip](array-type) Support insertion for vectorized engine. (#8494 ) (#8590 ) Please refer to #8493	2022-03-22 15:48:13 +08:00
camby	71ce3c4a6e	[feature-wip](array-type) Add codes and UT for array_contains and array_position functions (#8401 ) (#8589 ) array_contains function Usage example: 1. create table with ARRAY column, and insert some data: ``` > select * from array_test; +------+------+--------+ \| k1 \| k2 \| k3 \| +------+------+--------+ \| 1 \| 2 \| [1, 2] \| \| 2 \| 3 \| NULL \| \| 4 \| NULL \| [] \| \| 3 \| NULL \| NULL \| +------+------+--------+ ``` 2. enable vectorized: ``` > set enable_vectorized_engine=true; ``` 3. select with array_contains: ``` > select k1,array_contains(k3,1) from array_test; +------+-------------------------+ \| k1 \| array_contains(`k3`, 1) \| +------+-------------------------+ \| 3 \| NULL \| \| 1 \| 1 \| \| 2 \| NULL \| \| 4 \| 0 \| +------+-------------------------+ ``` 4. also we can use array_contains in where condition ``` > select * from array_test where array_contains(k3,1); +------+------+--------+ \| k1 \| k2 \| k3 \| +------+------+--------+ \| 1 \| 2 \| [1, 2] \| +------+------+--------+ ``` 5. array_position usage example ``` > select k1,k3,array_position(k3,2) from array_test; +------+--------+-------------------------+ \| k1 \| k3 \| array_position(`k3`, 2) \| +------+--------+-------------------------+ \| 3 \| NULL \| NULL \| \| 1 \| [1, 2] \| 2 \| \| 2 \| NULL \| NULL \| \| 4 \| [] \| 0 \| +------+--------+-------------------------+ ```	2022-03-22 15:42:40 +08:00
Adonis Ling	a9f51b5b65	[feature-wip](array-type) Fix compilation error. (#8422 ) (#8587 )	2022-03-22 15:31:16 +08:00
Adonis Ling	b638c07533	[feature-wip](array-type) Support nested array insertion. (#8305 ) (#8586 ) Please refer to #8304 .	2022-03-22 15:28:26 +08:00
Adonis Ling	e44038caf3	[feature-wip](array-type) Array data can be loaded in stream load. (#8368 ) (#8585 ) Please refer to #8367 .	2022-03-22 15:25:40 +08:00
camby	a498463ab5	[feature-wip](array-type)support select ARRAY data type on vectorized engine (#8217 ) (#8584 ) Usage Example: 1. create table for test; ``` `CREATE TABLE `array_test` ( `k1` tinyint(4) NOT NULL COMMENT "", `k2` smallint(6) NULL COMMENT "", `k3` ARRAY<int(11)> NULL COMMENT "" ) ENGINE=OLAP DUPLICATE KEY(`k1`) COMMENT "OLAP" DISTRIBUTED BY HASH(`k1`) BUCKETS 5 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "in_memory" = "false", "storage_format" = "V2" );` ``` 2. insert some data ``` `insert into array_test values(1, 2, [1, 2]);` `insert into array_test values(2, 3, null);` `insert into array_test values(3, null, null);` `insert into array_test values(4, null, []);` ``` 3. open vectorized `set enable_vectorized_engine=true;` 4. query array data `select * from array_test;` +------+------+--------+ \| k1 \| k2 \| k3 \| +------+------+--------+ \| 4 \| NULL \| [] \| \| 2 \| 3 \| NULL \| \| 1 \| 2 \| [1, 2] \| \| 3 \| NULL \| NULL \| +------+------+--------+ 4 rows in set (0.061 sec) Code Changes include： 1. add column_array, data_type_array codes; 2. codes about data_type creation by Field, TabletColumn, TypeDescriptor, PColumnMeta move to DataTypeFactory; 3. support create data_type for ARRAY date type; 4. RowBlockV2::convert_to_vec_block support ARRAY date type; 5. VMysqlResultWriter::append_block support ARRAY date type; 6. vectorized::Block serialize and deserialize support ARRAY date type;	2022-03-22 15:21:44 +08:00
Adonis Ling	38ec3cbbdf	[feature-wip](array-type) Support ArrayLiteral in SQL. (#8089 ) (#8582 ) Please refer to #8074	2022-03-22 15:07:06 +08:00
Adonis Ling	cf0a9fd177	[feature-wip](array-type) Create table with nested array type. (#8003 ) (#8575 ) ``` create table array_type_table(k1 INT, k2 Array<Array<int>>) duplicate key (k1) distributed by hash(k1) buckets 1 properties('replication_num' = '1'); ```	2022-03-22 15:03:32 +08:00
Pxl	be3d203289	[feature][vectorized] support table function explode_numbers() (#8509 )	2022-03-22 11:38:00 +08:00
yiguolei	989e03ddf9	[improvement] Improve sig handler (#8545 ) * Refactor glog's default signal handler Co-authored-by: Zhengguo Yang <780531911@qq.com>	2022-03-22 10:40:31 +08:00
caiconghui	905b9a6289	[fix](lru_cache) fix heap-use-after-free problem for lru cache(#8569 )	2022-03-21 21:23:43 +08:00
Mingyu Chen	04004021b5	[chore] Separate debugging information from BE binaries (#8544 ) Currently, the compiled output of BE mainly consists of two binaries: palo_be and meta_tool, which are both around 1.6G in size. However, the debug information is only needed for debugging purposes. So I separate the debug info from binaries. After BE is built, the debug info file will be saved in `be/lib/debug_info/` dir. `palo_be` and `meta_tool`'s size decrease to about 100MB This is optional, and default is disabled. To enable it, use: `STRIP_DEBUG_INFO=ON sh build.sh`	2022-03-21 16:33:01 +08:00
Zhengguo Yang	7c1c2b1d17	[chore] fix compile error when use clang as compiler and a be ut problem (#8554 )	2022-03-21 15:38:59 +08:00
yiguolei	337d174c14	[Refactor](schema_change) Remove tablet instances since tablet id is unique between base tablet and new schema change tablet (#8486 )	2022-03-21 12:43:54 +08:00
minghong	c772020db4	[fix] fix bug in WindowFunctionLastData::data, it keeps the first data not the last. (#8536 ) WindowFunctionLastData::add should keep the last value, but current implementation keeps the first one. Obviously, this code is copied from WindowFunctionFirstData::add.	2022-03-21 09:51:56 +08:00
Pxl	fc3ad371c8	[fix](vec) fix regexp_replace get wrong result on clang (#8505 )	2022-03-20 23:11:24 +08:00
Xinyi Zou	eeae516e37	[Feature](Memory) Hook TCMalloc new/delete automatically counts to MemTracker (#8476 ) Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G Implement a new way of memory statistics based on TCMalloc New/Delete Hook, MemTracker and TLS, and it is expected that all memory new/delete/malloc/free of the BE process can be counted.	2022-03-20 23:06:54 +08:00
ZenoYang	2ec0b81030	[improvement](storage) Low cardinality string optimization in storage layer (#8318 ) Low cardinality string optimization in storage layer	2022-03-20 23:04:25 +08:00
Zhengguo Yang	58a4c70fd4	[fix] fix String type comapaction or agg may crash when string is null (#8515 )	2022-03-18 11:27:28 +08:00
morrySnow	4da1718147	[fix] memory leak in ResourceTls (#8517 )	2022-03-18 09:42:19 +08:00
yinzhijian	94991864f5	[fix] Fix bug that __set_ missing for thrift optional fields in be (#8507 )	2022-03-18 09:41:06 +08:00
Zhengguo Yang	035ca5240f	[fix] Fix may coredump when check if all rowset is beta-rowset of a tablet (#8503 ) core dump like ``` * Aborted at 1647468467 (unix time) try "date -d @1647468467" if you are using GNU date * PC: @ 0x5555576940b0 doris::OlapScanNode::start_scan_thread() * SIGSEGV (@0x84) received by PID 39139 (TID 0x7ffee8388700) from PID 132; stack trace: * @ 0x555558926212 google::(anonymous namespace)::FailureSignalHandler() @ 0x7ffff753d400 (unknown) @ 0x5555576940b0 doris::OlapScanNode::start_scan_thread() @ 0x555557696e1b doris::OlapScanNode::start_scan() @ 0x55555769737d doris::OlapScanNode::get_next() @ 0x5555570784f5 doris::PlanFragmentExecutor::get_next_internal() @ 0x55555707d24c doris::PlanFragmentExecutor::open_internal() @ 0x55555707e72f doris::PlanFragmentExecutor::open() @ 0x555556ffab95 doris::FragmentExecState::execute() @ 0x555556fff0ed doris::FragmentMgr::_exec_actual() @ 0x5555570088ec std::_Function_handler<>::_M_invoke() @ 0x55555719a099 doris::ThreadPool::dispatch_thread() @ 0x555557193a8f doris::Thread::supervise_thread() @ 0x7ffff72f2ea5 start_thread @ 0x7ffff76058dd __clone @ 0x0 (unknown) ```	2022-03-18 09:39:13 +08:00
Mingyu Chen	b07b840b76	[fix](load) fix bug that BE may crash when calling `mark_as_failed` (#8501 ) 1. The methods in the IndexChannel are called back in the RpcClosure in the NodeChannel. However, this callback may occur after the whole task is finished (e.g. due to network latency), and by that time the IndexChannel may have been destructured, so we should not call the IndexChannel methods anymore, otherwise the BE will crash. Therefore, we use the `_is_closed` variable and `_closed_lock` to ensure that the RPC callback function will not call the IndexChannel's method after the NodeChannel is closed. 2. Do not add IndexChannel to the ObjectPool. Because when deconstruct IndexChannel, it may call the deconstruction of NodeChannel. And the deconstruction of NodeChannel maybe time consuming(wait rpc finished). But the ObjectPool will hold a SpinLock to destroy the objects, so it may cause CPU busy.	2022-03-18 09:38:16 +08:00
dataroaring	25cdd0be1a	[refactor] CalcPageLenForRow return void rather than always Status::Ok (#8490 ) Thus we can remove branches depending on CalcPageLenForRow.	2022-03-18 09:34:49 +08:00
Pxl	a8af8d2981	[fix](vectorized) fix core dump on get_json_string and add some ut (#8496 )	2022-03-17 10:08:31 +08:00
Zhengguo Yang	848acec584	[chore](dependency) update Croaring for good performance (#8492 ) update Croaring for good performance, according to RoaringBitmap/CRoaring#320	2022-03-17 10:07:55 +08:00
ZenoYang	b537e06ecd	[improvement](vectorized) Make bloom filter predicate run short-circuit logic (#8484 ) The current BloomFilter runs vectorization predicate evaluate, but `evaluate_vec` interface is not implemented, so the RuntimeFilter does not play a role after it is pushed down to the storage layer. And BF predicate computation cannot be automatically vectorized, thus making BloomFilter run short-circuit logic. For SSB Q2.1，`enable_storage_vectorization = true;` ``` test before impl: - Total: 36s164ms - RowsVectorPredFiltered: 0 - RealRuntimeFilterType: bloomfilter - HasPushDownToEngine: true test after impl: - Total: 2s345ms - RowsVectorPredFiltered: 595.247102M (595247102) - RealRuntimeFilterType: bloomfilter - HasPushDownToEngine: true ```	2022-03-17 10:07:30 +08:00
Pxl	a824c3e489	[feature](vectorized) support lateral view (#8448 )	2022-03-17 10:04:24 +08:00
wangbo	b8e6c3a00c	[fix] fix bitmap wrong result (#8478 ) Fix a bug when query bitmap return wrong result, even the simplest query. Such as ``` CREATE TABLE `pv_bitmap_fix2` ( `dt` int(11) NULL COMMENT "", `page` varchar(10) NULL COMMENT "", `user_id_bitmap` bitmap BITMAP_UNION NULL COMMENT "" ) ENGINE=OLAP AGGREGATE KEY(`dt`, `page`) COMMENT "OLAP" DISTRIBUTED BY HASH(`dt`) BUCKETS 2 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "in_memory" = "false", "storage_format" = "V2" ) Insert any hundreds of rows of data select count(distinct user_id_bitmap) from pv_bitmap_fix2 the result is wrong ``` This is a bug of vectorization of storage layer.	2022-03-16 11:39:41 +08:00
HappenLee	d39c021d71	[fix] min function of not null varchar column get error result (#8479 )	2022-03-16 11:38:55 +08:00
camby	3ba4de0d27	[fix](ut) fix some UT compile or run failed cases (#8489 )	2022-03-16 11:38:35 +08:00

1 2 3 4 5 ...

1788 Commits