doris

Author	SHA1	Message	Date
EmmyMiao87	25e475898e	[Bug] Fix the error result when assert num rows node is used (#3436 ) The child.open() function is not called before this commit. If the assert num rows node has child which process data in open function, the assert num rows node will fetch no data from child. So the result will be empty(incorrect). This error only appear in inner subquery which has a aggregation function. For example: `select * from table where k1=(select k1 from (select avg(k1) from table) a);` The first level of subquery returns a non-scalar value, so the assert num rows node is needed. The second level of subquery has a aggregation function, so the child of assert node is aggregate node. However, if the open stage of the aggregate node is not called, the get next state of aggregate node will return empty set. So the result is wrong. Fixed #3435.	2020-04-30 14:15:50 +08:00
WingC	0430714ca9	Remove redundant call function _wait_in_flight_packet() (#3399 ) The function `_wait_in_flight_packet` has been called in `_send_cur_batch`. No need to call twice.	2020-04-27 20:45:25 +08:00
HappenLee	4eb27bc7e3	[Profile] Make running profile clearer and more intuitive to improve usability (#3365 ) (#3383 ) This CL mainly made the following modifications: 1. Delete Invalid method in Running Profile Class. 2. Move Memlimit Counter from blockmgr to fragment and add PeakMemUsage Counter 3. Fix the bug of buffer pool memlimit counter 4. Call compute_time_in_profile() before pretty_print() to show the _local_time_percent without child running profile 5. Add TransferThread ThreadToken count in AveThreadToken Counter	2020-04-24 21:38:55 +08:00
yangzhg	a58bc1957e	Fix expect may produce incorrect values (#3381 )	2020-04-23 09:35:41 +08:00
HangyuanLiu	ad6698cd31	[Performance] Use Google/CCTZ to replace boost at timezone function (#3300 ) NOTICE: the thirdparty dependency need to upgrade to add libcctz.	2020-04-23 09:26:04 +08:00
Yingchun Lai	4a7a88ede1	[LSAN] Fix some memory leak detected by LSAN (#3326 )	2020-04-22 22:59:44 +08:00
Yunfeng,Wu	b60aabda11	[Doris On ES] Pushdown some castexpr predicate to ES (#3351 ) Process castexpr, such as: k (float) > 2.0, k(int) > 3.2, Doris On Es should ignore this doris native cast transformation for every row's col value, we push down this `cast semantic` to Elasticsearch. I believe in this `predicate` situation, would decrease the mount of data for transmission。 k1 is float: ```` k1 >= 5 ```` push-down filter: ``` {"range":{"k1":{"gte":"5.000000"}}} ``` k2 is int : ``` k2 > 3.2 ``` push-down filter: ``` {"range":{"k2":{"gte":"3.2"}}} ```	2020-04-21 08:34:20 +08:00
令狐少侠	688927918c	[Doris on ES] Fix bug: when Doris and ES type not match (#3315 )	2020-04-14 20:15:13 +08:00
Yunfeng,Wu	a467c6f81f	[ES Connector] Add field context for string field keyword type (#3305 ) This PR is just a transitional way，but it is better to move the predicates transformation from Doris BE to Doris BE, in this way, Doris BE is responsible for fetching data from ES. Add a `enable_keyword_sniff ` configuration item in creating External Elasticsearch Table ，it default to true , would to sniff the `keyword` type on the `text analyzed` Field and return the `json_path` which substitute the origin col name. ``` CREATE EXTERNAL TABLE `test` ( `k1` varchar(20) COMMENT "", `create_time` datetime COMMENT "" ) ENGINE=ELASTICSEARCH PROPERTIES ( "hosts" = "http://10.74.167.16:8200", "user" = "root", "password" = "root", "index" = "test", "type" = "doc", "enable_keyword_sniff" = "true" ); ``` note: `enable_keyword_sniff` default to "true" run this SQL： ``` select * from test where k1 = "wu yun feng" ``` Output predicate DSL： ``` {"term":{"k1.keyword":"wu yun feng"}} ``` and in this PR, I remove the elasticsearch version detected logic for now this is useless, maybe future is needed.	2020-04-13 23:07:33 +08:00
Yunfeng,Wu	614a76beea	[Doris on ES] Support compound_and predicate push down to Elasticsearch (#3277 ) Relate Issue: https://github.com/apache/incubator-doris/issues/3248 SQL: ``` select * from test where (k2 = 6 and k3 = 1) or (k2 = 2 and k3 =3 and k4 = 'beijing'); ``` Output filter: ``` ((#k2:[6 TO 6] #k3:[1 TO 1]) (#(#k2:[2 TO 2] #k3:[3 TO 3]) #k4:beijing))~1 ``` SQL: ``` select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and k3 =3 and (k4 = 'beijing' or k4 = 'zhaochun')); ``` Output filter: ``` (k2:[6 TO 6] k3:[7 TO 7] (#(#k2:[2 TO 2] #k3:[3 TO 3]) #((k4:beijing k4:zhaochun)~1)))~1 ``` SQL: ``` select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and abs(k3) =3 and (k4 = 'beijing' or k4 = 'zhaochun')); ``` Output filter (`abs` can not be pushed down to es, so doris on es would not process this scenario ): ``` match_all ```	2020-04-08 21:09:39 +08:00
HuangWei	2ed184e06a	Add config: tablet writer open rpc timeout (#3258 )	2020-04-03 16:43:56 +08:00
yangzhg	63cee94c5c	Fix output results may incorrect when using intersect and except statements (#3228 ) output results may incorrect when using intersect and except statements	2020-04-01 20:58:43 +08:00
lichaoyong	6a9a62901f	Fix bug of memory limit when group by varchar columns. (#3242 ) select date_format(k10, '%Y%m%d') as myk10 from baseall group by myk10; The date_format function in query above will be stored in MemPool during the query execution. If the query handles millions of rows, it will consume much memory. Should clear the MemPool at interval.	2020-04-01 18:48:18 +08:00
HuangWei	5f9359d618	Use SleepFor() instead of usleep() (#3211 )	2020-03-29 14:18:19 +08:00
lichaoyong	e20d905d70	Remove unused KUDU codes (#3175 ) KUDU table is no longer supported long time ago. Remove code related to it.	2020-03-24 13:54:05 +08:00
HangyuanLiu	d4c1938b5c	Open datetime min value limit (#3158 ) the min_value in olap/type.h of datetime is 0000-01-01 00:00:00, so we don't need restrict datetime min in tablet_sink	2020-03-24 10:52:57 +08:00
EmmyMiao87	dff3c0d57e	Revert "Remove deep copy when doing hash table EvalRow (#3171 )" (#3173 )	2020-03-23 15:29:46 +08:00
wyb	dd8d748c55	Remove deep copy when doing hash table EvalRow (#3171 ) remove varchar column deep copy in partitioned hash table EvalRow function	2020-03-21 09:52:49 +08:00
yangzhg	d29ed84b6a	[Bug] Fix bug that right semi/anti join is not right (#3167 ) This bug is introduced by PR: #3148. right semi/anti join can not use `insert_unique` in build phase of join.	2020-03-20 20:58:55 +08:00
HappenLee	2dc995df7b	[CodeStyle] Rename new_partition_aggregation_node and new_partitioned_hash_table (#3166 )	2020-03-20 19:59:01 +08:00
HappenLee	5a8fcd263f	[CodeStyle] Delete obsolete code of partition_aggregation_node and partitioned_hash_table (#3162 )	2020-03-20 16:25:29 +08:00
Yingchun Lai	c08d6e4708	[tablet meta] Do some refactor on TabletMeta (#3136 ) remove some functions' return value which always return OLAP_SUCCESS optimize some loops	2020-03-20 15:03:22 +08:00
lichaoyong	2d3dbc2c42	Revert "[CodeStyle] Del obsolete code of partition_aggregation_node (#3154 )" (#3160 ) This reverts commit dae013d797c1c2c9e54246d5ace4bdd90b297d43.	2020-03-20 14:47:25 +08:00
lichaoyong	5f004cb009	Revert "[CodeStyle] Remove unused PartitionedHashTable (#3156 )" (#3159 ) This reverts commit d3fd44f0a2fe076d2c62851babc162fcebe4d63b.	2020-03-20 14:42:40 +08:00
lichaoyong	d3fd44f0a2	[CodeStyle] Remove unused PartitionedHashTable (#3156 )	2020-03-20 12:19:08 +08:00
HappenLee	dae013d797	[CodeStyle] Del obsolete code of partition_aggregation_node (#3154 )	2020-03-20 11:33:55 +08:00
yangzhg	f0db9272dd	[Performance] Improve performence of hash join in some case (#3148 ) improve performent of hash join when build table has to many duplicated rows, this will cause hash table collisions and slow down the probe performence. In this pr when join type is semi join or anti join, we will build a hash table without duplicated rows. benchmark: dataset: tpcds dataset `store_sales` and `catalog_sales` ``` mysql> select count() from catalog_sales; +----------+ \| count() \| +----------+ \| 14401261 \| +----------+ 1 row in set (0.44 sec) mysql> select count(distinct cs_bill_cdemo_sk) from catalog_sales; +------------------------------------+ \| count(DISTINCT `cs_bill_cdemo_sk`) \| +------------------------------------+ \| 1085080 \| +------------------------------------+ 1 row in set (2.46 sec) mysql> select count() from store_sales; +----------+ \| count() \| +----------+ \| 28800991 \| +----------+ 1 row in set (0.84 sec) mysql> select count(distinct ss_addr_sk) from store_sales; +------------------------------+ \| count(DISTINCT `ss_addr_sk`) \| +------------------------------+ \| 249978 \| +------------------------------+ 1 row in set (2.57 sec) ``` test querys: query1: `select count() from (select store_sales.ss_addr_sk from store_sales left semi join catalog_sales on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;` query2: `select count() from (select catalog_sales.cs_bill_cdemo_sk from catalog_sales left semi join store_sales on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;` benchmark result: \|\|query1\|query2\| \|:--:\|:--:\|:--:\| \|before\|14.76 sec\|3 min 16.52 sec\| \|after\|12.64 sec\|10.34 sec\|	2020-03-20 10:31:14 +08:00
lichaoyong	b286f4271b	Remove unused PreAggregtionNode (#3151 )	2020-03-20 09:19:47 +08:00
HangyuanLiu	d01b58bff6	Support 64 bit timestamp in from_unixtime (#3069 ) Support 64 bit timestamp in from_unixtime	2020-03-17 17:30:42 +08:00
yangzhg	0959abc1dc	[ExceptNode] Implement except node (#3056 ) implement except node, support statement like: ``` select a from t1 except select b from t2 ```	2020-03-17 10:54:40 +08:00
HuangWei	a80e9bf229	Fix broker scan node mem limit check (#3123 )	2020-03-16 20:36:46 +08:00
yangzhg	dc07182bd4	[Intersect] Implements intersect node (#3034 ) imlement of the intersect node now can support statement like `select a from t intersect select b from t1 intersect select 1;`	2020-03-09 10:52:55 +08:00
HangyuanLiu	1d296e907d	Fix orc load timestamp bug (#3047 ) The timestamp value load from orc file is error, the value has an offset with hive and spark. Becuase the time zone of orc's timestamp is stored inside orc's stripe information, so the timestamp obtained here is an offset timestamp, so parse timestamp with UTC is actual datetime literal.	2020-03-06 18:03:27 +08:00
yangzhg	3b5a0b6060	[TPCDS] Implement the planner for set operation (#2957 ) Implement intersect and except planner. This CL does not implement intersect and except node in execution level.	2020-02-27 16:03:31 +08:00
HangyuanLiu	e23d735bac	Fix decimal bug in orc load (#2984 )	2020-02-26 10:58:18 +08:00
trueeyu	a340bc7a00	Remove unused LLVM related codes of directory:be/src/runtime (#2910 ) (#2985 ) Remove unused LLVM related codes of directory (step 4):be/src/runtime (#2910) there are many LLVM related codes in code base, but these codes are not really used. The higher version of GCC is not compatible with the LLVM 3.4.2 version currently used by Doris. The PR delete all LLVM related code of directory: be/src/runtime	2020-02-25 13:47:20 +08:00
Mingyu Chen	8eb413fa69	[Bug][RoutineLoad] Fix bug that routine Load encounter "label already used" exception (#2959 ) This CL modify 2 things: 1. When a routine load task submit failed, it will not be put back to the task queue. 2. The rpc timeout when executing a routine load task in BE is set to `query_timeout` of the task plan. ISSUE: #2964	2020-02-22 22:01:14 +08:00
Mingyu Chen	35b09ecd66	[JDK] Support OpenJDK (#2804 ) Support compile and running Frontend process and Broker process with OpenJDK. OpenJDK 13 is tested.	2020-02-20 23:47:02 +08:00
trueeyu	839ec45197	Remove llvm relative code from be/src/exec (#2955 ) Remove unused LLVM related codes of directory:be/src/exec (#2910) there are many LLVM related codes in code base, but these codes are not really used. The higher version of GCC is not compatible with the LLVM 3.4.2 version currently used by Doris. The PR delete all LLVM related code of directory: be/src/exec.	2020-02-20 20:43:26 +08:00
lichaoyong	1cf0fb9117	Use ThreadPool to refactor MemTableFlushExecutor (#2931 ) 1. MemTableFlushExecutor maintain a ThreadPool to receive FlushTask. 2. FlushToken is used to seperate different tasks from different tablets. Every DeltaWriter of tablet constructs a FlushToken, task in FlushToken are handle serially, task between FlushToken are handle concurrently. 3. I have remove thread limit on data_dir, because of I/O is not the main timer consumer of Flush thread. Much of time is consumed in CPU decoding and compress.	2020-02-18 18:39:04 +08:00
HangyuanLiu	43583e7bd2	Fix orc load bug (#2912 )	2020-02-16 19:14:42 +08:00
kangkaisen	6c33f80544	Add disable_storage_page_cache config (#2890 ) 1. when read column data page: for compaction, schema_change, check_sum: we don't use page cache for query and config::disable_storage_page_cache is false, we use page cache 2. when read column index page if config::disable_storage_page_cache is false, we use page cache	2020-02-16 19:13:30 +08:00
令狐少侠	fd492e3b6f	[Doris on ES] Support escape character (#2865 )	2020-02-13 11:32:48 +08:00
LingBin	3c539aac54	[Refactor] Some tiny refactor on streaming-load related code (#2891 ) Mainly contains the following modifications: 1. Use `std::unique_ptr` to replace some naked pointers 2. Modify some methods from member-method to local-static-function 3. Modify some methods do not need to be public to private 4. Some formatting changes: such as wrapping lines that are too long 5. Remove some useless variables 6. Add or modify some comments for easier understanding No functional changes in this patch.	2020-02-13 10:42:52 +08:00
yangzhg	3e160aeb66	[GroupingSet] fix a bug when using grouping set without all column in a grouping set item (#2877 ) fix a bug when using grouping sets without all column in a grouping set item will produce wrong value. fix grouping function check will not work in group by clause	2020-02-12 21:50:12 +08:00
yangzhg	502fa2eb50	[GroupingSet] Fix core when using grouping sets in large data (#2858 ) dst_tuples memory size to Allocate is wrong	2020-02-07 21:40:29 +08:00
Yunfeng,Wu	b35e8153c0	[Doris on Es] Fix lte and gte error expression (#2851 ) LE should LTE GE should GTE	2020-02-06 20:52:14 +08:00
Lijia Liu	99ad56d1bf	Support bitmap index for more type (#2630 ) For #2589 1. date(uint24_t)/datetime(int64_t)/largeint(int128_t) use frame of reference code as dict. 2. decimal(decimal12_t) also uses frame of reference code as dict. 3. float/double use bitshuffle code as dict.	2020-01-31 21:09:29 +08:00
HangyuanLiu	64e99f29e6	Fix parquet arrow read batch bug (#2812 ) Fix parquet arrow read batch bug #2811 The original code was to determine the number of rows in the batch based on the number of rows in the parquet RowGroup.But now it's a batch take 65535 lines. So when parquet row greater than 65535，the number of batch don't match the number of rowgroup. The code using the field "_current_line_of_group" as a position of array can cause the data to be out of array cause be crash	2020-01-21 10:57:56 +08:00
yangzhg	fc55423032	[SQL] Support Grouping Sets, Rollup and Cube to extend group by statement Support Grouping Sets, Rollup and Cube to extend group by statement support GROUPING SETS syntax ``` SELECT a, b, SUM( c ) FROM tab1 GROUP BY GROUPING SETS ( (a, b), (a), (b), ( ) ); ``` cube or rollup like ``` SELECT a, b,c, SUM( d ) FROM tab1 GROUP BY ROLLUP\|CUBE(a,b,c) ``` [ADD] support grouping functions in expr like grouping(a) + grouping(b) (#2039) [FIX] fix analyzer error in window function(#2039)	2020-01-17 16:24:02 +08:00

1 2 3 4 5

214 Commits