doris

Author	SHA1	Message	Date
Yunfeng,Wu	8c608bbad5	[Doris On ES] Skip function_call expr when process predicate (#3813 ) [Doris On ES] Skip function_call expr when process predicate Fixed #3801 Do not push-down function_call such as split_xxx when process predicate, Doris BE is responsible for processing these predicate All rows in table: ``` +------+------+------+------------+------------+ \| k1 \| k2 \| k3 \| UpdateTime \| ArriveTime \| +------+------+------+------------+------------+ \| NULL \| NULL \| kkk1 \| 123456789 \| NULL \| \| kkk1 \| NULL \| NULL \| 123456789 \| NULL \| \| NULL \| kkk2 \| NULL \| 123456789 \| NULL \| +------+------+------+------------+------------+ ``` The following predicate could not push down to ES. ``` SQL 1: mysql> select * from (select split_part(k1, "1", 1) as kk from case_replay_for_milimin) t where t.kk is not null; +------+ \| kk \| +------+ \| kkk \| +------+ 1 row in set (0.02 sec) SQL 2: mysql> select * from (select split_part(k1, "1", 1) as kk from case_replay_for_milimin) t where t.kk > 'a'; +------+ \| kk \| +------+ \| kkk \| +------+ SQL 3: mysql> select * from (select split_part(k1, "1", 1) as kk from case_replay_for_milimin) t where t.kk > '2'; +------+ \| kk \| +------+ \| kkk \| +------+ 1 row in set (0.03 sec) ```	2020-06-10 11:22:53 +08:00
Yunfeng,Wu	484e7de3c5	[Doirs On ES] fix bug for sparse docvalue context and remove the mistake usage of total (#3751 ) The other PR : https://github.com/apache/incubator-doris/pull/3513 (https://github.com/apache/incubator-doris/issues/3479) try to resolved the `inner hits node is not an array` because when a query( batch-size) run against new segment without this field, as-well the filter_path just only take `hits.hits.fields` 、`hits.hits._source` into account, this would appear an null inner hits node: ``` { "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2", "hits": { "total": 1 } } ``` Unfortunately this PR introduce another serious inconsistent result with different batch_size because of misusing the `total`. To avoid this two problem, we just add `hits.hits._score` to filter_path when `docvalue_mode` is true, `_score` would always `null` , and populate the inner hits node: ``` { "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2", "hits": { "total": 1, "hits": [ { "_score": null } ] } } ``` related issue: https://github.com/apache/incubator-doris/issues/3752	2020-06-04 16:31:18 +08:00
Mingyu Chen	27046c5b61	[Enhancement] Improve the performance of query with IN predicate (#3694 ) This CL mainly changes: 1. Add a new BE config `max_pushdown_conditions_per_column` to limit the number of conditions of a single column that can be pushed down to storage engine. 2. Add 2 new session variables `max_scan_key_num` and `doris_max_scan_key_num` which can set in session level and overwrite the config value in BE.	2020-06-04 11:39:00 +08:00
worker24h	e76f712bb3	[Bug] Load data is error in json load	2020-05-28 17:28:33 +08:00
lichaoyong	1cc78fe69b	[Enhancement] Convert metric to Json format (#3635 ) Add a JSON format for existing metrics like this. ``` { "tags": { "metric":"thread_pool", "name":"thrift-server-pool", "type":"active_thread_num" }, "unit":"number", "value":3 } ``` I add a new JsonMetricVisitor to handle the transformation. It's not to modify existing PrometheusMetricVisitor and SimpleCoreMetricVisitor. Also I add 1. A unit item to indicate the metric better 2. Cloning tablet statistics divided by database. 3. Use white space to replace newline in audit.log	2020-05-27 08:49:30 +08:00
HappenLee	f4c03fe8e2	1. Delete the code of Sort Node we do not use now. (#3666 ) Optimize the quick sort by find_the_median and try to reduce recursion level of quick sort.	2020-05-26 10:20:57 +08:00
Mingyu Chen	3ffc447b38	[OUTFILE] Support `INTO OUTFILE` to export query result (#3584 ) This CL mainly changes: 1. Support `SELECT INTO OUTFILE` command. 2. Support export query result to a file via Broker. 3. Support CSV export format with specified column separator and line delimiter.	2020-05-25 21:24:56 +08:00
HuangWei	fb02bb5cd9	[Load] Fix mem limit in NodeChannel (#3643 )	2020-05-22 09:11:59 +08:00
worker24h	4f79036a7e	Add error code into error message (#3645 )	2020-05-21 19:14:35 +08:00
worker24h	ef8fd1fcbe	[Load] Support load json-data into Doris by RoutineLoad or StreamLoad (#3553 ) Doris support load json-data by RoutineLoad or StreamLoad	2020-05-21 13:00:49 +08:00
令狐少侠	8018b1c348	[Doris on ES]Fix bug of like not translate correctly (#3602 ) Why this case happened In current implement, translation into dsl only if it is not the first charactor. Thus, when sql is write like '%abc', translation would not run. How fixed Now, translation will trigger with charactor '?' or '*' if it is the first charactor, translate directly else, check the preceding char is escaped or not to determin translation or not	2020-05-19 17:06:46 +08:00
HappenLee	7bf926eba8	[Profile] Improve the running profile 1. Delete Invalid Counter In Data_Stream_Sender. (#3598) 2. Add Counter For PartitionHashTable of PartitionAggregationNode: * Hash Probe Method * Row processed by Aggregation * HashFilledBuckets: Counter How Many FilledBuckets in Aggragation * HTResize: Counter How Many Resize of HashTable * HashProbe: Counter Probe of HashTable * HashFailedProbe: Counter Failed Probe of HashTable * HashTravelLength: Total TravelLength for Probe * HashCollisions: Counter of HashCollision 3. Del some unecessary code in PartitionHashTable by template	2020-05-16 21:35:30 +08:00
令狐少侠	5a57ecca15	[Doris On ES]fix bug of query failed in doc_value_mode when fields have none value (#3513 ) #3479 Here I try to explain the cause of the problem and how to fix it. The Cause of The problem Take the case in issue(#3479 ) as an example: The general results are as follows: ``` GET table/_doc/_search {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "took": 6, "timed_out": false, "_shards": { …… }, "hits": { "total": 3, "max_score": null, "hits": [ { "_index": "table", "_score": null, "sort": [ 0 ] }, { "_index": "table", "_score": null, "fields": { "k1": [ "kkk1" ] }, "sort": [ 0 ] }, { "_index": "table", "_score": null, "sort": [ 0 ] } ] } } ``` But in Doris on ES，Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows: ``` GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "hits": { "total": 0 } } GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "hits": { "total": 1 } } GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "hits": { "total": 1, "hits": [ { "fields": { "k1": [ "kkk1" ] } } ] } } ``` Scan-Worker On BE which processed result of shard2 will failed. The reasons are as follows: 1. "filter_path" causes the hits.hits object not exist. 2. In the current implementation, if there are some data rows（total > 0）, the hits.hits. object must be an array How To Fix it Two Method: 1. modify "filter_path" to contain the hits. Pros: Fixed Code is very simple Cons: More network cost 2. Deal with the case where fields are missing in a batch. Pros: No loss of performance Cons: Code is more complex Performance first, I use Method2. Design 1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not. 2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode 3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode" and _total to construct _total lines which fields are assigned with 'NULL'	2020-05-11 15:34:12 +08:00
yangzhg	f90da72078	[Planner]Enhance AssertNumRowsNode (#3485 ) Enhance AssertNumRowsNode to support equal, less than, greater than,... assert conditions	2020-05-08 12:49:48 +08:00
yangzhg	c85d847b1e	[CompileBug] fix a compile error (#3502 ) NodeChannel::mark_close() missing `return`	2020-05-07 23:01:46 +08:00
HuangWei	94539e7120	Non blocking OlapTableSink (#3143 ) ImplementaItion Notes NodeChannel _cur_batch -> _pending_batches: when _cur_batch is filled up, move it to _pending_batches. add_row() just produce batches. try_send_and_fetch_status() tries to consume one pending batch. If has in flight packet, skip send in this round. So we can add one sender thread to be in charge of all node channels try_send. IndexChannel init(), open() stay the same. Use for_each_node_channel() to expose the detailed changes of NodeChannel.(It's more easy to read & modify) Sender thread See func OlapTableSink::_send_batch_process() Why use polling？ If we use wait/notify, it will notify when generate a new batch. We can't skip sending this batch, coz it won't notify the same batch again. So wait/notify can't avoid blocking simply. So I choose polling. It's wasting to continuously try_send(), but it's difficult to set the suitable polling interval. Thus, I add std::this_thread::yield() to give up the time slice, give priority to other process/threads (if there are other process/threads waiting in the queue).	2020-05-07 10:43:41 +08:00
Yingchun Lai	b58b1b3953	[metrics] Make DorisMetrics to be a real singleton (#3417 )	2020-05-04 09:20:53 +08:00
HappenLee	d0fe7e4d94	[Profile] Make running profile clearer and more intuitive to improve usability (#3405 ) This CL mainly made the following modifications: 1. Delete Invalid MemoryUsed Counter and Add PeakMemUsage in each exec node and datastreamsender 2. Add intent in child execnode profile，make it is easily to know the relationship between execnode 3. Del _is_result_order we not support any more in olap_scan_node.h and olap_scan_node.cpp 4. Add scan_disk method to olap_scanner to fix the counter _num_disks_accessed_counter 5. Now we do not use buffer pool to read and write disk, so annotation eadio counter and 6. Delete the MemUsed counter in exec node.	2020-04-30 14:57:21 +08:00
EmmyMiao87	25e475898e	[Bug] Fix the error result when assert num rows node is used (#3436 ) The child.open() function is not called before this commit. If the assert num rows node has child which process data in open function, the assert num rows node will fetch no data from child. So the result will be empty(incorrect). This error only appear in inner subquery which has a aggregation function. For example: `select * from table where k1=(select k1 from (select avg(k1) from table) a);` The first level of subquery returns a non-scalar value, so the assert num rows node is needed. The second level of subquery has a aggregation function, so the child of assert node is aggregate node. However, if the open stage of the aggregate node is not called, the get next state of aggregate node will return empty set. So the result is wrong. Fixed #3435.	2020-04-30 14:15:50 +08:00
WingC	0430714ca9	Remove redundant call function _wait_in_flight_packet() (#3399 ) The function `_wait_in_flight_packet` has been called in `_send_cur_batch`. No need to call twice.	2020-04-27 20:45:25 +08:00
HappenLee	4eb27bc7e3	[Profile] Make running profile clearer and more intuitive to improve usability (#3365 ) (#3383 ) This CL mainly made the following modifications: 1. Delete Invalid method in Running Profile Class. 2. Move Memlimit Counter from blockmgr to fragment and add PeakMemUsage Counter 3. Fix the bug of buffer pool memlimit counter 4. Call compute_time_in_profile() before pretty_print() to show the _local_time_percent without child running profile 5. Add TransferThread ThreadToken count in AveThreadToken Counter	2020-04-24 21:38:55 +08:00
yangzhg	a58bc1957e	Fix expect may produce incorrect values (#3381 )	2020-04-23 09:35:41 +08:00
HangyuanLiu	ad6698cd31	[Performance] Use Google/CCTZ to replace boost at timezone function (#3300 ) NOTICE: the thirdparty dependency need to upgrade to add libcctz.	2020-04-23 09:26:04 +08:00
Yingchun Lai	4a7a88ede1	[LSAN] Fix some memory leak detected by LSAN (#3326 )	2020-04-22 22:59:44 +08:00
Yunfeng,Wu	b60aabda11	[Doris On ES] Pushdown some castexpr predicate to ES (#3351 ) Process castexpr, such as: k (float) > 2.0, k(int) > 3.2, Doris On Es should ignore this doris native cast transformation for every row's col value, we push down this `cast semantic` to Elasticsearch. I believe in this `predicate` situation, would decrease the mount of data for transmission。 k1 is float: ```` k1 >= 5 ```` push-down filter: ``` {"range":{"k1":{"gte":"5.000000"}}} ``` k2 is int : ``` k2 > 3.2 ``` push-down filter: ``` {"range":{"k2":{"gte":"3.2"}}} ```	2020-04-21 08:34:20 +08:00
令狐少侠	688927918c	[Doris on ES] Fix bug: when Doris and ES type not match (#3315 )	2020-04-14 20:15:13 +08:00
Yunfeng,Wu	a467c6f81f	[ES Connector] Add field context for string field keyword type (#3305 ) This PR is just a transitional way，but it is better to move the predicates transformation from Doris BE to Doris BE, in this way, Doris BE is responsible for fetching data from ES. Add a `enable_keyword_sniff ` configuration item in creating External Elasticsearch Table ，it default to true , would to sniff the `keyword` type on the `text analyzed` Field and return the `json_path` which substitute the origin col name. ``` CREATE EXTERNAL TABLE `test` ( `k1` varchar(20) COMMENT "", `create_time` datetime COMMENT "" ) ENGINE=ELASTICSEARCH PROPERTIES ( "hosts" = "http://10.74.167.16:8200", "user" = "root", "password" = "root", "index" = "test", "type" = "doc", "enable_keyword_sniff" = "true" ); ``` note: `enable_keyword_sniff` default to "true" run this SQL： ``` select * from test where k1 = "wu yun feng" ``` Output predicate DSL： ``` {"term":{"k1.keyword":"wu yun feng"}} ``` and in this PR, I remove the elasticsearch version detected logic for now this is useless, maybe future is needed.	2020-04-13 23:07:33 +08:00
Yunfeng,Wu	614a76beea	[Doris on ES] Support compound_and predicate push down to Elasticsearch (#3277 ) Relate Issue: https://github.com/apache/incubator-doris/issues/3248 SQL: ``` select * from test where (k2 = 6 and k3 = 1) or (k2 = 2 and k3 =3 and k4 = 'beijing'); ``` Output filter: ``` ((#k2:[6 TO 6] #k3:[1 TO 1]) (#(#k2:[2 TO 2] #k3:[3 TO 3]) #k4:beijing))~1 ``` SQL: ``` select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and k3 =3 and (k4 = 'beijing' or k4 = 'zhaochun')); ``` Output filter: ``` (k2:[6 TO 6] k3:[7 TO 7] (#(#k2:[2 TO 2] #k3:[3 TO 3]) #((k4:beijing k4:zhaochun)~1)))~1 ``` SQL: ``` select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and abs(k3) =3 and (k4 = 'beijing' or k4 = 'zhaochun')); ``` Output filter (`abs` can not be pushed down to es, so doris on es would not process this scenario ): ``` match_all ```	2020-04-08 21:09:39 +08:00
HuangWei	2ed184e06a	Add config: tablet writer open rpc timeout (#3258 )	2020-04-03 16:43:56 +08:00
yangzhg	63cee94c5c	Fix output results may incorrect when using intersect and except statements (#3228 ) output results may incorrect when using intersect and except statements	2020-04-01 20:58:43 +08:00
lichaoyong	6a9a62901f	Fix bug of memory limit when group by varchar columns. (#3242 ) select date_format(k10, '%Y%m%d') as myk10 from baseall group by myk10; The date_format function in query above will be stored in MemPool during the query execution. If the query handles millions of rows, it will consume much memory. Should clear the MemPool at interval.	2020-04-01 18:48:18 +08:00
HuangWei	5f9359d618	Use SleepFor() instead of usleep() (#3211 )	2020-03-29 14:18:19 +08:00
lichaoyong	e20d905d70	Remove unused KUDU codes (#3175 ) KUDU table is no longer supported long time ago. Remove code related to it.	2020-03-24 13:54:05 +08:00
HangyuanLiu	d4c1938b5c	Open datetime min value limit (#3158 ) the min_value in olap/type.h of datetime is 0000-01-01 00:00:00, so we don't need restrict datetime min in tablet_sink	2020-03-24 10:52:57 +08:00
EmmyMiao87	dff3c0d57e	Revert "Remove deep copy when doing hash table EvalRow (#3171 )" (#3173 )	2020-03-23 15:29:46 +08:00
wyb	dd8d748c55	Remove deep copy when doing hash table EvalRow (#3171 ) remove varchar column deep copy in partitioned hash table EvalRow function	2020-03-21 09:52:49 +08:00
yangzhg	d29ed84b6a	[Bug] Fix bug that right semi/anti join is not right (#3167 ) This bug is introduced by PR: #3148. right semi/anti join can not use `insert_unique` in build phase of join.	2020-03-20 20:58:55 +08:00
HappenLee	2dc995df7b	[CodeStyle] Rename new_partition_aggregation_node and new_partitioned_hash_table (#3166 )	2020-03-20 19:59:01 +08:00
HappenLee	5a8fcd263f	[CodeStyle] Delete obsolete code of partition_aggregation_node and partitioned_hash_table (#3162 )	2020-03-20 16:25:29 +08:00
Yingchun Lai	c08d6e4708	[tablet meta] Do some refactor on TabletMeta (#3136 ) remove some functions' return value which always return OLAP_SUCCESS optimize some loops	2020-03-20 15:03:22 +08:00
lichaoyong	2d3dbc2c42	Revert "[CodeStyle] Del obsolete code of partition_aggregation_node (#3154 )" (#3160 ) This reverts commit dae013d797c1c2c9e54246d5ace4bdd90b297d43.	2020-03-20 14:47:25 +08:00
lichaoyong	5f004cb009	Revert "[CodeStyle] Remove unused PartitionedHashTable (#3156 )" (#3159 ) This reverts commit d3fd44f0a2fe076d2c62851babc162fcebe4d63b.	2020-03-20 14:42:40 +08:00
lichaoyong	d3fd44f0a2	[CodeStyle] Remove unused PartitionedHashTable (#3156 )	2020-03-20 12:19:08 +08:00
HappenLee	dae013d797	[CodeStyle] Del obsolete code of partition_aggregation_node (#3154 )	2020-03-20 11:33:55 +08:00
yangzhg	f0db9272dd	[Performance] Improve performence of hash join in some case (#3148 ) improve performent of hash join when build table has to many duplicated rows, this will cause hash table collisions and slow down the probe performence. In this pr when join type is semi join or anti join, we will build a hash table without duplicated rows. benchmark: dataset: tpcds dataset `store_sales` and `catalog_sales` ``` mysql> select count() from catalog_sales; +----------+ \| count() \| +----------+ \| 14401261 \| +----------+ 1 row in set (0.44 sec) mysql> select count(distinct cs_bill_cdemo_sk) from catalog_sales; +------------------------------------+ \| count(DISTINCT `cs_bill_cdemo_sk`) \| +------------------------------------+ \| 1085080 \| +------------------------------------+ 1 row in set (2.46 sec) mysql> select count() from store_sales; +----------+ \| count() \| +----------+ \| 28800991 \| +----------+ 1 row in set (0.84 sec) mysql> select count(distinct ss_addr_sk) from store_sales; +------------------------------+ \| count(DISTINCT `ss_addr_sk`) \| +------------------------------+ \| 249978 \| +------------------------------+ 1 row in set (2.57 sec) ``` test querys: query1: `select count() from (select store_sales.ss_addr_sk from store_sales left semi join catalog_sales on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;` query2: `select count() from (select catalog_sales.cs_bill_cdemo_sk from catalog_sales left semi join store_sales on catalog_sales.cs_bill_cdemo_sk = store_sales.ss_addr_sk) a;` benchmark result: \|\|query1\|query2\| \|:--:\|:--:\|:--:\| \|before\|14.76 sec\|3 min 16.52 sec\| \|after\|12.64 sec\|10.34 sec\|	2020-03-20 10:31:14 +08:00
lichaoyong	b286f4271b	Remove unused PreAggregtionNode (#3151 )	2020-03-20 09:19:47 +08:00
HangyuanLiu	d01b58bff6	Support 64 bit timestamp in from_unixtime (#3069 ) Support 64 bit timestamp in from_unixtime	2020-03-17 17:30:42 +08:00
yangzhg	0959abc1dc	[ExceptNode] Implement except node (#3056 ) implement except node, support statement like: ``` select a from t1 except select b from t2 ```	2020-03-17 10:54:40 +08:00
HuangWei	a80e9bf229	Fix broker scan node mem limit check (#3123 )	2020-03-16 20:36:46 +08:00
yangzhg	dc07182bd4	[Intersect] Implements intersect node (#3034 ) imlement of the intersect node now can support statement like `select a from t intersect select b from t1 intersect select 1;`	2020-03-09 10:52:55 +08:00

... 5 6 7 8 9 ...

532 Commits