doris

Author	SHA1	Message	Date
Yunfeng,Wu	484e7de3c5	[Doirs On ES] fix bug for sparse docvalue context and remove the mistake usage of total (#3751 ) The other PR : https://github.com/apache/incubator-doris/pull/3513 (https://github.com/apache/incubator-doris/issues/3479) try to resolved the `inner hits node is not an array` because when a query( batch-size) run against new segment without this field, as-well the filter_path just only take `hits.hits.fields` 、`hits.hits._source` into account, this would appear an null inner hits node: ``` { "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2", "hits": { "total": 1 } } ``` Unfortunately this PR introduce another serious inconsistent result with different batch_size because of misusing the `total`. To avoid this two problem, we just add `hits.hits._score` to filter_path when `docvalue_mode` is true, `_score` would always `null` , and populate the inner hits node: ``` { "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2", "hits": { "total": 1, "hits": [ { "_score": null } ] } } ``` related issue: https://github.com/apache/incubator-doris/issues/3752	2020-06-04 16:31:18 +08:00
令狐少侠	5a57ecca15	[Doris On ES]fix bug of query failed in doc_value_mode when fields have none value (#3513 ) #3479 Here I try to explain the cause of the problem and how to fix it. The Cause of The problem Take the case in issue(#3479 ) as an example: The general results are as follows: ``` GET table/_doc/_search {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "took": 6, "timed_out": false, "_shards": { …… }, "hits": { "total": 3, "max_score": null, "hits": [ { "_index": "table", "_score": null, "sort": [ 0 ] }, { "_index": "table", "_score": null, "fields": { "k1": [ "kkk1" ] }, "sort": [ 0 ] }, { "_index": "table", "_score": null, "sort": [ 0 ] } ] } } ``` But in Doris on ES，Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows: ``` GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "hits": { "total": 0 } } GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "hits": { "total": 1 } } GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields {"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100} { "hits": { "total": 1, "hits": [ { "fields": { "k1": [ "kkk1" ] } } ] } } ``` Scan-Worker On BE which processed result of shard2 will failed. The reasons are as follows: 1. "filter_path" causes the hits.hits object not exist. 2. In the current implementation, if there are some data rows（total > 0）, the hits.hits. object must be an array How To Fix it Two Method: 1. modify "filter_path" to contain the hits. Pros: Fixed Code is very simple Cons: More network cost 2. Deal with the case where fields are missing in a batch. Pros: No loss of performance Cons: Code is more complex Performance first, I use Method2. Design 1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not. 2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode 3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode" and _total to construct _total lines which fields are assigned with 'NULL'	2020-05-11 15:34:12 +08:00
Yunfeng,Wu	e41aef54f2	[Doirs-On-Elasticsearch] Accerlate first scroll search (#2575 ) Add terminate_after for the first scroll to avoid decompress all postings list	2019-12-27 14:05:21 +08:00
Yunfeng,Wu	5a3f71dd6b	Push limit to Elasticsearch external table (#2400 )	2019-12-07 21:13:44 +08:00
Yunfeng,Wu	0f00febd21	Optimize Doris On Elasticsearch performance (#2237 ) Pure DocValue optimization for doris-on-es Future todo: Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable, should check pure_docvalue enabled for one whole scan outside, I will add this todo in future	2019-12-04 12:57:45 +08:00
Yunfeng,Wu	a465b38874	Enhance doris on es error message (#2297 ) Enhance doris on es error message and modify some field data transform error. For varchar/char type, sometimes elasticsearch user post some not-string value to Elasticsearch Index. because of reading value from _source, we can not process all json type and then just transfer the value to original string representation this may be a tricky, but we can workaround this issue	2019-11-26 18:32:25 +08:00
kangkaisen	1c229fbd92	Fix es_scan_reader_test in debug mode (#1905 )	2019-09-28 00:02:30 +08:00
Yunfeng,Wu	8034d83e20	Add scroll keepalive and http timeout configuration (#1731 )	2019-09-02 19:04:30 +08:00
ZHAO Chun	9d03ba236b	Uniform Status (#1317 )	2019-06-14 23:38:31 +08:00
lide	9c82d41981	Support Doris query ES by HTTP way (#925 )	2019-04-28 17:14:44 +08:00

10 Commits