Commit Graph

6 Commits

Author SHA1 Message Date
265c26f67d [Doris On ES] Add docvalue limitation for doc_values scan and enable doc_values scan default (#4055) 2020-07-10 18:37:36 +08:00
5a57ecca15 [Doris On ES]fix bug of query failed in doc_value_mode when fields have none value (#3513)
#3479 

Here I try to explain the cause of the problem and how to fix it.

**The Cause of The problem**
Take the case in issue(#3479 ) as an example:
The general results are as follows:
```
GET table/_doc/_search
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    ……
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "table",
        "_score": null,
        "sort": [
          0
        ]
      },
      {
        "_index": "table",
        "_score": null,
        "fields": {
          "k1": [
            "kkk1"
          ]
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "table",
        "_score": null,
        "sort": [
          0
        ]
      }
    ]
  }
}
```

But in Doris on ES,Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows:
```
GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}

{
  "hits": {
    "total": 0
  }
}

GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
  "hits": {
    "total": 1
  }
}

GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
  "hits": {
    "total": 1,
    "hits": [
      {
        "fields": {
          "k1": [
            "kkk1"
          ]
        }
      }
    ]
  }
}
```
*Scan-Worker On BE which processed result of shard2  will failed.* 

**The reasons are as follows:**
1. "filter_path" causes the hits.hits object not exist.  
2. In the current implementation, if there are some data rows(total > 0), the hits.hits. object must be an array

**How To Fix it**

Two Method:
1. modify "filter_path" to contain the hits.  
Pros: Fixed Code is very simple
Cons: More network cost
2. Deal with the case where fields are missing in a batch. 
Pros: No loss of performance
Cons: Code is more complex 

Performance first, I use Method2.

**Design**
1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not.
2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode
3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode"  and _total to construct _total lines which fields are assigned with 'NULL'
2020-05-11 15:34:12 +08:00
5a3f71dd6b Push limit to Elasticsearch external table (#2400) 2019-12-07 21:13:44 +08:00
0f00febd21 Optimize Doris On Elasticsearch performance (#2237)
Pure DocValue optimization for doris-on-es

Future todo:
Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable,  should check pure_docvalue enabled for one whole scan outside,  I will add this todo in future
2019-12-04 12:57:45 +08:00
c6dfe83b6d Add particular log info for doris on es (#1711) 2019-08-27 22:16:28 +08:00
9c82d41981 Support Doris query ES by HTTP way (#925) 2019-04-28 17:14:44 +08:00