Commit Graph

10 Commits

Author SHA1 Message Date
484e7de3c5 [Doirs On ES] fix bug for sparse docvalue context and remove the mistake usage of total (#3751)
The other PR : https://github.com/apache/incubator-doris/pull/3513 (https://github.com/apache/incubator-doris/issues/3479) try to resolved the `inner hits node is not an array` because when a  query( batch-size) run against new segment without this field, as-well the filter_path just only take `hits.hits.fields` 、`hits.hits._source` into account, this would appear an null inner hits node:
```
{
   "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2",
   "hits": {
      "total": 1
   }
}
```

Unfortunately this PR introduce another serious inconsistent result with different batch_size because of misusing the `total`.

To avoid this two problem,  we just add `hits.hits._score` to filter_path when `docvalue_mode` is true,   `_score`  would always `null` ,  and populate the inner hits node:

```
{
   "_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2",
   "hits": {
      "total": 1,
      "hits": [
         {
            "_score": null
         }
      ]
   }
}
```

related issue: https://github.com/apache/incubator-doris/issues/3752
2020-06-04 16:31:18 +08:00
5a57ecca15 [Doris On ES]fix bug of query failed in doc_value_mode when fields have none value (#3513)
#3479 

Here I try to explain the cause of the problem and how to fix it.

**The Cause of The problem**
Take the case in issue(#3479 ) as an example:
The general results are as follows:
```
GET table/_doc/_search
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    ……
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "table",
        "_score": null,
        "sort": [
          0
        ]
      },
      {
        "_index": "table",
        "_score": null,
        "fields": {
          "k1": [
            "kkk1"
          ]
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "table",
        "_score": null,
        "sort": [
          0
        ]
      }
    ]
  }
}
```

But in Doris on ES,Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows:
```
GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}

{
  "hits": {
    "total": 0
  }
}

GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
  "hits": {
    "total": 1
  }
}

GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
  "hits": {
    "total": 1,
    "hits": [
      {
        "fields": {
          "k1": [
            "kkk1"
          ]
        }
      }
    ]
  }
}
```
*Scan-Worker On BE which processed result of shard2  will failed.* 

**The reasons are as follows:**
1. "filter_path" causes the hits.hits object not exist.  
2. In the current implementation, if there are some data rows(total > 0), the hits.hits. object must be an array

**How To Fix it**

Two Method:
1. modify "filter_path" to contain the hits.  
Pros: Fixed Code is very simple
Cons: More network cost
2. Deal with the case where fields are missing in a batch. 
Pros: No loss of performance
Cons: Code is more complex 

Performance first, I use Method2.

**Design**
1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not.
2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode
3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode"  and _total to construct _total lines which fields are assigned with 'NULL'
2020-05-11 15:34:12 +08:00
e41aef54f2 [Doirs-On-Elasticsearch] Accerlate first scroll search (#2575)
Add terminate_after for the first scroll to avoid decompress all postings list
2019-12-27 14:05:21 +08:00
5a3f71dd6b Push limit to Elasticsearch external table (#2400) 2019-12-07 21:13:44 +08:00
0f00febd21 Optimize Doris On Elasticsearch performance (#2237)
Pure DocValue optimization for doris-on-es

Future todo:
Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable,  should check pure_docvalue enabled for one whole scan outside,  I will add this todo in future
2019-12-04 12:57:45 +08:00
a465b38874 Enhance doris on es error message (#2297)
Enhance doris on es error message and modify some field data transform error.
For varchar/char type, sometimes elasticsearch user post some not-string value to Elasticsearch Index. because of reading value from _source, we can not process all json type and then just transfer the value to original string representation this may be a tricky, but we can workaround this issue
2019-11-26 18:32:25 +08:00
1c229fbd92 Fix es_scan_reader_test in debug mode (#1905) 2019-09-28 00:02:30 +08:00
8034d83e20 Add scroll keepalive and http timeout configuration (#1731) 2019-09-02 19:04:30 +08:00
9d03ba236b Uniform Status (#1317) 2019-06-14 23:38:31 +08:00
9c82d41981 Support Doris query ES by HTTP way (#925) 2019-04-28 17:14:44 +08:00