#3479
Here I try to explain the cause of the problem and how to fix it.
**The Cause of The problem**
Take the case in issue(#3479 ) as an example:
The general results are as follows:
```
GET table/_doc/_search
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"took": 6,
"timed_out": false,
"_shards": {
……
},
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "table",
"_score": null,
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"fields": {
"k1": [
"kkk1"
]
},
"sort": [
0
]
},
{
"_index": "table",
"_score": null,
"sort": [
0
]
}
]
}
}
```
But in Doris on ES,Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows:
```
GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 0
}
}
GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1
}
}
GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
"hits": {
"total": 1,
"hits": [
{
"fields": {
"k1": [
"kkk1"
]
}
}
]
}
}
```
*Scan-Worker On BE which processed result of shard2 will failed.*
**The reasons are as follows:**
1. "filter_path" causes the hits.hits object not exist.
2. In the current implementation, if there are some data rows(total > 0), the hits.hits. object must be an array
**How To Fix it**
Two Method:
1. modify "filter_path" to contain the hits.
Pros: Fixed Code is very simple
Cons: More network cost
2. Deal with the case where fields are missing in a batch.
Pros: No loss of performance
Cons: Code is more complex
Performance first, I use Method2.
**Design**
1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not.
2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode
3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode" and _total to construct _total lines which fields are assigned with 'NULL'
Pure DocValue optimization for doris-on-es
Future todo:
Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable, should check pure_docvalue enabled for one whole scan outside, I will add this todo in future