Commit Graph

17 Commits

Author SHA1 Message Date
7eae3e280a [optimization] use inline optimize ExprContext::get_value (#5385) 2021-02-16 22:35:14 +08:00
93a4c7efc1 [LOG] Standardize the use of VLOG in code (#5264)
At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
2021-01-21 12:09:09 +08:00
6fedf5881b [CodeFormat] Clang-format cpp sources (#4965)
Clang-format all c++ source files.
2020-11-28 18:36:49 +08:00
7b2762b1b1 [Doris On ES][Bug-Fix] Can not pushdown limit when some predicate can not processed by ES (#4768)
Can not pushdown limit when some predicate not processed by ES, fixed: #4761
2020-10-21 12:10:55 +08:00
09f97f8a05 [Refactor] Fixes some be typo part 2 (#4747) 2020-10-20 09:28:57 +08:00
10f822eb43 [MemTracker] make all MemTrackers shared (#4135)
We make all MemTrackers shared, in order to show MemTracker real-time consumptions on the web.
As follows:
1. nearly all MemTracker raw ptr -> shared_ptr
2. Use CreateTracker() to create new MemTracker(in order to add itself to its parent)
3. RowBatch & MemPool still use raw ptrs of MemTracker, it's easy to ensure RowBatch & MemPool destructor exec 
     before MemTracker's destructor. So we don't change these code.
4. MemTracker can use RuntimeProfile's counter to calc consumption. So RuntimeProfile's counter need to be shared 
    too. We add a shared counter pool to store the shared counter, don't change other counters of RuntimeProfile.
Note that, this PR doesn't change the MemTracker tree structure. So there still have some orphan trackers, e.g. RowBlockV2's MemTracker. If you find some shared MemTrackers are little memory consumption & too time-consuming, you could make them be the orphan, then it's fine to use the raw ptr.
2020-07-31 21:57:21 +08:00
5a57ecca15 [Doris On ES]fix bug of query failed in doc_value_mode when fields have none value (#3513)
#3479 

Here I try to explain the cause of the problem and how to fix it.

**The Cause of The problem**
Take the case in issue(#3479 ) as an example:
The general results are as follows:
```
GET table/_doc/_search
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    ……
  },
  "hits": {
    "total": 3,
    "max_score": null,
    "hits": [
      {
        "_index": "table",
        "_score": null,
        "sort": [
          0
        ]
      },
      {
        "_index": "table",
        "_score": null,
        "fields": {
          "k1": [
            "kkk1"
          ]
        },
        "sort": [
          0
        ]
      },
      {
        "_index": "table",
        "_score": null,
        "sort": [
          0
        ]
      }
    ]
  }
}
```

But in Doris on ES,Be fetched data parallelly on all shards, and use `filter_path` to reduce the network cost. The process will be as follows:
```
GET table/_doc/_search?preference=_shards:1&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}

{
  "hits": {
    "total": 0
  }
}

GET table/_doc/_search?preference=_shards:2&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
  "hits": {
    "total": 1
  }
}

GET table/_doc/_search?preference=_shards:3&filter_path=_scroll_id,hits.hits._source,hits.total,_id,hits.hits._source.fields,hits.hits.fields
{"query":{"match_all":{}},"stored_fields":"_none_","docvalue_fields":["k1"],"sort":["_doc"],"size":100}
{
  "hits": {
    "total": 1,
    "hits": [
      {
        "fields": {
          "k1": [
            "kkk1"
          ]
        }
      }
    ]
  }
}
```
*Scan-Worker On BE which processed result of shard2  will failed.* 

**The reasons are as follows:**
1. "filter_path" causes the hits.hits object not exist.  
2. In the current implementation, if there are some data rows(total > 0), the hits.hits. object must be an array

**How To Fix it**

Two Method:
1. modify "filter_path" to contain the hits.  
Pros: Fixed Code is very simple
Cons: More network cost
2. Deal with the case where fields are missing in a batch. 
Pros: No loss of performance
Cons: Code is more complex 

Performance first, I use Method2.

**Design**
1. Add a variable "_doc_value_mode" into Class "EsScrollParser" to =indicate whether the data processed by this parser is doc_value_mode or not.
2. "_doc_value_mode" is passed from ESScollReader <- ESScanner <- ScrollQueryBuilder::build() that determines whether DSL is enable doc_value_mode
3. When hits.hits of response from ES is empty and total > 0. We know there are data lines, but the corresponding fields do not exist. EsScrollParser will use "_doc_value_mode"  and _total to construct _total lines which fields are assigned with 'NULL'
2020-05-11 15:34:12 +08:00
a467c6f81f [ES Connector] Add field context for string field keyword type (#3305)
This PR is just a transitional way,but it is better to move the predicates transformation from Doris BE to Doris BE, in this way, Doris BE is responsible for fetching data from ES.

 Add a  `enable_keyword_sniff ` configuration item in creating External Elasticsearch Table ,it default to true , would to sniff the `keyword` type on the `text analyzed` Field and return the `json_path` which substitute the origin col name.

```
CREATE EXTERNAL TABLE `test` (
  `k1` varchar(20) COMMENT "",
  `create_time` datetime COMMENT ""
) ENGINE=ELASTICSEARCH
PROPERTIES (
"hosts" = "http://10.74.167.16:8200",
"user" = "root",
"password" = "root",
"index" = "test",
"type" = "doc",
"enable_keyword_sniff" = "true"
);
```
note: `enable_keyword_sniff` default to  "true"

run this SQL:

```
select * from test where k1 = "wu yun feng"
```
 Output predicate DSL:

```
{"term":{"k1.keyword":"wu yun feng"}}
```
and in this PR, I remove the elasticsearch version detected logic for now this is useless, maybe future is needed.
2020-04-13 23:07:33 +08:00
614a76beea [Doris on ES] Support compound_and predicate push down to Elasticsearch (#3277)
Relate Issue: https://github.com/apache/incubator-doris/issues/3248


SQL:

```
select * from test where (k2 = 6 and k3 = 1) or (k2 = 2 and k3 =3 and k4 = 'beijing');
```

Output filter:

```
((#k2:[6 TO 6] #k3:[1 TO 1]) (#(#k2:[2 TO 2] #k3:[3 TO 3]) #k4:beijing))~1
```

SQL:

```
select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and k3 =3 and (k4 = 'beijing' or k4 = 'zhaochun'));
```
Output filter:

```
(k2:[6 TO 6] k3:[7 TO 7] (#(#k2:[2 TO 2] #k3:[3 TO 3]) #((k4:beijing k4:zhaochun)~1)))~1
```

SQL:

```
select * from test where (k2 = 6 or k3 = 7) or (k2 = 2 and abs(k3) =3 and (k4 = 'beijing' or k4 = 'zhaochun'));
```

Output filter (`abs` can not be pushed down to es, so doris on es would not process this scenario ):

```
match_all
```
2020-04-08 21:09:39 +08:00
fd492e3b6f [Doris on ES] Support escape character (#2865) 2020-02-13 11:32:48 +08:00
d05768ffd4 Fix core when es_scanner_node exit (#2634) 2020-01-02 16:30:11 +08:00
5a3f71dd6b Push limit to Elasticsearch external table (#2400) 2019-12-07 21:13:44 +08:00
0f00febd21 Optimize Doris On Elasticsearch performance (#2237)
Pure DocValue optimization for doris-on-es

Future todo:
Today, for every tuple scan we check if pure_docvalue is enabled, this is not reasonable,  should check pure_docvalue enabled for one whole scan outside,  I will add this todo in future
2019-12-04 12:57:45 +08:00
c2de62d6a1 Collect scanner's status when es_http_scan_node close (#1861) 2019-09-25 12:20:13 +08:00
9d03ba236b Uniform Status (#1317) 2019-06-14 23:38:31 +08:00
79ab7f4413 Change label of broker load txn (#1134)
* Change label of broker load txn

1. put broker load label into txn label
2. fix the bug of `label is already used`
3. fix partition error of new broker load

* Fix count error in mini load and broker load

There are three params (num_rows_load_total, num_rows_load_filtered, num_rows_load_unselected) which are used to count dpp.norm.ALL and dpp.abnorm.ALL.
num_rows_load_total is the number rows of source file.
num_rows_load_unselected is the not satisfied (where conjuncts) rows of num_rows_load_total
num_rows_load_filtered is the rows (quality not good enough) of (num_rows_load_total-num_rows_load_unselected)
2019-05-10 16:53:46 +08:00
9c82d41981 Support Doris query ES by HTTP way (#925) 2019-04-28 17:14:44 +08:00