BE can not graceful exit because some threads are running in endless
loop. This patch do the following optimization:
- Use the well encapsulated Thread and ThreadPool instead of std::thread
and std::vector<std::thread>
- Use CountDownLatch in thread's loop condition to avoid endless loop
- Introduce a new class Daemon for daemon works, like tcmalloc_gc,
memory_maintenance and calculate_metrics
- Decouple statistics type TaskWorkerPool and StorageEngine notification
by submit tasks to TaskWorkerPool's queue
- Reorder objects' stop and deconstruct in main(), i.e. stop network
services at first, then internal services
- Use libevent in pthreads mode, by calling evthread_use_pthreads(),
then EvHttpServer can exit gracefully in multi-threads
- Call brpc::Server's Stop() and ClearServices() explicitly
Sometimes we want to detect the hotspot of a cluster, for example, hot scanned tablet, hot wrote tablet,
but we have no insight about tablets in the cluster.
This patch introduce tablet level metrics to help to achieve this object, now support 4 metrics on tablets: `query_scan_bytes `, `query_scan_rows `, `flush_bytes `, `flush_count `.
However, one BE may holds hundreds of thousands of tablets, so I add a parameter for the metrics HTTP request,
and not return tablet level metrics by default.
1. Support convert(expr, target_type) function, which is same as CastExpr
2. Support cast (expr as signed/unsigned int)
This is just for compatibility, the signed/unsigned specification is meaningless.
1. Fix core bug wild pointer in PlanFragmentExecutor, fix issue #4447
2. Fix core bug wild pointer json load, fix issue #4452
3. Change the declare order of ODBC type in thrift for compatibility
1. The base column of bitmap_union could must be integer. The largeint is not supported too.
2. The base column of hll_union could not be decimal.
Check error msg of const expr in Union Node
If user wants to insert a negative number into bitmap mv, Doris will thrown exception 'invalid input'.
The const value in Union Node is checked in this commit.
* Implements the grammar of the batch delete #4051
* Process create, alter table when table has delete sign column
* Support the syntax for enabling the delete column
* Automatically filtered deleted data in the select statement.
* Automatically add delete sign when create rollup table
TODO:
* Optimize the reading and compaction logic on the be side, so that the data marked as deleted will be completely deleted during base compaction
```
SELECT *
FROM
(SELECT cs_order_number,
cs_warehouse_sk
FROM catalog_sales
WHERE cs_order_number = 125005
AND cs_warehouse_sk = 4) cs1
LEFT SEMI JOIN
(SELECT cs_order_number,
cs_warehouse_sk
FROM catalog_sales
WHERE cs_order_number = 125005) cs2
ON cs1.cs_order_number = cs2.cs_order_number
AND cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk;
```
The above query has an equal predicate and a not equal predicate.
If there exists not equal preidcate, the build table should be remained
as it is. So the deduplication should be removed.
Sometimes we want to detect the hotspot of a cluster, for example, hot scanned tablet, hot wrote tablet,
but we have no insight about tablets in the cluster.
This patch introduce tablet level metrics to help to achieve this object, now support 4 metrics on tablets: `query_scan_bytes `, `query_scan_rows `, `flush_bytes `, `flush_count `.
However, one BE may holds hundreds of thousands of tablets, so I add a parameter for the metrics HTTP request,
and not return tablet level metrics by default.
After PR: #4135, If a mem tracker has parent, it should be created by 'CreateTracker'.
So I removed other unused constructors.
And also fix the bug described in #4344
Doris use HashTable to implement except.
If user send A except B except C, first do A except B and then except C.
After A except B, HashTable will be rebuild.
There is a bug here to throw some rows.
Redesign metrics to 3 layers:
MetricRegistry - MetricEntity - Metrics
MetricRegistry : the register center
MetricEntity : the entity registered on MetricRegistry. Generally a MetricRegistry can be registered on several
MetricEntities, each of MetricEntity is an independent entity, such as server, disk_devices, data_directories, thrift
clients and servers, and so on.
Metric : metrics of an entity. Such as fragment_requests_total on server entity, disk_bytes_read on a disk_device entity,
thrift_opened_clients on a thrift_client entity.
MetricPrototype: the type of a metric. MetricPrototype is a global variable, can be shared by the same metrics across
different MetricEntities.
Stream load should read all the data completely before parsing the json.
And also add a new BE config streaming_load_max_batch_read_mb
to limit the data size when loading json data.
Fix the bug of loading empty json array []
Add doc to explain some certain case of loading json format data.
Fix: #4124
We make all MemTrackers shared, in order to show MemTracker real-time consumptions on the web.
As follows:
1. nearly all MemTracker raw ptr -> shared_ptr
2. Use CreateTracker() to create new MemTracker(in order to add itself to its parent)
3. RowBatch & MemPool still use raw ptrs of MemTracker, it's easy to ensure RowBatch & MemPool destructor exec
before MemTracker's destructor. So we don't change these code.
4. MemTracker can use RuntimeProfile's counter to calc consumption. So RuntimeProfile's counter need to be shared
too. We add a shared counter pool to store the shared counter, don't change other counters of RuntimeProfile.
Note that, this PR doesn't change the MemTracker tree structure. So there still have some orphan trackers, e.g. RowBlockV2's MemTracker. If you find some shared MemTrackers are little memory consumption & too time-consuming, you could make them be the orphan, then it's fine to use the raw ptr.
Result may error when ORC load negative decimal value
When load negative decimal which has pre zero , the result is wrong.
eg -0.0014, the orc result is -14(precision ... 0)
[Bug] Fix the problem of mem exec, when analytic eval node need to spill to disk with a low mem limit.
And clear_reservations of Analytic node reservation of block manager.
[Running Profile] Add Spilled flag in Running Profile, when Analytic eval node and sort node spill to Disk.
This CL mainly changes:
1. Reorganized the code logic to limit the supported json format to two, and the import behavior is more consistent.
2. Modified the statistical behavior of the number of error rows when loading in json format, so that the error rows can be counted correctly.
3. See `load-json-format.md` to get details of loading json format.
TPlanExecParams::volume_id is never used, so delete the print_volume_ids() function.
Fix log, and log if PlanFragmentExecutor::open() returns error.
Fix some comments
fix: https://github.com/apache/incubator-doris/issues/3984
1. add `conjunct.size` checking and `slot_desc nullptr` checking logic
2. For historical reasons, the function predicates are added one by one, I just refactor the processing make thelogic for function predicate processing more clearly
Fix: #3946
CL:
1. Add prepare phase for `from_unixtime()`, `date_format()` and `convert_tz()` functions, to handle the format string once for all.
2. Find the cctz timezone when init `runtime state`, so that don't need to find timezone for each rows.
3. Add constant rewrite rule for `utc_timestamp()`
4. Add doc for `to_date()`
5. Comment out the `push_handler_test`, it can not run in DEBUG mode, will be fixed later.
6. Remove `timezone_db.h/cpp` and add `timezone_utils.h/cpp`
The performance shows bellow:
11,000,000 rows
SQL1: `select count(from_unixtime(k1)) from tbl1;`
Before: 8.85s
After: 2.85s
SQL2: `select count(from_unixtime(k1, '%Y-%m-%d %H:%i:%s')) from tbl1 limit 1;`
Before: 10.73s
After: 4.85s
The date string format seems still slow, we may need a further enhancement about it.
Replace some boost to std in OlapScanNode.
This refactor seems solve the problem describe in #3929.
Because I found that BE will crash to calling `boost::condition_variable.notify_all()`.
But after upgrade to this, BE does not crash any more.
https://github.com/apache/incubator-doris/issues/3936
Doris On ES can obtain field value from `_source` or `docvalues`:
1. From `_source` , get the origin value as you put, ES process indexing、docvalues for `date` field is converted to millisecond
2. From `docvalues`, before( 6.4 you get `millisecond timestamp` value, after(include) 6.4 you get the formatted `date` value :2020-06-18T12:10:30.000Z, but ES (>=6.4) provide `format` parameter for `docvalue` field request, this would coming soon for Doris On ES
After this PR was merged into Doris, Doris On ES would only correctly support to process `millisecond` timestamp and string format date, if you provided a `seconds` timestamp, Doris On ES would process wrongly which (divided by 1000 internally)
ES mapping:
```
{
"timestamp_test": {
"mappings": {
"doc": {
"properties": {
"k1": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
}
}
```
ES documents:
```
{
"_index": "timestamp_test",
"_type": "doc",
"_id": "AXLbzdJY516Vuc7SL51m",
"_score": 1,
"_source": {
"k1": "2020-6-25"
}
},
{
"_index": "timestamp_test",
"_type": "doc",
"_id": "AXLbzddn516Vuc7SL51n",
"_score": 1,
"_source": {
"k1": 1592816393000 -> 2020/6/22 16:59:53
}
}
```
Doris Table:
```
CREATE EXTERNAL TABLE `timestamp_source` (
`k1` date NULL COMMENT ""
) ENGINE=ELASTICSEARCH
```
### enable_docvalue_scan = false
**For ES 5.5**:
```
mysql> select k1 from timestamp_source;
+------------+
| k1 |
+------------+
| 2020-06-25 |
| 2020-06-22 |
+------------+
```
**For ES 6.5 or above**:
```
mysql> select * from timestamp_source;
+------------+
| k1 |
+------------+
| 2020-06-25 |
| 2020-06-22 |
+------------+
```
### enable_docvalue_scan = true
**For ES 5.5**:
```
mysql> select k1 from timestamp_dv;
+------------+
| k1 |
+------------+
| 2020-06-25 |
| 2020-06-22 |
+------------+
```
**For ES 6.5 or above**:
```
mysql> select * from timestamp_dv;
+------------+
| k1 |
+------------+
| 2020-06-25 |
| 2020-06-22 |
+------------+
```
Prior to this PR, Doris On ES merged another PR https://github.com/apache/incubator-doris/pull/3513 which misusing the `total` node. After Doris On ES introduce `terminate_after` (https://github.com/apache/incubator-doris/issues/2576), the `total` documents would not be computed, rely on this `total` field would be dangerous, we just rely on the actual document count by counting the `inner hits` node which it means to be. So we just remove all total parsing and related logic from Doris On ES, this maybe improve performance slightly because of ignoring and skipping `total` json node.
Mainly change:
1. Fix the bug in `update_status(status)` of `PlanFragmentExecutor`.
2. When the FE Coordinator executes `execRemoteFragmentAsync()`, if it finds an RPC error, return a Future with an error code instead of exception.
3. Protect the `_status` in RuntimeState with lock
4. Move the `_runtime_profile` of RuntimeState before the `_obj_pool`, so that the profile will be
deconstructed after the object pool.
5. Remove the unused `ObjectPool` param in RuntimeProfile constructor. If I don't remove it,
RuntimeProfile will depends on the `_obj_pool` in RuntimeProfile.
* 1. Add enable spilling in query option, support spill disk in Analytic_Eval_Node, FE can open enable spilling by
set enable_spilling = true;
Now, Sort Node and Analytic_Eval_Node can spill to disk.
2. Delete merge merge_sorter code we do not use now.
3. Replace buffered_tuple_stream by buffered_tuple_stream2 in Analytic_Eval_Node and support spill to disk. Delete the useless code of buffered_block_mgr and buffered_tuple_stream.
4. Add DataStreamRecvr Profile. Move the counter belong to DataStreamRecvr from fragment to DataStreamRecvr Profile to make clear of Running Profile.
* change some hint in code
* replace disable_spill with enable_spill which is better compatible to FE
[Doris On ES] Skip function_call expr when process predicate
Fixed#3801
Do not push-down function_call such as split_xxx when process predicate, Doris BE is responsible for processing these predicate
All rows in table:
```
+------+------+------+------------+------------+
| k1 | k2 | k3 | UpdateTime | ArriveTime |
+------+------+------+------------+------------+
| NULL | NULL | kkk1 | 123456789 | NULL |
| kkk1 | NULL | NULL | 123456789 | NULL |
| NULL | kkk2 | NULL | 123456789 | NULL |
+------+------+------+------------+------------+
```
The following predicate could not push down to ES.
```
SQL 1:
mysql> select * from (select split_part(k1, "1", 1) as kk from case_replay_for_milimin) t where t.kk is not null;
+------+
| kk |
+------+
| kkk |
+------+
1 row in set (0.02 sec)
SQL 2:
mysql> select * from (select split_part(k1, "1", 1) as kk from case_replay_for_milimin) t where t.kk > 'a';
+------+
| kk |
+------+
| kkk |
+------+
SQL 3:
mysql> select * from (select split_part(k1, "1", 1) as kk from case_replay_for_milimin) t where t.kk > '2';
+------+
| kk |
+------+
| kkk |
+------+
1 row in set (0.03 sec)
```
The other PR : https://github.com/apache/incubator-doris/pull/3513 (https://github.com/apache/incubator-doris/issues/3479) try to resolved the `inner hits node is not an array` because when a query( batch-size) run against new segment without this field, as-well the filter_path just only take `hits.hits.fields` 、`hits.hits._source` into account, this would appear an null inner hits node:
```
{
"_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2",
"hits": {
"total": 1
}
}
```
Unfortunately this PR introduce another serious inconsistent result with different batch_size because of misusing the `total`.
To avoid this two problem, we just add `hits.hits._score` to filter_path when `docvalue_mode` is true, `_score` would always `null` , and populate the inner hits node:
```
{
"_scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAHaUWY1ExUVd0ZWlRY2",
"hits": {
"total": 1,
"hits": [
{
"_score": null
}
]
}
}
```
related issue: https://github.com/apache/incubator-doris/issues/3752
This CL mainly changes:
1. Add a new BE config `max_pushdown_conditions_per_column` to limit the number of conditions of a single column that can be pushed down to storage engine.
2. Add 2 new session variables `max_scan_key_num` and `doris_max_scan_key_num` which can set in session level and overwrite the config value in BE.
Add a JSON format for existing metrics like this.
```
{
"tags":
{
"metric":"thread_pool",
"name":"thrift-server-pool",
"type":"active_thread_num"
},
"unit":"number",
"value":3
}
```
I add a new JsonMetricVisitor to handle the transformation.
It's not to modify existing PrometheusMetricVisitor and SimpleCoreMetricVisitor.
Also I add
1. A unit item to indicate the metric better
2. Cloning tablet statistics divided by database.
3. Use white space to replace newline in audit.log
This CL mainly changes:
1. Support `SELECT INTO OUTFILE` command.
2. Support export query result to a file via Broker.
3. Support CSV export format with specified column separator and line delimiter.