both update status and open_vectorized_internal will call send_report and stop report thread. move update_status code to open method and remove unnecessary send_report and stop_report_thread.
---------
Co-authored-by: yiguolei <yiguolei@gmail.com>
This pr added support for the pre-aggregation hint. Users could use /*+PREAGGOPEN*/ to enable pre-preaggregation for OLAP table.
For example:
Let's say we have an aggregate-keys table t (k1 int, k2 int, v1 int sum, v2 int sum). Pre-aggregation could be enabled by query with a hint: select k1, v1 from t /*+PREAGGOPEN*/.
The columns name in stream load and broker load are case sensitive, make it case insensitive. This would be consist with query, because query sql columns name are case insensitve.
* [fuzzy](test) fuzzy some session variables stably according to pull_request_id
* fuzzy enable_fold_constant_by_be
---------
Co-authored-by: stephen <hello_stephen@@qq.com>
Show MTMV JOB/Task will list all the jobs and tasks among different databases in spite of the current database.
Now use current db to identify the mtmv tasks and jobs. Only the user who did not use a database can list all job and tasks among different databases.
Sometimes the profileContent of ProfileElement is very large (more than 30MB), and this kind of huge string object may cause performance problems for gc. But we use them only when we invoke profile relevant restful apis (such as /profile/{format}/{query_id}, /api/profile and so on), so we need to lazy load them.
Support using this sql to refresh mtmv manually. It can generate a mtmv task right now.
```
REFRESH MATERIALIZED VIEW test_mv_view [complete];
```
You can use `show mtmv task` to show the latest task.
In this pr, I also try to clear the mtmv tasks when drop the mtmv to make sure test suite to be right
support 4 phase Aggregation.
example:
`select count(distinct k1), sum(k2) from t`
suppose t.k0 is distribute key.
we have plan
```
Agg(DISTINCT_GLOBAL)
|
Exchange(Gather)
|
Agg(DISTINCT_LOCAL)
|
Agg(GLOBAL)
|
Exchange(hash distribute by k1)
|
Agg(LOCAL)
|
scan
```
limitations:
1. only support sql with one distinct.
not support:`select count(distinct k1), count(distinct k2) from t`
2. only support sql with distinct one column
not support: `select count(distinct k1, k2) from t`
Cached OlapScanNode each time call `addScanRangeLocations` will add TScanRangeLocations to result.
So `result` could grow too large and lead `getReplicaNumPerHost` a cpu hot spot in it's loop.
Doris always delays the execution of expressions as possible as it can, so as the expansion of constant expression. Given below SQL:
```sql
select i from (select 'abc' as i, sum(birth) as j from subquerytest2) as tmp
```
The aggregation would be eliminated, since its output is not required by the outer block, but the expasion for constant expression would be done in the final result expr, and since aggreagete output has been eliminate, the expasion would actually do nothing, and finally cause a empty results.
To fix this, we materialize the results expr in the inner block for such SQL, it may affect performance, but better than let system produce a mistaken result.
1.Compatible with the old optimizer, the sort and limit in the subquery will not take effect, just delete it directly.
```
select * from sub_query_correlated_subquery1 where sub_query_correlated_subquery1.k1 > (select sum(sub_query_correlated_subquery3.k3) a from sub_query_correlated_subquery3 where sub_query_correlated_subquery3.v2 = sub_query_correlated_subquery1.k2 order by a limit 1);
```
2.Adjust the unnesting position of the subquery to ensure that the conjunct in the filter has been optimized, and then unnesting
Support:
```
SELECT DISTINCT k1 FROM sub_query_correlated_subquery1 i1 WHERE ((SELECT count(*) FROM sub_query_correlated_subquery1 WHERE ((k1 = i1.k1) AND (k2 = 2)) or ((k1 = i1.k1) AND (k2 = 1)) ) > 0);
```
The reason why the above can be supported is that conjunction will be performed, which can be converted into the following
```
SELECT DISTINCT k1 FROM sub_query_correlated_subquery1 i1 WHERE ((SELECT count(*) FROM sub_query_correlated_subquery1 WHERE ((k1 = i1.k1) AND (k2 = 2 or k2 = 1)) ) > 0);
```
Not Support:
```
SELECT DISTINCT k1 FROM sub_query_correlated_subquery1 i1 WHERE ((SELECT count(*) FROM sub_query_correlated_subquery1 WHERE ((k1 = i1.k1) AND (k2 = 2)) or ((k2 = i1.k1) AND (k2 = 1)) ) > 0);
```
1. When mapping column from external datasource, use date/datetimev2 as default type
2. check `is_cancelled` when read data, to avoid endless loop after query is cancelled
1. use one rule to bind slot and function and do type coercion to fix type and nullable error
a. SUM(a1 + AVG(a2)) when a1 and a2 are TINYINT. Before, the return type was SMALLINT, after this PR will return the right type - DOUBLE.
2. fix runtime filter gnerator bugs - bind runtime filter on wrong join conjuncts.
The performance of ClickBench Q30 is affected by batch_size:
| batch_size | 1024 | 4096 | 20480 |
| -- | -- | -- | -- |
| Q30 query time | 2.27 | 1.08 | 0.62 |
Because aggregation operator will create a new result block for each batch block, and Q30 has 90 columns, which is time-consuming. Larger batch_size will decrease the number of aggregation blocks, so the larger batch_size will improve performance.
Doris internal reader will read at least 4064 rows even if batch_size < 4064, so this PR keep the process of reading external table the same as internal table.
Support set skip line number for stream load to load csv file.
Usage `-H skip_lines:number`:
```
curl --location-trusted -u root: -T test.csv -H skip_lines:5 -XPUT http://127.0.0.1:8030/api/testDb/testTbl/_stream_load
```
Skip line number also can be used in mysql load as below:
```sql
LOAD DATA
LOCAL
INFILE '${mysql_load_skip_lines}'
INTO TABLE ${tableName}
COLUMNS TERMINATED BY ','
IGNORE 2 LINES
PROPERTIES ("auth" = "root:");
```
Support master and follow change in multi fe for mtmv
This PR fixes following issues:
1. Start the mtmv only in master node, if master change to follower, it will stop the scheduler.
2. Fix a double meta write here
3. Rename some edit log function and variables
4. If a mv both have PeriodicalJob and immediate job and PeriodicalJob will be trigger right now, scheduler will ignore the immediate job.
5. Fix expired time bugs, and make sure it will be clean among all the fes.
6. cleanerScheduler interval from 1 day to 1 minute.