1. In the past, we use a BE table named `analysis_jobs` to persist the status of analyze jobs/tasks, however there are many flaws such as, if BE crashed analyze job/task would failed however the status of analyze job/task couldn't get updated.
2. Support `DROP ANALYZE JOB [job_id]` to delete analyze job
3. Support `SHOW ANALYZE TASK STATUS [job_id] ` to get the task status of specific job
4. Restrict the execute condition of auto analyze, only when the last execution of auto analyze job finished a while ago could be executed again
5. Support analyze whole DB
Parallel scanning can result in some read amplification, for example, select * from xx where limit 1 actually requires only one row of data. However, due to parallel scanning of multiple tablets, read amplification occurs, leading to performance bottlenecks in high-concurrency scenarios. This PR Adding a SessionVariable to enforce serial scanning can help mitigate this issue.
Nereids planner include all columns index in TFileScanRangeParams, this may cause the column projection incorrect for
text format table. Because csv reader use the column index position to split a line. Extra column index will cause get
wrong split result. This PR is to reset the column index after Projection, remove the useless column index.
fix 3 bugs:
1. failed to insert into a table with mv.
```sql
create table t (
id int,
c1 int,
c2 int,
c3 int
) duplicate key(id)
distributed by hash(id) buckets 4
create materialized view k12s3m as select id, sum(c1), max(c3) from t group by id;
insert into t select -4, -4, -4, 'd';
```
insert will rise exception because mv column is not handled. now we will add a target column and value as defineExpr.
2. failed to insert into a table with not all the columns.
```sql
insert into t(c1, c2) select c1, c2 from t
```
and t(id ukey, c1, c2, c3), will insert too many data, we fix it by change the output partitions.
3. failed to insert into a table with complex select.
the select statement has join or agg, fix the bug by the way similar to the one at 2nd bug.
Currently in regression-test, when a be crash, because curl does not set a timeout, suite-thread will get stuck.
To solve this, encapsulate the call to be into a function, set the timeout uniformly, and avoid getting stuck
In some cases ( or bugs), doris may returned query to jdbc, but jdbc can not recognized what doris sent back,
so hanged. To fix this, add a timeout of 30 minutes to jdbc connection.
1. make ColumnObject exception safe
2. introduce FlushContext and construct schema at memtable flush stage to make segment independent from dynamic schema
3. add more test cases
1. Use heap sort to find duplicated keys between segments and update the delete-bitmap. The old implementation traversed all keys in all segments, used each key to search for duplicates in earlier segments, and then marked them for deletion.
2. Trick: Each time the heap top is popped as a key1, the new heap top is key2, allowing for jumping directly from key1 to key2 instead of advancing iteratively.
3. Effect: This technique works well when there are many segments within the same rowset and the imported data is relatively ordered.
The BlockReader capture rowsets and init delete_handler in different place. If there is a base compaction, it may result in obtaining inconsistent delete handlers. Therefore, place these two operations under the same lock.
When using nereids, if we use compare operator of bitmap type, an analyze exception need to be throwed.
like:
select id from (select BITMAP_EMPTY() as c0 from expr_test) as ref0 where c0 = 1 order by id
Which c0 in subq0 is a bitmap type, this scenario is not supported right now.
update in-filter usage in pipeline mode:
1. if the target is local, we use in-bloom filter. Let BE choose in or bloom according to actual distinctive number
2. set default runtime_filter_max_in_num to 1024
Test on SSB 100g:
select lo_suppkey, count(distinct lo_linenumber) from lineorder group by lo_suppkey;
exec time: 4.388s
create materialized view:
create materialized view customer_uv as select lo_suppkey, bitmap_union(to_bitmap(lo_linenumber)) from lineorder group by lo_suppkey;
select lo_suppkey, count(distinct lo_linenumber) from lineorder group by lo_suppkey;
exec time: 12.908s
test with the patch, exec time: 5.790s