Optimization "select count(*) from table" stmtement , push down "count" type to BE.
support file type : parquet ,orc in hive .
1. 4kfiles , 60kwline num
before: 1 min 37.70 sec
after: 50.18 sec
2. 50files , 60kwline num
before: 1.12 sec
after: 0.82 sec
Enhance broadcast join cost calculation, by considering both the build side effort from building bigger hash table, and more probe side effort from bigger cost of ProbeWhenBuildSideOutput and ProbeWhenSearchHashTable, if parallel_fragment_exec_instance_num is more than 1.
Current solution gives a penalty factor on rightRowCount, and the factor is the total instance number to the power of 2.
Penalty on outputRows is not taken currently and will be refined in next generation cost model.
Also brings some update for shape checking:
update original control variable in shape file parallel_fragment_exec_instance_num to parallel_pipeline_task_num, if pipeline is enabled.
fix a be_number variable inactive issue.
consider sql:
```
SELECT *
FROM sub_query_correlated_subquery1 t1
WHERE coalesce(bitand(
cast(
(SELECT sum(k1)
FROM sub_query_correlated_subquery3 ) AS int),
cast(t1.k1 AS int)),
coalesce(t1.k1, t1.k2)) is NULL
ORDER BY t1.k1, t1.k2;
```
is Null conjunct is lost in SubqueryToApply rule. This pr fix it
After auto retry merged, it's hard to determine the execute times of doExecute method in compile time, and if the expected execute times in the expectation block is missed, unexpected invocation exception would be thrown, so just remove the expected execute times
select c_name from customer union select c_name from customer
this sql used agg node to get distinct row of c_name,
so it's no need to wait for inserted all data to hash map,
could output the data which it's inserted into hash map successed.
1. not use recursive parse to avoid stack overflow
2. To create a balanced tree instead of left deep tree
TODO: add expr_depth_limit to Nereids' parser
This file system cache key should contains `scheme://authority`, eg: `hdfs//nameservices1`.
Or it will encounter error:
```
Wrong FS: hdfs//abc/xxxx, expected: hdfs://def
```
Improve external table statistics collection, including log, observability and fix some bugs.
1. Add Running state for statistics job.
2. Add progress for show analyze job. (n/m tasks finished, n/m task failed and so on)
3. Add analyze time cost for show analyze task.
4. Make task failure message more clear.
5. Synchronize the job status updating code in updateTaskStatus.
6. Fix NPE in HMSAnalyzeTask. (Avoid refreshing statistics cache if the collection sql failed)
7. Return error message for with sync collection while timeout.
8. Log level improvement
9. Fix misuse of logCreateAnalysisJob for tasks.
current dphyper join reorder hasn't consider the join conjunct referencing only one side of the child. This is common case in outer join conjunct. So we need disable outer join reorder in dphyper until this problem is addressed.
problem:
1. create a iceberg_type catalog:
2. use iceberg catalog to specify verison
```
mysql> show catalog iceberg;
+----------------------+--------------------------+
| Key | Value |
+----------------------+--------------------------+
| type | iceberg |
| iceberg.catalog.type | hms |
| hive.metastore.uris | thrift://127.0.0.1:9083 |
| hadoop.username | hadoop |
| create_time | 2023-07-25 16:51:00.522 |
+----------------------+--------------------------+
5 rows in set (0.02 sec)
mysql> select * from iceberg.iceberg_db.tb1 FOR VERSION AS OF 8783036402036752909;
ERROR 5090 (42000): errCode = 2, detailMessage = Only iceberg/hudi external table supports time travel in current version
```
change:
Add `ICEBERG_EXTERNAL_TABLE` type for specify the version and time
* Improve db update binlog properties (binlog.enable = "true") with check
all table enable binlog
* Add more test_alter_database_property regression test
1. refactor in-predicate filter estimation
example: A in (1, 2, 3, 4)
after in-preidcate filter, A.stats.max<=4 and A.stats.min>=1
2. maintain minExpr and maxExpr in in-predicate stats derive
Currently, the new optimizer don't consider anything about partial update.
This PR add the ability to convert a delete statement to a partial update insert statement
for merge-on-write unique table
Support such grammar
ANALYZE TABLE test WITH CRON "* * * * * ?"
Such job would be scheduled as the cron expr specifie, but natively support minute-level schedule only