* if column stats are unknown, do not use dphyp
tpcds query64 is optimized in case of no stats
sf500, query64 improved from 15sec to 7sec on hdfs, and from 4sec to 3.85sec on olaptable
In the previous logic, when we restored the Column in the predicate pushdown based on the logical syntax tree for JdbcScanNode, in order to avoid query errors caused by keywords such as `key`, we added escape characters for it, but before we only Binary predicates are processed, which is imperfect. We should add escape characters to all columns that appear in the predicate to avoid errors with keywords or illegal characters.
Followup #28890
Make HttpSqlConverterPlugin and AuditLoader as Doris' builtin plugin.
To make it simple for user to support sql dialect and using audit loader.
HttpSqlConverterPlugin
By default, there is nothing changed.
There is a new global variable sql_converter_service, default is empty, if set, the HttpSqlConverterPlugin will be enabled
set global sql_converter_service = "http://127.0.0.1:5001/api/v1/convert"
AuditLoader
By default, there is nothing changed.
There is a new global variable enable_audit_plugin, default is false, if set to true, the audit loader plugin will be enable.
Doris will create audit_log in __internal_schema when startup
If enable_audit_plugin is true, the audit load will be inserted into audit_log table.
3 other global variables related to this plugin:
audit_plugin_max_batch_interval_sec: The max interval for audit loader to insert a batch of audit log.
audit_plugin_max_batch_bytes: The max batch size for audit loader to insert a batch of audit log.
audit_plugin_max_sql_length: The max length of statement in audit log
estimate column stats for "cast(col, XXXType)"
-----cast-est------
query4 41169 40335 40267 40267
query58 463 361 401 361
Total cold run time: 41632 ms
Total hot run time: 40628 ms
----master------
query4 40624 40180 40299 40180
query58 487 389 420 389
Total cold run time: 41111 ms
Total hot run time: 40569 ms
When varchar literal contains chinese, the length of varchar should not be the length of the varchar, it should be
the actual length of the using byte.
Chinese is represented by unicode, a chinese char occypy 4 byte at mostly. So if meet chinese in varchar literal, we
set the length is 4* length.
for example as following:
> CREATE MATERIALIZED VIEW test_varchar_literal_mv
> BUILD IMMEDIATE REFRESH AUTO ON MANUAL
> DISTRIBUTED BY RANDOM BUCKETS 2
> PROPERTIES ('replication_num' = '1')
> AS
> select case when l_orderkey > 1 then "一二三四" else "五六七八" end as field_1 from lineitem;
mysql> desc test_varchar_literal_mv;
the def of materialized view is as following:
+---------+-------------+------+-------+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+-------------+------+-------+---------+-------+
| field_1 | VARCHAR(16) | No | false | NULL | NONE |
+---------+-------------+------+-------+---------+-------+
Taking the Idea further from PR #24853 (#24853)
Column statistics already analyzed and available in HMS from spark, this PR proposes to reuse the analyzed stats from external source, when executed WITH SQL clause of analyze cooamd.
Spark analyzes and stores the statistics in Table properties instead of HiveColumnStatistics. In this PR, we try to get the statistics from these properties and make it available to Doris.
If there is no table that can be found, the task will cycle forever and no data will be loaded. To avoid invalid scheduled tasks, It is better to pause the job rather than run it.
Only root user can operate __internal_schema database
The scope of impact includes:
create database
drop database
alter database
create table
drop table
alter table
truncate table
insert overwrite
insert
delete
update
load(root also not allowed)
delete support check auth
Fix cte rewrite by mv wrongly when query has scalar aggregate but view no
For example as following, it should not be rewritten by materialized view successfully
// materialzied view define
def mv20_1 = """
select
l_shipmode,
l_shipinstruct,
sum(l_extendedprice),
count()
from lineitem
left join
orders on lineitem.L_ORDERKEY = orders.O_ORDERKEY
group by
l_shipmode,
l_shipinstruct;
"""
// query sql
def query20_1 =
"""
select
sum(l_extendedprice),
count()
from lineitem
left join
orders
on lineitem.L_ORDERKEY = orders.O_ORDERKEY
"""
Fix predicates compensation by mistake
For example as following, it can return right result, but it's wrong earlier.
// materialzied view define
def mv7_1 = """
select l_shipdate, o_orderdate, l_partkey, l_suppkey
from lineitem
left join orders
on lineitem.l_orderkey = orders.o_orderkey
where l_shipdate = '2023-12-08' and o_orderdate = '2023-12-08';
"""
// query sql
def query7_1 = """
select l_shipdate, o_orderdate, l_partkey, l_suppkey
from (select * from lineitem where l_shipdate = '2023-10-17' ) t1
left join orders
on t1.l_orderkey = orders.o_orderkey;
"""
and optimize some code usage and add more comment for method
Current union rf push down only support rf from parent join, but not support ancestor join.
The pr fixes this problem on project/distribute node's rf pushing down checking.
The merging of profiles requires ensuring the correctness of the profiles themselves. However, if merging is intended for troubleshooting correctness issues through profiles, errors may occur.
Moreover, the 'try-catch' does not catch exceptions related to profile merging. If merging fails, even the normal profile cannot be obtained.
Add `information_schema` database for all catalog.
This is useful when using BI tools to connect to Doris,
the tools can get meta info from `information_schema`.
This PR mainly changes:
1. There will be a `information_schema` db in each catalog.
2. Each `information_schema` db only store the meta info of the catalog it belongs to.
3. For `information_schema`, the `TABLE_SCHEMA` column's value is the database name.
4. There is a new global variable `show_full_dbname_in_info_schema_db`, default is false, if set to true,
The `TABLE_SCHEMA` column's value is the like `ctl.db`, because:
When connect to Doris, the `database` info in connection url will be: `xxx?db=ctl.db`.
And then some BI will try to query `information_schema` with sql like:
`select * from information_schema.columns where TABLE_SCHEMA = "ctl.db"`
So it has to be format as `ctl.db`
eg, the `information_schema.columns` table in external catalog `doris` is like:
```
mysql> select * from information_schema.columns limit 1\G
*************************** 1. row ***************************
TABLE_CATALOG: doris
TABLE_SCHEMA: doris.__internal_schema
TABLE_NAME: column_statistics
COLUMN_NAME: id
ORDINAL_POSITION: 1
COLUMN_DEFAULT: NULL
IS_NULLABLE: NO
DATA_TYPE: varchar
CHARACTER_MAXIMUM_LENGTH: 4096
CHARACTER_OCTET_LENGTH: 16384
NUMERIC_PRECISION: NULL
NUMERIC_SCALE: NULL
DATETIME_PRECISION: NULL
CHARACTER_SET_NAME: NULL
COLLATION_NAME: NULL
COLUMN_TYPE: varchar(4096)
COLUMN_KEY:
EXTRA:
PRIVILEGES:
COLUMN_COMMENT:
COLUMN_SIZE: 4096
DECIMAL_DIGITS: NULL
GENERATION_EXPRESSION: NULL
SRS_ID: NULL
```
6. Modify the behavior of
- show tables
- shwo databases
- show columns
- show table status
The above statements may query the `information_schema` db if there is `where` predicate after them
1. expand_runtime_filter_by_inner_join will create some redundant rfs,e.g., tpch q5 and q9, we need to remove one
2. hive: prune rf if target only used as probe
case:
```
MySQL root@127.0.0.1:test> select cast(12 as decimalv3(2,1))
+-----------------------------+
| cast(12 as DECIMALV3(2, 1)) |
+-----------------------------+
| 12.0 |
+-----------------------------+
```
decimalv2 literal will generate wrong result too. But it is not only
bugs in planner, but also have bugs in executor. So we need fix executor
bug in another PR.