For performance reason, we want to remove constant column from groupingExprs.
For example:
`select sum(T.A) from T group by T.B, 'xyz'` is equivalent to `select sum(T.A) from T group by T.B`
We can remove constant column `abc` from groupingExprs.
But there is an exception when all groupingExpr are constant
For example:
sql1: `select 'abc' from t group by 'abc'`
is not equivalent to
sql2: `select 'abc' from t`
sql3: `select 'abc', sum(a) from t group by 'abc'`
is not equivalent to
sql4: `select 1, sum(a) from t`
(when t is empty, sql3 returns 0 tuple, sql4 return 1 tuple)
We need to keep some constant columns if all groupingExpr are constant.
Consider sql5 `select a from (select "abc" as a, 'def' as b) T group by b, a;`
if the constant column `a` is in select list, this column should not be removed.
sql5 is transformed to
sql6 `select a from (select "abc" as a, 'def' as b) T group by a;`
Use FE cluster token to auth stream load.
This auth is only open for be, and fe auth still only support http basic auth.
I will use this auth for mysql load to build a no-auth stream load from fe to be.
And this will avoid double auth in mysql load.
More information to see the design doc.
Issue Number: close#16351
Dynamic schema table is a special type of table, it's schema change with loading procedure.Now we implemented this feature mainly for semi-structure data such as JSON, since JSON is schema self-described we could extract schema info from the original documents and inference the final type infomation.This speical table could reduce manual schema change operation and easily import semi-structure data and extends it's schema automatically.
Doris can use mysql-jdbc-jar to connect doris database, but doris has some data type that mysql without.
Such as DecimalV3 and Date/DatetimeV2
I add some case judgments in `Mysql Catalog` , so that Jdbc catalog can identify the data type of DORIS
When file cache enabled, running the same query for the second time may be still slow, for `FE` will assign the same
scan range into different backends among different queries, and the former cached data in `BE` will be useless if the scan range is changed.
So, this PR introduce consistent hash to assign the same scan range into the same backend among different queries.
when pg table have some unsupported column type like: point, polygon, jsonb......
jdbc catalog will convert it to string type in doris. but get result set in java is org.postgresql.util.PGobject
Some test need this pr: #16442
1. add limit threshold for topn runtime pushdown and key topn optimization
2. use unified session variable topn_opt_limit_threshold for all topn optimizations
3. add fuzzy support for topn_opt_limit_threshold
change implement of auth to rbac
each user has one default role which can not be drop;
if you grant priv to user,it will grant to default role ,
In the current pr, the user can still only have one role other than the default role, but in the future, the user and role will be many-to-many
rename PaloRole,PaloAuth,PaloPrivilege to Role,Auth,Privilege
The new Nereids planner do column projection after creating scan node. For ExternalFileScanNode, this may cause the columns in required_slots mismatch with the slots after projection. This pr is to reset the required_slots after projection.
This commit support:
1、Insert + select for struct/map type
2、Json stream load for struct type
3、m[key] function for map type
How to use:
Set the fe config to create table for struct and map type
1、admin set frontend config("enable_struct_type" = "true");
2、admin set frontend config("enable_map_type" = "true");
#16547
Co-authored-by: xy720 <xuyang25@baidu.com>
Co-authored-by: amory <wangqiannan@selectdb.com>
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
Co-authored-by: hucheng01 <hucheng01@baidu.com>
There were cooldownttl and cooldownttlms in StoragePolicy, it's so error-prone because they served nearly the same.
For example, the init function would only assign the ttl timestamp to cooldownttl, which would end up pushing cooldownttl 0 to be.
When JDBC client enable server side prepared statement, it will cache OlapScanNode and reuse it for performance, but each
time call `addScanRangeLocations` will add new item to `bucketSeq2locations`, so the `bucketSeq2locations` lead to a memleak if OlapScanNode
cached in memory
before this PR, Doris cannot process sql like that
```sql
CREATE TABLE `test_sq_dj1` (
`c1` int(11) NULL,
`c2` int(11) NULL,
`c3` int(11) NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`c1`) BUCKETS 3
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
CREATE TABLE `test_sq_dj2` (
`c1` int(11) NULL,
`c2` int(11) NULL,
`c3` int(11) NULL
) ENGINE=OLAP
DUPLICATE KEY(`c1`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`c1`) BUCKETS 3
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
);
insert into test_sq_dj1 values(1, 2, 3), (10, 20, 30), (100, 200, 300);
insert into test_sq_dj2 values(10, 20, 30);
-- core
SELECT * FROM test_sq_dj1 WHERE c1 IN (SELECT c1 FROM test_sq_dj2) OR c1 IN (SELECT c1 FROM test_sq_dj2) OR c1 < 10;
-- invalid slot
SELECT * FROM test_sq_dj1 WHERE c1 IN (SELECT c1 FROM test_sq_dj2) OR c1 IN (SELECT c2 FROM test_sq_dj2) OR c1 < 10;
```
there are two problems:
1. we should remove redundant sub-query in one conjuncts to avoid generate useless join node
2. when we have more than one sub-query in one disjunct. we should put the conjunct contains the disjunct at the top node of the set of mark join nodes. And pop up the mark slot to the top node.
RewriteBinaryPredicatesRule rewrite expression like
`cast(A decimal) > decimal` to `A > some_other_bigint`
in order to:
1. push down the rewrite predicate
2. avoid convert column A to decimal
We get the datatype of `A` by `expr0.getSrcSlotRef().getColumn().getType()`.
However, when A is result of a function from sub-query, this rule is not applicable.
For example:
```
select *
from (
select TIMESTAMPDIFF(MINUTE,startTime,endTime) AS timediff
from CNC_SliceSate) T
where timediff > 5.0;
```
we cannot push predicate down to OlapScan(CNC_SliceSate) to save effort.
adjust nullable on empty set should apply after unnested sub-query
some function should propagate nullable when args are datev2 or datetimev2
add back tpch sf0.1 nereids regression test
```
create table tbl1 (k1 varchar(100), k2 string) distributed by hash(k1) buckets 1 properties("replication_num" = "1");
insert into tbl1 values(1, "alice");
select cast(k1 as INT) as id from tbl1 order by id limit 2;
```
The above query could pass `checkEnableTwoPhaseRead` since the order by element is SlotRef but actually it's an function call expr