doris

Author	SHA1	Message	Date
luozenglin	1fdd4172bd	[fix](Inbitmap) fix in bitmap result error when left expr is constant (#15271 ) * [fix](Inbitmap) fix in bitmap result error when left expr is constant 1. When left expr of the in predicate is a constant, instead of generating a bitmap filter, rewrite sql to use `bitmap_contains`. For example,"select k1, k2 from (select 2 k1, 11 k2) t where k1 in (select bitmap_col from bitmap_tbl)" => "select k1, k2 from (select 2 k1, 11 k2) t left semi join bitmap_tbl b on bitmap_contains(b.bitmap_col, t.k1)" * add regression test	2022-12-22 19:25:09 +08:00
Xinyi Zou	77c15729d4	[fix](memory) Fix too many repeat cause OOM (#15217 )	2022-12-22 17:16:18 +08:00
zhengyu	6fb61b5bbc	[enhancement] (streamload) allow table in url when do two-phase commit (#15246 ) (#15248 ) Make it works even if user provide us with (unnecessary) table info in url. i.e. `curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H \ "txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc` can still works! Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>	2022-12-22 17:00:51 +08:00
ElvinWei	754fceafaf	[feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910 ) Histogram statistics Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics. For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan. The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred. The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency. --- Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows: Syntax ```SQL histogram(expr) ``` > The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query. example ``` MySQL [test]> select histogram(login_time) from dev_table; +------------------------------------------------------------------------------------------------------------------------------+ \| histogram(`login_time`) \| +------------------------------------------------------------------------------------------------------------------------------+ \| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}\| +------------------------------------------------------------------------------------------------------------------------------+ ``` description ```JSON { "bucket_size": 5, "buckets": [ { "lower": "2022-09-21 17:30:29", "upper": "2022-09-21 22:30:29", "count": 9, "pre_sum": 0, "ndv": 1 }, { "lower": "2022-09-22 17:30:29", "upper": "2022-09-22 22:30:29", "count": 10, "pre_sum": 9, "ndv": 1 }, { "lower": "2022-09-23 17:30:29", "upper": "2022-09-23 22:30:29", "count": 9, "pre_sum": 19, "ndv": 1 }, { "lower": "2022-09-24 17:30:29", "upper": "2022-09-24 22:30:29", "count": 9, "pre_sum": 28, "ndv": 1 }, { "lower": "2022-09-25 17:30:29", "upper": "2022-09-25 22:30:29", "count": 9, "pre_sum": 37, "ndv": 1 } ] } ``` TODO: - histogram func supports parameter and sample statistics (It's got another pr) - use histogram statistics - add p0 regression	2022-12-22 16:42:17 +08:00
Shuo Wang	d0a4a8e047	[Feature](Nereids) Push limit through union all. (#15272 ) This PR push limit through the union all into the child plan.	2022-12-22 14:46:47 +08:00
Shuo Wang	f8b368a85e	[Feature](Nereids) Support bitmap for materialized index. (#14863 ) This PR adds the rewriting and matching logic for the bitmap_union column in materialized index. If a materialized index has bitmap_union column, we try to rewrite count distinct or bitmap_union_count to the bitmap_union column in materialized index.	2022-12-22 14:40:25 +08:00
Yulei-Yang	0fa4c78e84	[Improvement](external table) support hive external table which stores data on tencent chdfs (#15125 )	2022-12-22 14:32:55 +08:00
zhengshiJ	a87f905a2d	[Feature](Nereids) unnest subquery in 'not in' predicate into NULL AWARE ANTI JOIN (#15230 ) when we process not in subquery. if the subquery return column is nullable, we need a NULL AWARE ANTI JOIN instead of ANTI JOIN. Doris already support NULL AWARE ANTI JOIN in PR #13871 Nereids need to do that so.	2022-12-22 14:13:47 +08:00
Ashin Gau	1520a4af6d	[refactor](resource) use resource to create external catalog (#14978 ) Use resource to create external catalog. -- HMS mysql> create resource hms_resource properties( -> "type"="hms", -> 'hive.metastore.uris' = 'thrift://172.21.0.44:7004', -> 'dfs.nameservices'='HANN', -> 'dfs.ha.namenodes.HANN'='nn1,nn2', -> 'dfs.namenode.rpc-address.HANN.nn1'='172.21.0.32:4007', -> 'dfs.namenode.rpc-address.HANN.nn2'='172.21.0.44:4007', -> 'dfs.client.failover.proxy.provider.HANN'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider' -> ); -- MYSQL mysql> create resource mysql_resource properties ( -> "type"="jdbc", -> "user"="root", -> "password"="123456", -> "jdbc_url" = "jdbc:mysql://127.0.0.1:3316/doris_test?useSSL=false", -> "driver_url" = "https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/jdbc_driver/mysql-connector-java-8.0.25.jar", -> "driver_class" = "com.mysql.cj.jdbc.Driver"); -- ES mysql> create resource es_resource properties ( -> "type"="es", -> "hosts"="http://127.0.0.1:29200", -> "nodes_discovery"="false", -> "enable_keyword_sniff"="true");	2022-12-22 13:45:55 +08:00
chenlinzhong	c9f26183b0	[feature-wip](MTMV) Support importing data to materialized view with multiple tables (#14944 ) ## Use Case create table t_user( event_day DATE, id bigint, username varchar(20) ) DISTRIBUTED BY HASH(id) BUCKETS 10 PROPERTIES ( "replication_num" = "1" ); insert into t_user values("2022-10-26",1,"clz"); insert into t_user values("2022-10-28",2,"zhangsang"); insert into t_user values("2022-10-29",3,"lisi"); create table t_user_pv( event_day DATE, id bigint, pv bigint ) DISTRIBUTED BY HASH(id) BUCKETS 10 PROPERTIES ( "replication_num" = "1" ); insert into t_user_pv values("2022-10-26",1,200); insert into t_user_pv values("2022-10-28",2,200); insert into t_user_pv values("2022-10-28",3,300); DROP MATERIALIZED VIEW if exists multi_mv; CREATE MATERIALIZED VIEW multi_mv BUILD IMMEDIATE REFRESH COMPLETE start with "2022-10-27 19:35:00" next 60 second KEY(username) DISTRIBUTED BY HASH (username) buckets 1 PROPERTIES ('replication_num' = '1') AS select t_user.username, t_user_pv.pv from t_user, t_user_pv where t_user.id=t_user_pv.id;	2022-12-22 11:46:41 +08:00
xueweizhang	fdcabf16b1	[fix](multi-catalog) fix show data on external catalog (#15227 ) if switch external catalog, and use a database that has same name with one database of internal catalog, query 'show data', will get data info from internal catalog.	2022-12-22 09:43:15 +08:00
luozenglin	b4f5b7a4c9	[fix](load) fix load failure caused by incorrect file format (#15222 ) Issue Number: close #15221	2022-12-22 09:38:37 +08:00
luozenglin	cc995c4307	[fix](load) fix new_load_scan_node load finished but no data actually caused by wrong file size (#15211 )	2022-12-22 09:28:00 +08:00
xueweizhang	1ca1417824	[feature](multi-catalog) support show tables/table status from catalog.db (#15180 ) support 'show tables from catalog.db' and 'show table status from catalog.db'	2022-12-22 09:22:40 +08:00
minghong	56f7ba19c0	[opt](planner) add session var: COMPACT_EQUAL_TO_IN_PREDICATE_THRESHOLD (#15225 ) in previous pr(#14876) we compact equals like "a=1 or a=2 or a = 3 " in to "a in (1, 2, 3)" this pr set a lower bound for the number of equals COMPACT_EQUAL_TO_IN_PREDICATE_THRESHOLD (default is 2) for performance reason, we create a hashSet to collect literals, like {1,2,3}. and hence, the literals in "in-predicates" are in random order. for regression test, if we need stable output of explain string, set COMPACT_EQUAL_TO_IN_PREDICATE_THRESHOLD to a large number to avoid compact rule.	2022-12-21 21:10:47 +08:00
Shuo Wang	c0b39de61c	[Feature](Nereids) Support join hint (#13601 ) Support join hint for nereids planner. Hints for broadcast and shuffle are supported by this PR.	2022-12-21 21:09:13 +08:00
Kikyou1997	649bbc1e58	[fix](nereids) Fix case-when (#15150 )	2022-12-21 21:03:50 +08:00
luozenglin	e65b577f90	[fix](InBitmap) Check whether the in bitmap contains correlated subqueries (#15184 )	2022-12-21 16:52:27 +08:00
mch_ucchi	90349f0e61	[Feature](Nereids) support mask function (#15120 ) support function for nereids: mask, mask_first_n, mask_last_n	2022-12-21 10:25:11 +08:00
Ashin Gau	d0d7a6d8ad	[fix](multi-catalog) can't show databases when creating a new user in external catalog (#15204 ) Fix bug: A new user with grants to access external catalog can't show databases.	2022-12-21 08:58:06 +08:00
xueweizhang	8969c19cd4	[fix](jdbc) fix create table like table of jdbc error (#15179 ) when create table like table of jdbc, it will get error like 'errCode = 2, detailMessage = Failed to execute CREATE TABLE LIKE baseall_mysql. Reason: errCode = 2, detailMessage = property table_type must be set' this pr fix it.	2022-12-21 08:56:43 +08:00
minghong	5c35f02bdb	[fix](nereids) add signature for IF to support HLL type (#15188 )	2022-12-20 22:22:11 +08:00
zhangstar333	c3712b1114	[bug](jdbc) fix error of jdbc with datetime type in oracle (#15205 )	2022-12-20 22:05:55 +08:00
morrySnow	5cf21fa7d1	[feature](planner) mark join to support subquery in disjunction (#14579 ) Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>	2022-12-20 15:22:43 +08:00
zhengshiJ	d9550c311e	[feature](Nereids) implement setOperation (#15020 ) The pr implements the SetOperation. - Adapt to the EliminateUnnecessaryProject rule to ensure that the project under SetOperation is not deleted. - Add predicate pushdown of SetOperation - Optimization: Merge multiple SetOperations with the same type and the same qualifier - Optimization: merge oneRowRelation and union	2022-12-20 15:14:29 +08:00
minghong	fdb54a346d	[feature] (nereids) support aggregate function group_bit_and/or/xor (#15003 ) support group_bit_and group_bit_xor group_bit_or	2022-12-20 14:11:07 +08:00
starocean999	6712f1fc1d	[fix](Nereids) encryption function with 4 params should auto-complate last param with config (#15038 )	2022-12-20 13:55:54 +08:00
AlexYue	737fe49f6f	[Bug](FE) fix compile error due to code refactor (#15192 )	2022-12-20 13:20:55 +08:00
starocean999	4979ad09c8	[fix](join)the policy to choose colocate join is not correct (#15140 ) * [hotfix](dev-1.0.1) fix colocate join bug in vec engine after introducing output tuple (#10651) to support vectorized outer join, we introduced a out tuple for hash join node, but it breaks the checking for colocate join. To solve this problem, we need map the output slot id to the children's slot id of hash join node, and the colocate join can be checked correctly. * fix colocate join bug * fix non vec colocate join issue Co-authored-by: lichi <lichi@rateup.com.cn> * add test cases Co-authored-by: lichi <lichi@rateup.com.cn>	2022-12-20 09:44:47 +08:00
minghong	320b264c9d	[feature](planner) compact multi-euqals to in-predicate #14876	2022-12-20 09:43:34 +08:00
minghong	81c06e8edc	[feature](nereids) add scalar function is_null_pred and is_not_null_pred (#15163 )	2022-12-20 00:54:40 +08:00
mch_ucchi	918698151a	[Fix](Nereids)fix be core when select constant expression (#15157 ) fix be core when select !2	2022-12-20 00:48:11 +08:00
minghong	a84a590b4f	[fix](nereids) estimate TimeStampArithmetic (#15061 ) `select * from lineitem where l_shipdate < date('1994-01-01') + interval '1' YEAR limit 1;` cause stack overflow	2022-12-20 00:44:42 +08:00
minghong	4dece99c97	[fix](nereids)add estimation for full outer join (#14902 )	2022-12-20 00:42:11 +08:00
minghong	a086f67255	[fix](nereids) stats calculator lost column statistics on limit node (#14759 ) `select avg(id) from (select id from t1 limit 1);` above sql encounters NPE, because stats for limit node lost column statistics	2022-12-20 00:39:57 +08:00
Mingyu Chen	21523f4db1	[fix](auth) fix bug that user info may lost when upgrading to 1.2.0 (#15144 ) * [fix](auth) fix bug that user info may lost when upgrading to 1.2.0 * fix	2022-12-19 16:01:18 +08:00
Mingyu Chen	f5823a90ff	[fix](broker-load) fix broker load with hdfs failed to get right file type (#15138 )	2022-12-19 16:00:58 +08:00
Jibing-Li	6be5670ce9	[Feature](multi catalog)Remove enable_multi_catalog config item, open this function to public. (#15130 ) The multi-catalog feature is ready to use, remove enable_multi_catalog switch in FE config, open it to public.	2022-12-19 14:29:13 +08:00
xueweizhang	1597afcd67	[fix](mutil-catalog) fix get many same name db/table when show where (#15076 ) when show databases/tables/table status where xxx, it will change a selectStmt to select result from information_schema, it need catalog info to scan schema table, otherwise may get many database or table info from multi catalog. for example mysql> show databases where schema_name='test'; +----------+ \| Database \| +----------+ \| test \| \| test \| +----------+ MySQL [internal.test]> show tables from test where table_name='test_dc'; +----------------+ \| Tables_in_test \| +----------------+ \| test_dc \| \| test_dc \| +----------------+	2022-12-19 14:27:48 +08:00
xueweizhang	000972ae17	[fix](executor) fix some npe about getting catalog and add some error info (#15155 )	2022-12-19 14:25:52 +08:00
Ashin Gau	7730a88d11	[fix](multi-catalog) add support for orc binary type (#15141 ) Fix three bugs: 1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed 2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed 3. fmt::format can't generate right format string, and will be failed	2022-12-19 14:24:12 +08:00
jiafeng.zhang	e8bac706d3	[deps](FE)Upgrade the velocity version that hive-exec depends on to 2.3 (#15067 )	2022-12-19 14:20:11 +08:00
AlexYue	b62a94ab46	[enhancement](metric)add one metric for the publish num per db (#14942 ) Add one metric to detect the publish txn num per db. User can get the relative speed of the txns processing per db using this metric and doris_fe_txn_num.	2022-12-19 14:18:11 +08:00
luozenglin	07f5d9562c	[fix](brokerload) fix broker load failed aused by the error path (#15057 )	2022-12-19 10:51:48 +08:00
zhangstar333	17e14e9a63	[bug](udaf) fix java udaf incorrect get null value with row (#15151 )	2022-12-19 10:07:12 +08:00
yuxuan-luo	a75c302bdb	[fix](schema) Fix create table error if Colocate tables not equal to bucket num (#15071 ) Co-authored-by: hugoluo <hugoluo@tencent.com>	2022-12-19 09:24:14 +08:00
Jibing-Li	3506b568ff	[Regression](multi catalog)P2 regression case for external hms catalog on emr. #15156	2022-12-19 09:21:48 +08:00
924060929	af4d9b636a	[refactor](Nerieds) Refactor aggregate function/plan/rules and support related cbo rules (#14827 ) # Proposed changes ## refactor - add AggregateExpression to shield the difference of AggregateFunction before disassemble and after - request `GATHER` physicalProperties for query, because query always collect result to the coordinator, use `GATHER` maybe select a better plan - refactor `NormalizeAggregate` - remove some physical fields for the `LogicalAggregate`, like `AggPhase`, `isDisassemble` - remove `AggregateDisassemble` and `DistinctAggregateDisassemble`, and use `AggregateStrategies` to generate various of PhysicalHashAggregate, like `two phases aggregate`, `three phases aggregate`, and cascades can auto select the lowest cost alternative. - move `PushAggregateToOlapScan` to `AggregateStrategies` - separate the traverse and visit method in FoldConstantRuleOnFE - if some expression not implement the visit method, the traverse method can handle and rewrite the children by default - if some expression implement the visit, the user defined traverse(invoke accept/visit method) will quickly return because the default visit method will not forward to the children, and the pre-process in traverse method will not be skipped. ## new feature - support `disable_nereids_rules` to skip some rules. example: 1. create 1 bucket table `n` ```sql CREATE TABLE `n` ( `id` bigint(20) NOT NULL ) ENGINE=OLAP DUPLICATE KEY(`id`) COMMENT 'OLAP' DISTRIBUTED BY HASH(`id`) BUCKETS 1 PROPERTIES ( "replication_allocation" = "tag.location.default: 1", "in_memory" = "false", "storage_format" = "V2", "disable_auto_compaction" = "false" ); ``` 2. insert some rows into `n` ```sql insert into n select * from numbers('number'='20000000') ``` 3. query table `n` ```sql SET enable_nereids_planner=true; SET enable_vectorized_engine=true; SET enable_fallback_to_original_planner=false; explain plan select id from n group by id; ``` the result show that we use the one stage aggregate ``` \| PhysicalHashAggregate ( aggPhase=LOCAL, aggMode=INPUT_TO_RESULT, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional.empty, requestProperties=[GATHER], stats=(rows=1, width=1, penalty=2.0E7) ) \| \| +--PhysicalProject ( projects=[id#0], stats=(rows=20000000, width=1, penalty=0.0) ) \| \| +--PhysicalOlapScan ( qualified=default_cluster:test.n, output=[id#0, name#1], stats=(rows=20000000, width=1, penalty=0.0) ) \| ``` 4. disable one stage aggregate ```sql explain plan select /+SET_VAR(disable_nereids_rules=DISASSEMBLE_ONE_PHASE_AGGREGATE_WITHOUT_DISTINCT)/ id from n group by id ``` the result is two stage aggregate ``` \| PhysicalHashAggregate ( aggPhase=GLOBAL, aggMode=BUFFER_TO_RESULT, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional[[id#0]], requestProperties=[GATHER], stats=(rows=1, width=1, penalty=2.0E7) ) \| \| +--PhysicalHashAggregate ( aggPhase=LOCAL, aggMode=INPUT_TO_BUFFER, groupByExpr=[id#0], outputExpr=[id#0], partitionExpr=Optional[[id#0]], requestProperties=[ANY], stats=(rows=1, width=1, penalty=2.0E7) ) \| \| +--PhysicalProject ( projects=[id#0], stats=(rows=20000000, width=1, penalty=0.0) ) \| \| +--PhysicalOlapScan ( qualified=default_cluster:test.n, output=[id#0, name#1], stats=(rows=20000000, width=1, penalty=0.0) ) \| ```	2022-12-18 21:49:29 +08:00
xueweizhang	6aba948df0	[fix](multi-catalog) hidden password for show create jdbc catalog (#15145 ) when show create catalog of jdbc, it will show 'jdbc.password' plain text. fix it like other code that hidden password.	2022-12-17 17:20:17 +08:00
starocean999	6d5251af78	[fix](subquery)fix bug of using constexpr as subquery's output (#15119 )	2022-12-16 21:58:58 +08:00

1 2 3 4 5 ...

3323 Commits