Commit Graph

7545 Commits

Author SHA1 Message Date
b23a785775 [Fix](Variant) support materialize view for variant and accessing variant subcolumns (#30603)
* [Fix](Variant) support materialize view for variant and accessing variant subcolumns
1. fix schema change with path lost and lead to invalid data read
2. support element_at function in BE side and use simdjson to parse data
3. fix multi slot expression
2024-02-16 10:12:23 +08:00
cb43ac8ab2 [feature](nereids) using rollup column stats (#30852) 2024-02-16 10:12:23 +08:00
f95d0cf802 [fix](Nereids) should not infer not null from mark join (#30897) 2024-02-16 10:12:23 +08:00
08508d65fd [feature-wip](plsql)(step1) Support PL-SQL (#30817)
# 1. Motivation
PL-SQL (Stored procedure) is a collection of sql, which is defined and used similarly to functions. It supports conditional judgments, loops and other control statements, supports cursor processing of result sets, and can write business logic in SQL.

Hive uses Hplsql to support PL-SQL and is largely compatible with Oracle, Impala, MySQL, Redshift, PostgreSQL, DB2, etc. We support PL-SQL in Doris based on Hplsql to achieve compatibility with Stored procedures of database systems such as Oracle and PostgreSQL.

Reference documentation:
Hive: http://mail.hplsql.org
Oracle: https://docs.oracle.com/en/database/oracle/oracle-database/21/lnpls/plsql-language-fundamentals.html#GUID-640DB3AA-15AF-4825-BD6C-1D4EB5AB7715
Mysql: https://dev.mysql.com/doc/refman/8.0/en/create-procedure.html

# 2. Implementation
Take the following case as an example to explain the process of connecting Doris FE to execute stored procedures using the Mysql protocol.
```
CREATE OR REPLACE PROCEDURE A(IN name STRING, OUT result int)
          select count(*) from test;
          select count(*) into result from test where k = name;
END

declare result INT default = 0;
call A(‘xxx’, result);
print result;
```
![image](https://github.com/apache/doris/assets/13197424/0b78e039-0350-4ef1-bef3-0ebbf90274cd)

1. Add procedure and persist the Procedure Name and Source (raw SQL) into Doris FE metadata.
2. Call procedure, extract the actual parameter Value and Procedure Name in Call Stmt. Use Procedure Name to find the Source in the metadata, extract the Name and Type of the Procedure parameter, and match them with the actual parameter Value to form a complete variable <Name, Type, Value>.
3. Execute Doris Statement
     - Use Doris Logical Plan Builder to parse the Doris Statement syntax in Source, replace parameter variables, remove the into variable clause, and generate a Plan Tree that conforms to Doris syntax.
     - Use stmtExecutor to execute SQL and encapsulate the query result set iterator into QueryResult.
     - Output the query results to Mysql Channel, or write them into Cursor, parameters, and variables.
     - Stored Programs compatible with Mysql protocol support multiple statements.
4. Execute PL-SQL Statement
     - Use Plsql Logical Plan Builder to parse and execute PL-SQL Statement syntax in Source, including Loop, Cursor, IF, Declare, etc., and basically reuse HplSQL.

# 3. TODO
1. Support drop procedure.
2. Create procedure only in `PlSqlOperation`.
3. Doris Parser supports declare variable.
4. Select Statement supports insert into variable.
5. Parameters and fields have the same name.
6. If Cursor exits halfway, will there be a memory leak?
7. Use getOriginSql(ctx) in syntax parsing LogicalPlanBuilder to obtain the original SQL. Is there any problem with special characters?
8. Supports complex types such as Map and Struct.
9. Test syntax such as Package.
10. Support UDF
11. In Oracle, create procedure must have AS or IS after RIGHT_PAREN,
but Mysql and Hive not support AS or IS. Compatibility issues with Oracle will be discussed and resolved later.
12. Built-in functions require a separate management.
13. Doris statement add stmt: egin_transaction_stmt, end_transaction_stmt, commit_stmt, rollback_stmt.
14. Add plsql stmt: cmp_stmt, copy_from_local_stmt, copy_stmt, create_local_temp_table_stmt, merge_stmt.

# 4. Some questions
1. JDBC does not support the execution of stored procedures that return results. You can only Into the execution results into a variable or write them into a table, because when multiple result sets are returned, JDBC needs to use the prepareCall statement to execute, otherwise the Statemnt of the returned result executes Finalize. Send EOF Packet will report an error;
2. Use PL-SQL Cursor to open multiple Query result set iterators at the same time. Doris BE will cache the intermediate status of these Queries (such as HashTable) and query results until the Query result set iteration is completed. If the Cursor is not available for a long time Being used will result in a lot of memory waste.
3. In plsql/Var.defineType(), the corresponding Plsql Var type will be found through the Mysql type name string, and the corresponding relationship between Doris type and Plsql Var needs to be implemented.
4. Currently, PL-SQL Statement will be forwarded to Master FE for creation and calculation, which may affect other services on Doris FE and is limited by the performance of Doris FE. Consider moving it to Doris BE for execution.
5. The format of the result returned by Doris Statement is ```xxxx\n, xxxx\n, 2 rows affected (0.03 sec)```. PL-SQL uses Print to print variable values in an unformatted format, and JDBC cannot easily obtain them. Real results.

# 5. Some thoughts
The above execution of Doris Statement reuses Doris Logical Plan Builder for syntax parsing, parses it from top to bottom into a Plan Tree, and calls stmtExecutor for execution. PL-SQL replacement variables, removal of Into Variable and other operations are coupled in Doris syntax parsing. The advantage is that it is easier to It can be compatible with Doris grammar with a few changes, but the disadvantage is that it will invade the Doris grammar parsing process.
HplSQL performs a syntax parsing independently of Hive to implement variable substitution and other operations, and finally outputs a SQL that conforms to Hive syntax. The following is a simple syntax parsing process for select, where, expression, table name, join, The parsing of agg, order and other grammars must be re-implemented. The advantage is that it is completely independent from the original system, but the changes are too complicated.
![image](https://github.com/apache/doris/assets/13197424/7539e485-0161-44de-9100-1a01ebe6cc07)
2024-02-16 10:12:23 +08:00
c6cd6b125d [nereids] group by key elimination (#30774) 2024-02-16 10:12:23 +08:00
2cb46eed94 [Feature](auto-inc) Add start value for auto increment column (#30512) 2024-02-16 10:12:23 +08:00
5c2a4a80dd [fix](nereids) Fix use aggregate mv wrongly when rewrite query which only contains join (#30858)
the materialized view def is as following:
>            select 
>               o_orderdate, 
>               o_shippriority, 
>               o_comment, 
>               l_orderkey, 
>               o_orderkey, 
>               count(*) 
>             from 
>               orders 
>               left join lineitem on l_orderkey = o_orderkey
>               group by o_orderdate, 
>               o_shippriority, 
>               o_comment, 
>               l_orderkey;

the query should rewrite success by using above materialized view
>             select 
>               o_orderdate, 
>               o_shippriority, 
>               o_comment, 
>               l_orderkey, 
>               ps_partkey, 
>               count(*) 
>             from 
>               orders left 
>               join lineitem on l_orderkey = o_orderkey
>               left join partsupp on ps_partkey = l_orderkey
>               group by
>              o_orderdate, 
>              o_shippriority, 
>              o_comment, 
>              l_orderkey, 
>              ps_partkey;
2024-02-16 10:12:23 +08:00
2e4daa7006 [fix](Nereids): fix wrong case in TransposeSemiJoinLogicalJoinProject (#30874) 2024-02-16 10:12:23 +08:00
4701dd49c3 (selectdb-cloud) Use info level in recordCreatePartitionFailedMsg() due to intersection happens all the time (#30448) 2024-02-06 08:35:54 +08:00
bc2e8ac8f9 [fix](kerberos) fix kerberos ugi login method (#30766) 2024-02-06 08:35:54 +08:00
8e147f4c93 [BugFix](MultiCatalog) Fix oss file location is not avaiable in iceberg hadoop catalog (#30761)
1, create iceberg hadoop catalog like below:
CREATE CATALOG iceberg_catalog PROPERTIES (
"warehouse" = "s3a://xxx/xxx",
"type" = "iceberg",
"s3.secret_key" = "*XXX",
"s3.region" = "region",
"s3.endpoint" = "http://xxx.jd.local",
"s3.bucket" = "xxx-test",
"s3.access_key" = "xxxxx",
"iceberg.catalog.type" = "hadoop",
"fs.s3a.impl" = "org.apache.hadoop.fs.s3a.S3AFileSystem",
"create_time" = "2024-02-02 11:15:28.570"
);

2, run select * from iceberg_catalog.table limit 1;

will get errCode = 2, detailMessage = Unknown file location nullnulls3a:/xxxx

expect:
OK

also need to bp to branch-2.0
2024-02-06 08:35:54 +08:00
92226c986a [fix](catalog) fix data_sub/data_add func pushdown in jdbcscan (#30807) 2024-02-06 08:35:54 +08:00
06ed5780e4 [opt](catalog) cache the converted properties (#30668)
convert properties may be a heavy operation, so we cache the result.
2024-02-06 08:35:54 +08:00
1ed24117ac [function](url_decode)add url_decode function (#30667) 2024-02-05 22:23:00 +08:00
4e8c94ef14 [config](move-memtable) set LOAD_STREAM_PER_NODE default to 2 (#30830) 2024-02-05 22:23:00 +08:00
bff3b04029 [fix](cosn) use s3 client to read cosn on BE side (#30835) 2024-02-05 22:22:59 +08:00
501ece3123 Collect index row count for MTMV. (#30855) 2024-02-05 22:00:36 +08:00
3a752b758a [fix](Nereids) colcoate node attr lost after merge fragment (#30818) 2024-02-05 21:58:08 +08:00
a5d9004974 [fix](Nereids) physical property deriver on some node is not right (#30819) 2024-02-05 21:57:29 +08:00
fc762f426b [enhance](mtmv) mtmv disable hive auto refresh (#30775)
- If the `related table` is `hive`, do not refresh automatically
- If the `related table` is `hive`, the partition col is allowed to be `null`. Otherwise, it must be `not null`
- add more `ut`
2024-02-05 21:56:57 +08:00
d1bb63ed67 [fix](arrow-flight) Modify FE Arrow version to 15.0.0 #30824 2024-02-05 21:56:57 +08:00
48aaaa8005 [Enhancement](fuction) change function REPEAT nullable mode (#30743) 2024-02-04 22:21:36 +08:00
27f65f4463 [Feature](executor)Stream load support workload group (#30763)
* Stream load support workload group

* skip mysql load
2024-02-04 22:21:36 +08:00
25f6a733fe [fix](stats) keep threads in pool alive to maintain reasonable parallelism (#30451) 2024-02-04 22:21:16 +08:00
ccbcf879b5 [test](mtmv) Add materialized view availability regression test (#30769)
Add materialized view availability regression test

when mv refresh_time is in the grace_period(unit is second), materialized view will be use to
query rewrite regardless of the base table is update or not
when mv refresh_time is out of the grace_period(unit is second), will check the base table is update or not
if update the materialized view will not be used to query rewrite
2024-02-04 22:21:16 +08:00
9e76592297 Support analyze materialized view. (#30540) 2024-02-04 22:21:16 +08:00
383850ef12 [Opt](multi-catalog) Opt split assignment to resolve uneven distribution. (#30390)
[Opt] (multi-catalog) Opt split assignment to resolve uneven distribution. Currently only for `FileQueryScanNode`.

Referring to the implementation of Trino, 
- Local node soft affinity optimization. Prefer local replication node.
- Remote split will use the consistent hash algorithm is used when the file cache is turned on, and because of the possible unevenness of the consistent hash, the split is re-adjusted so that the maximum and minimum split numbers of hosts differ by at most `max_split_num_variance` split.
- Remote split will use the round-robin algorithm is used when the file cache is turned off.
2024-02-04 14:28:38 +08:00
b275cb0f44 [feature](mtmv) mtmv support workload group (#29595)
MTMV supports controlling the resource usage of refresh tasks by setting the name of workload group
about workload group : https://doris.apache.org/zh-CN/docs/dev/admin-manual/workload-group
2024-02-04 14:28:38 +08:00
6442663735 [Function](exec) upport atan2 math function (#30672)
Co-authored-by: Rohit Satardekar <rohitrs1983@gmail.com>
2024-02-04 14:28:38 +08:00
36b2712709 [chore](Nereids) turn on nereids dml when update to 2.1 (#30776) 2024-02-04 14:28:38 +08:00
3cc409b14f [bug](function) fix date_sub function failed when arg type is datev2 (#30443)
* [bug](function) fix date_sub function failed when arg type is datev2

* update
2024-02-04 14:28:38 +08:00
d749fc3d27 [improvement](binlog) Change BinlogConfig default TTL_SECONDS to 86400 (1day) (#30771)
* Change BinlogConfig default TTL_SECONDS to 86400 (1day)

Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>

* Fix binlog.ttl_seconds in regression test

Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>

---------

Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2024-02-04 14:28:38 +08:00
5aed3abb8a [Fix](Nereids) Fix rewrite by materialized view fail when join input has agg (#30734)
materialized view definition is as following, and the query sql is the same
when outer group by use the col1 in the inner group, which can be rewritten by materialized view

select
t1.o_orderdate,
t1.o_orderkey,
t1.col1
from
(
select
o_orderkey,
o_custkey,
o_orderstatus,
o_orderdate,
sum(o_shippriority) as col1
from
orders
group by
o_orderkey,
o_custkey,
o_orderstatus,
o_orderdate
) as t1
left join lineitem on lineitem.l_orderkey = t1.o_orderkey
group by
t1.o_orderdate,
t1.o_orderkey,
t1.col1
2024-02-03 20:27:04 +08:00
4f0414d13e [fix](Nereids) date >= simplify to > by mistake (#30765) 2024-02-03 20:26:04 +08:00
d99bb51d36 [fix](legacy-planner) fixed loss of BetweenPredicate rewrite on reanalyze in legacy planner (29798) (#30328) 2024-02-03 20:26:04 +08:00
a3a73162e5 [Fix](Job)Fix One-Time type JOB parameter verification error (#30779) 2024-02-03 20:26:04 +08:00
8a0ea4b651 [enhancement](Nereids): datetime support microsecond overflow (#30744) 2024-02-03 20:26:04 +08:00
4f8730d092 [improvement](jdbc catalog) Optimize connection pool parameter settings (#30588)
This PR makes the following changes to the connection pool of JDBC Catalog
1. Set the maximum connection survival time, the default is 30 minutes

-   Moreover, one-half of the maximum survival time is the recyclable time,
-   One-tenth is the check interval for recycling connections

2. Keepalive only takes effect on the connection pool on BE, and will be activated based on one-fifth of the maximum survival time.
3. The maximum number of existing connections is changed from 100 to 10
4. Add the connection cache recycling thread on BE, and add a parameter to control the recycling time, the default is 28800 (8 hours)
5. Add CatalogID to the key of the connection pool cache to achieve better isolation, requires refresh catalog to take effect
6. Upgrade druid connection pool to version 1.2.20
7. Added JdbcResource's setting of default parameters when upgrading the FE version to avoid errors due to unset parameters.
2024-02-03 20:26:03 +08:00
ac681e8e8c [ehmancement](binlog) Add show proc '/binlog' impl (#30770)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2024-02-03 20:26:03 +08:00
e413dbec91 [fix](nereids)need substitute agg function using agg node's output if it's in order by key (#30704) 2024-02-03 20:25:25 +08:00
Pxl
5687ca977d [Bug](java-udf) fix core dump when javaudf input 0 row block (#30720)
fix core dump when javaudf input 0 row block
2024-02-03 20:25:25 +08:00
Pxl
0f47f7f389 [Feature](runtime filter) normalize ignore runtime filter (#30152)
normalize ignore runtime filter
2024-02-03 20:24:39 +08:00
e5bdc369e2 [runtimefilter](nereids)push down RF into cte producer (#30568)
* push down Rf into CTE
2024-02-03 20:24:39 +08:00
9889683ae3 [Feature](Job)STARTS and AT allow setting current_timestamp (#30593) 2024-02-03 20:24:39 +08:00
e21c9dca9c [fix](mtmv)compatibility metadata without refreshsnapshot #30735 2024-02-02 13:31:47 +08:00
94eedd8ea4 [Enhancement](function)make SUBSTRING_INDEX function DEPEND_ON_ARGUMENT (#30392) 2024-02-02 13:31:47 +08:00
9b7c6af581 [Fix](JDK17) Fixed that BE could not be started using JDK17 (#30286)
Issue Number: #30484

This is because hadoop-client-api relies on hadoop-common.
In the case of JDK17, it will still include hadoop-common.
2024-02-01 23:14:14 +08:00
318bd3f9de [Cherry-Pick][improvement](stmt) Add fuzzy matching of label in show transaction (#30725)
* Add fuzzy matching of label in show transaction

* fix
2024-02-01 23:04:06 +08:00
3315c16383 [enhance](function) refactor from_format_str and support more format (#30452) 2024-02-01 19:08:37 +08:00
fb0d712096 [fix](multi-catalog)access HMS need ugiDoAs (#30595) 2024-02-01 19:08:37 +08:00