Commit Graph

7797 Commits

Author SHA1 Message Date
19cc65cc24 [fix](Nereids): fix bug of converting to NLJ. (#15290) 2022-12-23 19:33:45 +08:00
ede68e075d [fix](iceberg-v2) fix fe iceberg split, add regression case (#15299) 2022-12-23 19:33:00 +08:00
a98636a970 [bugfix](from_unixtime) fix timezone not work for from_unixtime (#15298)
* [bugfix](from_unixtime) fix timezone not work for from_unixtime
2022-12-23 19:05:09 +08:00
bfaaa2bd7c [feature](Nereids) support digital_masking function (#15252) 2022-12-23 18:59:08 +08:00
2f089be37e [feature](nereids) support bitAnd/ bitOr/ bitXor (#15261) 2022-12-23 18:39:39 +08:00
06d0035c02 [refactor](non-vec)remove schema change related non-vec code (#15313)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-23 18:33:04 +08:00
27d64964e6 [enhancement](Nereids) cast expression to the type with parameters (#14657) 2022-12-23 18:29:50 +08:00
ef3da105c9 [DOCS](refactor) refine en docs (#15244)
* Update basic-summary.md

* Update README.md
2022-12-23 16:47:51 +08:00
00fd5b1b1c [typo](doc) update Paxos spell mistake (#15171) 2022-12-23 16:47:12 +08:00
e7a077a81f [fix](jdbc catalog) fix bugs of jdbc catalog and table valued function (#15216)
* fix bugs

* add `desc function` test

* add test

* fix
2022-12-23 16:46:39 +08:00
e336178ef8 [Fix](multi catalog)Fix VFileScanner file not found status bug. #15226
The if condition to check NOT FOUND status for VFileScanner is incorrect, fix it.
2022-12-23 16:45:54 +08:00
b935fd0e7d [fix](fe)fix bug of the bucket shuffle join is not recognized (#15255)
* [fix](fe)fix bug of the bucket shuffle join is not recognized

* use broadcast join for empty table
2022-12-23 16:44:44 +08:00
1926239f09 [improvement](test) add --conf option for run-regression-test.sh for custom config file (#15287)
* add --conf option for run-regression-test.sh for custom config file

* fix shell check error
2022-12-23 16:43:18 +08:00
8a810cd554 [fix](bitmapfilter) fix core dump caused by bitmap filter (#15296)
Do not push down the bitmap filter to a non-integer column
2022-12-23 16:42:45 +08:00
8515a03ef9 [fix](compile) fix compile error caused by mysql_scan_node.cpp not being found when enabling WITH_MYSQL (#15277) 2022-12-23 16:25:28 +08:00
764b1db097 [fix](s3 outfile) Add theuse_path_style parameter for s3 outfile (#15288)
Currently, `outfile` did not support `use_path_style` parameter and use `virtual-host style` by default,
however some Object-storage may only support `use_path_style` access mode.

This pr add the`use_path_style` parameter for s3 outfile, so that different object-storage can use different access mode.
2022-12-23 16:22:06 +08:00
4b7f279cf9 [Enhancement](Nereids) change expression to conjuncts in filter (#14807) 2022-12-23 15:31:40 +08:00
fe562bc3e7 [Bug](Agg) fix crash when encountering not supported agg function like last_value(bitmap) (#15257)
The former logic inside aggregate_function_window.cpp would shutdown BE once encountering agg function with complex type like BITMAP. This pr makes it don't crash and would return one more concrete error message which tells the unsupported function signature to user.
2022-12-23 14:23:21 +08:00
cb295de981 [Bug](decimalv3) Fix wrong precision of DECIMALV3 (#15302)
* [Bug](decimalv3) Fix wrong precision of DECIMALV3

* update
2022-12-23 14:11:08 +08:00
b085ff49f0 [refactor](non-vec) delete non-vec data sink (#15283)
* [refactor](non-vec) delete non-vec data sink

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-12-23 14:10:47 +08:00
38530100d8 [fix](localgc) check gc only cache directory (#15238) 2022-12-23 10:40:55 +08:00
82fbfab77f [fix](union)the union node should not pass through children in some case (#15286)
the union node will make children pass through in wrong condition. If the children's materialized slots are different from union node, children can't be passed through.
2022-12-23 10:27:49 +08:00
020c47f528 [load](config) update max timeout (#15280) 2022-12-23 10:15:26 +08:00
Pxl
6b3721af23 [Bug](function) fix core dump on reverse() when big string input
fix core dump on reverse() when big string input
2022-12-23 10:14:09 +08:00
09a22813e4 [feature](Nereids) support syntax SELECT DISTINCT (#15197)
Add a new rule 'ProjectWithDistinctToAggregate' to support "select distinct xx from table".
This rule check's the logicalProject node's isDisinct property and replace the logicalProject node with a LogicalAggregate node.
So any rule before this, if createing a new logicalProject node, should make sure isDisinct property is correctly passed around.
please see rule BindSlotReference or BindFunction for example.
2022-12-22 23:54:08 +08:00
83a99a0f8b [refactor](non-vec) Remove non vec code from be (#15278)
* [refactor](removecode) remove some non-vectorization
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-22 23:28:30 +08:00
67647f0cf6 [fix](Nereids): fix bug of converting to NLJ. (#15268) 2022-12-22 23:05:39 +08:00
df5969ab58 [Feature] Support function roundBankers (#15154) 2022-12-22 22:53:09 +08:00
388df291af [pipeline](schedule) Add profile for except node and fix steal task problem (#15282) 2022-12-22 22:42:37 +08:00
e331e0420b [improvement](topn)add per scanner limit check for new scanner (#15231)
Optimize for key topn query like `SELECT * FROM store_sales ORDER BY ss_sold_date_sk, ss_sold_time_sk LIMIT 100` 
(ss_sold_date_sk, ss_sold_time_sk is prefix of table sort key). 

Check per scanner limit and set eof true to reduce the data need to be read.
2022-12-22 22:39:31 +08:00
d38461616c [Pipeline](error msg) format error message (#15247) 2022-12-22 20:55:06 +08:00
1fdd4172bd [fix](Inbitmap) fix in bitmap result error when left expr is constant (#15271)
* [fix](Inbitmap) fix in bitmap result error when left expr is constant

1. When left expr of the in predicate is a constant, instead of generating a bitmap filter, rewrite sql to use `bitmap_contains`.
  For example,"select k1, k2 from (select 2 k1, 11 k2) t where k1 in (select bitmap_col from bitmap_tbl)"
  => "select k1, k2 from (select 2 k1, 11 k2) t left semi join bitmap_tbl b on bitmap_contains(b.bitmap_col, t.k1)"

* add regression test
2022-12-22 19:25:09 +08:00
77c15729d4 [fix](memory) Fix too many repeat cause OOM (#15217) 2022-12-22 17:16:18 +08:00
6fb61b5bbc [enhancement] (streamload) allow table in url when do two-phase commit (#15246) (#15248)
Make it works even if user provide us with (unnecessary) table info in url.
i.e. `curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H \
"txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc`
can still works!

Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
2022-12-22 17:00:51 +08:00
754fceafaf [feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910)
**Histogram statistics**

Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics.

For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan.

The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred.

The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency.

---

Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows:

**Syntax**

```SQL
histogram(expr)
```

> The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query.

**example**

```
MySQL [test]> select histogram(login_time) from dev_table;
+------------------------------------------------------------------------------------------------------------------------------+
| histogram(`login_time`)                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------+
| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}|
+------------------------------------------------------------------------------------------------------------------------------+
```
**description**

```JSON
{
    "bucket_size": 5, 
    "buckets": [
        {
            "lower": "2022-09-21 17:30:29", 
            "upper": "2022-09-21 22:30:29", 
            "count": 9, 
            "pre_sum": 0, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-22 17:30:29", 
            "upper": "2022-09-22 22:30:29", 
            "count": 10, 
            "pre_sum": 9, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-23 17:30:29", 
            "upper": "2022-09-23 22:30:29", 
            "count": 9, 
            "pre_sum": 19, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-24 17:30:29", 
            "upper": "2022-09-24 22:30:29", 
            "count": 9, 
            "pre_sum": 28, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-25 17:30:29", 
            "upper": "2022-09-25 22:30:29", 
            "count": 9, 
            "pre_sum": 37, 
            "ndv": 1
        }
    ]
}
```

TODO:
- histogram func supports parameter and sample statistics (It's got another pr)
- use histogram statistics
- add  p0 regression
2022-12-22 16:42:17 +08:00
d0a4a8e047 [Feature](Nereids) Push limit through union all. (#15272)
This PR push limit through the union all into the child plan.
2022-12-22 14:46:47 +08:00
f8b368a85e [Feature](Nereids) Support bitmap for materialized index. (#14863)
This PR adds the rewriting and matching logic for the bitmap_union column in materialized index.

If a materialized index has bitmap_union column, we try to rewrite count distinct or bitmap_union_count to the bitmap_union column in materialized index.
2022-12-22 14:40:25 +08:00
0fa4c78e84 [Improvement](external table) support hive external table which stores data on tencent chdfs (#15125) 2022-12-22 14:32:55 +08:00
a87f905a2d [Feature](Nereids) unnest subquery in 'not in' predicate into NULL AWARE ANTI JOIN (#15230)
when we process not in subquery. if the subquery return column is nullable, we need a NULL AWARE ANTI JOIN instead of ANTI JOIN.
Doris already support NULL AWARE ANTI JOIN in PR #13871
Nereids need to do that so.
2022-12-22 14:13:47 +08:00
87756f5441 [regresstion](query) query with limit 0 regresstion test (#15245) 2022-12-22 14:06:44 +08:00
e9a201e0ec [refactor](non-vec) delete some non-vec exec node (#15239)
* [refactor](non-vec) delete some non-vec exec node
2022-12-22 14:05:51 +08:00
1520a4af6d [refactor](resource) use resource to create external catalog (#14978)
Use resource to create external catalog.
-- HMS
mysql> create resource hms_resource properties(
    -> "type"="hms",
    -> 'hive.metastore.uris' = 'thrift://172.21.0.44:7004',
    -> 'dfs.nameservices'='HANN',
    -> 'dfs.ha.namenodes.HANN'='nn1,nn2',
    -> 'dfs.namenode.rpc-address.HANN.nn1'='172.21.0.32:4007',
    -> 'dfs.namenode.rpc-address.HANN.nn2'='172.21.0.44:4007',
    -> 'dfs.client.failover.proxy.provider.HANN'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
    -> );

-- MYSQL
mysql> create resource mysql_resource properties (
    -> "type"="jdbc",
    -> "user"="root",
    -> "password"="123456",
    -> "jdbc_url" = "jdbc:mysql://127.0.0.1:3316/doris_test?useSSL=false",
    -> "driver_url" = "https://doris-community-test-1308700295.cos.ap-hongkong.myqcloud.com/jdbc_driver/mysql-connector-java-8.0.25.jar",
    -> "driver_class" = "com.mysql.cj.jdbc.Driver");

-- ES
mysql> create resource es_resource properties (
    -> "type"="es",
    -> "hosts"="http://127.0.0.1:29200",
    -> "nodes_discovery"="false",
    -> "enable_keyword_sniff"="true");
2022-12-22 13:45:55 +08:00
2bb4ea5dea [regresion-test](icebergv2) add icebergv2 test case (#15187) 2022-12-22 13:45:07 +08:00
c9f26183b0 [feature-wip](MTMV) Support importing data to materialized view with multiple tables (#14944)
## Use Case

create table t_user(
     event_day DATE,
     id bigint,
     username varchar(20)
)
DISTRIBUTED BY HASH(id) BUCKETS 10 
PROPERTIES (
   "replication_num" = "1"
 );
insert into  t_user values("2022-10-26",1,"clz");
insert into  t_user values("2022-10-28",2,"zhangsang");
insert into  t_user values("2022-10-29",3,"lisi");
create table t_user_pv(
    event_day DATE,
    id bigint,
    pv bigint
)
DISTRIBUTED BY HASH(id) BUCKETS 10 
PROPERTIES (
   "replication_num" = "1"
 );
insert into  t_user_pv  values("2022-10-26",1,200);
insert into  t_user_pv  values("2022-10-28",2,200);
insert into  t_user_pv  values("2022-10-28",3,300);

DROP MATERIALIZED VIEW  if exists multi_mv;
CREATE MATERIALIZED VIEW  multi_mv
BUILD IMMEDIATE 
REFRESH COMPLETE 
start with "2022-10-27 19:35:00"
next  60 second
KEY(username)   
DISTRIBUTED BY HASH (username)  buckets 1
PROPERTIES ('replication_num' = '1') 
AS 
select t_user.username, t_user_pv.pv  from t_user, t_user_pv where t_user.id=t_user_pv.id;
2022-12-22 11:46:41 +08:00
c81a3bfe1b [docs](compile)Add Windows compilation documentation (#15253)
Add Windows compilation documentation
2022-12-22 10:16:58 +08:00
fdcabf16b1 [fix](multi-catalog) fix show data on external catalog (#15227)
if switch external catalog, and use a database that has same name with one database of internal catalog,
query 'show data', will get data info from internal catalog.
2022-12-22 09:43:15 +08:00
7d49ddf50c [bugfix](thirdparty) patch simdjson to avoid conflict with odbc macro BOOL (#15223)
fix conflit name BOOL in odbc sqltypes.h and simdjson element.h. Change BOOL to BOOLEAN in simdjson.

- thirdparty/installed/include/sqltypes.h

> #define	BOOL				int


- thirdparty/src/simdjson-1.0.2/include/simdjson/dom/element.h

> enum class element_type {
>   ARRAY = '[',     ///< dom::array
>   OBJECT = '{',    ///< dom::object
>   INT64 = 'l',     ///< int64_t
>   UINT64 = 'u',    ///< uint64_t: any integer that fits in uint64_t but *not* int64_t
>   DOUBLE = 'd',    ///< double: Any number with a "." or "e" that fits in double.
>   STRING = '"',    ///< std::string_view
>   BOOL = 't',      ///< bool
>   NULL_VALUE = 'n' ///< null
> };
>
2022-12-22 09:40:04 +08:00
b4f5b7a4c9 [fix](load) fix load failure caused by incorrect file format (#15222)
Issue Number: close #15221
2022-12-22 09:38:37 +08:00
cc995c4307 [fix](load) fix new_load_scan_node load finished but no data actually caused by wrong file size (#15211) 2022-12-22 09:28:00 +08:00
1cc79510c9 [enhancement](compaction) add delete_sign_index check before filter delete (#15190) 2022-12-22 09:26:37 +08:00