Commit Graph

3327 Commits

Author SHA1 Message Date
ec055e1acb [feature](new file reader) Integrate new file reader (#15175) 2022-12-26 08:55:52 +08:00
0e651365ca [profile](scanner) add per scanner running time profile (#15321)
* [profile](scanner) add per scanner running time profile


Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-26 08:55:07 +08:00
a807978882 [refactor](non-vec) Remove rowbatch code from delta writer and some rowbatch related code (#15349)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-26 08:54:51 +08:00
b7768a928d [Improvement](S3) support access s3 via temporary security credentials (#15340) 2022-12-26 00:31:55 +08:00
e640f49b6d [refactor](non-vec) remove non vectorized predicate and row_block (#15348)
remove non vectorized predicate and row_block
2022-12-25 21:45:00 +08:00
5cefd05869 [fix](multi-catalog) fix and optimize iceberg v2 reader (#15274)
Fix three bugs when read iceberg v2 tables:
1. The `delete position` in `delete file` represents the position of delete row in the entire file, but the `read range` in 
`RowGroupReader` represents the position in current row group. Therefore, we need to subtract the position of first 
row of current row group from `delete position`.
2. When only reading the partition columns, `RowGroupReader` skips processing the `delete position`.
3. If the `delete position` has delete all rows in a row group, the `read range` is empty, but we read the whole row 
group in such case.

Optimize four performance issues:
1. We change `delete position` to `delete range`, and then merge `delete range` and `read range` into the final read 
ranges. This process is too tedious and time-consuming. . we can merge `delete position` and `read range` directly.
2. `delete position` is ordered in a `delete file`, so we can use merge-sort, instead of ordered-set.
3. Initialize `RowGroupReader` when reading, instead of initialize all row groups when opening a `ParquetReader`, to 
save memory usage, and the same as `IcebergReader`.
4. Change the recursive call of `_do_lazy_read` to loop logic.
2022-12-24 16:02:07 +08:00
e72404c537 [fix](scan) fix that be may core dump when the predicates are all false (#15332) 2022-12-24 15:27:43 +08:00
06f71f2bca [pipeline](fix) Fix bugs to pass all regression cases (#15306)
* [pipeline](fix) Fix bugs to pass all regression cases

* update

* update
2022-12-23 22:17:50 +08:00
a98636a970 [bugfix](from_unixtime) fix timezone not work for from_unixtime (#15298)
* [bugfix](from_unixtime) fix timezone not work for from_unixtime
2022-12-23 19:05:09 +08:00
06d0035c02 [refactor](non-vec)remove schema change related non-vec code (#15313)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-23 18:33:04 +08:00
e336178ef8 [Fix](multi catalog)Fix VFileScanner file not found status bug. #15226
The if condition to check NOT FOUND status for VFileScanner is incorrect, fix it.
2022-12-23 16:45:54 +08:00
8a810cd554 [fix](bitmapfilter) fix core dump caused by bitmap filter (#15296)
Do not push down the bitmap filter to a non-integer column
2022-12-23 16:42:45 +08:00
8515a03ef9 [fix](compile) fix compile error caused by mysql_scan_node.cpp not being found when enabling WITH_MYSQL (#15277) 2022-12-23 16:25:28 +08:00
fe562bc3e7 [Bug](Agg) fix crash when encountering not supported agg function like last_value(bitmap) (#15257)
The former logic inside aggregate_function_window.cpp would shutdown BE once encountering agg function with complex type like BITMAP. This pr makes it don't crash and would return one more concrete error message which tells the unsupported function signature to user.
2022-12-23 14:23:21 +08:00
b085ff49f0 [refactor](non-vec) delete non-vec data sink (#15283)
* [refactor](non-vec) delete non-vec data sink

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2022-12-23 14:10:47 +08:00
38530100d8 [fix](localgc) check gc only cache directory (#15238) 2022-12-23 10:40:55 +08:00
Pxl
6b3721af23 [Bug](function) fix core dump on reverse() when big string input
fix core dump on reverse() when big string input
2022-12-23 10:14:09 +08:00
83a99a0f8b [refactor](non-vec) Remove non vec code from be (#15278)
* [refactor](removecode) remove some non-vectorization
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-22 23:28:30 +08:00
df5969ab58 [Feature] Support function roundBankers (#15154) 2022-12-22 22:53:09 +08:00
388df291af [pipeline](schedule) Add profile for except node and fix steal task problem (#15282) 2022-12-22 22:42:37 +08:00
e331e0420b [improvement](topn)add per scanner limit check for new scanner (#15231)
Optimize for key topn query like `SELECT * FROM store_sales ORDER BY ss_sold_date_sk, ss_sold_time_sk LIMIT 100` 
(ss_sold_date_sk, ss_sold_time_sk is prefix of table sort key). 

Check per scanner limit and set eof true to reduce the data need to be read.
2022-12-22 22:39:31 +08:00
d38461616c [Pipeline](error msg) format error message (#15247) 2022-12-22 20:55:06 +08:00
77c15729d4 [fix](memory) Fix too many repeat cause OOM (#15217) 2022-12-22 17:16:18 +08:00
6fb61b5bbc [enhancement] (streamload) allow table in url when do two-phase commit (#15246) (#15248)
Make it works even if user provide us with (unnecessary) table info in url.
i.e. `curl -X PUT --location-trusted -u user:passwd -H "txn_id:18036" -H \
"txn_operation:commit" http://fe_host:http_port/api/{db}/{table}/_stream_load_2pc`
can still works!

Signed-off-by: freemandealer <freeman.zhang1992@gmail.com>
2022-12-22 17:00:51 +08:00
754fceafaf [feature-wip](statistics) add aggregate function histogram and collect histogram statistics (#14910)
**Histogram statistics**

Currently doris collects statistics, but no histogram data, and by default the optimizer assumes that the different values of the columns are evenly distributed. This calculation can be problematic when the data distribution is skewed. So this pr implements the collection of histogram statistics.

For columns containing data skew columns (columns with unevenly distributed data in the column), histogram statistics enable the optimizer to generate more accurate estimates of cardinality for filtering or join predicates involving these columns, resulting in a more precise execution plan.

The optimization of the execution plan by histogram is mainly in two aspects: the selection of where condition and the selection of join order. The selection principle of the where condition is relatively simple: the histogram is used to calculate the selection rate of each predicate, and the filter with higher selection rate is preferred.

The selection of join order is based on the estimation of the number of rows in the join result. In the case of uneven data distribution in the join condition columns, histogram can greatly improve the accuracy of the prediction of the number of rows in the join result. At the same time, if the number of rows of a bucket in one of the columns is 0, you can mark it and directly skip the bucket in the subsequent join process to improve efficiency.

---

Histogram statistics are mainly collected by the histogram aggregation function, which is used as follows:

**Syntax**

```SQL
histogram(expr)
```

> The histogram function is used to describe the distribution of the data. It uses an "equal height" bucking strategy, and divides the data into buckets according to the value of the data. It describes each bucket with some simple data, such as the number of values that fall in the bucket. It is mainly used by the optimizer to estimate the range query.

**example**

```
MySQL [test]> select histogram(login_time) from dev_table;
+------------------------------------------------------------------------------------------------------------------------------+
| histogram(`login_time`)                                                                                                      |
+------------------------------------------------------------------------------------------------------------------------------+
| {"bucket_size":5,"buckets":[{"lower":"2022-09-21 17:30:29","upper":"2022-09-21 22:30:29","count":9,"pre_sum":0,"ndv":1},...]}|
+------------------------------------------------------------------------------------------------------------------------------+
```
**description**

```JSON
{
    "bucket_size": 5, 
    "buckets": [
        {
            "lower": "2022-09-21 17:30:29", 
            "upper": "2022-09-21 22:30:29", 
            "count": 9, 
            "pre_sum": 0, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-22 17:30:29", 
            "upper": "2022-09-22 22:30:29", 
            "count": 10, 
            "pre_sum": 9, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-23 17:30:29", 
            "upper": "2022-09-23 22:30:29", 
            "count": 9, 
            "pre_sum": 19, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-24 17:30:29", 
            "upper": "2022-09-24 22:30:29", 
            "count": 9, 
            "pre_sum": 28, 
            "ndv": 1
        }, 
        {
            "lower": "2022-09-25 17:30:29", 
            "upper": "2022-09-25 22:30:29", 
            "count": 9, 
            "pre_sum": 37, 
            "ndv": 1
        }
    ]
}
```

TODO:
- histogram func supports parameter and sample statistics (It's got another pr)
- use histogram statistics
- add  p0 regression
2022-12-22 16:42:17 +08:00
e9a201e0ec [refactor](non-vec) delete some non-vec exec node (#15239)
* [refactor](non-vec) delete some non-vec exec node
2022-12-22 14:05:51 +08:00
1cc79510c9 [enhancement](compaction) add delete_sign_index check before filter delete (#15190) 2022-12-22 09:26:37 +08:00
8ecf69b09b [pipeline](regression) nested loop join test get error result in pipeline engine and refactor the code for need more input data (#15208) 2022-12-21 19:03:51 +08:00
af54299b26 [Pipeline](projection) Support projection on pipeline engine (#15220) 2022-12-21 15:47:29 +08:00
a447121fc3 [fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan (#15199)
* [fix](scanner scheduler) fix coredump of ScannerScheduler::_scanner_scan

* fix
2022-12-21 15:44:47 +08:00
2445ac9520 [Bug](runtimefilter) Fix BE crash due to init failure (#15228) 2022-12-21 15:36:22 +08:00
5aefb793f9 [Bugfix](round) fix round function may coredump (#15203)
* [Bugfix](round) fix round function may coredump
2022-12-21 14:36:10 +08:00
efdc73777a [enhancement](load) verify the number of rows between different replicas when load data to avoid data inconsistency (#15101)
It is very difficult to investigate the data inconsistency of multiple replicas.
When loading data, the number of rows between replicas is checked to avoid some data inconsistency problems.
2022-12-21 09:50:13 +08:00
732417258c [Bug](pipeline) Fix bugs to pass TPCDS cases (#15194) 2022-12-20 22:29:55 +08:00
2501198800 [Bug](compile) Fix compiling error (#15207) 2022-12-20 20:05:49 +08:00
821c12a456 [chore](BE) remove all useless segment group related code #15193
The segment group is useless in current codebase, remove all the related code inside Doris. As for the related protobuf code, use reserved flag to prevent any future user from using that field.
2022-12-20 17:11:47 +08:00
5cf21fa7d1 [feature](planner) mark join to support subquery in disjunction (#14579)
Co-authored-by: Gabriel <gabrielleebuaa@gmail.com>
2022-12-20 15:22:43 +08:00
9d48154cdc [minor](non-vec) delete unused interface in RowBatch (#15186) 2022-12-20 13:06:34 +08:00
a2d56af7d9 [profile](datasender) add more detail profile in data stream sender (#15176)
* [profile](datasender) add more detail profile in data stream sender


Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-12-20 12:07:34 +08:00
938f4f33d6 [Pipeline] Add MLFQ when schedule (#15124) 2022-12-20 11:49:15 +08:00
0c2911efb1 [enhancement](gc) sub_file_cache checks the directory files when gc (#15114)
* [enhancement](gc) sub_file_cache checks the directory files when gc

* update
2022-12-20 10:50:11 +08:00
98cdeed6e0 [chore](routine load) remove deprecated property of librdkafka reconnect.backoff.jitter.ms #15172 2022-12-20 10:13:56 +08:00
40141a9c9c [opt](vectorized) opt the null map _has_null logic (#15181)
opt the null map _has_null logic
2022-12-20 10:01:54 +08:00
494eb895d3 [vectorized](pipeline) support union node operator (#15031) 2022-12-19 22:01:56 +08:00
7c67fa8651 [Bug](pipeline) fix bug of right anti join error result in pipeline (#15165) 2022-12-19 19:28:44 +08:00
0732f31e5d [Bug](pipeline) Fix bugs for scan node and join node (#15164)
* [Bug](pipeline) Fix bugs for scan node and join node

* update
2022-12-19 15:59:29 +08:00
445ec9d02c [fix](counter) fix coredump caused by updating destroyed counter (#15160) 2022-12-19 14:35:03 +08:00
1597afcd67 [fix](mutil-catalog) fix get many same name db/table when show where (#15076)
when show databases/tables/table status where xxx, it will change a selectStmt to select result from 
information_schema, it need catalog info to scan schema table, otherwise may get many
database or table info from multi catalog.

for example
mysql> show databases where schema_name='test';
+----------+
| Database |
+----------+
| test |
| test |
+----------+

MySQL [internal.test]> show tables from test where table_name='test_dc';
+----------------+
| Tables_in_test |
+----------------+
| test_dc |
| test_dc |
+----------------+
2022-12-19 14:27:48 +08:00
7730a88d11 [fix](multi-catalog) add support for orc binary type (#15141)
Fix three bugs:
1. DataTypeFactory::create_data_type is missing the conversion of binary type, and OrcReader will failed
2. ScalarType#createType is missing the conversion of binary type, and ExternalFileTableValuedFunction will failed
3. fmt::format can't generate right format string, and will be failed
2022-12-19 14:24:12 +08:00
03ea2866b7 [fix](load) add to error tablets when delta writer failed to close (#15118)
The result of load should be failed when all tablets delta writer failed to close on single node.
But the result returned to client is success.
The reason is that the committed tablets and error tablets are both empty, so publish will be success.
We should add it to error tablets when delta writer failed to close, then the transaction will be failed.
2022-12-19 14:22:25 +08:00