Commit Graph

6608 Commits

Author SHA1 Message Date
7c2da89518 [docs](spark-load) set hadoop env (#12342)
(spark-load) set hadoop env
2022-09-06 16:41:38 +08:00
4e95b3afaf [test](nereids) add subquery regression Testing (#12372)
Added regression test of sub-queries. Currently only associated sub-queries are added. Non-associated sub-queries will be added after project revision.
2022-09-06 16:37:17 +08:00
f1507f93ee [enhancement](chore)add single empty line rule to fe check style for Nereids (#12365) 2022-09-06 14:19:59 +08:00
4a55b504c0 [feature-wip](parquet-reader) bug fix, get the correct group reader (#12294)
Fix the problem that cannot read the lineitem table of TPCH , and the error of allocate memory
Co-authored-by: jinzhe <jinzhe@selectdb.com>
2022-09-06 13:59:35 +08:00
d7dedfadad [fix](nereids) fix dead loop in unnesting subquery rule (#12345)
[fix](nereids) fix dead loop in unnesting subquery rule
2022-09-06 11:50:30 +08:00
cf5d194fe1 [enhancement](array-type) Split Array Offsets and String Offsets (#12341)
In old Doris version string offsets are 32bit, but it is not enough for Array type.
If we change string offsets from 32bit to 64bit, there will be problem if we upgrade BE one by one. Because at the same time 32bit Offsets and 64 bit Offsets String will exist at the same time.
As a result, we separate the Codes for Array Offsets.
Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-09-06 11:18:27 +08:00
53b79d5a8c [Enhancement](restore) new add the property of reserve_replica to restore statement (#11942)
Add a new property called 'reserve_replica', which means you can
get a table with same partitions with the same replication num
as before the backup.

Co-authored-by: Stalary <stalary@163.com>
Co-authored-by: camby <104178625@qq.com>
2022-09-06 10:32:21 +08:00
2019cf9406 [regression](test) add tpcds sf1 unique test (#12268) 2022-09-06 10:12:00 +08:00
86fa0e38e2 [fix](join) hash join should use children's output tuple ids not output tableref ids (#12261) 2022-09-06 09:53:45 +08:00
f2aa87d797 Add ctas support config key type ut and doc. (#12327) 2022-09-06 09:16:02 +08:00
190717dbcc [enhancement](chore)add single space separator rule to fe check style (#12354)
Some times, our code use more than one space as separator by mistake. This PR add a CheckStyle rule SingleSpaceSeparator to check that for Nereids.
2022-09-05 21:59:58 +08:00
b8e38b9167 [Bug](load) block call clear_column_data may have ref not equal 1 (#12350) 2022-09-05 20:40:40 +08:00
0deee72a63 About the modification of broker load specifying hdfs user name parameter (#12330)
About the modification of broker load specifying hdfs user name parameter
2022-09-05 19:34:26 +08:00
a47eb55d7c [regression](load)split dataset to cover more situation (#12311) 2022-09-05 19:25:01 +08:00
e175a7ed63 [fix](memtracker) Fix the exceeded limit of the first query execution (#12332)
In some cases, when the user executes the query for the first time, an error of the exceeded mem limit will be reported, and the query will be successful only after the second execution.

This is because when the query is executed for the first time, the memory consumed by adding the page cache and other caches is recorded in the query mem tracker, hoping to unify the behavior of multiple queries.

A temporary solution, remove the hook of scanner thread, test clickbench q13

Before removing the scanner thread hook
Enable page cache: 3G for the first query, 3G for the tracker; 900M for the second query, 900M for the tracker.
Turn off page cache: 1.9G for the first query, 1.9G for the tracker; 900M for the second query, 900M for the tracker
After removing the scanner thread hook and fix MemTrackerLimiter::cache_consume_local bug
Enable page cache: 2916M for the first query, 1147M for the tracker; 979M for the second query, 1144M for the tracker
Turn off page cache: 1809M for the first query, 1147M for the tracker; 975M for the second query, 1145M for the tracker
TODO, a better solution is to track storage-related memory separately, in the scanner thread. Otherwise, it is impossible to know where the process memory grows when querying.
2022-09-05 19:22:46 +08:00
05f6e1b33d [fix](memtracker) Fix open query profile to print the complete mem limit exceed log #12339 2022-09-05 19:21:43 +08:00
38937c15d7 [typo](streamload) fix typo and remove useless method declaration #12343 2022-09-05 19:16:36 +08:00
698bae09b2 [fix](Nereids)get NPE and group not be optimized when add REWRITE rule to Cascades Optimzer (#12346)
Fix some bugs when add REWRITE rule to Cascades Optimizer
- all rule should set as not rewrite rule when use them in Cascades Optimizer
- IMPLEMENT rule promise should large than others since we should do exploration first.
2022-09-05 19:11:48 +08:00
f466a072d8 fix bug: tpch-q12 invalid type (#12347)
In old planner, Predicate set its type in analyzeImpl(). However, function analyzeImpl() is in old planner path, but not in nereids path. And hence the type is invalid.

Because all predicate has type bool, we set its type in constructor.
2022-09-05 19:09:27 +08:00
dadfd85c40 prune for agg with constant expr (#12274)
Currently, nereids doesn't support aggregate function with no slot reference in query, since all the column would be pruned, e.g.

SELECT COUNT(1) FROM t;

This PR reserve the column with the smallest amount of data when doing column prune under this situation.

To be noticed, this PR ONLY handle aggregate functions. So projection with no slot reference need to be handled in future.
2022-09-05 19:09:00 +08:00
8bfb89c100 [feature-wip](array-type) Add some regression tests for nested array (#12322)
#11392 made _input_block in each BetaRowsetReaders sharable. However, for some types (e.g. nested array with more than 1 depth), the _column_vector_batches in RowBlockV2 can be nested which means that there is a ColumnVectorBatch inside another ColumnVectorBatch. In this case, the data of inner ColumnVectorBatch
may be corrupted because the data of _input_block is copied shallowly to the _output_block.
2022-09-05 14:05:24 +08:00
3b104e334a [Bug](load) fix missing nullable info in stream load (#12302) 2022-09-05 13:41:28 +08:00
7b352c93ff [improvement](sink) avoid frequent allocation and deallocation when serializing block (#12310) 2022-09-05 12:23:43 +08:00
2398cd3bb6 [enhancement](Nereids)print slot name in explain string (#12272)
Currently, explain string print all expression as slot id, e.g. `<slot 1>`.
This PR, print its name with slot id instead, e.g. `column_a[#1]`. For details:
- print qualified table name for OlapScanNode
- print NamedExpression name with SlotId instead of just SlotId
- OlapScanNode's node name use "OlapScanNode" instead of table name
2022-09-05 11:31:35 +08:00
e5f3f0e730 [typo](docs) mix of SSD and HDD disks should specify the storage directory only (#12309)
add notice of storage
2022-09-05 09:23:34 +08:00
74b6eaf44b [typo](docs)Replace table link fix (#12317) 2022-09-05 08:29:41 +08:00
7929500608 [typo](docs)The table_function calling reset() function should set _eos to false #12323 2022-09-05 08:29:19 +08:00
7f10fa9768 [fix](compile)compile error when use clang on aarch64 platform (#12319) 2022-09-05 08:28:51 +08:00
d5e5afe437 [Bug](function) disable LUT for yearweek (#12324) 2022-09-05 08:27:43 +08:00
ef37396b63 [fix](dbt)fix dbt incremental bug (#12280) 2022-09-04 16:40:40 +08:00
81664fd78c github workflow build docs check fix (#12318)
github workflow build docs check fix
2022-09-03 21:32:43 +08:00
90a0baf5f8 [fix](array-type) Forbid ARRAY<NOT_NULL(T)> temporarily (#12262)
Currently, there are still lots of bugs related to ARRAY<NOT_NULL(T)>.

We decide that we don't support ARRAY<NOT_NULL(T)> types at the first version and all elements in ARRAY are nullable.

Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-09-03 14:26:08 +08:00
3a30e12ffb update data-model, add error_code into DUPLICATE KEY (#12131) 2022-09-03 14:23:29 +08:00
34dd67f804 [feature](nereids) add weekOfYear to support ssb-flat benchmark (#12207)
support function WeekOfYear
In current implementation, WeekOfYear can be used in where clause, but not in select clause.
2022-09-03 12:04:51 +08:00
62561834a8 [Feature](array-type) Support is-null-predicate for array type (#12237) 2022-09-03 11:37:57 +08:00
e7303c12c7 [Enhancement](array-type) Support Floating/Decimal type for array aggregation functions (#12271) 2022-09-03 09:55:56 +08:00
5d0b1868c2 [chore](docs)Add compile check for document format (#12300)
Add compile check for document format

Avoid document formatting issues that fail in the daily build release of the official website
so that we can find problems and fix them in time to avoid repeated modifications
Since the compiler for the website is now in the doris-website repo, we pull the code from this repo, delete the documentation inside, and copy the documentation from doris master to perform the compiler check
2022-09-03 09:44:20 +08:00
b154a1b45e [doc] fix some docs issue (#11101)
* fix some docs issue

* add -y for apt-get

Co-authored-by: chaow <941210239@qq.com>
2022-09-02 21:06:12 +08:00
c944496fb4 [chore](log) add cluster and tag message to exception (#12287) 2022-09-02 20:46:39 +08:00
0d33c713d1 [Bug](CTAS) Fix CTAS error for use agg column as first. (#12299)
* FIX: ctas default use duplicate key.
2022-09-02 20:44:01 +08:00
1fd3490c56 remove duplicate "comments" (#12264) 2022-09-02 18:57:10 +08:00
7f7a3a7524 [feature](nereids) Convert subqueries into algebraic expressions and … (#11454)
1.Convert subqueries to Apply nodes.
2.Convert ApplyNode to ordinary join.

### Detailed design:

There are three types of current subexpressions, scalarSubquery, inSubquery, and Exists. The scalarSubquery refers to the returned data as 1 row and 1 column.

**Subquery replacement**

```
before:
scalarSubquery:  filter(t1.a = scalarSubquery(output b));
inSubquery:  filter(inSubquery);   inSubquery = (t1.a in select ***);
exists:  filter(exists);   exists = (select ***);

end:
scalarSubquery:  filter(t1.a = b);
inSubquery:  filter(True);
exists:  filter(True);
```

**Subquery Transformation Rules**

```
PushApplyUnderFilter
 * before:
 *             Apply
 *          /              \
 * Input(output:b)    Filter(Correlated predicate/UnCorrelated predicate)
 *
 * after:
 *          Filter(Correlated predicate)
 *                      |
 *                  Apply
 *                /            \
 *      Input(output:b)    Filter(UnCorrelated predicate)
```

```
PushApplyUnderProject
 * before:
 *            Apply
 *         /              \
 * Input(output:b)    Project(output:a)
 *
 * after:
 *          Project(b,(if the Subquery is Scalar add 'a' as the output column))
 *          /               \
 * Input(output:b)      Apply
```

```
ApplyPullFilterOnAgg
 * before:
 *             Apply
 *          /              \
 * Input(output:b)    agg(output:fn,c; group by:null)
 *                              |
 *              Filter(Correlated predicate(Input.e = this.f)/UnCorrelated predicate)
 *
 * end:
 *          Apply(Correlated predicate(Input.e = this.f))
 *         /              \
 * Input(output:b)    agg(output:fn,this.f; group by:this.f)
 *                              |
 *                    Filter(UnCorrelated predicate)
```

```
ApplyPullFilterOnProjectUnderAgg
 * before:
 *              apply
 *         /              \
 * Input(output:b)        agg
 *                         |
 *                  Project(output:a)
 *                         |
 *              Filter(correlated predicate(Input.e = this.f)/Unapply predicate)
 *                          |
 *                         child
 *              apply
 *         /              \
 * Input(output:b)        agg
 *                         |
 *              Filter(correlated predicate(Input.e = this.f)/Unapply predicate)
 *                         |
 *                  Project(output:a,this.f, Unapply predicate(slots))
 *                          |
 *                         child

```

```
ScalarToJoin
 * UnCorrelated -> CROSS_JOIN
 * Correlated -> LEFT_OUTER_JOIN
```

```
InToJoin
 * Not In -> LEFT_ANTI_JOIN
 * In -> LEFT_SEMI_JOIN
```

```
existsToJoin
 * Exists
 *    Correlated -> LEFT_SEMI_JOIN
 *      correlated                  LEFT_SEMI_JOIN(Correlated Predicate)
 *      /       \         -->       /           \
 *    input    queryPlan          input        queryPlan
 *
 *    UnCorrelated -> CROSS_JOIN(limit(1))
 *      uncorrelated                    CROSS_JOIN
 *      /           \          -->      /       \
 *    input        queryPlan          input    limit(1)
 *                                               |
 *                                             queryPlan
 *
 * Not Exists
 *    Correlated -> LEFT_ANTI_JOIN
 *      correlated                  LEFT_ANTI_JOIN(Correlated Predicate)
 *       /       \         -->       /           \
 *     input    queryPlan          input        queryPlan
 *
 *   UnCorrelated -> CROSS_JOIN(Count(*))
 *                                    Filter(count(*) = 0)
 *                                          |
 *         apply                       Cross_Join
 *      /       \         -->       /           \
 *    input    queryPlan          input       agg(output:count(*))
 *                                               |
 *                                             limit(1)
 *                                               |
 *                                             queryPlan
```
2022-09-02 17:34:19 +08:00
08c5e0b1e3 [chore](deps) strip debug info of thirdparty dependencies (#12284)
Strip debug info of most of thridparty dependencies' static lib.
If can significantly reduce the size of thirdparty libs: 3.4G -> 1.6G
And the doris_be binary size will be reduced: 1.5G -> 868M (clang build)
And after compress, the BE binary is only 195M with debug info!
2022-09-02 15:43:29 +08:00
64302ff4c9 [typo](docs)Sidebar fix (#12297)
* sidebar fix
2022-09-02 15:09:26 +08:00
81c5732dc7 [feature-wip](MTMV) Support creating materialized view for multiple tables (#11646)
Support creating materialized view for multiple tables.

Examples:

mysql> CREATE TABLE t1 (pk INT, v1 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk) PROPERTIES ('replication_num' = '1');
mysql> CREATE TABLE t2 (pk INT, v2 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk) PROPERTIES ('replication_num' = '1');
mysql> CREATE MATERIALIZED VIEW mv BUILD IMMEDIATE REFRESH COMPLETE KEY (mv_pk) DISTRIBUTED BY HASH (mv_pk) PROPERTIES ('replication_num' = '1') AS SELECT t1.pk as mv_pk FROM t1, t2 WHERE t1.pk = t2.pk;
2022-09-02 14:51:56 +08:00
Pxl
a8c8ebf5cf [Enhancement](compaction) empty string optimize for binary dict code (#12259)
improve write empty string perfomance.
2022-09-02 14:25:19 +08:00
7a4173b497 [typo](docs)Fix admin copy table format (#12288)
Fix admin copy table format
2022-09-02 14:08:56 +08:00
202ad5c659 [feature-wip](parquet-reader) bug fix, the number of rows are different among columns in a block (#12228)
1. `ExprContext` is delete in `ParquetReader::close()`, but it has not been closed,
so the `DCHECH` in `~ExprContext()` is failed. the lifetime of `ExprContext` is managed by scan node,
so we should not delete its pointer in `ParquetReader::close()`.
2. `RowGroupReader::next_batch` will update `_read_rows` in every column loop,
and does not ensure the number of rows in every column are equal.
3.  The skipped row ranges are variables in stack, which are released when calling `ArrayColumnReader::read_column_data`, so we should copy them out.
2022-09-02 09:50:25 +08:00
3ce6bb548d doc_stream_load_format (#12144)
doc_stream_load_format
2022-09-02 09:22:10 +08:00
10c3e683dd [docs]update users numbers (#12248)
update users numbers
2022-09-02 09:21:36 +08:00