Commit Graph

8276 Commits

Author SHA1 Message Date
d77ea64ae4 [typo](docs) Changing the Jump Address of SparkLoad in BrokerLoad (#12731) 2022-09-23 09:15:17 +08:00
617820b1f5 [Refactor](parquet) refactor parquet write to uniform and consistent logic (#12730) 2022-09-23 09:12:34 +08:00
0203b36cc4 [regressiontest](test_with)add with_case test (#12814) 2022-09-23 09:10:33 +08:00
84dd3edd0d [Bug](view) Show create view support comment #12838 2022-09-23 09:09:44 +08:00
8fcd8ed8b3 [chore](build) add option to disable -frecord-gcc-switches (#12846) 2022-09-22 15:38:14 +08:00
340784e294 [feature-wip](statistics) add statistics module related syntax (#12766)
This pull request includes some implementations of the statistics(#6370), it adds statistics module related syntax. The current syntax for collecting statistics will not collect statistics (It will collect statistics until test is stable). 

- `ANALYZE` syntax(collect statistics)

```SQL
ANALYZE [[ db_name.tb_name ] [( column_name [, ...] )], ...] [PARTITIONS(...)] [ PROPERTIES(...) ]
```
> db_name.tb_name: collect table and column statistics from tb_name.
> column_name: collect column statistics from column_name.
> properties: properties of statistics jobs.

example:
```SQL
ANALYZE;  -- collect statistics for all tables in the current database
ANALYZE table1(pv, citycode);  -- collect pv and citycode statistics for table1
ANALYZE test.table2 PARTITIONS(partition1); -- collect statistics for partition1 of table2
```

- `SHOW ANALYZE` syntax(show statistics job info)

```SQL
SHOW ANALYZE
    [TABLE | ID]
    [
        WHERE
        [STATE = ["PENDING"|"SCHEDULING"|"RUNNING"|"FINISHED"|"FAILED"|"CANCELLED"]]
    ]
    [ORDER BY ...]
    [LIMIT limit][OFFSET offset];
```

- `SHOW TABLE STATS`syntax(show table statistics)

```SQL
SHOW TABLE STATS [ db_name.tb_name ]  
```

- `SHOW COLUMN STATS` syntax(show column statistics)

```SQL
SHOW COLUMN STATS [ db_name.tb_name ] 
```
2022-09-22 11:15:00 +08:00
3fa820ec50 [feature-wip](statistics) collect statistics by sql task (#12765)
This pull request includes some implementations of the statistics(#6370), it Implements sql-task to collect statistics based on internal-query(#9983).

After the ANALYZE statement is parsed, statistical tasks will be generated. The statistical tasks includes mata-task(get statistics from metadata) and sql-task(get statistics by sql query). For sql-task, it will get statistics such as the row_count, the number of null values, and the maximum value by SQL query.

For statistical tasks, also include sampling sql-task, which will be implemented in the next pr.
2022-09-22 11:13:35 +08:00
70ab9cb43e [feature](http) refactor version info and add new http api for get version info (#12513)
Refactor version info and add new http api for get version info
2022-09-22 10:53:04 +08:00
77e423042c (brpc) donot use pooled brpc (#12754)
It seems that pooled brpc does not release port timely.
2022-09-22 10:00:26 +08:00
57b3c03371 [enhancement](like)pass data to like function in block not in row (#12825)
The like predicate process data in block perform better than in row. Currently, only not null column is optimized, nullable column will be handled later.

SELECT COUNT(*) FROM hits WHERE URL LIKE '%google%';
before: ~680ms
after: ~570ms
2022-09-22 09:59:30 +08:00
32551a7263 [bugfix](predicate column) data maybe wrong if not a single page (#12796)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-09-22 09:55:31 +08:00
6cd4c9ecb5 [bugfix](fe) Fix test_materialized_view_hll case npt. (#12829)
when enable light schema change, run test_materialized_view_hll case throw NullPointerException.
  java.lang.NullPointerException: null
      at org.apache.doris.analysis.SlotDescriptor.setColumn(SlotDescriptor.java:153)
      at org.apache.doris.planner.OlapScanNode.updateSlotUniqueId(OlapScanNode.java:399)
2022-09-22 09:50:53 +08:00
4b95b4e41d [feature-wip](file-scanner)Get column type from parquet schema (#12833)
Get schema from parquet reader.
The new VFileScanner need to get file schema (column name to type map) from parquet file while processing load job, 
this pr is to set the type information for parquet columns.
2022-09-22 09:35:37 +08:00
1ca6d559e4 [feature-wip](parquet-reader) refactor some arguments for parquet reader (#12771)
refactor some arguments for parquet reader 
1. Add new parquet context to wrap reader arguments
2. Reduced some arguments for function call
Co-authored-by: jinzhe <jinzhe@selectdb.com>
2022-09-22 09:34:01 +08:00
cbadbecd9a [fix](Nereids) anti join could not be reorder (#12827) 2022-09-22 09:19:12 +08:00
wxy
1ae7c4e307 [fix](LOAD statement): fix bug for toSql func of LoadStmt. (#12648) 2022-09-22 09:07:46 +08:00
c58e4ca03b [enhancement](Nereids) turn on all reorder rule that needed by zig-zag tree (#12767) 2022-09-22 02:35:31 +08:00
0dee640a3e [feature](Nereids): eliminate filter true and add checker. (#12821) 2022-09-22 02:31:11 +08:00
e21ffac419 [Improvement](dateformat) Improve efficiency for function date_format (#12811) 2022-09-21 22:38:16 +08:00
35f07ede26 [typo](docs)Changing the Jump Address of BrokerLoad in SparkLoad (#12735)
* [typo](docs)Changing the Jump Address of BrokerLoad in SparkLoad

Changing the Jump Address of BrokerLoad in SparkLoad

* Update spark-load-manual.md
2022-09-21 22:03:28 +08:00
b09cc95701 [typo](docs) fix get-starting doc err (#12777) 2022-09-21 21:58:41 +08:00
1c98c3a8f0 [fix](Nereids) GroupExpression never be optimize if it run with exploration job (#12815)
Exploration job only do explore, but never call optimize. So the GroupExpression explored by exploration only job will never do implementation.
2022-09-21 21:03:37 +08:00
fbdebe2424 [feature-wip](new-scan)Add load counter for VFileScanner (#12812)
The new scanner (VFileScanner) need a counter to record two values in load job.
1. The number of rows unselected by pre-filter, and
2. The number of rows filtered by unmatched schema or other error. This pr is to implement the counter.
2022-09-21 20:59:13 +08:00
c55d08fa2f [fix](memtracker) Refactor load channel mem tracker to improve accuracy (#12791)
The mem hook record tracker cannot guarantee that the final consumption is 0, nor can it guarantee that the memory alloc and free are recorded in a one-to-one correspondence.

In the life cycle of a memtable from insert to flush, the memory free of hook is more than that of alloc, resulting in tracker consumption less than 0.

In order to avoid the cumulative error of the upper load channel tracker, the memtable tracker consumption is reset to zero on destructor.
2022-09-21 20:16:19 +08:00
b41eaa5ac0 [fix](memtracker) Introduce orphan mem tracker to verify memory tracking accuracy (#12794)
The mem hook consumes the orphan tracker by default. If the thread does not attach other trackers, by default all consumption will be passed to the process tracker through the orphan tracker.

In real time, consumption of all other trackers + orphan tracker consumption = process tracker consumption.

Ideally, all threads are expected to attach to the specified tracker, so that "all memory has its own ownership", and the consumption of the orphan mem tracker is close to 0, but greater than 0.
2022-09-21 15:47:10 +08:00
8f4bb0f804 [improvement](agg) iterate aggregation data in memory written order (#12704)
Following the iteration order of the hash table will result in out-of-order access to aggregate states, which is very inefficient.
Traversing aggregate states in memory write order can significantly improve memory read efficiency.

Test
hash table items count: 3.35M

Before this optimization: insert keys into column takes 500ms
With this optimization only takes 80ms
2022-09-21 14:58:50 +08:00
27f7ae258d [Enhancement](load) optimize flush policy to avoid small segments #12706
In current policy, if mem-limit exceeded, load channel will pick tablets that consume most memory, but mem_consumption contains memory in flush, if some delta writer flushing a full memtable(default 200MB), the current memtable might be very small, we should avoid flush such memtable, which can generate a very small segment.
2022-09-21 14:33:05 +08:00
ec2b3bf220 [feature-wip](new-scan)Refactor VFileScanner, support broker load, remove unused functions in VScanner base class. (#12793)
Refactor of scanners. Support broker load.
This pr is part of the refactor scanner tasks. It provide support for borker load using new VFileScanner.
Work still in progress.
2022-09-21 12:49:56 +08:00
7b46e2400f [enhancement](Nereids) add all necessary PhysicalDistribute on Join's child to ensure get correct cost (#12483)
In an earlier PR #11976 , we add shuffle join and bucket shuffle support. But if join's right child's distribution spec satisfied join's require, we do not add distribute on right child. Instead of, do it in plan translator.
It is hard to calculate accurate cost in this way, since we some distribute cost do not calculated.
In this PR, we introduce a new shuffle type BUCKET, and change the way of add enforce to ensure all necessary distribute will be added in cost and enforcer job.
2022-09-21 12:18:37 +08:00
a7993755ae [typo](docs)rename doc file name (#12783)
Co-authored-by: chenjie <chenjie@cecdat.com>
2022-09-21 11:25:38 +08:00
52a0da1f5c [improve](Nereids): add check validator during post. (#12702) 2022-09-21 11:25:04 +08:00
b6e20db997 [fix](outfile) select OBJECT and HLL columns into outfile as null. (#12734) 2022-09-21 11:24:31 +08:00
632867c1c1 [Bug](datetimev2) Fix lost precision for datetimev2 (#12723) 2022-09-21 11:15:02 +08:00
3cfaae0031 [Improvement](sort) Use heap sort to optimize sort node (#12700) 2022-09-21 10:01:52 +08:00
a5643822de [feature-wip](unique-key-merge-on-write) fix calculate delete bitmap when has sequence column (#12789)
when the rowset has multiple segments with sequence column, we should compare sequence id with previous segment.
2022-09-21 09:21:07 +08:00
bd4bfa8f00 [fix](memtracker) Fix thread mem tracker try consume accuracy #12782 2022-09-21 09:20:41 +08:00
c72a19f410 [BugFix](VExprContext) capture error status to prevent incorrect func call which causes coredump #12779 2022-09-21 09:20:16 +08:00
f1539761e8 [Bugfix](string_functions) rearrange code to avoid global buffer overflow in FindInSetOp::execute (#12677) 2022-09-21 09:19:38 +08:00
c5b6056b7a [fix](lateral_view) fix lateral view explode_split with temp table (#12643)
Problem describe:

follow SQL return wrong result:
WITH example1 AS ( select 6 AS k1 ,'a,b,c' AS k2) select k1, e1 from example1 lateral view explode_split(k2, ',') tmp as e1;

Wrong result:

+------+------+
| k1   | e1   |
+------+------+
|    0 | a    |
|    0 | b    |
|    0 | c    |
+------+------+
Correct result should be:
+------+------+
| k1   | e1   |
+------+------+
|    6 | a    |
|    6 | b    |
|    6 | c    |
+------+------+
Why?
TableFunctionNode::outputSlotIds do not include column k1.

Co-authored-by: cambyzju <zhuxiaoli01@baidu.com>
2022-09-21 09:19:18 +08:00
b0b876f640 [typo](docs) vectorization needs to be turned off to use native udf #12739 2022-09-21 09:13:48 +08:00
11e0151445 [chore](build) add an option to disable strip thridparty libs (#12772) 2022-09-21 09:11:25 +08:00
7dfbb7c639 [chore](regression-test) add order by column in tpch_sf1_p1/tpch_sf1/nereids/q11.groovy (#12770) 2022-09-20 22:26:24 +08:00
d5486726de [Bug](date) Fix wrong result produced by date function (#12720) 2022-09-20 21:09:26 +08:00
cc072d35b7 [Bug](date) Fix wrong type in TimestampArithmeticExpr (#12727) 2022-09-20 21:08:48 +08:00
b550985df6 fix thirdparty builder (#12768) 2022-09-20 19:41:00 +08:00
e70c298e0c [Bugfix](mem) Fix memory limit check may overflow (#12776)
This bug is because the result of subtracting signed and unsigned numbers may overflow if it is negative.
2022-09-20 18:18:23 +08:00
bb7206d461 [refactor](SimpleScheduler) refactor code for getting available backend in SimpleScheduler (#12710) 2022-09-20 18:08:29 +08:00
b837b2eb95 [feature-wip](parquet-reader) filter rows by page index (#12664)
# Proposed changes

[Parquet v1.11+ supports page skipping](https://github.com/apache/parquet-format/blob/master/PageIndex.md), 
which helps the scanner reduce the amount of data scanned, decompressed, decoded, and insertion.
According to the performance FlameGraph, decompression takes up 20% cpu time.
If a page can be filtered as a whole, the page can not be decompressed.

However, the row numbers between pages are not aligned. Columns containing predicates can be filtered by page granularity,
but other columns need to be skipped within pages, so non predicate columns can only save the decoding and insertion time.

Array column needs the repetition level to align with other columns, so the array column can only save the decoding and insertion time.

## Explore
`OffsetIndex` in the column metadata can locate the page position.
Theoretically, a page can be completely skipped, including the time of reading from HDFS.
However, the average size of a page is around 500KB. Skipping a page requires calling the `skip`.
The performance of `skip` is low when it is called frequently,
and may not be better than continuous reading of large blocks of data (such as 4MB).

If multiple consecutive pages are filtered, `skip` reading can be performed according to`OffsetIndex`.
However, for the convenience of programming and readability, the data of all pages are loaded and filtered in turn.
2022-09-20 15:55:19 +08:00
47797ad7e8 [feature](Nereids) Push down not slot references expression of on clause (#11805)
pushdown not slotreferences expr of on clause.
select * from t1 join t2 on t1.a + 1 = t2.b + 2 and t1.a + 1 > 2

project()
+---join(t1.a + 1 = t2.b + 2 && t1.a + 1 > 2)
    |---scan(t1)
    +---scan(t2)

transform to

project()
+---join(c = d && c > 2)
    |---project(t1.a -> t1.a + 1)
    |   +---scan(t1)
    +---project(t2.b -> t2.b + 2)
        +---scan(t2)
2022-09-20 13:41:54 +08:00
d83eb13ac5 [enhancement](nereids) use Literal promotion to avoid unnecessary cast (#12663)
Instead of add a cast function on literal, we directly change the literal type. This change could save cast execution time and memory.
For example:
In SQL: 
"CASE WHEN l_orderkey > 0 THEN ...", 0 is a TinyIntLiteral. 
Before this PR: 
"CASE WHEN l_orderkey > CAST (TinyIntLiteral(0) AS INT)` 
With this PR:  
"CASE WHEN l_orderkey > IntegerLiteral(0)"
2022-09-20 11:15:47 +08:00