Commit Graph

6608 Commits

Author SHA1 Message Date
978dae267e [typo](docs)Optimized string and date function doc (#12949) 2022-09-26 09:26:12 +08:00
91134cff61 [typo](docs)Optimized date function doc order and add partial function doc #12878 2022-09-26 09:25:11 +08:00
acd5d67355 [feature-wip](new-scan)Add new odbc scanner and new odbc scan node (#12899) 2022-09-26 09:24:25 +08:00
56fc00cb53 [chore](config) increase minimum thread num of some thread pool (#12917)
Too small minimum thread num will cause additional overhead for creating and recycling threads.
2022-09-26 09:00:18 +08:00
32144ccda8 [Enhancement](debugging) Add more debug info for clang build (#12845) 2022-09-26 08:50:12 +08:00
7f2ea35b63 [enhancement](test) add brown cases to p2 (#12694) 2022-09-25 23:46:45 +08:00
60556070bb [enhancement](test) add github events cases to p2 (#12696) 2022-09-25 23:46:15 +08:00
692176ec07 [feature-wip](parquet-reader) pre read page data in advance to avoid frequent seek (#12898)
1. Fix the bug of file position in `HdfsFileReader`
2. Reserve enough buffer for `ColumnColumnReader` to read large continuous memory
2022-09-25 21:21:06 +08:00
380c3f42ab [Refactor](datev2) Update comments for datev2/datetimev2 (#12823) 2022-09-25 18:43:32 +08:00
f879a51ce9 [Improvement](dict) optimize dictionary column (#12852) 2022-09-25 18:29:10 +08:00
d8e8bc0e69 [Improvement](predicate) Replace for-loop by memcpy (#12867) 2022-09-25 18:27:59 +08:00
59699a4321 [feature](JSON datatype)Support JSON datatype (#10322)
Add `JSON` datatype, following features are implemented by this PR:
1. `CREATE` tables with `JSON` type columns
2. `INSERT` values containing `JSON` type value stored in `String`, which is represented as binary format(AKA `JSONB`) at BE 
3. `SELECT` JSON columns

Detail design refers [DSIP-016: Support JSON type](https://cwiki.apache.org/confluence/display/DORIS/DSIP-016%3A+Support+JSON+type)

* add JSONB data storage format type

* fix JsonLiteral resolve bug

* add DataTypeJson case in data_type_factory

* add JSON syntax check in FE

* add operators for jsonb_document, currently not support comparison between any JSON type value

* add ColumnJson and DataTypeJson

* add JsonField to store JsonValue

* add JsonValue to convert String JSON to BINARY JSON and JsonLiteral case for vliteral

* add push_json for MysqlResultWriter

* JSON column need no zone_map_index

* Revert "JSON column need no zone_map_index"

This reverts commit f71d1ce1ded9dbae44a5d58abcec338816b70d79.

* add JSON writer and reader, ignore zone-map for JSON column

* add json_to_string for DataTypeJson

* add olap_data_convertor for JSON type

* add some enum

* add OLAP_FIELD_TYPE_JSON type, FieldTypeTraits for it and corresponding cases or functions

* fix column_json offsets overflow bug, format code

* remove useless TODOs, add CmpType cases for JSON type

* add license header

* format license

* format be codes

* resolve rebase master conflicts

* fix bugs for CREATE and meta related code

* refactor JsonValue constructors, add fe JSON cases and fix some bugs, reformat codes

* modification be codes along code review advice

* fix rebase conflicts with master

* add unit test for json_value and column_json

* fix rebase error

* rename json to jsonb

* fix some data convert bugs, set Mysql type to JSON
2022-09-25 14:06:49 +08:00
57d5f69814 [fix](load) print detailed error message (#12938)
fix flush failure return message
2022-09-25 10:31:41 +08:00
dd6ed5a9a7 [fix](function)fix string split function buffer overflow (#12834) 2022-09-24 17:32:00 +08:00
f1a64ea09f [fix](new-scan)Fix new scanner load job bugs (#12903)
Fix bugs:
1. Fe need to send file format (e.g. parquet, orc ...) to be while processing load jobs using new scanner.
2. Try to get parquet file column type from SchemaElement.type before getting from Logical type and Converted type.
2022-09-24 17:21:19 +08:00
3bb920ba54 [Enhancement](load) Refine the load channel flush policy on mem limit (#12716)
1. Remove single load channel mem limit, only use load channel mgr mem limit
2. Default load channel mgr mem limit from 50% to 80%
3. load channel mgr add soft mem limit. When the soft limit is exceeded, other threads will not hang, only current thread triggers flush
4. When exceed load channel mgr mem limit, find a load channel with the largest mem usage, continue to find a tablet channel with the largest mem usage, and try to flush 1/3 of the mem usage of this tablet channel.
2022-09-24 10:01:13 +08:00
7b230e41a8 [bugfix](scanner) olap scanner compute is wrong (#12857)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-09-24 09:59:59 +08:00
d65756b504 [Bug](bucket shuffle) fix error bucket shuffle join plan in two same table (#12930) 2022-09-24 09:59:23 +08:00
34d6d36ff5 fix transfer to tracker (#12932)
~MemTrackerLimiter() repeated consumption of _untracked_mem, resulting in inaccurate process mem tracker.
2022-09-24 09:01:05 +08:00
943814a86f build extension docs failed fix (#12915)
build extension docs fix
2022-09-23 21:58:02 +08:00
1cb43b7f38 [fix](frontend) fix peerDependencies error (#12373)
```npm install``` problem with peer dependencies in the latest version of npm (v7+) 
Use ```npm install --legacy-peer-deps``` to fix it.

Reference: https://blog.npmjs.org/post/626173315965468672/npm-v7-series-beta-release-and-semver-major
2022-09-23 21:54:52 +08:00
9dc35ab534 [fix](streamload) set coord for streamLoad (#12744)
When a stream load is canceled, status is reported to coord.
2022-09-23 20:23:19 +08:00
7f5970d62f [fix](Nereids): add stats in plan. (#12790)
* [improve](Nereids): add stats for bestPlan and correct fix selectivity
2022-09-23 19:26:49 +08:00
5bfdfac387 [feature-wip](parquet-reader) add parquet reader profile (#12797)
Add profile for parquet reader. New counters:
- ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics
- ParquetReadGroups: The number of row groups to read
- ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics
- ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics
- ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics
- ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole
## Result
```
┌──────────────────────────────────────────────────────┐
│[0: VFILE_SCAN_NODE]                                  │
│(Active: 1s29ms, non-child: 96.42)                    │
│  - Counters:                                         │
│      - BytesRead: 0.00                               │
│      - FileReadCalls: 1.826K (1826)                  │
│      - FileReadTime: 510.627ms                       │
│      - FileRemoteReadBytes: 65.23 MB                 │
│      - FileRemoteReadCalls: 1.146K (1146)            │
│      - FileRemoteReadRate: 128.29331970214844 MB/sec │
│      - FileRemoteReadTime: 508.469ms                 │
│      - NumDiskAccess: 0                              │
│      - NumScanners: 1                                │
│      - ParquetFilteredBytes: 0.00                    │
│      - ParquetFilteredGroups: 0                      │
│      - ParquetFilteredRowsByGroup: 0                 │
│      - ParquetFilteredRowsByPage: 6.600003M (6600003)│
│      - ParquetReadBytes: 2.13 GB                     │
│      - ParquetReadGroups: 20                         │
│      - PeakMemoryUsage: 0.00                         │
│      - PredicateFilteredRows: 3.399797M (3399797)    │
│      - PredicateFilteredTime: 133.302ms              │
│      - RowsRead: 3.399997M (3399997)                 │
│      - RowsReturned: 200                             │
│      - RowsReturnedRate: 194                         │
│      - TotalRawReadTime(*): 726.566ms                │
│      - TotalReadThroughput: 0.0 /sec                 │
│      - WaitScannerTime: 1s27ms                       │
└──────────────────────────────────────────────────────┘
```
2022-09-23 18:42:14 +08:00
f7e3ca29b5 [Opt](Vectorized) Support push down no grouping agg (#12803)
Support push down no grouping agg
2022-09-23 18:29:54 +08:00
a7d42b5d81 [fix](streamload&sink) release and allocate memory in the same tracker (#12820)
1. HttpServer threads allocate bytebuffer and put them into streamload pipe, but scanner thread release them with query tracker.
2. We can assume brpc allocate memory in doris thread.

Above problems leads to wrong result of memtracker.
2022-09-23 17:51:44 +08:00
bd12a49baf [feature](Nereids) enable bucket shuffle join on fragment without scan node (#12891)
In the past, with legacy planner, we could only do bucket shuffle join on the join node belonging to the fragment with at least one scan node.
But, bucket shuffle join should do on each join node that left child's data distribution satisfy join's demand.
In nereids, we have data distribution info on each node. So we could enable bucket shuffle join on fragment without scan node.
2022-09-23 15:01:50 +08:00
c100d24116 [enhancement](Nereids) remove unnecessary ExchangeNode under AssertNumRowsNode (#12841)
current, we always add exchange under AssertNumRowsNode. Nevertheless, if its child node's partition is unpartitioned, no need to add exchange at all.
2022-09-23 14:50:27 +08:00
892e53a15b [fix](test) fix a test failure problem after merging (#12902) 2022-09-23 14:22:29 +08:00
e28e30fe71 [Improvement](statistics) collect statistics in parallel and add test cases (#12839)
This PR mainly improves some functions of the statistics module(#6370):

1. when collecting partition statistics, filter empty partitions in advance and do not generate statistical tasks.
2. the old statistical update method may have problems when updating statistics in parallel, which has been solved.
3. optimize internal-query.
4. add test cases related to statistics.
5. modify some comments as prompted by CheckStyle.
2022-09-23 11:59:53 +08:00
bb36490d95 [test](Nereids) add TPC-H Q2 as regression test case (#12840) 2022-09-23 11:00:31 +08:00
b7eea72d1d [feature-wip](MTMV) Support showing and dropping materialized view for multiple tables (#12762)
Use cases:

mysql> CREATE TABLE t1 (pk INT, v1 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk) PROPERTIES ('replication_num' = '1');
Query OK, 0 rows affected (0.05 sec)

mysql> CREATE TABLE t2 (pk INT, v2 INT SUM) AGGREGATE KEY (pk) DISTRIBUTED BY hash (pk) PROPERTIES ('replication_num' = '1');
Query OK, 0 rows affected (0.01 sec)

mysql> CREATE MATERIALIZED VIEW mv BUILD IMMEDIATE REFRESH COMPLETE KEY (mv_pk) DISTRIBUTED BY HASH (mv_pk) PROPERTIES ('replication_num' = '1') AS SELECT t1.pk as mv_pk FROM t1, t2 WHERE t1.pk = t2.pk;
Query OK, 0 rows affected (0.02 sec)

mysql> SHOW TABLES;
+---------------+
| Tables_in_dev |
+---------------+
| mv            |
| t1            |
| t2            |
+---------------+
3 rows in set (0.00 sec)

mysql> SHOW CREATE TABLE mv;
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Materialized View | Create Materialized View                                                                                                                                                                                                                                                                                                                                                                                        |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| mv                | CREATE MATERIALIZED VIEW `mv`
BUILD IMMEDIATE REFRESH COMPLETE ON DEMAND
KEY(`mv_pk`)
DISTRIBUTED BY HASH(`mv_pk`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"in_memory" = "false",
"storage_format" = "V2",
"disable_auto_compaction" = "false"
)
AS SELECT `t1`.`pk` AS `mv_pk` FROM `default_cluster:dev`.`t1` , `default_cluster:dev`.`t2` WHERE `t1`.`pk` = `t2`.`pk`; |
+-------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

mysql> DROP MATERIALIZED VIEW mv;
Query OK, 0 rows affected (0.01 sec)
2022-09-23 10:36:40 +08:00
d77ea64ae4 [typo](docs) Changing the Jump Address of SparkLoad in BrokerLoad (#12731) 2022-09-23 09:15:17 +08:00
617820b1f5 [Refactor](parquet) refactor parquet write to uniform and consistent logic (#12730) 2022-09-23 09:12:34 +08:00
0203b36cc4 [regressiontest](test_with)add with_case test (#12814) 2022-09-23 09:10:33 +08:00
84dd3edd0d [Bug](view) Show create view support comment #12838 2022-09-23 09:09:44 +08:00
8fcd8ed8b3 [chore](build) add option to disable -frecord-gcc-switches (#12846) 2022-09-22 15:38:14 +08:00
340784e294 [feature-wip](statistics) add statistics module related syntax (#12766)
This pull request includes some implementations of the statistics(#6370), it adds statistics module related syntax. The current syntax for collecting statistics will not collect statistics (It will collect statistics until test is stable). 

- `ANALYZE` syntax(collect statistics)

```SQL
ANALYZE [[ db_name.tb_name ] [( column_name [, ...] )], ...] [PARTITIONS(...)] [ PROPERTIES(...) ]
```
> db_name.tb_name: collect table and column statistics from tb_name.
> column_name: collect column statistics from column_name.
> properties: properties of statistics jobs.

example:
```SQL
ANALYZE;  -- collect statistics for all tables in the current database
ANALYZE table1(pv, citycode);  -- collect pv and citycode statistics for table1
ANALYZE test.table2 PARTITIONS(partition1); -- collect statistics for partition1 of table2
```

- `SHOW ANALYZE` syntax(show statistics job info)

```SQL
SHOW ANALYZE
    [TABLE | ID]
    [
        WHERE
        [STATE = ["PENDING"|"SCHEDULING"|"RUNNING"|"FINISHED"|"FAILED"|"CANCELLED"]]
    ]
    [ORDER BY ...]
    [LIMIT limit][OFFSET offset];
```

- `SHOW TABLE STATS`syntax(show table statistics)

```SQL
SHOW TABLE STATS [ db_name.tb_name ]  
```

- `SHOW COLUMN STATS` syntax(show column statistics)

```SQL
SHOW COLUMN STATS [ db_name.tb_name ] 
```
2022-09-22 11:15:00 +08:00
3fa820ec50 [feature-wip](statistics) collect statistics by sql task (#12765)
This pull request includes some implementations of the statistics(#6370), it Implements sql-task to collect statistics based on internal-query(#9983).

After the ANALYZE statement is parsed, statistical tasks will be generated. The statistical tasks includes mata-task(get statistics from metadata) and sql-task(get statistics by sql query). For sql-task, it will get statistics such as the row_count, the number of null values, and the maximum value by SQL query.

For statistical tasks, also include sampling sql-task, which will be implemented in the next pr.
2022-09-22 11:13:35 +08:00
70ab9cb43e [feature](http) refactor version info and add new http api for get version info (#12513)
Refactor version info and add new http api for get version info
2022-09-22 10:53:04 +08:00
77e423042c (brpc) donot use pooled brpc (#12754)
It seems that pooled brpc does not release port timely.
2022-09-22 10:00:26 +08:00
57b3c03371 [enhancement](like)pass data to like function in block not in row (#12825)
The like predicate process data in block perform better than in row. Currently, only not null column is optimized, nullable column will be handled later.

SELECT COUNT(*) FROM hits WHERE URL LIKE '%google%';
before: ~680ms
after: ~570ms
2022-09-22 09:59:30 +08:00
32551a7263 [bugfix](predicate column) data maybe wrong if not a single page (#12796)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-09-22 09:55:31 +08:00
6cd4c9ecb5 [bugfix](fe) Fix test_materialized_view_hll case npt. (#12829)
when enable light schema change, run test_materialized_view_hll case throw NullPointerException.
  java.lang.NullPointerException: null
      at org.apache.doris.analysis.SlotDescriptor.setColumn(SlotDescriptor.java:153)
      at org.apache.doris.planner.OlapScanNode.updateSlotUniqueId(OlapScanNode.java:399)
2022-09-22 09:50:53 +08:00
4b95b4e41d [feature-wip](file-scanner)Get column type from parquet schema (#12833)
Get schema from parquet reader.
The new VFileScanner need to get file schema (column name to type map) from parquet file while processing load job, 
this pr is to set the type information for parquet columns.
2022-09-22 09:35:37 +08:00
1ca6d559e4 [feature-wip](parquet-reader) refactor some arguments for parquet reader (#12771)
refactor some arguments for parquet reader 
1. Add new parquet context to wrap reader arguments
2. Reduced some arguments for function call
Co-authored-by: jinzhe <jinzhe@selectdb.com>
2022-09-22 09:34:01 +08:00
cbadbecd9a [fix](Nereids) anti join could not be reorder (#12827) 2022-09-22 09:19:12 +08:00
wxy
1ae7c4e307 [fix](LOAD statement): fix bug for toSql func of LoadStmt. (#12648) 2022-09-22 09:07:46 +08:00
c58e4ca03b [enhancement](Nereids) turn on all reorder rule that needed by zig-zag tree (#12767) 2022-09-22 02:35:31 +08:00
0dee640a3e [feature](Nereids): eliminate filter true and add checker. (#12821) 2022-09-22 02:31:11 +08:00