Commit Graph

837 Commits

Author SHA1 Message Date
295d887cf5 [improvement](thread) set name for priority thread pool (#13552) 2022-10-26 09:32:15 +08:00
3006b258b0 [Improvement](bloomfilter) allocate memory for BF in open phase (#13494) 2022-10-21 17:37:26 +08:00
ccc04210d6 [feature](jsonb type) functions for cast from and to jsonb datatype (#13379) 2022-10-21 15:18:16 +08:00
9dc5dd382a [enhancement](memtracker) Fix Brpc mem count and refactored thread context macro (#13469) 2022-10-21 12:01:38 +08:00
Pxl
1892e8f66e [Enhancement](scanner) support split avg key range (#13166) 2022-10-20 14:53:16 +08:00
2b328eafbb [function](string_function) add new string function 'extract_url_parameter' (#13323) 2022-10-20 11:11:43 +08:00
4996eafe74 [bugfix](VecDateTimeValue) eat the value of microsecond in function from_date_format_str (#13446)
* [bugfix](VecDateTimeValue) eat the value of microsecond in function from_date_format_str

* add sql based regression test

Co-authored-by: xiaojunjie <xiaojunjie@baidu.com>
2022-10-20 09:02:33 +08:00
f329d33666 [chore](fix) Fix some spell errors in be's comments. #13452 2022-10-20 08:56:01 +08:00
c449028a5f [fix](year) fix year() results are not as expected (#13426)
fix `year()` results are not as expected
2022-10-19 11:28:00 +08:00
755a946516 [feature](jsonb) jsonb functions (#13366)
Issue Number: Step3 of DSIP-016: Support JSON type
2022-10-19 08:44:08 +08:00
6d322f85ac [improvement](compaction) delete num based compaction policy (#13409)
Co-authored-by: yixiutt <yixiu@selectdb.com>
2022-10-18 16:13:28 +08:00
125def5102 [enhancement](macOS M1) Support building from source on macOS (M1) (#13195)
# Proposed changes

This PR fixed lots of issues when building from source on macOS with Apple M1 chip.

## ATTENTION

The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime:
1. Some errors with memory tracker occur when BE (RELEASE) starts.
2. Some UT cases fail.
...

Temporarily, the following changes are made on macOS to start BE successfully.
1. Disable memory tracker.
2. Use tcmalloc instead of jemalloc.

This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues.

## Use case

```shell
./build.sh -j 8 --be --clean

cd output/be/bin
ulimit -n 60000
./start_be.sh --daemon
```

## Something else

It takes around _**10+**_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the  development experience on macOS greatly when we finish the adaptation job.
2022-10-18 13:10:13 +08:00
f0dbbe5b46 [Bug](funciton) fix repeat coredump when step is to long (#13408) 2022-10-18 09:55:06 +08:00
87a6b1a13b [enhancement](memtracker) Fix bthread local consume mem tracker (#13368)
Previously, bthread_getspecific was called every time bthread local was used. In the test at #10823, it was found that frequent calls to bthread_getspecific had performance problems.

So a cache is implemented on pthread local based on the btls key, but the btls key cannot correctly sense bthread switching.

So, based on bthread_self to get the bthread id to implement the cache.
2022-10-17 18:31:07 +08:00
045bccdbea [Feature](Retention) support retention function (#13056) 2022-10-17 11:00:47 +08:00
a83eaddfcf [test](cache)Add remote cache ut (#13377) 2022-10-16 23:59:50 +08:00
1d5ba9cbcc [Improvement](like) Change like function to batch call (#13314) 2022-10-16 16:18:22 +08:00
4a5095f00d [cleanup](config) remove unused config push_write_mbytes_per_sec (#13290) 2022-10-12 15:58:04 +08:00
1bd14f1d82 [feature-wip](jsonb) jsonb parse function and load (#13129)
add function to parse json string to jsonb format and use it to support stream load.
2022-10-12 13:56:37 +08:00
16999ef02d [Vectorized][Function] support date_trunc and countequal function (#13039) 2022-10-12 10:01:09 +08:00
334708dc8c [fix](memory): avoid coredump when list pointer is null (#12919) 2022-10-11 16:00:23 +08:00
Pxl
bdcb600f3d [Bug](load) fix core dump on big block load (#13014) 2022-10-10 12:38:32 +08:00
dc2d33298b [chore](be config) remove config use_mmap_allocate_chunk #13196
This config is never used online and there exist bugs if enable this config. So that I remove this config and related tests.


Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-10-09 16:19:59 +08:00
Pxl
245490d6b7 [Enhancement](runtime filter) optimize for runtime filter (#12856)
optimize for runtime filter
2022-10-09 14:11:03 +08:00
d286aa7bf7 [fix](spark-load) no need to filter row group when doing spark load (#13116)
1. Fix issue #13115 
2. Modify the method of `get_next_block` or `GenericReader`, to return "read_rows" explicitly.
    Some columns in block may not be filled in reader, if the first column is not filled, use `block->rows()` can not return real row numbers.
3. Add more checks for broker load test cases.
2022-10-05 23:00:56 +08:00
026ffaf10d [feature-wip](parquet-reader) add detail profile for parquet reader (#13095)
Add more detail profile for ParquetReader:
ParquetColumnReadTime: the total time of reading parquet columns
ParquetDecodeDictTime: time to parse dictionary page
ParquetDecodeHeaderTime: time to parse page header
ParquetDecodeLevelTime: time to parse page's definition/repetition level
ParquetDecodeValueTime: time to decode page data into doris column
ParquetDecompressCount: counter of decompressing page data
ParquetDecompressTime: time to decompress page data
ParquetParseMetaTime: time to parse parquet meta data
2022-10-02 15:11:48 +08:00
e7f18e998a [chore](be-ut) Remove useless lines which cause compilation errors (#13053) 2022-09-30 11:26:25 +08:00
820ec435ce [feature-wip](parquet-reader) refactor parquet_predicate (#12896)
This change serves the  following purposes:
1.  use ScanPredicate instead of TCondition for external table, it can reuse old code branch.
2. simplify and delete some useless old code
3.  use ColumnValueRange to save predicate
2022-09-28 21:27:13 +08:00
d80b7b9689 [feature-wip](new-scan) support more load situation (#12953) 2022-09-27 21:48:32 +08:00
Pxl
9607f60845 [Feature](serialize) move block_data_version to fe heart beat (#12667)
Move block_data_version from be config to fe heart beat
2022-09-27 18:25:54 +08:00
1bb42a7bc0 [function](hash) add support of murmur_hash3_64 (#12923) 2022-09-26 14:23:37 +08:00
692176ec07 [feature-wip](parquet-reader) pre read page data in advance to avoid frequent seek (#12898)
1. Fix the bug of file position in `HdfsFileReader`
2. Reserve enough buffer for `ColumnColumnReader` to read large continuous memory
2022-09-25 21:21:06 +08:00
59699a4321 [feature](JSON datatype)Support JSON datatype (#10322)
Add `JSON` datatype, following features are implemented by this PR:
1. `CREATE` tables with `JSON` type columns
2. `INSERT` values containing `JSON` type value stored in `String`, which is represented as binary format(AKA `JSONB`) at BE 
3. `SELECT` JSON columns

Detail design refers [DSIP-016: Support JSON type](https://cwiki.apache.org/confluence/display/DORIS/DSIP-016%3A+Support+JSON+type)

* add JSONB data storage format type

* fix JsonLiteral resolve bug

* add DataTypeJson case in data_type_factory

* add JSON syntax check in FE

* add operators for jsonb_document, currently not support comparison between any JSON type value

* add ColumnJson and DataTypeJson

* add JsonField to store JsonValue

* add JsonValue to convert String JSON to BINARY JSON and JsonLiteral case for vliteral

* add push_json for MysqlResultWriter

* JSON column need no zone_map_index

* Revert "JSON column need no zone_map_index"

This reverts commit f71d1ce1ded9dbae44a5d58abcec338816b70d79.

* add JSON writer and reader, ignore zone-map for JSON column

* add json_to_string for DataTypeJson

* add olap_data_convertor for JSON type

* add some enum

* add OLAP_FIELD_TYPE_JSON type, FieldTypeTraits for it and corresponding cases or functions

* fix column_json offsets overflow bug, format code

* remove useless TODOs, add CmpType cases for JSON type

* add license header

* format license

* format be codes

* resolve rebase master conflicts

* fix bugs for CREATE and meta related code

* refactor JsonValue constructors, add fe JSON cases and fix some bugs, reformat codes

* modification be codes along code review advice

* fix rebase conflicts with master

* add unit test for json_value and column_json

* fix rebase error

* rename json to jsonb

* fix some data convert bugs, set Mysql type to JSON
2022-09-25 14:06:49 +08:00
5bfdfac387 [feature-wip](parquet-reader) add parquet reader profile (#12797)
Add profile for parquet reader. New counters:
- ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics
- ParquetReadGroups: The number of row groups to read
- ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics
- ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics
- ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics
- ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole
## Result
```
┌──────────────────────────────────────────────────────┐
│[0: VFILE_SCAN_NODE]                                  │
│(Active: 1s29ms, non-child: 96.42)                    │
│  - Counters:                                         │
│      - BytesRead: 0.00                               │
│      - FileReadCalls: 1.826K (1826)                  │
│      - FileReadTime: 510.627ms                       │
│      - FileRemoteReadBytes: 65.23 MB                 │
│      - FileRemoteReadCalls: 1.146K (1146)            │
│      - FileRemoteReadRate: 128.29331970214844 MB/sec │
│      - FileRemoteReadTime: 508.469ms                 │
│      - NumDiskAccess: 0                              │
│      - NumScanners: 1                                │
│      - ParquetFilteredBytes: 0.00                    │
│      - ParquetFilteredGroups: 0                      │
│      - ParquetFilteredRowsByGroup: 0                 │
│      - ParquetFilteredRowsByPage: 6.600003M (6600003)│
│      - ParquetReadBytes: 2.13 GB                     │
│      - ParquetReadGroups: 20                         │
│      - PeakMemoryUsage: 0.00                         │
│      - PredicateFilteredRows: 3.399797M (3399797)    │
│      - PredicateFilteredTime: 133.302ms              │
│      - RowsRead: 3.399997M (3399997)                 │
│      - RowsReturned: 200                             │
│      - RowsReturnedRate: 194                         │
│      - TotalRawReadTime(*): 726.566ms                │
│      - TotalReadThroughput: 0.0 /sec                 │
│      - WaitScannerTime: 1s27ms                       │
└──────────────────────────────────────────────────────┘
```
2022-09-23 18:42:14 +08:00
f7e3ca29b5 [Opt](Vectorized) Support push down no grouping agg (#12803)
Support push down no grouping agg
2022-09-23 18:29:54 +08:00
1ca6d559e4 [feature-wip](parquet-reader) refactor some arguments for parquet reader (#12771)
refactor some arguments for parquet reader 
1. Add new parquet context to wrap reader arguments
2. Reduced some arguments for function call
Co-authored-by: jinzhe <jinzhe@selectdb.com>
2022-09-22 09:34:01 +08:00
b41eaa5ac0 [fix](memtracker) Introduce orphan mem tracker to verify memory tracking accuracy (#12794)
The mem hook consumes the orphan tracker by default. If the thread does not attach other trackers, by default all consumption will be passed to the process tracker through the orphan tracker.

In real time, consumption of all other trackers + orphan tracker consumption = process tracker consumption.

Ideally, all threads are expected to attach to the specified tracker, so that "all memory has its own ownership", and the consumption of the orphan mem tracker is close to 0, but greater than 0.
2022-09-21 15:47:10 +08:00
bd4bfa8f00 [fix](memtracker) Fix thread mem tracker try consume accuracy #12782 2022-09-21 09:20:41 +08:00
d435f0de41 [feature-wip](parquet-reader) add page index row range (#12652)
Add some utils and provide the candidate row range  (generated with skipped row range of each column) 
to read for page index filter
this version support binary operator filter

todo: 
- use context instead of structures in close() 
- process complex type filter
- use this instead of row group minmax filter
- refactor _eval_binary() for row group filter and page index filter
2022-09-20 10:36:19 +08:00
fb9e48a34a [fix](vstream load) Fix bug when load json with jsonpath (#12660) 2022-09-19 10:13:18 +08:00
e01986b8b9 [feature](light-schema-change) fix light-schema-change and add more cases (#12160)
Fix _delete_sign_idx and _seq_col_idx when append_column or build_schema when load.
Tablet schema cache support recycle when schema sptr use count equals 1.
Add a http interface for flink-connector to sync ddl.
Improve tablet->tablet_schema() by max_version_schema.
2022-09-17 11:29:36 +08:00
a97f63141e [fix](cast) Add validity check for date conversion for non-vectorization (#12608)
actual result
select cast("0.0000031417" as date);
+------------------------------+
| CAST('0.0000031417' AS DATE) |
+------------------------------+
| 2000-00-00 |
+------------------------------+

expect result
select cast("0.0000031417" as date);
+------------------------------+
| CAST('0.0000031417' AS DATE) |
+------------------------------+
| NULL |
+------------------------------+
2022-09-16 09:08:53 +08:00
b136d80e1a [enhancement](compress) reuse compression ctx and buffer (#12573)
Reuse compression ctx and buffer.
Use a global instance for every compression algorithm, and use a
thread saft buffer pool to reuse compression buffer, pool size is equal
to max parallel thread num in compression, and this will not be too large.

Test shows this feature increase 5% of data import and compaction.

Co-authored-by: yixiutt <yixiu@selectdb.com>
2022-09-15 10:59:46 +08:00
c5ad989065 [refactor](reader) refactor the interface of file reader (#12574)
Currently, Doris has a variety of readers for different file formats,
such as parquet reader, orc reader, csv reader, json reader and so on.

The interfaces of these readers are not unified, which makes it impossible to call them through a unified method.

In this PR, I added a `GenericReader` interface class, and other Readers will implement this interface class
to use the `get_next_block()` method.

This PR currently only modifies `arrow_reader` and `parquet reader`.
Other readers will be modified one by one in subsequent PRs.
2022-09-14 22:31:11 +08:00
Pxl
0ead048b93 [Enhancement](column) remove ColumnString terminating zero and add a data_version for pblock (#12456)
1. remove ColumnString terminating zero
    2. add a data_version for pblock
    3. change EncryptionMode to enum class
2022-09-14 21:25:22 +08:00
e33f4f90ae [fix](exec) Avoid query thread block on wait_for_start (#12411)
When FE send cancel rpc to BE, it does not notify the wait_for_start() thread, so that the fragment will be blocked and occupy the execution thread.
Add a max wait time for wait_for_start() thread. So that it will not block forever.
2022-09-13 08:57:37 +08:00
c8e9a32bb2 [Function](cbrt)Add cbrt function for doris (#12523)
Add cbrt function for doris
2022-09-12 19:58:45 +08:00
c3af60eff8 [fix](threadpool) threadpool schedules does not work right on concurr… (#12370)
* [fix](threadpool) threadpool schedules does not work right on concurrent token

Assuming there is a concurrent thread token whose concurrency is 2, and the 1st
submit on the token is submitted to threadpool while the 2nd is not submitted due
to busy. The token's active_threads is 1, then thread pool does not schedule the
token.

The patch fixes the problem.
2022-09-08 14:54:46 +08:00
26cf2d3742 [enhancement](array-type) avoid abuse of Offset and Offset64 #12378
We already separate Array Offset64 and String Offset(32bit) in PR: #12341

Now we limit: Offset inside IColumn, Offset64 only inside ColumnArray, to avoid abuse of them.
If we use the wrong one, it will compile failed.
2022-09-08 14:53:07 +08:00
09b45f2b71 [Function](ELT)Add elt function (#12321) 2022-09-07 15:21:08 +08:00