Commit Graph

317 Commits

Author SHA1 Message Date
e17c2416f0 [fix](join) fix be core dump when using right join with other join predicates (#13511) 2022-10-24 10:35:07 +08:00
3a3def447d [fix](csv-reader) fix bug that csv reader can not read text format hms table (#13515)
1. Missing field and line delimiter
2. When query external table with text(csv) format, we should pass the column position map to BE,
    otherwise the column order is wrong.

TODO:
1. For now, if we query csv file with non-exist column, it will return null.
    But it should return null or default value of that column.
2. Add regression test after hive docker is ready.
2022-10-22 22:40:03 +08:00
8e19b13f18 [Improvement](runtimefilter) don nott allocate memory if all targets are local (#13557) 2022-10-21 21:43:38 +08:00
3006b258b0 [Improvement](bloomfilter) allocate memory for BF in open phase (#13494) 2022-10-21 17:37:26 +08:00
5dde13fb7d [fix](scan)extend_scan_key should not change the range parameter (#13530)
* [fix](scan)extend_scan_key should not change the range parameter

* [fix](scan)new olap scan node has the same issue
2022-10-21 15:17:12 +08:00
d3f65aa746 [Improvement](join) remove unnecessary state for join (#13472) 2022-10-21 09:59:34 +08:00
32b1456b28 [feature-wip](array) remove array config and check array nested depth (#13428)
1. remove FE config `enable_array_type`
2. limit the nested depth of array in FE side.
3. Fix bug that when loading array from parquet, the decimal type is treated as bigint
4. Fix loading array from csv(vec-engine), handle null and "null"
5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null. 
6. Remove `check_array_format()`, because it's logic is wrong and meaningless
7. Add stream load csv test cases and more parquet broker load tests
2022-10-20 15:52:31 +08:00
b5cd167713 [fix](hashjoin) fix coredump of hash join in ubsan build (#13479)
* [fix](hashjoin) fix coredump of hash join in ubsan build
2022-10-20 10:16:19 +08:00
f7c69ade18 [feature-wip](multi-catalog) implement predicate pushdown in native OrcReader (#13453)
# Proposed changes
Implement predicate pushdown in `OrcReader` by converting doris `ColumnValueRange` to orc `SearchArgument`.

## Remaining problems
1. Orc support `not in`, which may have effect on bloom filter. However, doris `ScanNode` has not push down `not in` to file scanner.
2. Orc support `is null`, and row range has `hasNull` identifier. However,  `_contain_null` in `ColumnValueRange` is ambiguous. `_contain_null = true` only means that the value can be nullable, not equal to null.
3. `DateTimeV2` has lost microsecond precision in `ColumnValueRange`, which may cause filtering error when a min-max value equals to the predicate value.
4. `DateTimeV1`  is not accurate enough, and only saved to seconds.
5. Orc support the predicate pushdown of `float&double` type, but doris has not push down `float&double` type for precision reason.
2022-10-20 10:07:36 +08:00
f329d33666 [chore](fix) Fix some spell errors in be's comments. #13452 2022-10-20 08:56:01 +08:00
5423de68dd [refactor](new-scan) remove old file scan node (#13433)
All these files are not used anymore, can be removed.
2022-10-19 14:25:32 +08:00
21f233d7e7 [feature-wip](multi-catalog) use apache orc reader to read orc file (#13404)
Use apache orc to read orc file, and convert ColumnVectorBatch to doris block.
2022-10-18 13:47:56 +08:00
125def5102 [enhancement](macOS M1) Support building from source on macOS (M1) (#13195)
# Proposed changes

This PR fixed lots of issues when building from source on macOS with Apple M1 chip.

## ATTENTION

The job for supporting macOS with Apple M1 chip is too big and there are lots of unresolved issues during runtime:
1. Some errors with memory tracker occur when BE (RELEASE) starts.
2. Some UT cases fail.
...

Temporarily, the following changes are made on macOS to start BE successfully.
1. Disable memory tracker.
2. Use tcmalloc instead of jemalloc.

This PR kicks off the job. Guys who are interested in this job can continue to fix these runtime issues.

## Use case

```shell
./build.sh -j 8 --be --clean

cd output/be/bin
ulimit -n 60000
./start_be.sh --daemon
```

## Something else

It takes around _**10+**_ minutes to build BE (with prebuilt third-parties) on macOS with M1 chip. We will improve the  development experience on macOS greatly when we finish the adaptation job.
2022-10-18 13:10:13 +08:00
cd3450bd9d [Improvement](join) optimize join probing phase (#13357) 2022-10-18 12:37:17 +08:00
dbf71ed3be [feature-wip](new-scan) Support stream load with csv in new scan framework (#13354)
1. Refactor the file reader creation in FileFactory, for simplicity.
    Previously, FileFactory had too many `create_file_reader` interfaces.
    Now unified into two categories: the interface used by the previous BrokerScanNode,
    and the interface used by the new FileScanNode.
    And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files.

2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode

3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside.

4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.
2022-10-17 23:33:41 +08:00
c114d87d13 [Enhancement](array-type) Tuple is null predicate support array type (#13307)
Issue Number: #12689
2022-10-17 18:50:56 +08:00
Pxl
632670a49c [Enhancement](function) refactor of date function (#13362)
refactor of date function
2022-10-16 14:31:26 +08:00
4bc33a54a1 [Fix](agg) fix bitmap agg core dump when phmap pointer assert alignment (#13381) 2022-10-15 10:39:23 +08:00
8218cfed40 [Bug](function) Fix constant predicate evaluation (#13346) 2022-10-15 01:05:29 +08:00
baf2689610 [Improvement](join) compute hash values by vectorized way (#13335) 2022-10-13 16:04:58 +08:00
3e84c04195 [Bug](predicate) fix nullptr in scan node (#13316) 2022-10-13 12:14:42 +08:00
dfe308f501 [Improvement](join) refine prefetch strategy (#13286) 2022-10-12 19:02:06 +08:00
4fc7a048d2 [feature-wip](parquet-reader) fix string test and support decimal64 (#13184)
1. Refactor arguments list of parquet min max filter, pass parquet type for  min max value parsing
2. Fix the filter of string min max

Co-authored-by: jinzhe <jinzhe@selectdb.com>
2022-10-12 16:52:28 +08:00
bb4414e303 [feature-wip](multi-catalog) optimize parquet profile & add null map timer (#13257)
Use indentation to make `ParquetReader`'s profile more readable
Add `ParquetReader.DecodeNullMapTime` to show the time of parsing `NullMap` for `NullableColumn`

```
VFILE_SCAN_NODE  (id=0):(Active:  279.62ms,  %  non-child:  85.83%)
    -  FileReadBytes:  2.36  MB
    -  FileReadCalls:  20
    -  FileReadTime:  5.686ms
    -  MaxScannerThreadNum:  1
    -  NewlyCreateFreeBlocksNum:  125
    -  NumScanners:  1
    -  ParquetReader:  0ns
        -  ColumnReadTime:  259.946ms
        -  DecodeDictTime:  0ns
        -  DecodeHeaderTime:  437.707us
        -  DecodeLevelTime:  30.101us
        -  DecodeNullMapTime:  53.295ms
        -  DecodeValueTime:  62.607ms
        -  DecompressCount:  511
        -  DecompressTime:  1.159ms
        -  FilteredBytes:  0.00  
        -  FilteredGroups:  0
        -  FilteredRowsByGroup:  0
        -  FilteredRowsByPage:  0
        -  ParseMetaTime:  22.517ms
        -  ReadBytes:  2.36  MB
        -  ReadGroups:  20
```
2022-10-12 16:51:06 +08:00
b7621e1615 [feature-wip](new-scan) support csv reader (#13282)
Issue Number: close #12574
This pr adds CsvReader which implements GenericReader interface to support read csv format file.
2022-10-12 16:22:13 +08:00
df54c6b63a [enhancement](memtracker) Add independent and unique scanner mem tracker for each query (#13262) 2022-10-11 19:47:12 +08:00
1724a91f53 [Bug](predicate) Cover all const predicates in scan node (#13238)
For an vectorized expression which meets the condition vexpr->is_constant(), a const column is expected to return.
But now we still don't cover all predicates for const expression.
For example, for query SELECT col FROM tbl WHERE 'PROMOTION' LIKE 'AAA%', predicate like will return a ColumnVector which contains a single value.

This PR want to cover all const predicates in scan node whether it returns a constcolumn or not
2022-10-11 15:49:53 +08:00
c1ce48ffe4 [fix](new-scann) scanner may be marked close twice (#13263) 2022-10-11 15:37:15 +08:00
Pxl
bdcb600f3d [Bug](load) fix core dump on big block load (#13014) 2022-10-10 12:38:32 +08:00
935ef5a598 [feature-wip](new-scan) Add new ES scanner and new ES scan node #13027 2022-10-10 09:56:38 +08:00
dd089259be [feature-wip](multi-catalog) Optimize the performance of boolean & dictionary decoding (#13212)
Generate vector for dictionary data.
Decode boolean values in batch.
2022-10-10 08:41:11 +08:00
Pxl
245490d6b7 [Enhancement](runtime filter) optimize for runtime filter (#12856)
optimize for runtime filter
2022-10-09 14:11:03 +08:00
b81a8789c3 [feature-wip](parquet-reader) optimize the performance of column conversion (#13122)
Convert Parquet column into doris column via batch method.
In the previous implementation, only numeric types can be converted in batches,
and other types can only be inserted one by one.
This process will generate repeated virtual function calls and container expansion.
2022-10-08 18:03:10 +08:00
5214e898d9 [fix](parquet-reader) skip data/datatime column predicate filter to avoid coredump (#13072)
Will be fixed later
Co-authored-by: jinzhe <jinzhe@selectdb.com>
2022-10-08 18:02:35 +08:00
cf2b93532b [fix](file-scanner) fix some logic about broker load with parquet with new file scanner (#13135)
Fix some logic about broker load using new file scanner, with parquet format:

1. If columns are specified in load stmt, but none of them are in parquet file,
    error will be thrown like `err: No columns found in file`. See `parquet_s3_case4`

2. If the first column of table are not in table, the result number of rows is wrong.
    See `parquet_s3_case8`

3. If column specified in `columns` in load stmt does not exist in file and table,
    error will be thrown like: `failed to find default value expr for slot: x1`. See `parquet_s3_case2`
2022-10-08 13:08:08 +08:00
b41748efa1 [feature-wip](new-scan)Add new jdbc scanner and new jdbc scan node (#12848)
Related pr: #11582
This pr is the new jdbc scan node and scanner.
2022-10-07 09:55:17 +08:00
d286aa7bf7 [fix](spark-load) no need to filter row group when doing spark load (#13116)
1. Fix issue #13115 
2. Modify the method of `get_next_block` or `GenericReader`, to return "read_rows" explicitly.
    Some columns in block may not be filled in reader, if the first column is not filled, use `block->rows()` can not return real row numbers.
3. Add more checks for broker load test cases.
2022-10-05 23:00:56 +08:00
7b75c2df54 [fix](BE) fix the stream load error when upgrade BE from 1.1.2 to master (#13058) 2022-10-05 12:13:26 +08:00
026ffaf10d [feature-wip](parquet-reader) add detail profile for parquet reader (#13095)
Add more detail profile for ParquetReader:
ParquetColumnReadTime: the total time of reading parquet columns
ParquetDecodeDictTime: time to parse dictionary page
ParquetDecodeHeaderTime: time to parse page header
ParquetDecodeLevelTime: time to parse page's definition/repetition level
ParquetDecodeValueTime: time to decode page data into doris column
ParquetDecompressCount: counter of decompressing page data
ParquetDecompressTime: time to decompress page data
ParquetParseMetaTime: time to parse parquet meta data
2022-10-02 15:11:48 +08:00
287ff50a6f [Bug](datev2) Fix compatible error between datev2 and date (#13024) 2022-09-29 18:01:55 +08:00
820ec435ce [feature-wip](parquet-reader) refactor parquet_predicate (#12896)
This change serves the  following purposes:
1.  use ScanPredicate instead of TCondition for external table, it can reuse old code branch.
2. simplify and delete some useless old code
3.  use ColumnValueRange to save predicate
2022-09-28 21:27:13 +08:00
1ba9e4b568 [Improvement](sort) Reuse memory in sort node (#12921) 2022-09-28 09:44:35 +08:00
d80b7b9689 [feature-wip](new-scan) support more load situation (#12953) 2022-09-27 21:48:32 +08:00
5790d23624 [fix](transfer_thread) fix the loss of notification. (#12988) 2022-09-27 08:44:02 +08:00
Pxl
8731eea26e [Chore](clang) fix some build fail on clang15 (#12882)
remove unused variables
2022-09-26 23:13:28 +08:00
acd5d67355 [feature-wip](new-scan)Add new odbc scanner and new odbc scan node (#12899) 2022-09-26 09:24:25 +08:00
692176ec07 [feature-wip](parquet-reader) pre read page data in advance to avoid frequent seek (#12898)
1. Fix the bug of file position in `HdfsFileReader`
2. Reserve enough buffer for `ColumnColumnReader` to read large continuous memory
2022-09-25 21:21:06 +08:00
f1a64ea09f [fix](new-scan)Fix new scanner load job bugs (#12903)
Fix bugs:
1. Fe need to send file format (e.g. parquet, orc ...) to be while processing load jobs using new scanner.
2. Try to get parquet file column type from SchemaElement.type before getting from Logical type and Converted type.
2022-09-24 17:21:19 +08:00
7b230e41a8 [bugfix](scanner) olap scanner compute is wrong (#12857)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-09-24 09:59:59 +08:00
5bfdfac387 [feature-wip](parquet-reader) add parquet reader profile (#12797)
Add profile for parquet reader. New counters:
- ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics
- ParquetReadGroups: The number of row groups to read
- ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics
- ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics
- ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics
- ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole
## Result
```
┌──────────────────────────────────────────────────────┐
│[0: VFILE_SCAN_NODE]                                  │
│(Active: 1s29ms, non-child: 96.42)                    │
│  - Counters:                                         │
│      - BytesRead: 0.00                               │
│      - FileReadCalls: 1.826K (1826)                  │
│      - FileReadTime: 510.627ms                       │
│      - FileRemoteReadBytes: 65.23 MB                 │
│      - FileRemoteReadCalls: 1.146K (1146)            │
│      - FileRemoteReadRate: 128.29331970214844 MB/sec │
│      - FileRemoteReadTime: 508.469ms                 │
│      - NumDiskAccess: 0                              │
│      - NumScanners: 1                                │
│      - ParquetFilteredBytes: 0.00                    │
│      - ParquetFilteredGroups: 0                      │
│      - ParquetFilteredRowsByGroup: 0                 │
│      - ParquetFilteredRowsByPage: 6.600003M (6600003)│
│      - ParquetReadBytes: 2.13 GB                     │
│      - ParquetReadGroups: 20                         │
│      - PeakMemoryUsage: 0.00                         │
│      - PredicateFilteredRows: 3.399797M (3399797)    │
│      - PredicateFilteredTime: 133.302ms              │
│      - RowsRead: 3.399997M (3399997)                 │
│      - RowsReturned: 200                             │
│      - RowsReturnedRate: 194                         │
│      - TotalRawReadTime(*): 726.566ms                │
│      - TotalReadThroughput: 0.0 /sec                 │
│      - WaitScannerTime: 1s27ms                       │
└──────────────────────────────────────────────────────┘
```
2022-09-23 18:42:14 +08:00