Commit Graph

71 Commits

Author SHA1 Message Date
7182f14645 [improvement][fix](multi-catalog) speed up list partition prune (#14268)
In previous implementation, when doing list partition prune, we need to generation `rangeToId`
every time we doing prune.
But `rangeToId` is actually a static data that should be create-once-use-every-where.

So for hive partition, I created the `rangeToId` and all other necessary data structures for partition prunning
in partition cache, so that we can use it directly.

In my test, the cost of partition prune for 10000 partitions reduce from 8s -> 0.2s.

Aslo add "partition" info in explain string for hive table.
```
|   0:VEXTERNAL_FILE_SCAN_NODE                           |
|      predicates: `nation` = '0024c95b'                 |
|      inputSplitNum=1, totalFileSize=4750, scanRanges=1 |
|      partition=1/10000                                 |
|      numNodes=1                                        |
|      limit: 10                                         |
```

Bug fix:
1. Fix bug that es scan node can not filter data
2. Fix bug that query es with predicate like `where substring(test2,2) = "ext2";` will fail at planner phase.
`Unexpected exception: org.apache.doris.analysis.FunctionCallExpr cannot be cast to org.apache.doris.analysis.SlotRef`

TODO:
1. Some problem when quering es version 8: ` Unexpected exception: Index: 0, Size: 0`, will be fixed later.
2022-11-17 08:30:03 +08:00
20634ab7e3 [feature-wip](multi-catalog) support partition&missing columns in parquet lazy read (#14264)
PR https://github.com/apache/doris/pull/13917 has supported lazy read for non-predicate columns in ParquetReader, 
but can't trigger lazy read when predicate columns are partition or missing columns.
This PR support such case, and fill partition and missing columns in `FileReader`.
2022-11-16 08:43:11 +08:00
6d2e6d85d3 [enhancement](be)release memory in Node's close() method (#14258)
* [enhancement](be)release memory in Node's close() method

* format code
2022-11-15 15:59:23 +08:00
d55faa7f6a [feature](remote)Only query can use local cache when reading remote files. (#13865)
When calling select on remote files, download cache files to local disk.
When calling alter table on remote files, read files directly from remote storage. So if tablet is too large, it will not take up too many local disk when creating local cache file.
2022-11-14 10:30:15 +08:00
376b4fda9f [fix](scankey) fix extended scan key errors. (#14200)
Issue Number: close #14199
2022-11-12 20:44:09 +08:00
035657c5a1 [typo](comment) Fix a lot of spell errors in be comments (#14208)
fix typos.
2022-11-12 16:06:15 +08:00
12652ebb0e [UDF](java udf) using config to enable java udf instead of macro at compile time (#14062)
* [UDF](java udf) useing config to enable java udf instead of macro at compile time
2022-11-11 09:03:52 +08:00
6bd5378f66 [feature-wip](multi-catalog) lazy read for ParquetReader (#13917)
Read predicate columns firstly, and use VExprContext(push-down predicates)
to generate the select vector, which is then applied to read the non-predicate columns.
The data in non-predicate columns may be skipped by select vector, so the value-decode-time can be reduced.
If a whole page can be skipped, the decompress-time can also be reduced.
2022-11-10 16:56:14 +08:00
Pxl
0e26f28bf2 [Enhancement](runtime-filter) enlarge runtime filter in predicate threshold (#13581)
enlarge runtime filter in predicate threshold
2022-11-10 15:48:46 +08:00
a73f4dfdc1 [fix](memtracker) Fix scanner thread ending after fragment thread causing mem tracker null pointer #14143 2022-11-10 15:42:53 +08:00
Pxl
794a551b0f [Enhancement][fix](profile)() modify some profiles (#14074)
1. add RemainedDownPredicates
2. fix core dump when _scan_ranges is empty
3. fix invalid memory access on vLiteral's debug_string()
4. enlarge mv test wait time
2022-11-09 21:59:28 +08:00
0b945fe361 [enhancement](memtracker) Refactor mem tracker hierarchy (#13585)
mem tracker can be logically divided into 4 layers: 1)process 2)type 3)query/load/compation task etc. 4)exec node etc.

type includes

enum Type {
        GLOBAL = 0,        // Life cycle is the same as the process, e.g. Cache and default Orphan
        QUERY = 1,         // Count the memory consumption of all Query tasks.
        LOAD = 2,          // Count the memory consumption of all Load tasks.
        COMPACTION = 3,    // Count the memory consumption of all Base and Cumulative tasks.
        SCHEMA_CHANGE = 4, // Count the memory consumption of all SchemaChange tasks.
        CLONE = 5, // Count the memory consumption of all EngineCloneTask. Note: Memory that does not contain make/release snapshots.
        BATCHLOAD = 6,  // Count the memory consumption of all EngineBatchLoadTask.
        CONSISTENCY = 7 // Count the memory consumption of all EngineChecksumTask.
    }
Object pointers are no longer saved between each layer, and the values of process and each type are periodically aggregated.

other fix:

In [fix](memtracker) Fix transmit_tracker null pointer because phamp is not thread safe #13528, I tried to separate the memory that was manually abandoned in the query from the orphan mem tracker. But in the actual test, the accuracy of this part of the memory cannot be guaranteed, so put it back to the orphan mem tracker again.
2022-11-08 09:52:33 +08:00
6ed443c7e8 [enhancement](profile) add instanceNum, tableIds to profile. (#13985) 2022-11-08 08:49:16 +08:00
95591ce49a [refactor](cv)wait on condition variable more gently (#12620) 2022-11-08 08:40:31 +08:00
37e4a1769d [fix](sequence) fix that update table core dump with sequence column (#13847)
* [fix](sequence) fix that update table core dump with sequence column

* update
2022-11-03 09:02:21 +08:00
7b4c2cabb4 [feature](new-scan) support transactional insert in new scan framework (#13858)
Support running transactional insert operation with new scan framework. eg:

admin set frontend config("enable_new_load_scan_node" = "true");
begin;
insert into tbl1 values(1,2);
insert into tbl1 values(3,4);
insert into tbl1 values(5,6);
commit;
Add some limitation to transactional insert

Do not support non-literal value in insert stmt
Fix some issue about array type:

Forbid cast other non-array type to NESTED array type, it may cause BE crash.
Add getStringValueForArray() method for Expr, to get valid string-formatted array type value.
Add useLocalSessionState=true in regression-test jdbc url
without this config, the jdbc driver will send some init cmd each time it connect to server, such as
select @@session.tx_read_only.
But when we use transactional insert, after begin command, Doris do not support any other type of
stmt except for insert, commit or rollback.
So adding this config to let the jdbc NOT send cmd when connecting.
2022-11-03 08:36:07 +08:00
Pxl
be124523f4 [enhancement](profile) add profile to show column predicates (#13862) 2022-11-02 09:07:26 +08:00
2fb218173e [improvement](scan) change the max thread num and num of free blocks in new scan (#13793)
1. 
In the previous implementation, the max thread num of olap scanner was set relatively small, such as 3.
which would slow down some of queries.
In this PR, I changed the max thread num  to a quarter of the scaner thread pool(default is 12),
which is less than the old scan node's max thread num, but larger than the previous implementation.
The upper limit of the max thread num of the old scan node is too high, which is not reasonable.

2.
Lower down the number of pre allocated free blocks.
2022-10-31 14:00:06 +08:00
3c95106d45 [Bug](jdbc) Fix memory leak for JDBC datasource (#13657) 2022-10-27 00:02:25 +08:00
c418bbd2d1 [feature-wip](new-scan) support Json reader (#13546)
Issue Number: close #12574
This pr adds `NewJsonReader` which implements GenericReader interface to support read json format file.

TODO:
1. modify `_scann_eof` later.
2. Rename `NewJsonReader` to `JsonReader` when `JsonReader` is deleted.
2022-10-26 12:52:21 +08:00
44c9163b3c [Fix](multi-catalog)Fix partition external table query bug. (#13535)
The index for external table columns from path is incorrect in new scanner. This is a fix for it.
e.g. In the next query, nation and city columns are from path
```
mysql> select nation, city, count(*) from parquet_two_part group by nation, city;
+--------+------------+----------+
| nation | city       | count(*) |
+--------+------------+----------+
| cn     | beijing    |  1199969 |
| cn     | shanghai   |  1199771 |
| jp     | tokyo      |   599715 |
| rus    | moscow     |   600659 |
| us     | chicago    |  1199805 |
| us     | washington |  1201296 |
+--------+------------+----------+
6 rows in set (0.39 sec)
```
2022-10-26 12:47:37 +08:00
295d887cf5 [improvement](thread) set name for priority thread pool (#13552) 2022-10-26 09:32:15 +08:00
3a3def447d [fix](csv-reader) fix bug that csv reader can not read text format hms table (#13515)
1. Missing field and line delimiter
2. When query external table with text(csv) format, we should pass the column position map to BE,
    otherwise the column order is wrong.

TODO:
1. For now, if we query csv file with non-exist column, it will return null.
    But it should return null or default value of that column.
2. Add regression test after hive docker is ready.
2022-10-22 22:40:03 +08:00
8e19b13f18 [Improvement](runtimefilter) don nott allocate memory if all targets are local (#13557) 2022-10-21 21:43:38 +08:00
3006b258b0 [Improvement](bloomfilter) allocate memory for BF in open phase (#13494) 2022-10-21 17:37:26 +08:00
5dde13fb7d [fix](scan)extend_scan_key should not change the range parameter (#13530)
* [fix](scan)extend_scan_key should not change the range parameter

* [fix](scan)new olap scan node has the same issue
2022-10-21 15:17:12 +08:00
32b1456b28 [feature-wip](array) remove array config and check array nested depth (#13428)
1. remove FE config `enable_array_type`
2. limit the nested depth of array in FE side.
3. Fix bug that when loading array from parquet, the decimal type is treated as bigint
4. Fix loading array from csv(vec-engine), handle null and "null"
5. Change the csv array loading behavior, if the array string format is invalid in csv, it will be converted to null. 
6. Remove `check_array_format()`, because it's logic is wrong and meaningless
7. Add stream load csv test cases and more parquet broker load tests
2022-10-20 15:52:31 +08:00
5423de68dd [refactor](new-scan) remove old file scan node (#13433)
All these files are not used anymore, can be removed.
2022-10-19 14:25:32 +08:00
21f233d7e7 [feature-wip](multi-catalog) use apache orc reader to read orc file (#13404)
Use apache orc to read orc file, and convert ColumnVectorBatch to doris block.
2022-10-18 13:47:56 +08:00
dbf71ed3be [feature-wip](new-scan) Support stream load with csv in new scan framework (#13354)
1. Refactor the file reader creation in FileFactory, for simplicity.
    Previously, FileFactory had too many `create_file_reader` interfaces.
    Now unified into two categories: the interface used by the previous BrokerScanNode,
    and the interface used by the new FileScanNode.
    And separate the creation methods of readers that read `StreamLoadPipe` and other readers that read files.

2. Modify the StreamLoadPlanner on FE side to support using ExternalFileScanNode

3. Now for generic reader, the file reader will be created inside the reader, not passed from the outside.

4. Add some test cases for csv stream load, the behavior is same as the old broker scanner.
2022-10-17 23:33:41 +08:00
8218cfed40 [Bug](function) Fix constant predicate evaluation (#13346) 2022-10-15 01:05:29 +08:00
3e84c04195 [Bug](predicate) fix nullptr in scan node (#13316) 2022-10-13 12:14:42 +08:00
b7621e1615 [feature-wip](new-scan) support csv reader (#13282)
Issue Number: close #12574
This pr adds CsvReader which implements GenericReader interface to support read csv format file.
2022-10-12 16:22:13 +08:00
df54c6b63a [enhancement](memtracker) Add independent and unique scanner mem tracker for each query (#13262) 2022-10-11 19:47:12 +08:00
1724a91f53 [Bug](predicate) Cover all const predicates in scan node (#13238)
For an vectorized expression which meets the condition vexpr->is_constant(), a const column is expected to return.
But now we still don't cover all predicates for const expression.
For example, for query SELECT col FROM tbl WHERE 'PROMOTION' LIKE 'AAA%', predicate like will return a ColumnVector which contains a single value.

This PR want to cover all const predicates in scan node whether it returns a constcolumn or not
2022-10-11 15:49:53 +08:00
c1ce48ffe4 [fix](new-scann) scanner may be marked close twice (#13263) 2022-10-11 15:37:15 +08:00
935ef5a598 [feature-wip](new-scan) Add new ES scanner and new ES scan node #13027 2022-10-10 09:56:38 +08:00
Pxl
245490d6b7 [Enhancement](runtime filter) optimize for runtime filter (#12856)
optimize for runtime filter
2022-10-09 14:11:03 +08:00
cf2b93532b [fix](file-scanner) fix some logic about broker load with parquet with new file scanner (#13135)
Fix some logic about broker load using new file scanner, with parquet format:

1. If columns are specified in load stmt, but none of them are in parquet file,
    error will be thrown like `err: No columns found in file`. See `parquet_s3_case4`

2. If the first column of table are not in table, the result number of rows is wrong.
    See `parquet_s3_case8`

3. If column specified in `columns` in load stmt does not exist in file and table,
    error will be thrown like: `failed to find default value expr for slot: x1`. See `parquet_s3_case2`
2022-10-08 13:08:08 +08:00
b41748efa1 [feature-wip](new-scan)Add new jdbc scanner and new jdbc scan node (#12848)
Related pr: #11582
This pr is the new jdbc scan node and scanner.
2022-10-07 09:55:17 +08:00
d286aa7bf7 [fix](spark-load) no need to filter row group when doing spark load (#13116)
1. Fix issue #13115 
2. Modify the method of `get_next_block` or `GenericReader`, to return "read_rows" explicitly.
    Some columns in block may not be filled in reader, if the first column is not filled, use `block->rows()` can not return real row numbers.
3. Add more checks for broker load test cases.
2022-10-05 23:00:56 +08:00
7b75c2df54 [fix](BE) fix the stream load error when upgrade BE from 1.1.2 to master (#13058) 2022-10-05 12:13:26 +08:00
026ffaf10d [feature-wip](parquet-reader) add detail profile for parquet reader (#13095)
Add more detail profile for ParquetReader:
ParquetColumnReadTime: the total time of reading parquet columns
ParquetDecodeDictTime: time to parse dictionary page
ParquetDecodeHeaderTime: time to parse page header
ParquetDecodeLevelTime: time to parse page's definition/repetition level
ParquetDecodeValueTime: time to decode page data into doris column
ParquetDecompressCount: counter of decompressing page data
ParquetDecompressTime: time to decompress page data
ParquetParseMetaTime: time to parse parquet meta data
2022-10-02 15:11:48 +08:00
287ff50a6f [Bug](datev2) Fix compatible error between datev2 and date (#13024) 2022-09-29 18:01:55 +08:00
820ec435ce [feature-wip](parquet-reader) refactor parquet_predicate (#12896)
This change serves the  following purposes:
1.  use ScanPredicate instead of TCondition for external table, it can reuse old code branch.
2. simplify and delete some useless old code
3.  use ColumnValueRange to save predicate
2022-09-28 21:27:13 +08:00
d80b7b9689 [feature-wip](new-scan) support more load situation (#12953) 2022-09-27 21:48:32 +08:00
Pxl
8731eea26e [Chore](clang) fix some build fail on clang15 (#12882)
remove unused variables
2022-09-26 23:13:28 +08:00
acd5d67355 [feature-wip](new-scan)Add new odbc scanner and new odbc scan node (#12899) 2022-09-26 09:24:25 +08:00
7b230e41a8 [bugfix](scanner) olap scanner compute is wrong (#12857)
Co-authored-by: yiguolei <yiguolei@gmail.com>
2022-09-24 09:59:59 +08:00
5bfdfac387 [feature-wip](parquet-reader) add parquet reader profile (#12797)
Add profile for parquet reader. New counters:
- ParquetFilteredGroups: Filtered row groups by `RowGroup` min-max statistics
- ParquetReadGroups: The number of row groups to read
- ParquetFilteredRowsByGroup: The number of filtered rows by `RowGroup` min-max statistics
- ParquetFilteredRowsByPage: The number of filtered rows by page min-max statistics
- ParquetFilteredBytes: The filtered bytes by `RowGroup` min-max statistics
- ParquetReadBytes: The total bytes in `ParquetReadGroups`, may be further filtered If a page is skipped as a whole
## Result
```
┌──────────────────────────────────────────────────────┐
│[0: VFILE_SCAN_NODE]                                  │
│(Active: 1s29ms, non-child: 96.42)                    │
│  - Counters:                                         │
│      - BytesRead: 0.00                               │
│      - FileReadCalls: 1.826K (1826)                  │
│      - FileReadTime: 510.627ms                       │
│      - FileRemoteReadBytes: 65.23 MB                 │
│      - FileRemoteReadCalls: 1.146K (1146)            │
│      - FileRemoteReadRate: 128.29331970214844 MB/sec │
│      - FileRemoteReadTime: 508.469ms                 │
│      - NumDiskAccess: 0                              │
│      - NumScanners: 1                                │
│      - ParquetFilteredBytes: 0.00                    │
│      - ParquetFilteredGroups: 0                      │
│      - ParquetFilteredRowsByGroup: 0                 │
│      - ParquetFilteredRowsByPage: 6.600003M (6600003)│
│      - ParquetReadBytes: 2.13 GB                     │
│      - ParquetReadGroups: 20                         │
│      - PeakMemoryUsage: 0.00                         │
│      - PredicateFilteredRows: 3.399797M (3399797)    │
│      - PredicateFilteredTime: 133.302ms              │
│      - RowsRead: 3.399997M (3399997)                 │
│      - RowsReturned: 200                             │
│      - RowsReturnedRate: 194                         │
│      - TotalRawReadTime(*): 726.566ms                │
│      - TotalReadThroughput: 0.0 /sec                 │
│      - WaitScannerTime: 1s27ms                       │
└──────────────────────────────────────────────────────┘
```
2022-09-23 18:42:14 +08:00