Commit Graph

111 Commits

Author SHA1 Message Date
1bfe0f0393 [feature](iceberg)support read iceberg complex type,iceberg.orc format and position delete. (#33935) (#34256)
master #33935
2024-04-29 14:40:12 +08:00
352617a34d [fix](scanner) cached blocks may be empty when VFileScanner return NOT_FOUND (#32745)
Cached blocks may be empty when VFileScanner return NOT_FOUND. This feature is introduced by https://github.com/apache/doris/pull/15226. Move this function inner `VFileScanner`.
2024-03-27 10:01:05 +08:00
e99b33c274 [opt](file-meta-cache) reduce file meta cache size and disable cache for some cases (#32340)
File meta cache on BE is used to cache the meta for external table's file such as parquet footer.
This cache is counted by number, not memory consumption.
So if the cache object is big(eg, a large parquet footer), the total memory consumption of this cache
will be large and causing OOM.

This PR mainly changes:

1. Add a new method `exceed_prune_limit()` for `CachePolicy`
    For `ObjLRUCache`, it always return true so that the minor of full gc on BE will prune the cache each time.

2. Reduce the default capability of file meta cache, from 20000 to 1000

    Also change the default capability of hdfs file handle cache, from 20000 to 1000

4. Change judgement of whether enable file meta cache when querying

    If the number of file need to be read is larger than the 1/3 of the file meta cache's capability, file meta cache
    will be disabled for this query. Because cache is useless if there are too many files.
2024-03-21 14:07:22 +08:00
2e564036ef [fix](profile) avoid update profile in deconstructor (#32131)
In previous, the counter in `profile` may be updated when close the file reader.
And the file reader may be closed when the object being deconstruted.
But at that time, the `profile` object may already be deleted, causing NPE and BE will crash.

This PR try to fix this issue:

1. Remove the "profile counter update" logic from all `close()` method.

2. Add a new interface `ProfileCollector`

	It has 2 methods:
	
	- `collect_profile_at_runtime()`

		It can be called at runtime, eg, in every `get_next_block()` method.
		So that the counter in profile can be updated at runtime.
		
	- `collect_profile_before_close()`

		Should be called before the object call `close()`. And it will only be called once.
		
3. Derived from `ProfileCollector`

	All classes which may update the profile counter in `close()` method should extends
	the `ProfileCollector`. Such as `GenericReader`, etc. And implement `collect_profile_before_close()`
	
	And `collect_profile_before_close()` will be called in `scanner->mark_to_need_to_close()`.
2024-03-21 14:07:22 +08:00
6289f7e605 [Fix](multi-catalog) Fix truncate_char_or_varchar_column crash. (#30731) 2024-02-03 20:26:04 +08:00
7d037c12bf [bugfix](paimon)fix paimon testcases (#30514)
1. set default timezone
2. not supported `char` type to pushdown
2024-01-31 23:53:39 +08:00
ad2c13e009 [Optimize](kill-query)Support the scanners exits as soon as possible when kill query #29803 2024-01-12 13:58:19 +08:00
0b731800a0 [enhancement](group_commit) refector wal manager code (#29560) 2024-01-07 18:54:41 +08:00
bfe65565d8 [feature](paimon)support native reader (#29339)
Support native reader fro paimon.

Upgrade paimon 0.5 to 0.6 : apache/doris-shade#32
2024-01-04 14:31:48 +08:00
66fbb22ad7 [fix](group commit) Fix some wal problems on group commit (#28554) 2023-12-19 09:51:03 +08:00
ec91dd1129 [opt](vfilescanner) interrupt running parquet/orc readers when scannode is finished (#28223)
VScanNode::get_next will check whether the ScanNode has reached limit condition, and send eos to TaskScheduler, and TaskScheduler will try to close ScanNode.
However, ScanNode must wait all running scanners finished, so even if ScanNode has reached limit condition, it can't be closed immediately.
This PR try to interrupt the running readers, and make ScanNode to end as soon as possible.
2023-12-13 19:31:08 +08:00
54d062ddee [feature](stream load) (step one)Add arrow data type for stream load (#26709)
By using the Arrow data format, we can reduce the streamload of data transferred and improve the data import performance
2023-12-06 23:29:46 +08:00
54b5d04ff9 [improve](csv_reader) handle csv reader error (#27892) 2023-12-02 10:05:02 +08:00
1706699e7e [fix](multi-catalog)support the max compute partition prune (#27154)
1. max compute partition prune,
we just support filter mc partitions by '=',it can filter just one partition
to support multiple partition filter and range operator('>','<', '>='..), the partition prune should be supported.

2. add max compute row count cache and partitionValues cache

3. add max compute regression case
2023-12-01 22:28:26 +08:00
5d548935e0 [improvement](insert) support schema change and decommission for group commit (#26359) 2023-11-17 21:41:38 +08:00
52995c528e [fix](iceberg) iceberg use customer method to encode special characters of field name (#27108)
Fix two bugs:
1. Missing column is case sensitive, change the column name to lower case in FE for hive/iceberg/hudi
2. Iceberg use custom method to encode special characters in column name. Decode the column name to match the right column in parquet reader.
2023-11-17 18:38:55 +08:00
a5565f68b2 [Refactor](opentelemetry) Remove opentelemetry (#26605) 2023-11-09 18:05:34 +08:00
fa7a38b587 [fix](runtime filter) append late arrival runtime filters in vfilecanner (#25996)
`VFileScanner` will try to append late arrival runtime filters in each loop of `ScannerScheduler::_scanner_scan`.  However, `VFileScanner::_get_next_reader` only generates the `_push_down_conjuncts` in the first loop, so the late arrival runtime filters are ignored.
2023-11-07 09:50:35 +08:00
a4e415ab09 [feature](hive)Support hive tables after alter type. (#25138)
1.Reconstruct the logic of decode to read parquet. The parquet  reader first reads the data according to the parquet physical type, and then performs a type conversion.

2.Support hive alter table.
2023-11-02 00:24:21 +08:00
0e0f8090f7 [refactor](text_convert)Use serde to replace text_convert. (#25543)
Remove text_convert and use serde to replace it.
2023-10-24 09:52:43 +08:00
f75ee49cb4 [chore](fmt) Remove stringstream by fmt (#25474)
Signed-off-by: Jack Drogon <jack.xsuperman@gmail.com>
2023-10-16 21:31:54 +08:00
be27d4d921 [fix](broker-load) fix use_count() issue when doing broker load in debug mode (#25288)
When executing broker load in ASAN mode, BE may crash with error:
```
F20231010 18:18:17.044978 185490 block.cpp:694] Check failed: d.column->use_count() == 1 (3 vs. 1)
*** Check failure stack trace: ***
    @     0x55e9d94c4e46  google::LogMessage::SendToLog()
    @     0x55e9d94c1410  google::LogMessage::Flush()
    @     0x55e9d94c5689  google::LogMessageFatal::~LogMessageFatal()
    @     0x55e9c509f80d  doris::vectorized::Block::clear_column_data()
    @     0x55e9b6c170b3  doris::PlanFragmentExecutor::get_vectorized_internal()
    @     0x55e9b6c147e6  doris::PlanFragmentExecutor::open_vectorized_internal()
    @     0x55e9b6c12d9a  doris::PlanFragmentExecutor::open()
    @     0x55e9b6c18426  doris::PlanFragmentExecutor::execute()
    @     0x55e9b6945cca  doris::FragmentMgr::_exec_actual()
    @     0x55e9b696456c  doris::FragmentMgr::exec_plan_fragment()::$_0::operator()()
```

It may happen when there is column maping like:
```
(k1,v2,v3,v4,v5,v6,v7,v8)
set (k2=v4,k3=v4,k4=v4)
```

in load stmt.

Case is covered by Baidu test cases
2023-10-12 17:04:29 +08:00
642e5cdb69 [Fix](Status) Make Status [[nodiscard]] and handle returned Status correctly (#23395) 2023-09-29 22:38:52 +08:00
430634367a [pipelineX](node)support file scan operator (#24924) 2023-09-27 22:10:43 +08:00
082bcd820b [feature](insert) Support wal for group commit insert (#23053) 2023-09-26 14:46:24 +08:00
40be6a0b05 [fix](hive) do not split compress data file and support lz4/snappy block codec (#23245)
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304
2023-08-26 12:59:05 +08:00
527293aa41 [refactor](dynamic table) remove dynamic table (#23298) 2023-08-23 14:15:14 +08:00
d4694167a8 [Enhancement](chore) Some Status relevant enhancement (#23072) 2023-08-21 14:14:38 +08:00
a5ca6cadd6 [Improvement] Optimize count operation for iceberg (#22923)
Iceberg has its own metadata information, which includes count statistics for table data. If the table does not contain equli'ty delete, we can get the count data of the current table directly from the count statistics.
2023-08-18 09:57:51 +08:00
db69457576 [fix](avro)Fix S3 TVF avro format reading failure (#22199)
This pr fixes two issues:

1. when using s3 TVF to query files in AVRO format, due to the change of `TFileType`, the originally queried `FILE_S3 ` becomes `FILE_LOCAL`, causing the query failed.
2. currently, both parameters `s3.virtual.key` and `s3.virtual.bucket` are removed. A new `S3Utils`  in jni-avro to parse the bucket and key of s3.
The purpose of doing this operation is mainly to unify the parameters of s3.
2023-08-11 17:22:48 +08:00
f2658dc7bd [Feature](multi-catalog) Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema. (#22318)
Truncate char or varchar columns if size is smaller than file columns or not found in the file column schema by session var `truncate_char_or_varchar_columns`.
2023-08-10 14:37:20 +08:00
c9dc715c5d [fix](broker-load) fix error when using multi data description for same table in load stmt (#22666)
For load request, there are 2 tuples on scan node, input tuple and output tuple.
The input tuple is for reading file, and it will be converted to output tuple based on user specified column mappings.

And the broker load support different column mapping in different data description to same table(or partition).
So for each scanner, the output tuples are same but the input tuple can be different.

The previous implements save the input tuple in scan node level, causing different scanner using same input tuple,
which is incorrect.
This PR remove the input tuple from scan node and save them in each scanners.
2023-08-07 20:03:03 +08:00
1ed1b69485 [refactor](reader) move reader from vec/exec/scan to vec/exec/format (#22371)
This readers should be in vec/exec/format
2023-08-04 09:47:20 +08:00
4bc65aa921 [fix](load) PrefetchBufferedReader Crashing caused updating counter with an invalid runtime profile (#22464) 2023-08-02 18:19:48 +08:00
89433f6a13 [fix](complex_type) throw error when reading complex types in broker/stream load (#22331)
Check whether there are complex types in parquet/orc reader in broker/stream load. Broker/stream load will cast any type as string type, and complex types will be casted wrong. This is a temporary method, and will be replaced by tvf.
2023-07-31 22:23:08 +08:00
ae8a26335c [opt](hive)opt select count(*) stmt push down agg on parquet in hive . (#22115)
Optimization "select count(*) from table" stmtement , push down "count" type to BE.
support file type : parquet ,orc in hive .

1. 4kfiles , 60kwline num 
    before:  1 min 37.70 sec 
    after:   50.18 sec

2. 50files , 60kwline num
    before: 1.12 sec
    after: 0.82 sec
2023-07-29 00:31:01 +08:00
8caa5a9ba4 [Fix](mutli-catalog) Fix null partitions error in iceberg tables. (#22185)
### Issue
when partition has null partitions, it throws error
`Failed to fill partition column: t_int=null`

### Resolution
- Fix the following null partitions error in iceberg tables by replacing null partition to '\N'.
- Add regression test for hive null partition.
2023-07-27 23:57:35 +08:00
bed940b7fc [fix](log) column index off-by-one error in scanner logs (#19747) 2023-07-21 18:30:01 +08:00
5fc0a84735 [improvement](catalog) reduce the size thrift params for external table query (#21771)
### 1
In previous implementation, for each FileSplit, there will be a `TFileScanRange`, and each `TFileScanRange`
contains a list of `TFileRangeDesc` and a `TFileScanRangeParams`.
So if there are thousands of FileSplit, there will be thousands of `TFileScanRange`, which cause the thrift
data send to BE too large, resulting in:

1. the rpc of sending fragment may fail due to timeout
2. FE will OOM

For a certain query request, the `TFileScanRangeParams` is the common part and is same of all `TFileScanRange`.
So I move this to the `TExecPlanFragmentParams`.
After that, for each FileSplit, there is only a list of `TFileRangeDesc`.

In my test, to query a hive table with 100000 partitions, the size of thrift data reduced from 151MB to 15MB,
and the above 2 issues are gone.

### 2
Support when setting `max_external_file_meta_cache_num` <=0, the file meta cache for parquet footer will
not be used.
Because I found that for some wide table, the footer is too large(1MB after compact, and much more after
deserialized to thrift), it will consuming too much memory of BE when there are many files.

This will be optimized later, here I just support to disable this cache.
2023-07-17 13:37:02 +08:00
9ee7fa45d1 [Refactor](multi-catalog) Refactor to process splitted conjuncts for dict filter. (#21459)
Conjuncts are currently split, so refactor source code to handle split conjuncts for dict filters.
2023-07-07 09:19:08 +08:00
a6b51ec19a [Feature](avro) Support Apache Avro file format (#19990)
support read avro file by hdfs() or s3() .
```sql
select * from s3(
         "uri" = "http://127.0.0.1:9312/test2/person.avro",
         "ACCESS_KEY" = "ak",
         "SECRET_KEY" = "sk",
         "FORMAT" = "avro");
+--------+--------------+-------------+-----------------+
| name   | boolean_type | double_type | long_type       |
+--------+--------------+-------------+-----------------+
| Alyssa |            1 |     10.0012 | 100000000221133 |
| Ben    |            0 |    5555.999 |      4009990000 |
| lisi   |            0 | 5992225.999 |      9099933330 |
+--------+--------------+-------------+-----------------+

select * from hdfs(
                "uri" = "hdfs://127.0.0.1:9000/input/person2.avro",
                "fs.defaultFS" = "hdfs://127.0.0.1:9000",
                "hadoop.username" = "doris",
                "format" = "avro");
+--------+--------------+-------------+-----------+
| name   | boolean_type | double_type | long_type |
+--------+--------------+-------------+-----------+
| Alyssa |            1 |  8888.99999 |  89898989 |
+--------+--------------+-------------+-----------+
```

current avro reader only support common data type, the complex data types will be supported later.
2023-06-28 21:15:35 +08:00
3dfeee3946 [fix](typesystem) fix wrong return type argument cause type check fail (#21082) 2023-06-22 00:04:46 +08:00
923f7edad0 [opt](hudi) using native reader to read the base file with no log file (#20988)
Two optimizations:
1. Insert string bytes directly to remove decoding&encoding process.
2. Use native reader to read the hudi base file if it has no log file. Use `explain` to show how many splits are read natively.
2023-06-20 11:20:21 +08:00
b7a50a09fe [Opt](orc-reader) Optimize orc reader by dict filtering. (#20806)
Optimize orc reader by dict filtering.  It is similar with #17594.
Test result
**ssb-flat-100**: (3 nodes)
| Query        | before opt           | after opt  |
| ------------- |:-------------:| ---------:|
Q1.1 | 1.239 | 1.145
Q1.2 | 1.254 | 1.128
Q1.3 | 1.931 | 1.644
Q2.1 | 1.359 | 1.006
Q2.2 | 1.229 | 0.674
Q2.3 | 0.934 | 0.427
Q3.1 | 2.226 | 1.712
Q3.2 | 2.042 | 1.562
Q3.3 | 1.631 | 1.021
Q3.4 | 1.618 | 0.732
Q4.1 | 2.294 | 1.858
Q4.2 | 2.511 | 1.961
Q4.3 | 1.736 | 1.446
total | 22.004 | 16.316
2023-06-16 13:11:37 +08:00
31a4f96f01 [refactor](exprcontext) move close to expr context's dector method (#20747)
The close method does nothing. But I am not sure we could remove it. So that I add it to dector method and remove many many calls.
2023-06-14 18:01:07 +08:00
4b15185e25 [improvement](hdfs) add parquet footer cache and hdfs file handle cache (#20544)
1. Add hdfs file handle cache for hdfs file reader

    Copied from Impala, `https://github.com/apache/impala/blob/master/be/src/util/lru-multi-cache.h`. (Thanks for the Impala team)
    This is a lru cache that can store multi entries with same key.
    The key is build with {file name + modification time}
    The value is the hdfsFile pointer that point to a certain hdfs file.
    
    This cache is to avoid reopen same hdfs file mutli time, which can save
    query time.
    
    Add a BE config `max_hdfs_file_handle_cache_num` to limit the max number
    of file handle cache, default is 20000.

2. Add file meta cache

	The file meta cache is a lru cache. the key is {file name + modification time},
	the value is the parsed file meta info of the certain file, which can save
	the time of re-parsing file meta everytime.
	Currently, it is only used for caching parquet file footer.
	
The test show that is cache is hit, the `FileOpenTime` and `ParseFooterTime` is reduce to almost 0
in query profile, which can save time when there are lots of files to read.
2023-06-13 15:13:57 +08:00
73ad885e19 [Feature][Fix](multi-catalog) Implements transactional hive full acid tables. (#20679)
After supporting insert-only transactional hive full acid tables #19518, #19419, this PR support transactional hive full acid tables.

Support hive3 transactional hive full acid tables.
Hive2 transactional hive full acid tables need to run major compactions.
2023-06-13 08:55:16 +08:00
0b228b3414 [fix](load)Support load json data with default value (#20624)
* support json default value

---------

Co-authored-by: duanxujian <duanxujian@jd.com>
2023-06-12 14:51:31 +08:00
9a83d78dfe [Enhancement](hudi) support hudi mor table, step2 follow #19909 (#20570)
PR(https://github.com/apache/doris/pull/19909) has implemented the framework of hudi reader for MOR table. This PR completes all functions of reading MOR table and enables end-to-end queries.
Key Implementations:
1. Use hudi meta information to generate the table schema, not from hive client.
2. Use hive client to list hudi partitions, so it strongly depends the sync-tools(https://hudi.apache.org/docs/syncing_metastore/) which syncs the partitions of hudi into hive metastore. However, we may get the hudi partitions directly from .hoodie directory.
3. Remove `HudiHMSExternalCatalog`, because other catalogs like glue is compatible with hive catalog.
4. Read the COW table originally from c++.
5. Hudi RecordReader will use ProcessBuilder to start a hotspot debugger process, which may be stuck when attaching the origin JNI process, soI use a tricky method to kill this useless process.
2023-06-10 12:25:53 +08:00
93b53cf2f4 [improvement](exception-safe) create and prepare node/sink support exception safe (#20551) 2023-06-09 21:06:59 +08:00