Commit Graph

994 Commits

Author SHA1 Message Date
4dad7c94da [fix](orc) fix the count(*) pushdown issue in orc format (#24446)
In previous, when querying hive table in orc format, and the file is splitted.
the result of select count(*) may be multiple of the real row number.

This is because the number of rows should be got after orc strip prune,
otherwise, it may return wrong result
2023-09-16 09:57:39 +08:00
b9ddcbf729 [feature](merge-cloud) Rewrite code related to IOContext (#24269) 2023-09-15 19:57:58 +08:00
d24f3efd4a [pipelineX](profile) Phase 1: refactor pipelineX detailed profile (#24322) 2023-09-15 16:14:05 +08:00
9c681692bd Revert "[fix] fix http_stream retry mechanism (#23969)" (#24407)
This reverts commit 05e365ea137eb8c92b8e7eedc7d1435e83f065ae.
2023-09-15 10:07:53 +08:00
05e365ea13 [fix] fix http_stream retry mechanism (#23969)
Co-authored-by: yiguolei <676222867@qq.com>
2023-09-14 21:41:11 +08:00
Pxl
35c5d71549 [Improvement](join) some improvement of hash join (#23972)
some improvement of hash join
2023-09-14 17:55:35 +08:00
8e7f7c9566 [fix](profile) move probe time to pull and add LoopGenerateJoin time #24302 2023-09-14 16:41:01 +08:00
d8feca2530 [Enhancement]The page cache can be parameterized by the session variable of fe. (#23981) 2023-09-14 14:28:19 +08:00
c7ae2a7d22 [Refactor & Bugfix](static variables) move some static vairables to exec_env (#24029) 2023-09-13 09:27:03 +08:00
d8ef9dda59 [feature](merge-cloud) Rewrite FS interface (#23953) 2023-09-12 19:20:25 +08:00
dbf509edc0 [Debug](scan) Add debug log for find p0 scan coredump in pipeline (#24202) 2023-09-12 12:17:44 +08:00
6e28d878b5 [fix](hudi) compatible with hudi spark configuration and support skip merge (#24067)
Fix three bugs:
1. Hudi slice maybe has log files only, so `new Path(filePath)`  will throw errors.
2. Hive column names are lowercase only, so match column names in ignore-case-mode.
3.  Compatible with [Spark Datasource Configs](https://hudi.apache.org/docs/configurations/#Read-Options), so users can add `hoodie.datasource.merge.type=skip_merge` in catalog properties to skip merge logs files.
2023-09-11 19:54:59 +08:00
dbb9365556 [Enhance](ip)optimize priority_ network matching logic for be (#23795)
Issue Number: close #xxx

If the user has configured the wrong priority_network, direct startup failure to avoid users mistakenly assuming that the configuration is correct
If the user has not configured p_ n. Select only the first IP from the IPv4 list, rather than selecting from all IPs, to avoid users' servers not supporting IPv4
extends #23784
2023-09-11 18:32:31 +08:00
c94e47583c [fix](join) avoid DCHECK failed in '_filter_data_and_build_output' (#24162)
avoid DCHECK failed in '_filter_data_and_build_output'
2023-09-11 11:54:44 +08:00
9b3be0ba7a [Fix](multi-catalog) Do not throw exceptions when file not exists for external hive tables. (#23799)
A similar bug compares to #22140 .

When executing a query with hms catalog, the query maybe failed because some hdfs files are not existed. We should just distinguish this kind of errors and skip it.

```
errCode = 2, detailMessage = (xxx.xxx.xxx.xxx)[CANCELLED][INTERNAL_ERROR]failed to init reader for file hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: [INTERNAL_ERROR]Init OrcReader failed. reason = Failed to read hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc: [INTERNAL_ERROR]Read hdfs file failed. (BE: xxx.xxx.xxx.xxx) namenode:hdfs://xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc, err: (2), No such file or directory), reason: RemoteException: File does not exist: /xxx/dwd_tmp.db/check_dam_table_relation_record_day_data/part-00000-c4ee3118-ae94-4bf7-8c40-1f12da07a292-c000.snappy.orc at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86) 
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) 
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:158) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1927) 
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:738) 
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:426) 
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) 
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) 
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) 
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
```
2023-09-10 21:55:09 +08:00
f85da7d942 [improvement](jdbc) add profile for jdbc read and convert phase (#23962)
Add 2 metrics in jdbc scan node profile:
- `CallJniNextTime`: call get next from jdbc result set
- `ConvertBatchTime`: call convert jobject to columm block

Also fix a potential concurrency issue when init jdbc connection cache pool
2023-09-10 21:42:06 +08:00
262c669918 [fix](jdbc catalog) fix jdbc catalog creating json columns when reading json data (#24122) 2023-09-10 12:00:53 +08:00
93c1151f1a [fix](join) incorrect result of mark join (#24112) 2023-09-10 11:30:45 +08:00
f9a75b5c4f [feature](csv_serde)1.append csv serde for serialize to csv and deserialize from csv. 2.let csvReader use csv serde not text_converter. (#23352)
1. append csv serde for serialize to csv and deserialize from csv.
2. let csvReader use csv serde not text_converter.
2023-09-10 00:16:21 +08:00
03757d0672 [bug](explode) fix table node not implement alloc_resource function (#24031)
fix table node not implement alloc_resource function
2023-09-09 08:25:28 +08:00
0f0ffa3482 [Fix](Parquet Reader) fix parquet read issue (#24092) 2023-09-09 00:35:18 +08:00
76ca57cf21 [bug](join) fix outer join not add tuple is null column when build rows is 0 (#23974)
fix outer join not add tuple is null column when build rows is 0
2023-09-08 17:55:03 +08:00
Pxl
69868f18d6 [Bug](join) fix nested loop join some problems (#24034) 2023-09-08 17:40:41 +08:00
82dc970916 [feature](insert) Support group commit insert (#22829) 2023-09-08 15:51:03 +08:00
b73f345479 [fix](intersect) fix wrong result of intersect node (#24044)
Issue Number: close #24046
2023-09-08 10:27:37 +08:00
68acb8597b [fix](nested_loop_join) null value should be output in semi-anti join (#23971)
create table t1
        (k1 bigint, k2 bigint)
        ENGINE=OLAP
DUPLICATE KEY(k1, k2)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(k2) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"is_being_synced" = "false",
"storage_format" = "V2",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false"
);
create table t3
        (k1 bigint, k2 bigint)
        ENGINE=OLAP
DUPLICATE KEY(k1, k2)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(k2) BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"is_being_synced" = "false",
"storage_format" = "V2",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false"
);
Data:

insert into t1 values (1,null),(null,1),(1,2), (null,2),(1,3), (2,4), (2,5), (3,3), (3,4), (20,2), (22,3), (24,4),(null,null);
insert into t3 values (1,null),(null,1),(1,4), (1,2), (null,3), (2,4), (3,7), (3,9),(null,null),(5,1);
Query:

 select t1.* from t1 where not exists ( select k1 from t3 where t1.k2 < t3.k2 );
Result:

Empty set
Expect result:

+------+------+
| k1   | k2   |
+------+------+
| NULL | NULL |
|    1 | NULL |
+------+------+
2023-09-08 09:28:55 +08:00
9bc7010639 fix topn be inoperative because Field == Null always return true (#23830)
```if (!new_top.is_null() && new_top != old_top)``` is always false since old_top is Null when init and Field == Null always return true.

We add old_top.is_null() check first to avoid the problem and then issue more carefull discussion about Field == Null semantics.
2023-09-04 16:02:07 +08:00
Pxl
bb3fadc5d3 [Bug](materialized-view) fix mv not match because cast and alias name (#23580)
fix mv not match because cast and alias name
2023-09-04 12:46:33 +08:00
3317909141 [pipelineX](join) support nested loop join operator (#23756) 2023-09-04 10:08:22 +08:00
9da9409bd4 [refactor](join) improve join node output when build table rows is 0 (#23713) 2023-09-04 09:48:38 +08:00
347cceb530 [Feature](inverted index) push count on index down to scan node (#22687)
Co-authored-by: airborne12 <airborne12@gmail.com>
2023-09-02 22:24:43 +08:00
95488c4d93 [Fix](vscanner) remove TEMP column in block after filter (#23778) 2023-09-02 21:54:27 +08:00
6b56896a01 [chore](json reader) add original data to error messge for tracing (#22803) 2023-09-02 20:15:18 +08:00
228f0ac5bb [Feature](Multi-Catalog) support query doris bitmap column in external jdbc catalog (#23021) 2023-09-02 12:46:33 +08:00
657e927d50 [fix](json)Fix the bug that read json file Out of bounds access (#23411) 2023-09-02 01:11:37 +08:00
eaf2a6a80e [fix](date) return right date value even if out of the range of date dictionary(#23664)
PR(https://github.com/apache/doris/pull/22360) and PR(https://github.com/apache/doris/pull/22384) optimized the performance of date type. However hive supports date out of 1970~2038, leading wrong date value in tpcds benchmark.
How to fix:
1. Increase dictionary range: 1900 ~ 2038
2. The date out of 1900 ~ 2038 is regenerated.
2023-09-01 14:40:20 +08:00
65f41f71c1 [pipelineX](refactor) refine codes (#23726) 2023-09-01 07:57:35 +08:00
3a2c0d16f7 [fix](parquet) fix potential heap-use-after-free issue and cache issue (#23638)
1. When file meta cache is disabled (by setting `max_external_file_meta_cache_num=0` in be.conf),
the parquet's meta info is owned by parquet reader and will be released when calling `reader->close()`.

But the underlying file reader of this parquet reader will be released after `reader->close()`,
this may causing `heap-use-after-free` bug because some part of meta info may be referenced by file reader.

This PR fix it by making sure that meta info is released after file reader released.

2. Add modification time for file meta cache in BE, to avoid parquet read error like:
`Failed to deserialize parquet page header`
2023-08-31 18:23:05 +08:00
d22290e548 [pipelineX](join) support hash join (#23689) 2023-08-31 13:01:26 +08:00
Pxl
f35ab37e1e [Bug](materialized-view) fix load db use analyzer to analyze diffrent metaindex (#23673)
fix load db use analyzer to analyze diffrent metaindex
2023-08-31 12:35:38 +08:00
f7caae08d5 [fix](union) should open/alloc_resource in sink operator instead of source (#23637) 2023-08-30 18:58:59 +08:00
94a8fa6bc9 [bug](function) fix explode_number function return wrong rows (#23603)
before the explode_number function result is random with const value.
because the _cur_size is reset, so it's can't insert values to column.
2023-08-29 19:02:49 +08:00
962221cb18 [test](log) add log for debug case failure (#23506) 2023-08-28 10:45:25 +08:00
40be6a0b05 [fix](hive) do not split compress data file and support lz4/snappy block codec (#23245)
1. do not split compress data file
Some data file in hive is compressed with gzip, deflate, etc.
These kinds of file can not be splitted.

2. Support lz4 block codec
for hive scan node, use lz4 block codec instead of lz4 frame codec

4. Support snappy block codec
For hadoop snappy

5. Optimize the `count(*)` query of csv file
For query like `select count(*) from tbl`, only need to split the line, no need to split the column.

Need to pick to branch-2.0 after this PR: #22304
2023-08-26 12:59:05 +08:00
f66f161017 [fix](multi-catalog)fix hive table with cosn location issue (#23409)
Sometimes, the partitions of a hive table may on different storage, eg, some is on HDFS, others on object storage(cos, etc).
This PR mainly changes:

1. Fix the bug of accessing files via cosn.
2. Add a new field `fs_name` in TFileRangeDesc
    This is because, when accessing a file, the BE will get a hdfs client from hdfs client cache, and different file in one query
request may have different fs name, eg, some of are `hdfs://`, some of are `cosn://`, so we need to specify fs name
for each file, otherwise, it may return error:

`reason: IllegalArgumentException: Wrong FS: cosn://doris-build-1308700295/xxxx, expected: hdfs://[172.xxxx:4007](http://172.xxxxx:4007/)`
2023-08-26 00:16:00 +08:00
8af1e7f27f [Fix](orc-reader) Fix incorrect result if null partition fields in orc file. (#23369)
Fix incorrect result if null partition fields in orc file. 

### Root Cause
Theoretically, the underlying file of the hive partition table should not contain partition fields. But we found that in some user scenarios, the partition field will exist in the underlying orc/parquet file and are null values. As a result, the  pushed down partition field which are null values. filter incorrectly.

### Solution
we handle this case by only reading non-partition fields. The parquet reader is already handled this way, this PR handles the orc reader.
2023-08-26 00:13:11 +08:00
a3a951c71d [Fix](multi-catalog) Fix load string dict issue for transactional hive tables. (#23306)
Fix load string dict issue for transactional hive tables. The column name need to pass 'row.column_name'.

apache/doris-thirdparty#112
2023-08-26 00:09:12 +08:00
29273771f7 [Fix](multi-catalog) Fix hive incorrect result by disable string dict filter if exprs contain null expr. (#23361)
Issue Number: close #21960

Fix hive incorrect result by disable string dict filter if exprs contain null expr.
2023-08-25 21:16:43 +08:00
d331bfc513 [Performance](pipeline) support shared scan segment in mow (#23305) 2023-08-25 10:43:02 +08:00
Pxl
d9db3f5431 [Improvement](scan) Remove redundant predicates on scan node (#23374)
* Remove redundant predicates on scan node

* update

* fix
2023-08-25 10:41:37 +08:00