Commit Graph

120 Commits

Author SHA1 Message Date
44a1a20e65 [feature-wip](parquet-reader)parse parquet schema (#11381)
Analyze schema elements in parquet FileMetaData, and generate the hierarchy of nested fields.
For exmpale:
1. primitive type
```
// thrift:
optional int32 <column-name>;
// sql definition:
<column-name> int32;
```
2. nested type
```
// thrift:
optional group <column-name> (LIST) {
  repeated group bag {
    optional group array_element (LIST) {
      repeated group bag {
        optional int32 array_element
      }
    }
  }
}
// sql definition:
<column-name> array<array<int32>>
```
2022-08-02 10:56:13 +08:00
4960043f5e [enhancement] Refactor to improve the usability of MemTracker (step2) (#10823) 2022-07-21 17:11:28 +08:00
a044b5dcc5 [refactor](predicate) refactor predicates in scan node (#10701)
* [reafactor](predicate) refactor predicates in scan node

* update
2022-07-11 09:21:01 +08:00
c358a43f35 [feature-wip] support parquet predicate push down (#10512) 2022-07-08 23:11:25 +08:00
139cd3d11a [Improvement] remove olap filters when use in key ranges (#10278) 2022-06-23 09:12:29 +08:00
Pxl
fd0bd395ac [Enhancement] Remove some unused include (#10035) 2022-06-17 10:47:25 +08:00
f7b5f36da4 [feature] Support read hive external table and outfile into HDFS that authenticated by kerberos (#9579)
At present, Doris can only access the hadoop cluster with kerberos authentication enabled by broker, but Doris BE itself 
does not supports access to a kerberos-authenticated HDFS file.

This PR hope solve the problem.

When create hive external table, users just specify following properties to access the hdfs data with kerberos authentication enabled:

```sql
CREATE EXTERNAL TABLE t_hive (
k1 int NOT NULL COMMENT "",
k2 char(10) NOT NULL COMMENT "",
k3 datetime NOT NULL COMMENT "",
k5 varchar(20) NOT NULL COMMENT "",
k6 double NOT NULL COMMENT ""
) ENGINE=HIVE
COMMENT "HIVE"
PROPERTIES (
'hive.metastore.uris' = 'thrift://192.168.0.1:9083',
'database' = 'hive_db',
'table' = 'hive_table',
'dfs.nameservices'='hacluster',
'dfs.ha.namenodes.hacluster'='n1,n2',
'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
'dfs.namenode.kerberos.principal'='hadoop/_HOST@REALM.COM'
'hadoop.security.authentication'='kerberos',
'hadoop.kerberos.principal'='doris_test@REALM.COM',
'hadoop.kerberos.keytab'='/path/to/doris_test.keytab'
);
```

If you want  to `select into outfile` to HDFS that kerberos authentication enable, you can refer to the following SQL statement:

```sql
select * from test into outfile "hdfs://tmp/outfile1" 
format as csv
properties
(
'fs.defaultFS'='hdfs://hacluster/',
'dfs.nameservices'='hacluster',
'dfs.ha.namenodes.hacluster'='n1,n2',
'dfs.namenode.rpc-address.hacluster.n1'='192.168.0.1:8020',
'dfs.namenode.rpc-address.hacluster.n2'='192.168.0.2:8020',
'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
'dfs.namenode.kerberos.principal'='hadoop/_HOST@REALM.COM'
'hadoop.security.authentication'='kerberos',
'hadoop.kerberos.principal'='doris_test@REALM.COM',
'hadoop.kerberos.keytab'='/path/to/doris_test.keytab'
);
```
2022-06-14 20:07:03 +08:00
94089b9192 [Refactor] Use file factory to replace create file reader/writer (#9505)
1. Simplify code logic and improve abstraction
2. Fix the mem leak of raw pointer

Co-authored-by: lihaopeng <lihaopeng@baidu.com>
2022-06-08 15:07:39 +08:00
cd105bee0a [refactor](es) Clean es tcp scannode and related thrift definitions (#9553)
PaloExternalSourcesService is designed for es_scan_node using tcp protocol.
But es tcp protocol need deploy a tcp jar into es code. Both es version and lucene version are upgraded,
and the tcp jar is not maintained any more.

So that I remove all the related code and thrift definitions.
2022-05-14 10:03:55 +08:00
eec1dfde3a [feature] (vec) instead of converting line to src tuple for stream load in vectorized. (#9314)
Co-authored-by: xiepengcheng01 <xiepengcheng01@xafj-palo-rpm64.xafj.baidu.com>
2022-05-09 11:24:07 +08:00
5a44eeaf62 [refactor] Unify all unit tests into one binary file (#8958)
1. solved the previous delayed unit test file size is too large (1.7G+) and the unit test link time is too long problem problems
2. Unify all unit tests into one file to significantly reduce unit test execution time to less than 3 mins
3. temporarily disable stream_load_test.cpp, metrics_action_test.cpp, load_channel_mgr_test.cpp because it will re-implement part of the code and affect other tests
2022-04-12 15:30:40 +08:00
d51545a952 [fix](ut)(memory-leak) Fix be asan ut failed and hdfs file reader memory leak (#8905) 2022-04-08 00:07:00 +08:00
eb68dd0bb5 [fix](ut) Fix be ut not work for byte_buffer_test2 and json_scanner_with_jsonpath_test (#8791) 2022-04-01 10:12:47 +08:00
71d050d0bc [improvement][test] (log)Add more error message on connect to hdfs failure, and corresponding ut (#8755)
I met a failure of reading hdfs files in broker load, the error message is unclear and
I spent a lot of time to locate the problem.

```
W0330 11:08:01.093812 2755268 broker_scan_node.cpp:364] Scanner[0] process failed. status=connect failed.
W0330 11:08:01.097682 2018787 fragment_mgr.cpp:234] Got error while opening fragment 712ae2b848324cb6-94a83d646173c1e9: Internal error: connect failed.
W0330 11:08:01.097702 2018787 tablet_sink.cpp:148] connect failed.
```

We should add more information when connect to hdfs failed.
2022-03-31 13:56:25 +08:00
eeae516e37 [Feature](Memory) Hook TCMalloc new/delete automatically counts to MemTracker (#8476)
Early Design Documentation: https://shimo.im/docs/DT6JXDRkdTvdyV3G

Implement a new way of memory statistics based on TCMalloc New/Delete Hook,
MemTracker and TLS, and it is expected that all memory new/delete/malloc/free
of the BE process can be counted.
2022-03-20 23:06:54 +08:00
e17aef9467 [refactor] refactor the implement of MemTracker, and related usage (#8322)
Modify the implementation of MemTracker:
1. Simplify a lot of useless logic;
2. Added MemTrackerTaskPool, as the ancestor of all query and import trackers, This is used to track the local memory usage of all tasks executing;
3. Add cosume/release cache, trigger a cosume/release when the memory accumulation exceeds the parameter mem_tracker_consume_min_size_bytes;
4. Add a new memory leak detection mode (Experimental feature), throw an exception when the remaining statistical value is greater than the specified range when the MemTracker is destructed, and print the accurate statistical value in HTTP, the parameter memory_leak_detection
5. Added Virtual MemTracker, cosume/release will not sync to parent. It will be used when introducing TCMalloc Hook to record memory later, to record the specified memory independently;
6. Modify the GC logic, register the buffer cached in DiskIoMgr as a GC function, and add other GC functions later;
7. Change the global root node from Root MemTracker to Process MemTracker, and remove Process MemTracker in exec_env;
8. Modify the macro that detects whether the memory has reached the upper limit, modify the parameters and default behavior of creating MemTracker, modify the error message format in mem_limit_exceeded, extend and apply transfer_to, remove Metric in MemTracker, etc.;

Modify where MemTracker is used:
1. MemPool adds a constructor to create a temporary tracker to avoid a lot of redundant code;
2. Added trackers for global objects such as ChunkAllocator and StorageEngine;
3. Added more fine-grained trackers such as ExprContext;
4. RuntimeState removes FragmentMemTracker, that is, PlanFragmentExecutor mem_tracker, which was previously used for independent statistical scan process memory, and replaces it with _scanner_mem_tracker in OlapScanNode;
5. MemTracker is no longer recorded in ReservationTracker, and ReservationTracker will be removed later;
2022-03-11 22:04:23 +08:00
e7c417505c [fix] fix hash table insert() may be failed but not handle this error (#8207) 2022-03-03 22:33:05 +08:00
c66a9bf64b [fix](be-ut) fix unit test bug for tablet_info_test (#8253)
introduced from #8041
2022-02-27 10:44:20 +08:00
0726a43a2a [fix](be-ut) Fix unused-but-set-variable errors. (#8211) 2022-02-23 21:43:15 +08:00
50864aca7d [refactor] fix warings when compile with clang (#8069) 2022-02-19 11:29:02 +08:00
aea3e4e59b [refactor] Remove version hash from BE and related test in BE (#8027) 2022-02-14 09:29:27 +08:00
f8d086d87f [feature](rpc) (experimental)Support implement UDF through GRPC protocol. (#7519)
Support implement UDF through GRPC protocol. This brings several benefits: 
1. The udf implementation language is not limited to c++, users can use any familiar language to implement udf
2. UDF is decoupled from Doris, udf will not cause doris coredump, udf computing resources are separated from doris, and doris services are not affected

But RPC's UDF has a fixed overhead, so its performance is much slower than C++ UDF, especially when the amount of data is large.

Create function like

```
CREATE FUNCTION rpc_add(INT, INT) RETURNS INT PROPERTIES (
  "SYMBOL"="add_int",
  "OBJECT_FILE"="127.0.0.1:9999",
  "TYPE"="RPC"
);
```
Function service need to implement `check_fn` and `fn_call` methods
Note:
THIS IS AN EXPERIMENTAL FEATURE, THE INTERFACE AND DATA STRUCTURE MAY BE CHANGED IN FUTURE !!!
2022-02-08 09:25:09 +08:00
5fc0a9f40d [improvement](Load) Cancel the load job ASAP when encounter unqualified data (#6319)
This PR mainly changes:

1. Help to Cancel the load job ASAP when encounter unqualified data.
    Solution is described in #6318 .
    Also replace some std::stringstream with fmt::memory_buffer to avoid performance issues.

2. fix a NPE bug when create user with empty host
3. fix compile warning after rebasing the master(vectorization)
2022-01-18 13:13:55 +08:00
4ed1846369 [fix](ut) Fix BE broker scanner unit test bug (#7486)
introduced from #7454
2021-12-26 10:30:37 +08:00
fc9e502b51 [improvement](brpc)(config) Support transfer RowBatch in Controller Attachment (#7164)
Transfer RowBatch in Protobuf Request to Controller Attachment,
when the maximum length of the RowBatch in the Protobuf Request is exceeded.
This can avoid reaching the upper limit of the Protobuf Request length (2G),
and it is expected that performance can be improved.
2021-12-02 11:41:38 +08:00
6c4aeab06f [fix](broker-load) BE may crash when using preceding filter in broker or routine load (#7193)
The broker scan node has two tuple descriptors:
One is dest tuple and the other is src tuple.
The src tuple is used to read the lines of the original file,

and the dest tuple is used to save the converted lines.
The preceding filter is executed on the src tuple, so src tuple descriptor should be used
to initialize the filter expression
2021-11-30 22:04:05 +08:00
e2d3d0134e dd a method to get doris current memory usage (#6979)
Add all memory usage check when TryConsume memory
2021-11-24 10:07:54 +08:00
6c6380969b [refactor] replace boost smart ptr with stl (#6856)
1. replace all boost::shared_ptr to std::shared_ptr
2. replace all boost::scopted_ptr to std::unique_ptr
3. replace all boost::scoped_array to std::unique<T[]>
4. replace all boost:thread to std::thread
2021-11-17 10:18:35 +08:00
088a16d33b Chinese annotation modification (#6958)
* Modify Chinese comment (#6951)
2021-11-09 18:00:14 +08:00
0393c9b3b9 [Optimize] Support send batch parallelism for olap table sink (#6397)
* Support send batch parallelism for olap table sink

Co-authored-by: caiconghui <caiconghui@xiaomi.com>
2021-08-30 11:03:09 +08:00
9216735cfa [New Featrue] Support Vectorization Execution Engine Interface For Doris (#6329)
1. FE vectorized plan code
2. Function register vec function
3. Diff function nullable type
4. New thirdparty code and new thrift struct
2021-08-11 14:54:06 +08:00
f26e3408b2 [Profile] Support show load profile for broker load job (#6214)
1.
Add new statement:
`SHOW LOAD PROFILE "xxx";`

2.
Improve the read performance of orc scanner
2021-07-27 13:37:34 +08:00
ed3ff470ce [ARRAY] Support array type load and select not include access by index (#5980)
This is part of the array type support and has not been fully completed. 
The following functions are implemented
1. fe array type support and implementation of array function, support array syntax analysis and planning
2. Support import array type data through insert into
3. Support select array type data
4. Only the array type is supported on the value lie of the duplicate table

this pr merge some code from #4655 #4650 #4644 #4643 #4623 #2979
2021-07-13 14:02:39 +08:00
dbfe8e4753 [enhancement] Optimize load CSV file memory allocate (#6174)
Optimize load CSV file memory allocate, avoid frequent allocation,
may reduce the load time by 40%-50% when large column numbers
2021-07-12 09:58:45 +08:00
739c0268ff [refactor] Remove decimal v1 related code from code base (#6079)
remove ALL DECIMAL V1 type code , this is a part of #6073
2021-07-07 10:26:32 +08:00
149def9e42 [Feature] Support RuntimeFilter in Doris (BE Implement) (#6077)
1. support in/bloomfilter/minmax
2. support broadcast/shuffle/bucket shuffle/colocate join
3. opt memory use and cpu cache miss while build runtime filter
4. opt memory use in left semi join (works well on tpcds-95)
2021-07-04 20:59:05 +08:00
bde60280b8 [Optimize] use string_view instead of std::string in string function (#6010) 2021-06-16 09:40:13 +08:00
ba38973209 use virtual hosted-style request to access object store (#5894)
* use virtual hosted-style access request object store
2021-05-27 15:52:07 +08:00
01a45e8691 add read buffer when use s3 reader (#5791) 2021-05-17 11:46:38 +08:00
11cce06962 [Feature] Support create history dynamic partition (#5703)
1. Add a new dynamic partition property `create_history_partition`.
    If set to true, Doris will create all partitions from `start` to `end`.

2. Add a new FE config `max_dynamic_partition_num`
    To limit the number of partitions created when creating one table.
2021-05-08 12:05:19 +08:00
de87f4ae84 [Feature] Add list partition support (#5529)
Add list partition support
2021-04-24 17:42:27 +08:00
be733cfa9c [Metrics] Add some large memtrackers' metric (#5614)
MemTracker can provide memory consumption for us to find out which
module consume more memory, but it's just a current value, this patch
add metrics for some large memory consumers, then we can find out
which module consume more memory in timeline, it would be useful to
troubleshoot OOM problems and optimize configs.
2021-04-21 09:15:04 +08:00
892fbf6ded Update s3_reader_test.cpp (#5658) 2021-04-15 10:59:30 +08:00
c4cc681d14 remove boost_foreach, using c++ foreach instead (#5611) 2021-04-15 10:52:29 +08:00
cef3cbc53a [Bug] Fix bug that the last column may be null when using multibytes separator (#5534) 2021-03-23 09:35:30 +08:00
a91888a68b [BUG] fix memory limit failure and optimize memory usage in join stage (#5514)
This patch works well on tpcds-1T query-24
2021-03-21 11:32:51 +08:00
e023ef5404 [Load] Support multi bytes LineDelimiter and ColumnSeparator (#5462)
* [Internal][Support Multibytes Separator] doris-1079
support multi bytes LineDelimiter and ColumnSeparator
2021-03-09 09:35:39 +08:00
6ede4c6ec1 [Feature] Support backup,restore,load,export directly connect to s3 (#5399)
* [doris-1008] support backup and restore directly to cloud storage via aws s3 protocol

* Internal][S3DirectAccess] Support backup,restore,load,export directlyconnect to s3
1. Support load and export data from/to s3 directly.
2. Add a config to auto convert broker access to s3 acces when available

Change-Id: Iac96d4b3670776708bc96a119ff491db8cb4cde7

(cherry picked from commit 2f03832ca52221cc7436069b96c45c48c4bc7201)

* [Internal][S3DirectAccess] File path glob compatible with broker

Change-Id: Ie55e07a547aa22c6fa8d432ca926216c10384e68
(cherry picked from commit d4fb25544c0dc06d23e1ada571ec3f8edd4ba56f)

* [internal] [doris-1008] fix log4j class not found

Change-Id: I468176aca0d821383c74ee658d461aba9e7d5be3
(cherry picked from commit 029adaa9d6ded8503acbd6644c1519456f3db232)

* add poms

Co-authored-by: yangzhengguo01 <yangzhengguo01@baidu.com>
2021-02-22 16:07:56 +08:00
7eae3e280a [optimization] use inline optimize ExprContext::get_value (#5385) 2021-02-16 22:35:14 +08:00
780900ac9c [Feature] Support preceding filter original data when loading (#5338)
Support conditional filtering of original data in broker load and routine load
eg:

```
LOAD LABEL `label1`
(
DATA INFILE ('bos://cmy-repo/1.csv')
INTO TABLE tbl2
COLUMNS TERMINATED BY '\t'
(event_day, product_id, ocpc_stage, user_id)
SET (
	ocpc_stage = ocpc_stage + 100
)
PRECEDING FILTER user_id = 1381035
WHERE ocpc_stage > 30
)
...
```
2021-02-07 22:37:48 +08:00