* [doris-1008] support backup and restore directly to cloud storage via aws s3 protocol
* Internal][S3DirectAccess] Support backup,restore,load,export directlyconnect to s3
1. Support load and export data from/to s3 directly.
2. Add a config to auto convert broker access to s3 acces when available
Change-Id: Iac96d4b3670776708bc96a119ff491db8cb4cde7
(cherry picked from commit 2f03832ca52221cc7436069b96c45c48c4bc7201)
* [Internal][S3DirectAccess] File path glob compatible with broker
Change-Id: Ie55e07a547aa22c6fa8d432ca926216c10384e68
(cherry picked from commit d4fb25544c0dc06d23e1ada571ec3f8edd4ba56f)
* [internal] [doris-1008] fix log4j class not found
Change-Id: I468176aca0d821383c74ee658d461aba9e7d5be3
(cherry picked from commit 029adaa9d6ded8503acbd6644c1519456f3db232)
* add poms
Co-authored-by: yangzhengguo01 <yangzhengguo01@baidu.com>
Support conditional filtering of original data in broker load and routine load
eg:
```
LOAD LABEL `label1`
(
DATA INFILE ('bos://cmy-repo/1.csv')
INTO TABLE tbl2
COLUMNS TERMINATED BY '\t'
(event_day, product_id, ocpc_stage, user_id)
SET (
ocpc_stage = ocpc_stage + 100
)
PRECEDING FILTER user_id = 1381035
WHERE ocpc_stage > 30
)
...
```
At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
Stream load should read all the data completely before parsing the json.
And also add a new BE config streaming_load_max_batch_read_mb
to limit the data size when loading json data.
Fix the bug of loading empty json array []
Add doc to explain some certain case of loading json format data.
Fix: #4124
This CL mainly changes:
1. Reorganized the code logic to limit the supported json format to two, and the import behavior is more consistent.
2. Modified the statistical behavior of the number of error rows when loading in json format, so that the error rows can be counted correctly.
3. See `load-json-format.md` to get details of loading json format.
After checking, I found that broker load in 0.11 added num_of_columns_from_file parameter in thrift. This parameter does not consider compatibility in BE.
So broker load could cause BE crashed during the upgrade
Currently, we do not support parsing encoded/compressed columns in file path, eg: extract column k1 from file path /path/to/dir/k1=1/xxx.csv
This patch is able to parse columns from file path like in Spark(Partition Discovery).
This patch parse partition columns at BrokerScanNode.java and save parsing result of each file path as a property of TBrokerRangeDesc, then the broker reader of BE can read the value of specified partition column.
Currently, Doris only support UTF-8 encoded data. All data will be
shown to user in UTF-8 format. So if data loaded in Doris does not
UTF-8 encoded, user will see garbled data when querying.
I introduce a fast UTF-8 validator from
https://github.com/lemire/fastvalidate-utf-8
This validator is highly optimized that it only takes 0.7 CPU cycles
to validata a 64k string. And by testing 1GB data load to Doris, the
validator has no impact on performance.
This change adds a load property named strict_mode which is used to prohibit the incorrect data.
When it is set to false, the incorrect data will be loaded by NULL just like before.
When it is set to true, the incorrect data which belongs to a column without expr will be filtered.
The strict_mode is supported in broker load v2 now. It will be supported in stream load later.
1. Apache HDFS broker support HDFS HA and Hadoop kerberos authentication.
2. New Backup and Restore function. Use Fs Broker to backup your data to HDFS or restore them from HDFS.
3. Table-Level Privileges. Grant fine-grained privileges on table-level to specified user.
4. A lot of bugs fixed.
5. Performance improvement.