This CL mainly changes:
1. Add a new BE config `max_pushdown_conditions_per_column` to limit the number of conditions of a single column that can be pushed down to storage engine.
2. Add 2 new session variables `max_scan_key_num` and `doris_max_scan_key_num` which can set in session level and overwrite the config value in BE.
Problem is described in ISSUE #3678
This CL mainly changed to rule of creating dynamic partition.
1. If time unit is DAY, the logic remains unchanged.
2. If time unit is WEEK, the logical changes are as follows:
1. Allow to set the start day of every week, the default is Monday. Optional Monday to Sunday
2. Assuming that the starting day is a Tuesday, the range of the partition is Tuesday of the week to Monday of the next week.
3. If time unit is MONTH, the logical changes are as follows:
1. Allow to set the start date of each month. The default is 1st, and can be selected from 1st to 28th.
2. Assuming that the starting date is the 2nd, the range of the partition is from the 2nd of this month to the 1st of the next month.
4. The `SHOW DYNAMIC PARTITION TABLES` statement adds a `StartOf` column to show the start day of week or month.
It is recommended to refer to the example in `dynamic-partition.md` to understand.
TODO:
Better to support HOUR and YEAR time unit. Maybe in next PR.
FIX: #3678
1. User interface:
1.1 Spark resource management
Spark is used as an external computing resource in Doris to do ETL work. In the future, there may be other external resources that will be used in Doris, for example, MapReduce is used for ETL, Spark/GPU is used for queries, HDFS/S3 is used for external storage. We introduced resource management to manage these external resources used by Doris.
```sql
-- create spark resource
CREATE EXTERNAL RESOURCE resource_name
PROPERTIES
(
type = spark,
spark_conf_key = spark_conf_value,
working_dir = path,
broker = broker_name,
broker.property_key = property_value
)
-- drop spark resource
DROP RESOURCE resource_name
-- show resources
SHOW RESOURCES
SHOW PROC "/resources"
-- privileges
GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
```
- CREATE EXTERNAL RESOURCE:
FOR user_name is optional. If there has, the external resource belongs to this user. If not, the external resource belongs to the system and all users are available.
PROPERTIES:
1. type: resource type. Only support spark now.
2. spark configuration: follow the standard writing of Spark configurations, refer to: https://spark.apache.org/docs/latest/configuration.html.
3. working_dir: optional, used to store ETL intermediate results in spark ETL.
4. broker: optional, used in spark ETL. The ETL intermediate results need to be read with the broker when pushed into BE.
Example:
```sql
CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES
(
"type" = "spark",
"spark.master" = "yarn",
"spark.submit.deployMode" = "cluster",
"spark.jars" = "xxx.jar,yyy.jar",
"spark.files" = "/tmp/aaa,/tmp/bbb",
"spark.yarn.queue" = "queue0",
"spark.executor.memory" = "1g",
"spark.hadoop.yarn.resourcemanager.address" = "127.0.0.1:9999",
"spark.hadoop.fs.defaultFS" = "hdfs://127.0.0.1:10000",
"working_dir" = "hdfs://127.0.0.1:10000/tmp/doris",
"broker" = "broker0",
"broker.username" = "user0",
"broker.password" = "password0"
)
```
- SHOW RESOURCES:
General users can only see their own resources.
Admin and root users can show all resources.
1.2 Create spark load job
```sql
LOAD LABEL db_name.label_name
(
DATA INFILE ("/tmp/file1") INTO TABLE table_name, ...
)
WITH RESOURCE resource_name
[(key1 = value1, ...)]
[PROPERTIES (key2 = value2, ... )]
```
Example:
```sql
LOAD LABEL example_db.test_label
(
DATA INFILE ("hdfs:/127.0.0.1:10000/tmp/file1") INTO TABLE example_table
)
WITH RESOURCE "spark0"
(
"spark.executor.memory" = "1g",
"spark.files" = "/tmp/aaa,/tmp/bbb"
)
PROPERTIES ("timeout" = "3600")
```
The spark configurations in load stmt can override the existing configuration in the resource for temporary use.
#3010
This CL mainly changes:
1. Support `SELECT INTO OUTFILE` command.
2. Support export query result to a file via Broker.
3. Support CSV export format with specified column separator and line delimiter.
* Support bitmap_intersect
Support aggregate function Bitmap Intersect, it is mainly used to take intersection of grouped data.
The function 'bitmap_intersect(expr)' calculates the intersection of bitmap columns and returns a bitmap object.
The defination is following:
FunctionName: bitmap_intersect,
InputType: bitmap,
OutputType: bitmap
The scenario is as follows:
Query which users satisfy the three tags a, b, and c at the same time.
```
select bitmap_to_string(bitmap_intersect(user_id)) from
(
select bitmap_union(user_id) user_id from bitmap_intersect_test
where tag in ('a', 'b', 'c')
group by tag
) a
```
Closed#3552.
* Add docs of bitmap_union and bitmap_intersect
* Support null of bitmap_intersect
Mainly changes:
1. Shade and provide the thrift lib in spark-doris-connector
2. Add a `build.sh` for spark-doris-connector
3. Move the README.md of spark-doris-connector to `docs/`
4. Change the line delimiter of `fe/src/test/java/org/apache/doris/analysis/AggregateTest.java`
Fix some format.
NOTICE(#3622 ):
This is a "revert of revert pull request".
This pr is mainly used to synthesize the PRs whose commits were
scattered and submitted due to the wrong merge method into a complete single commit.
Add a new config `drop_backend_after_decommission` in FE. if this config
is false, the BE will not be dropped after finishing decommission operation.
This new config is try to solve the problem described in ISSUE: #3460 .
TODO:
This method will generate a lot of data migration, so it is only a temporary solution.
After that, we should try to solve the problem of data balancing within the BE.
This CL also add the documents of FE and BE configuration.
These documents are incomplete and can be added later.
Fix#3390
This CL add more info in `JobDetails` column of `SHOW LOAD` result for Broker Load Job.
For example:
```
{
"Unfinished backends": {
"9c3441027ff948a0-8287923329a2b6a7": [10002]
},
"All backends": {
"9c3441027ff948a0-8287923329a2b6a7": [10002, 10004, 10006]
},
"ScannedRows": 2390016,
"TaskNumber": 1,
"FileNumber": 1,
"FileSize": 1073741824
}
```
2 newly added keys:
`Unfinished backends` indicates the BE which task on them are not finished.
`All backends` indicates the BE which this job has tasks on it.
One more thing, I pass the Backend Id along with the heartbeat msg from FE to BE, so that BE can
know the Id of themselves.
This CL add new command to set replication number of table in one time.
```
alter table test_tbl set ("replication_num" = "3");
```
It changes replication num of a unpartitioned table.
and
```
alter table test_tbl set ("default.replication_num" = "3");
```
It changes default replication num of the specified table.
If not reset, all queries comes from same session will have save isQuery field value.
This bug will cause all entries in fe.audit.log has same IsQuery=true.
This CL also fix another bug:
The resolved IPs of domain of a user should not appear in other user's white list. Fix#3380
HttpURLConnection can automatically redirect stream load to BE, but there is no authorization
information in http request headers after redirect.
Maybe HttpURLConnection remove authorization info when do followRedirect.
The solution is set the followRedirect property to false on the connection object and do the
redirect request manually.
#3364