Currently, there are some useless includes in the codebase. We can use a tool named include-what-you-use to optimize these includes. By using a strict include-what-you-use policy, we can get lots of benefits from it.
Added bprc stub cache check and reset api, used to test whether the bprc stub cache is available, and reset the bprc stub cache
add a config used for auto check and reset bprc stub
1 Avoid sending tasks if there is no data to be consumed
By fetching latest offset of partition before sending tasks.(Fix [Optimize] Avoid too many abort task in routine load job #6803 )
2 Add a preCheckNeedSchedule phase in update() of routine load.
To avoid taking write lock of job for long time when getting all kafka partitions from kafka server.
3 Upgrade librdkafka's version to 1.7.0 to fix a bug of "Local: Unknown partition"
See offsetsForTimes fails with 'Local: Unknown partition' edenhill/librdkafka#3295
4 Avoid unnecessary storage migration task if there is no that storage medium on BE.
Fix [Bug] Too many unnecessary storage migration tasks #6804
Use `commitAsync` to commit offset to kafka, instead of using `commitSync`, which may block for a long time.
Also assign a group.id to routine load if user not specified "property.group.id" property, so that all consumer of
this job will use same group.id instead of a random id for each consume task.
Support when creating a kafka routine load, start consumption from a specified point in time instead of a specific offset.
eg:
```
FROM KAFKA
(
"kafka_broker_list" = "broker1:9092,broker2:9092",
"kafka_topic" = "my_topic",
"property.kafka_default_offsets" = "2021-10-10 11:00:00"
);
or
FROM KAFKA
(
"kafka_broker_list" = "broker1:9092,broker2:9092",
"kafka_topic" = "my_topic",
"kafka_partitions" = "0,1,2",
"kafka_offsets" = "2021-10-10 11:00:00, 2021-10-10 11:00:00, 2021-10-10 12:00:00"
);
```
This PR also reconstructed the analysis method of properties when creating or altering
routine load jobs, and unified the analysis process in the `RoutineLoadDataSourceProperties` class.
At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
Remove the default constructor for UniqueID
Add a gen_uid method in UniqueId. If need to generate a new uid, users should call this api explicitly.
Reuse boost random generator not generate a new one every time.
Commit kafka offset in routine load
Kafka will decide whether to delete data based on whether all consumer group is commit offset or not. If there is no commit offset, the kafka server disk may be full
1. get_json_xxx() now support using quoto to escape dot
2. Implement json_path_prepare() function to preprocess json_path
Performance of get_json_string() on 1000000 rows reduces from 2.27s to 0.27s
1. Use a data consumer group to share a single stream load pipe with multi data consumers. This will increase the consuming speed of Kafka messages, as well as reducing the task number of routine
load job.
Test results:
* 1 consumer, 1 partitions:
consume time: 4.469s, rows: 990140, bytes: 128737139. 221557 rows/s, 28M/s
* 1 consumer, 3 partitions:
consume time: 12.765s, rows: 2000143, bytes: 258631271. 156689 rows/s, 20M/s
blocking get time(us): 12268241, blocking put time(us): 1886431
* 3 consumers, 3 partitions:
consume time(all 3): 6.095s, rows: 2000503, bytes: 258631576. 328220 rows/s, 42M/s
blocking get time(us): 1041639, blocking put time(us): 10356581
The next 2 cases show that we can achieve higher speed by adding more consumers. But the bottle neck transfers from Kafka consumer to Doris ingestion, so 3 consumers in a group is enough.
I also add a Backend config `max_consumer_num_per_group` to change the number of consumers in a data consumer group, and default value is 3.
In my test(1 Backend, 2 tablets, 1 replicas), 1 routine load task can achieve 10M/s, which is same as raw stream load.
2. Add OFFSET_BEGINNING and OFFSET_END support for Kafka routine load
1. stream load executor will abort txn when no correct data in task
2. change txn label to DebugUtil.print(UUID) which is same as task id printed by be
3. change print uuid to hi-lo