Commit Graph

62 Commits

Author SHA1 Message Date
991dc7fc5c [fix][routine-load] fix bug that routine load cannot cancel task when append_data return error (#8457) 2022-03-14 10:18:14 +08:00
538df28737 [improvement](routine-load) Support routine load task succeed with empty data consumed (#8256) 2022-03-03 22:35:50 +08:00
c0e59e59aa [fix][refactor] fix bugs and refactor some code by lint (#7871)
1. Fix some `passedByValue` issues.
2. Fix some `dereferenceBeforeCheck` issues.
3. Fix some `uninitMemberVar` issues.
4. Fix some iterator `eraseDereference` issues.
5. Fix compile issue introduced from #7923 #7905 #7848
2022-02-01 14:31:14 +08:00
5f8d91257b [improvement](routine-load) Reduce the probability that the routine load task rpc timeout (#7754)
If an load task has a relatively short timeout, then we need to ensure that
each RPC of this task does not get blocked for a long time.
And an RPC is usually blocked for two reasons.

1. handling "memory exceeds limit" in the RPC
    
    If the system finds that the memory occupied by the load exceeds the threshold,
    it will select the load channel that occupies the most memory and flush the memtable in it.
    this operation is done in the RPC, which may be more time consuming.

2. close the load channel

    When the load channel receives the last batch, it will end the task.
    It will wait for all memtables flushes to finish synchronously. This process is also time consuming.

Therefore, this PR solves this problem by.

1. Use timeout to determine whether it is a high-priority load task

    If the timeout of an load task is relatively short, then we mark it as a high-priority task.

2. not processing "memory exceeds limit" for high priority tasks
3. use a separate flush thread to flush memtable for high priority tasks.
2022-01-16 10:41:31 +08:00
f3817829bb [fix] fix malloc and free mismatch issue (#7702)
The memory allocate by `malloc` should be freed by `free`
2022-01-14 09:32:33 +08:00
83f6eef506 [improvement](routine-load) Make routine load work with old kafka version (#7630)
Co-authored-by: caiconghui1 <caiconghui1@jd.com>
2022-01-10 17:30:24 +08:00
760fc02bfe Added bprc stub cache check and reset api, used to test whether the bprc stub cache is available, and reset the bprc stub cache (#6916)
Added bprc stub cache check and reset api, used to test whether the bprc stub cache is available, and reset the bprc stub cache
add a config used for auto check and reset bprc stub
2021-11-05 09:45:37 +08:00
5ef3f59928 [Optimize][RoutineLoad] Avoid sending tasks if there is no data to be consumed (#6805)
1 Avoid sending tasks if there is no data to be consumed
By fetching latest offset of partition before sending tasks.(Fix [Optimize] Avoid too many abort task in routine load job #6803 )

2 Add a preCheckNeedSchedule phase in update() of routine load.
To avoid taking write lock of job for long time when getting all kafka partitions from kafka server.

3 Upgrade librdkafka's version to 1.7.0 to fix a bug of "Local: Unknown partition"
See offsetsForTimes fails with 'Local: Unknown partition' edenhill/librdkafka#3295

4 Avoid unnecessary storage migration task if there is no that storage medium on BE.
Fix [Bug] Too many unnecessary storage migration tasks #6804
2021-10-13 11:39:01 +08:00
ad3c9390a2 [Bug] Fix bdbje getDatabaseNames() bug and scan node close bug (#6769)
1. This bug is introduced from #6582
2. Optimize the error log of Address used used error msg.
3. Add some document about compilation.
    1. Add a custom thirdparty download url.
    2. Add a custom com.alibaba maven jar package for DataX.
4. Fix bug that BE crash when closing scan node, introduced from #6622.
2021-09-29 11:11:28 +08:00
521fb15a9b [Bug] Fix some memory bugs (#6699)
1. Fix a memory leak in `collect_iterator.cpp` (Fix #6700)
2. Add a new BE config `max_segment_num_per_rowset` to limit the num of segment in new rowset.(Fix #6701)
3. Make the error msg of stream load more friendly.
2021-09-22 12:30:14 +08:00
fa290383dc [Doc] Modify README to add some statistical indicators (#6486)
1. Add license/total line/release badegs.
2. Add monthly active contributor and contributor growth graph
3. fix a pom.xml bug
4. Modify some routine load log on BE side
2021-08-25 09:36:26 +08:00
2c208e932b [Bug][RoutineLoad] Avoid TOO_MANY_TASKS error (#6342)
Use `commitAsync` to commit offset to kafka, instead of using `commitSync`, which may block for a long time.
Also assign a group.id to routine load if user not specified "property.group.id" property, so that all consumer of
this job will use same group.id instead of a random id for each consume task.
2021-08-03 11:59:06 +08:00
8b4721c941 [Bug] Fix kafka consumer reuse bug (#6007)
When judging whether consumer can be reused, it is necessary to judge whether the parameter content is equal.
2021-06-16 09:39:05 +08:00
07ad038870 [Feature][RoutineLoad] Support for consuming kafka from the point of time (#5832)
Support when creating a kafka routine load, start consumption from a specified point in time instead of a specific offset.
eg:
```
FROM KAFKA
(
    "kafka_broker_list" = "broker1:9092,broker2:9092",
    "kafka_topic" = "my_topic",
    "property.kafka_default_offsets" = "2021-10-10 11:00:00"
);

or

FROM KAFKA
(
    "kafka_broker_list" = "broker1:9092,broker2:9092",
    "kafka_topic" = "my_topic",
    "kafka_partitions" = "0,1,2",
    "kafka_offsets" = "2021-10-10 11:00:00, 2021-10-10 11:00:00, 2021-10-10 12:00:00"
);
```

This PR also reconstructed the analysis method of properties when creating or altering
routine load jobs, and unified the analysis process in the `RoutineLoadDataSourceProperties` class.
2021-05-22 23:37:53 +08:00
01a45e8691 add read buffer when use s3 reader (#5791) 2021-05-17 11:46:38 +08:00
5fed34fcfe [optimize] provide a better defer operator (#5706) 2021-05-12 10:37:23 +08:00
98e80aa65e [refactor] Replace boost::function with std::function (#5700)
Replace boost::function with std::function
2021-05-09 22:00:48 +08:00
a4f8194111 [Audit][Stream Load] Support audit function for stream load (#5452)
Record finished stream load job (both successful job and failed job) into audit log
so that we can see when the stream load job was executed and check the details of stream load jobs.
2021-04-21 16:36:12 +08:00
wyb
128752b4f9 [Routine load] Fix kafka load too many task bug (#5327) 2021-02-03 13:23:30 +08:00
93a4c7efc1 [LOG] Standardize the use of VLOG in code (#5264)
At present, the application of vlog in the code is quite confusing.
It is inherited from impala VLOG_XX format, and there is also VLOG(number) format.
VLOG(number) format does not have a unified specification, so this pr standardizes the use of VLOG
2021-01-21 12:09:09 +08:00
6fedf5881b [CodeFormat] Clang-format cpp sources (#4965)
Clang-format all c++ source files.
2020-11-28 18:36:49 +08:00
796f44beac [Bug] Fix bug that routine load blocked with TOO_MANY_TASKS error (#4861)
When receiving empty msg from kafka, the load process will quit abnormally.
Fix #4860
2020-11-12 10:05:10 +08:00
09f97f8a05 [Refactor] Fixes some be typo part 2 (#4747) 2020-10-20 09:28:57 +08:00
3438a746ac [Typo] Fix typo in metrics macros (#4739)
Just fix typo.
Rename DEFINE_GAUGE_METRIC_PROTOTYPE_5ARG(name, unit) to DEFINE_GAUGE_METRIC_PROTOTYPE_2ARG(name, unit)
Rename DEFINE_GAUGE_METRIC_PROTOTYPE_2ARG(name, unit) witch define core metrics to DEFINE_GAUGE_CORE_METRIC_PROTOTYPE_2ARG(name, unit)
2020-10-15 19:56:43 +08:00
f431d8d94c [Enhance][Log] Make RPC error log more clear (#4702)
At present, when some rpc errors occur, the client cannot obtain the error information well.

And this CL change the RPC error returned to client like this:

```
ERROR 1064 (HY000): errCode = 2, detailMessage = there is no scanNode Backend. [10002: in black list(A error 
occurred: errorCode=2001 errorMessage:Channel inactive error!)]

ERROR 1064 (HY000): failed to send brpc batch, error=The server is overcrowded, error_text=[E1011]The server is 
overcrowded @xx.xx.xx.xx:8060 [R1][E1011]The server is overcrowded @xx.xx.xx.xx:8060 [R2][E1011]The server is 
overcrowded @xx.xx.xx.xx:8060 [R3][E1011]The server is overcrowded @xx.xx.xx.xx:8060, client: yy.yy.yy.yy
```
2020-10-13 10:08:43 +08:00
b780df697a [refactor] Optimize threads usage mode in BE (#4440)
BE can not graceful exit because some threads are running in endless
loop. This patch do the following optimization:
- Use the well encapsulated Thread and ThreadPool instead of std::thread
  and std::vector<std::thread>
- Use CountDownLatch in thread's loop condition to avoid endless loop
- Introduce a new class Daemon for daemon works, like tcmalloc_gc,
  memory_maintenance and calculate_metrics
- Decouple statistics type TaskWorkerPool and StorageEngine notification
  by submit tasks to TaskWorkerPool's queue
- Reorder objects' stop and deconstruct in main(), i.e. stop network
  services at first, then internal services
- Use libevent in pthreads mode, by calling evthread_use_pthreads(),
  then EvHttpServer can exit gracefully in multi-threads
- Call brpc::Server's Stop() and ClearServices() explicitly
2020-09-06 20:19:14 +08:00
e71152132c [metrics] Redesign metrics to 3 layers (#4115)
Redesign metrics to 3 layers:
    MetricRegistry - MetricEntity - Metrics
    MetricRegistry : the register center
    MetricEntity : the entity registered on MetricRegistry. Generally a MetricRegistry can be registered on several 
        MetricEntities, each of MetricEntity is an independent entity, such as server, disk_devices, data_directories, thrift 
        clients and servers, and so on. 
    Metric : metrics of an entity. Such as fragment_requests_total on server entity, disk_bytes_read on a disk_device entity, 
        thrift_opened_clients on a thrift_client entity.
    MetricPrototype: the type of a metric. MetricPrototype is a global variable, can be shared by the same metrics across 
        different MetricEntities.
2020-08-08 11:23:01 +08:00
fdd65c50c4 [Bug] fix mem_tracker use-after-free & add UT for it (#3899) 2020-06-20 19:08:53 +08:00
ef8fd1fcbe [Load] Support load json-data into Doris by RoutineLoad or StreamLoad (#3553)
Doris support load json-data by RoutineLoad or StreamLoad
2020-05-21 13:00:49 +08:00
72f3082358 [Metrics] Add some metrics for container size in BE (#3246)
We can observe the workload of BE, and also it's a way to check
whether there is any problem in BE, like some container increase
too large and lead to OOM.

This patch add the following metrics:
```
Name                                   Description
rowset_count_generated_and_in_use      The total count of rowset id generated and in use since BE last start
unused_rowsets_count                   The total count of unused rowset waiting to be GC
broker_count                           The total count of brokers in management
data_stream_receiver_count             The total count of data stream receivers in management
fragment_endpoint_count                The total count of fragment endpoints of data stream in management, should always equal to data_stream_receiver_count
active_scan_context_count              The total count of active scan contexts
plan_fragment_count                    The total count of plan fragments in executing
load_channel_count                     The total count of load channels in management
result_buffer_block_count              The total count of result buffer blocks for queries, each block has a limited queue size (default 1024)
result_block_queue_count               The total count of queues for fragments, each queue has a limited size (default 20, by config::max_memory_sink_batch_count)
routine_load_task_count                The total count of routine load tasks in executing
small_file_cache_count                 The total count of cached small files' digest info
stream_load_pipe_count                 The total count of stream load pipes, each pipe has a limited buffer size (default 1M)
tablet_writer_count                    The total count of tablet writers
brpc_endpoint_stub_count               The total count of brpc endpoints
```
2020-04-25 16:13:39 +08:00
1cf0fb9117 Use ThreadPool to refactor MemTableFlushExecutor (#2931)
1. MemTableFlushExecutor maintain a ThreadPool to receive FlushTask.
2. FlushToken is used to seperate different tasks from different tablets.
   Every DeltaWriter of tablet constructs a FlushToken,
   task in FlushToken are handle serially, task between FlushToken are
   handle concurrently.
3. I have remove thread limit on data_dir, because of I/O is not the main
   timer consumer of Flush thread. Much of time is consumed in CPU decoding
   and compress.
2020-02-18 18:39:04 +08:00
3c539aac54 [Refactor] Some tiny refactor on streaming-load related code (#2891)
Mainly contains the following modifications:
1. Use `std::unique_ptr` to replace some naked pointers
2. Modify some methods from member-method to local-static-function
3. Modify some methods do not need to be public to private
4. Some formatting changes: such as wrapping lines that are too long
5. Remove some useless variables
6. Add or modify some comments for easier understanding

No functional changes in this patch.
2020-02-13 10:42:52 +08:00
9c90b09a3f [Alter Table] No need to check whether table is stable when doing some kinds of alter operation (#2617)
* [Alter Table] No need to check whether table is stable when doing some kinds of alter operation.

Not all alter table operation require table to be stable. Such as rename, modify meta data.
2020-01-02 20:51:23 +08:00
569d0bb3af Replace all remaining boost::split() with strings::split() (#2302) 2019-11-26 22:22:14 +08:00
45df6aae08 Fix some routine load bugs (#2093)
Mainly fix the following issues:

1. A null pointer exception is raised when a database or table is dropped. The expected behavior is that the routine load job is stopped.

2. Memory leaks. Batch routine load task submissions are no longer performed, and modifications are submitted separately for each task.

3. Unreasonable task timeout.
    Routine load tasks should not be queued in the BE thread pool for execution. The task sent to the BE should be executed immediately, otherwise the task in the FE will be timeout first. Eventually leads to constant timeout for all subsequent tasks.

4. All routine load job should be scheduled once it being submitted. Not waiting the available BE slot. Otherwise, all later submitted jobs may not be scheduled forever.
2019-10-31 21:53:03 +08:00
f852f50acb Improve unique id performance (#1911)
Remove the default constructor for UniqueID
Add a gen_uid method in UniqueId. If need to generate a new uid, users should call this api explicitly.
Reuse boost random generator not generate a new one every time.
2019-09-29 18:20:02 +08:00
5a12a1d7df Fix compile error (#1780) 2019-09-10 23:48:42 +08:00
235cdb0ecd Commit kafka offset (#1734)
Commit kafka offset in routine load

Kafka will decide whether to delete data based on whether all consumer group is commit offset or not. If there is no commit offset, the kafka server disk may be full
2019-09-10 14:27:06 +08:00
9d03ba236b Uniform Status (#1317) 2019-06-14 23:38:31 +08:00
ff0dd0d2da Support SSL authentication with Kafka in routine load job (#1235) 2019-06-07 16:29:01 +08:00
180d8e5cbd Modify some thirdparties (#1228)
1. Change Kafka java client from 2.0.0 to 0.10.1.1. Because high version client may not support low server server.
2. Enable SSL in librdkafka
2019-05-30 21:23:37 +08:00
9d19c6c315 Support arbitrary kafka properties (#1204) 2019-05-28 10:03:50 +08:00
722a9e71c7 Optimize json functions (#1177)
1. get_json_xxx() now support using quoto to escape dot
2. Implement json_path_prepare() function to preprocess json_path

Performance of get_json_string() on 1000000 rows reduces from 2.27s to 0.27s
2019-05-21 09:13:12 +08:00
7f8a1bcdb6 Threadpool should be shutdown before join() (#1171) 2019-05-17 19:10:22 +08:00
b2e63910a6 Fix bug that routine load task may be blocked due to premature deconstruction (#1166)
Data consumer group should wait all data consumers finished before return.
2019-05-16 16:15:00 +08:00
cf1e7aa844 Add close tablet writer log (#1014) 2019-04-28 10:33:50 +08:00
2b4d02b2fa Add error load log url for routine load job (#938) 2019-04-28 10:33:50 +08:00
0cccb5cc9c Fix bugs of routine load job (#917)
1. Uninitialized counter cause endless data consuming.
2. Incorrect handle null value in column mapping.

* fix bug
2019-04-28 10:33:50 +08:00
400d8a906f Optimize the consumer assignment of Kafka routine load job (#870)
1. Use a data consumer group to share a single stream load pipe with multi data consumers. This will increase the consuming speed of Kafka messages, as well as reducing the task number of routine
load job. 

Test results:

* 1 consumer, 1 partitions:
    consume time: 4.469s, rows: 990140, bytes: 128737139.  221557 rows/s, 28M/s
* 1 consumer, 3 partitions:
    consume time: 12.765s, rows: 2000143, bytes: 258631271. 156689 rows/s, 20M/s
    blocking get time(us): 12268241, blocking put time(us): 1886431
* 3 consumers, 3 partitions:
    consume time(all 3): 6.095s, rows: 2000503, bytes: 258631576. 328220 rows/s, 42M/s
    blocking get time(us): 1041639, blocking put time(us): 10356581

The next 2 cases show that we can achieve higher speed by adding more consumers. But the bottle neck transfers from Kafka consumer to Doris ingestion, so 3 consumers in a group is enough.

I also add a Backend config `max_consumer_num_per_group` to change the number of consumers in a data consumer group, and default value is 3.

In my test(1 Backend, 2 tablets, 1 replicas), 1 routine load task can achieve 10M/s, which is same as raw stream load.

2. Add OFFSET_BEGINNING and OFFSET_END support for Kafka routine load
2019-04-28 10:33:50 +08:00
9d08be3c5f Add metrics for routine load (#795)
* Add metrics for routine load
* limit the max number of routine load task in backend to 10
* Fix bug that some partitions will no be assigned
2019-04-28 10:33:50 +08:00