183 lines
8.3 KiB
Go
183 lines
8.3 KiB
Go
// Copyright 2024 PingCAP, Inc.
|
|
//
|
|
// Licensed under the Apache License, Version 2.0 (the "License");
|
|
// you may not use this file except in compliance with the License.
|
|
// You may obtain a copy of the License at
|
|
//
|
|
// http://www.apache.org/licenses/LICENSE-2.0
|
|
//
|
|
// Unless required by applicable law or agreed to in writing, software
|
|
// distributed under the License is distributed on an "AS IS" BASIS,
|
|
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
// See the License for the specific language governing permissions and
|
|
// limitations under the License.
|
|
|
|
// Package framework contains all the codes related to DXF.
|
|
//
|
|
// The goal of the DXF is to implement unified scheduling and distributed execution
|
|
// of tasks, and to provide unified resource management capabilities for both
|
|
// overall and individual tasks, which better meets users' expectations for resource usage.
|
|
//
|
|
// DXF runs on all the nodes of the TiDB cluster, there is an owner node responsible
|
|
// for scheduling resources and tasks, and all other nodes are followers which are
|
|
// responsible for executing tasks.
|
|
//
|
|
// # The components that are unique to owner nodes:
|
|
//
|
|
// - scheduler manager: manages schedulers of tasks, and it also responsible for
|
|
// post-cleanup of tasks, such as moving the task to history table, cleaning up
|
|
// intermediate data files generated by global sort.
|
|
// - task schedulers: responsible for managing lifecycle and scheduling of the task.
|
|
// - node manager: manages nodes that can be used to run tasks.
|
|
// - balancer: responsible for balancing the load of subtasks on all nodes.
|
|
//
|
|
// # The components that are running on all nodes:
|
|
//
|
|
// - task executor manager: manages all task executors.
|
|
// - slot manager: manages the slots or resources that can be used to run tasks.
|
|
// - task executor: responsible for executing the task.
|
|
//
|
|
// # Resource abstraction
|
|
//
|
|
// To fully utilize the resources and avoid resource overuse, we abstract the
|
|
// resources in the TiDB cluster as slots. A slot is the minimum granularity of
|
|
// node resources. For each node, slot count = number of cores, and each slot
|
|
// represents:
|
|
//
|
|
// - one core on the node
|
|
// - 1/number-of-cores * total-memory-of-memory
|
|
// - 1/number-of-cores * total-disk-space-of-disk.
|
|
// Note: this factor is not considered during scheduling right now.
|
|
//
|
|
// For example, if a node has 16 cores, 128GB memory, and 1TB disk space, then one
|
|
// slot represents: 1 core, 8GB memory, and 64GB disk space.
|
|
//
|
|
// The maximum number of slots that can be used by a task is determined its concurrency
|
|
// and the target scope.
|
|
//
|
|
// To better describe the resources that a task can use, we define a stripe as a
|
|
// slot group which consists one slot on each node of the same target scope. As
|
|
// we don't allow subtasks of the same task run on some node concurrently, so
|
|
// the maximum resource that a task can use is task-concurrency stripes.
|
|
//
|
|
// # Service scope
|
|
//
|
|
// To isolate resources, and avoid DXF tasks from interfering with online transactions,
|
|
// each node in the cluster have a service scope and the default scope is empty.
|
|
//
|
|
// A DXF task can only run on the nodes with the same scope as the target scope
|
|
// of the task.
|
|
//
|
|
// Due to history reasons, there is a special service scope 'background'. When
|
|
// scheduling a task with empty target scope, the task will run on the nodes of
|
|
// the 'background' scope if such nodes exist, otherwise, the task will run on the
|
|
// nodes of same scope, i.e. empty.
|
|
//
|
|
// # Task abstraction
|
|
//
|
|
// A task is abstracted as multiple steps that runs in sequence, each step contains
|
|
// multiple sub-tasks that runs in parallel, such as:
|
|
//
|
|
// task
|
|
// ├── step1
|
|
// │ ├── subtask1
|
|
// │ ├── subtask2
|
|
// │ └── subtask3
|
|
// └── step2
|
|
// ├── subtask1
|
|
// ├── subtask2
|
|
// └── subtask3
|
|
//
|
|
// For the steps of specific task type, see step.go for more detail.
|
|
//
|
|
// # Task order
|
|
//
|
|
// As the resources are limited, we need to schedule tasks in a certain order.
|
|
//
|
|
// In DXF, we manage to run tasks of higher ranking first, and then run tasks of
|
|
// lower ranking. A task of higher ranking might preempt the resources of a task
|
|
// of lower ranking.
|
|
//
|
|
// Note, we use the word rank instead of priority as it's only part of fields that
|
|
// determine the order of tasks. Task rank is defined by:
|
|
//
|
|
// priority asc, create_time asc, id asc.
|
|
//
|
|
// # Task state machine
|
|
//
|
|
// Note: if a task fails during running, it will end with `reverted` state.
|
|
// The `failed` state is used to mean the framework cannot run the task, such as
|
|
// invalid task type, scheduler init error(fatal), etc.
|
|
//
|
|
// normal execution state transition:
|
|
//
|
|
// ┌──────┐
|
|
// │failed│
|
|
// └──────┘
|
|
// ▲
|
|
// ┌──┴────┐ ┌───────┐ ┌────────┐
|
|
// │pending├────►│running├────►│succeed │
|
|
// └──┬────┘ └──┬┬───┘ └────────┘
|
|
// │ ││ ┌─────────┐ ┌────────┐
|
|
// │ │└────────►│reverting├────►│reverted│
|
|
// │ ▼ └─────────┘ └────────┘
|
|
// │ ┌──────────┐ ▲
|
|
// └─────────►│cancelling├────┘
|
|
// └──────────┘
|
|
//
|
|
// Note: if ManualRecovery is enabled, when some subtask failed, the task will
|
|
// move to `awaiting-resolution` state, and manual operation is needed for the
|
|
// task to continue. This mechanism is used for debugging, some bug such as those
|
|
// on global-sort are harder to investigate without the intermediate files, or to
|
|
// manually recover from some error when importing large mount of data using
|
|
// global-sort where one round of import takes a lot of time, it might be more
|
|
// flexible and efficient than retrying the whole task.
|
|
//
|
|
// pause/resume state transition:
|
|
// as we don't know the state of the task before `paused`, so the state after
|
|
// `resuming` is always `running`.
|
|
//
|
|
// ┌───────┐
|
|
// │pending├──┐
|
|
// └───────┘ │ ┌───────┐ ┌──────┐
|
|
// ├────►│pausing├──────►│paused│
|
|
// ┌───────┐ │ └───────┘ └───┬──┘
|
|
// │running├──┘ │
|
|
// └───▲───┘ ┌────────┐ │
|
|
// └────────────┤resuming│◄─────────┘
|
|
// └────────┘
|
|
//
|
|
// modifying state transition:
|
|
//
|
|
// ┌───────┐
|
|
// │pending├──┐
|
|
// └───────┘ │
|
|
// ┌───────┐ │ ┌─────────┐
|
|
// │running├──┼────►│modifying├────► original state
|
|
// └───────┘ │ └─────────┘
|
|
// ┌───────┐ │
|
|
// │paused ├──┘
|
|
// └───────┘
|
|
//
|
|
// # Subtask state machine
|
|
//
|
|
// NOTE: `running` -> `pending` only happens when some node is taken as dead, so
|
|
// its running subtask is balanced to other node, and the subtask is idempotent,
|
|
// we do this to make the subtask can be scheduled to other node again, it's NOT
|
|
// a normal state transition.
|
|
//
|
|
// ┌──────────────┐
|
|
// │ ┌───┴──┐
|
|
// │ ┌───────►│paused│
|
|
// ▼ │ └──────┘
|
|
// ┌───────┐ ┌───┴───┐ ┌───────┐
|
|
// │pending├───►│running├───►│succeed│
|
|
// └───────┘ └┬──┬───┘ └───────┘
|
|
// ▲ │ │ ┌──────┐
|
|
// └────────┘ ├───────►│failed│
|
|
// │ └──────┘
|
|
// │ ┌────────┐
|
|
// └───────►│canceled│
|
|
// └────────┘
|
|
package framework
|