tidb/pkg/dxf/framework/doc.go

// Copyright 2024 PingCAP, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// Package framework contains all the codes related to DXF.
//
// The goal of the DXF is to implement unified scheduling and distributed execution
// of tasks, and to provide unified resource management capabilities for both
// overall and individual tasks, which better meets users' expectations for resource usage.
//
// DXF runs on all the nodes of the TiDB cluster, there is an owner node responsible
// for scheduling resources and tasks, and all other nodes are followers which are
// responsible for executing tasks.
//
// # The components that are unique to owner nodes:
//
//   - scheduler manager: manages schedulers of tasks, and it also responsible for
//     post-cleanup of tasks, such as moving the task to history table, cleaning up
//     intermediate data files generated by global sort.
//   - task schedulers: responsible for managing lifecycle and scheduling of the task.
//   - node manager: manages nodes that can be used to run tasks.
//   - balancer: responsible for balancing the load of subtasks on all nodes.
//
// # The components that are running on all nodes:
//
// - task executor manager: manages all task executors.
// - slot manager: manages the slots or resources that can be used to run tasks.
// - task executor: responsible for executing the task.
//
// # Resource abstraction
//
// To fully utilize the resources and avoid resource overuse, we abstract the
// resources in the TiDB cluster as slots. A slot is the minimum granularity of
// node resources. For each node, slot count = number of cores, and each slot
// represents:
//
//   - one core on the node
//   - 1/number-of-cores * total-memory-of-memory
//   - 1/number-of-cores * total-disk-space-of-disk.
//     Note: this factor is not considered during scheduling right now.
//
// For example, if a node has 16 cores, 128GB memory, and 1TB disk space, then one
// slot represents: 1 core, 8GB memory, and 64GB disk space.
//
// The maximum number of slots that can be used by a task is determined its concurrency
// and the target scope.
//
// To better describe the resources that a task can use, we define a stripe as a
// slot group which consists one slot on each node of the same target scope. As
// we don't allow subtasks of the same task run on some node concurrently, so
// the maximum resource that a task can use is task-concurrency stripes.
//
// # Service scope
//
// To isolate resources, and avoid DXF tasks from interfering with online transactions,
// each node in the cluster have a service scope and the default scope is empty.
//
// A DXF task can only run on the nodes with the same scope as the target scope
// of the task.
//
// Due to history reasons, there is a special service scope 'background'. When
// scheduling a task with empty target scope, the task will run on the nodes of
// the 'background' scope if such nodes exist, otherwise, the task will run on the
// nodes of same scope, i.e. empty.
//
// # Task abstraction
//
// A task is abstracted as multiple steps that runs in sequence, each step contains
// multiple sub-tasks that runs in parallel, such as:
//
//	task
//	├── step1
//	│   ├── subtask1
//	│   ├── subtask2
//	│   └── subtask3
//	└── step2
//	    ├── subtask1
//	    ├── subtask2
//	    └── subtask3
//
// For the steps of specific task type, see step.go for more detail.
//
// # Task order
//
// As the resources are limited, we need to schedule tasks in a certain order.
//
// In DXF, we manage to run tasks of higher ranking first, and then run tasks of
// lower ranking. A task of higher ranking might preempt the resources of a task
// of lower ranking.
//
// Note, we use the word rank instead of priority as it's only part of fields that
// determine the order of tasks. Task rank is defined by:
//
//	priority asc, create_time asc, id asc.
//
// # Task state machine
//
// Note: if a task fails during running, it will end with `reverted` state.
// The `failed` state is used to mean the framework cannot run the task, such as
// invalid task type, scheduler init error(fatal), etc.
//
// normal execution state transition:
//
//	┌──────┐
//	│failed│
//	└──────┘
//	   ▲
//	┌──┴────┐     ┌───────┐     ┌────────┐
//	│pending├────►│running├────►│succeed │
//	└──┬────┘     └──┬┬───┘     └────────┘
//	   │             ││         ┌─────────┐     ┌────────┐
//	   │             │└────────►│reverting├────►│reverted│
//	   │             ▼          └─────────┘     └────────┘
//	   │          ┌──────────┐    ▲
//	   └─────────►│cancelling├────┘
//	              └──────────┘
//
// Note: if ManualRecovery is enabled, when some subtask failed, the task will
// move to `awaiting-resolution` state, and manual operation is needed for the
// task to continue. This mechanism is used for debugging, some bug such as those
// on global-sort are harder to investigate without the intermediate files, or to
// manually recover from some error when importing large mount of data using
// global-sort where one round of import takes a lot of time, it might be more
// flexible and efficient than retrying the whole task.
//
// pause/resume state transition:
// as we don't know the state of the task before `paused`, so the state after
// `resuming` is always `running`.
//
//	┌───────┐
//	│pending├──┐
//	└───────┘  │     ┌───────┐       ┌──────┐
//	           ├────►│pausing├──────►│paused│
//	┌───────┐  │     └───────┘       └───┬──┘
//	│running├──┘                         │
//	└───▲───┘        ┌────────┐          │
//	    └────────────┤resuming│◄─────────┘
//	                 └────────┘
//
// modifying state transition:
//
//	┌───────┐
//	│pending├──┐
//	└───────┘  │
//	┌───────┐  │     ┌─────────┐
//	│running├──┼────►│modifying├────► original state
//	└───────┘  │     └─────────┘
//	┌───────┐  │
//	│paused ├──┘
//	└───────┘
//
// # Subtask state machine
//
// NOTE: `running` -> `pending` only happens when some node is taken as dead, so
// its running subtask is balanced to other node, and the subtask is idempotent,
// we do this to make the subtask can be scheduled to other node again, it's NOT
// a normal state transition.
//
//	               ┌──────────────┐
//	               │          ┌───┴──┐
//	               │ ┌───────►│paused│
//	               ▼ │        └──────┘
//	┌───────┐    ┌───┴───┐    ┌───────┐
//	│pending├───►│running├───►│succeed│
//	└───────┘    └┬──┬───┘    └───────┘
//	     ▲        │  │        ┌──────┐
//	     └────────┘  ├───────►│failed│
//	                 │        └──────┘
//	                 │        ┌────────┐
//	                 └───────►│canceled│
//	                          └────────┘
package framework