Files
tidb/pkg/dxf/framework/doc.go

183 lines
8.3 KiB
Go

// Copyright 2024 PingCAP, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package framework contains all the codes related to DXF.
//
// The goal of the DXF is to implement unified scheduling and distributed execution
// of tasks, and to provide unified resource management capabilities for both
// overall and individual tasks, which better meets users' expectations for resource usage.
//
// DXF runs on all the nodes of the TiDB cluster, there is an owner node responsible
// for scheduling resources and tasks, and all other nodes are followers which are
// responsible for executing tasks.
//
// # The components that are unique to owner nodes:
//
// - scheduler manager: manages schedulers of tasks, and it also responsible for
// post-cleanup of tasks, such as moving the task to history table, cleaning up
// intermediate data files generated by global sort.
// - task schedulers: responsible for managing lifecycle and scheduling of the task.
// - node manager: manages nodes that can be used to run tasks.
// - balancer: responsible for balancing the load of subtasks on all nodes.
//
// # The components that are running on all nodes:
//
// - task executor manager: manages all task executors.
// - slot manager: manages the slots or resources that can be used to run tasks.
// - task executor: responsible for executing the task.
//
// # Resource abstraction
//
// To fully utilize the resources and avoid resource overuse, we abstract the
// resources in the TiDB cluster as slots. A slot is the minimum granularity of
// node resources. For each node, slot count = number of cores, and each slot
// represents:
//
// - one core on the node
// - 1/number-of-cores * total-memory-of-memory
// - 1/number-of-cores * total-disk-space-of-disk.
// Note: this factor is not considered during scheduling right now.
//
// For example, if a node has 16 cores, 128GB memory, and 1TB disk space, then one
// slot represents: 1 core, 8GB memory, and 64GB disk space.
//
// The maximum number of slots that can be used by a task is determined its concurrency
// and the target scope.
//
// To better describe the resources that a task can use, we define a stripe as a
// slot group which consists one slot on each node of the same target scope. As
// we don't allow subtasks of the same task run on some node concurrently, so
// the maximum resource that a task can use is task-concurrency stripes.
//
// # Service scope
//
// To isolate resources, and avoid DXF tasks from interfering with online transactions,
// each node in the cluster have a service scope and the default scope is empty.
//
// A DXF task can only run on the nodes with the same scope as the target scope
// of the task.
//
// Due to history reasons, there is a special service scope 'background'. When
// scheduling a task with empty target scope, the task will run on the nodes of
// the 'background' scope if such nodes exist, otherwise, the task will run on the
// nodes of same scope, i.e. empty.
//
// # Task abstraction
//
// A task is abstracted as multiple steps that runs in sequence, each step contains
// multiple sub-tasks that runs in parallel, such as:
//
// task
// ├── step1
// │ ├── subtask1
// │ ├── subtask2
// │ └── subtask3
// └── step2
// ├── subtask1
// ├── subtask2
// └── subtask3
//
// For the steps of specific task type, see step.go for more detail.
//
// # Task order
//
// As the resources are limited, we need to schedule tasks in a certain order.
//
// In DXF, we manage to run tasks of higher ranking first, and then run tasks of
// lower ranking. A task of higher ranking might preempt the resources of a task
// of lower ranking.
//
// Note, we use the word rank instead of priority as it's only part of fields that
// determine the order of tasks. Task rank is defined by:
//
// priority asc, create_time asc, id asc.
//
// # Task state machine
//
// Note: if a task fails during running, it will end with `reverted` state.
// The `failed` state is used to mean the framework cannot run the task, such as
// invalid task type, scheduler init error(fatal), etc.
//
// normal execution state transition:
//
// ┌──────┐
// │failed│
// └──────┘
// ▲
// ┌──┴────┐ ┌───────┐ ┌────────┐
// │pending├────►│running├────►│succeed │
// └──┬────┘ └──┬┬───┘ └────────┘
// │ ││ ┌─────────┐ ┌────────┐
// │ │└────────►│reverting├────►│reverted│
// │ ▼ └─────────┘ └────────┘
// │ ┌──────────┐ ▲
// └─────────►│cancelling├────┘
// └──────────┘
//
// Note: if ManualRecovery is enabled, when some subtask failed, the task will
// move to `awaiting-resolution` state, and manual operation is needed for the
// task to continue. This mechanism is used for debugging, some bug such as those
// on global-sort are harder to investigate without the intermediate files, or to
// manually recover from some error when importing large mount of data using
// global-sort where one round of import takes a lot of time, it might be more
// flexible and efficient than retrying the whole task.
//
// pause/resume state transition:
// as we don't know the state of the task before `paused`, so the state after
// `resuming` is always `running`.
//
// ┌───────┐
// │pending├──┐
// └───────┘ │ ┌───────┐ ┌──────┐
// ├────►│pausing├──────►│paused│
// ┌───────┐ │ └───────┘ └───┬──┘
// │running├──┘ │
// └───▲───┘ ┌────────┐ │
// └────────────┤resuming│◄─────────┘
// └────────┘
//
// modifying state transition:
//
// ┌───────┐
// │pending├──┐
// └───────┘ │
// ┌───────┐ │ ┌─────────┐
// │running├──┼────►│modifying├────► original state
// └───────┘ │ └─────────┘
// ┌───────┐ │
// │paused ├──┘
// └───────┘
//
// # Subtask state machine
//
// NOTE: `running` -> `pending` only happens when some node is taken as dead, so
// its running subtask is balanced to other node, and the subtask is idempotent,
// we do this to make the subtask can be scheduled to other node again, it's NOT
// a normal state transition.
//
// ┌──────────────┐
// │ ┌───┴──┐
// │ ┌───────►│paused│
// ▼ │ └──────┘
// ┌───────┐ ┌───┴───┐ ┌───────┐
// │pending├───►│running├───►│succeed│
// └───────┘ └┬──┬───┘ └───────┘
// ▲ │ │ ┌──────┐
// └────────┘ ├───────►│failed│
// │ └──────┘
// │ ┌────────┐
// └───────►│canceled│
// └────────┘
package framework