// Copyright 2024 PingCAP, Inc. // // Licensed under the Apache License, Version 2.0 (the "License"); // you may not use this file except in compliance with the License. // You may obtain a copy of the License at // // http://www.apache.org/licenses/LICENSE-2.0 // // Unless required by applicable law or agreed to in writing, software // distributed under the License is distributed on an "AS IS" BASIS, // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. // See the License for the specific language governing permissions and // limitations under the License. // Package framework contains all the codes related to DXF. // // The goal of the DXF is to implement unified scheduling and distributed execution // of tasks, and to provide unified resource management capabilities for both // overall and individual tasks, which better meets users' expectations for resource usage. // // DXF runs on all the nodes of the TiDB cluster, there is an owner node responsible // for scheduling resources and tasks, and all other nodes are followers which are // responsible for executing tasks. // // # The components that are unique to owner nodes: // // - scheduler manager: manages schedulers of tasks, and it also responsible for // post-cleanup of tasks, such as moving the task to history table, cleaning up // intermediate data files generated by global sort. // - task schedulers: responsible for managing lifecycle and scheduling of the task. // - node manager: manages nodes that can be used to run tasks. // - balancer: responsible for balancing the load of subtasks on all nodes. // // # The components that are running on all nodes: // // - task executor manager: manages all task executors. // - slot manager: manages the slots or resources that can be used to run tasks. // - task executor: responsible for executing the task. // // # Resource abstraction // // To fully utilize the resources and avoid resource overuse, we abstract the // resources in the TiDB cluster as slots. A slot is the minimum granularity of // node resources. For each node, slot count = number of cores, and each slot // represents: // // - one core on the node // - 1/number-of-cores * total-memory-of-memory // - 1/number-of-cores * total-disk-space-of-disk. // Note: this factor is not considered during scheduling right now. // // For example, if a node has 16 cores, 128GB memory, and 1TB disk space, then one // slot represents: 1 core, 8GB memory, and 64GB disk space. // // The maximum number of slots that can be used by a task is determined its concurrency // and the target scope. // // To better describe the resources that a task can use, we define a stripe as a // slot group which consists one slot on each node of the same target scope. As // we don't allow subtasks of the same task run on some node concurrently, so // the maximum resource that a task can use is task-concurrency stripes. // // # Service scope // // To isolate resources, and avoid DXF tasks from interfering with online transactions, // each node in the cluster have a service scope and the default scope is empty. // // A DXF task can only run on the nodes with the same scope as the target scope // of the task. // // Due to history reasons, there is a special service scope 'background'. When // scheduling a task with empty target scope, the task will run on the nodes of // the 'background' scope if such nodes exist, otherwise, the task will run on the // nodes of same scope, i.e. empty. // // # Task abstraction // // A task is abstracted as multiple steps that runs in sequence, each step contains // multiple sub-tasks that runs in parallel, such as: // // task // ├── step1 // │ ├── subtask1 // │ ├── subtask2 // │ └── subtask3 // └── step2 // ├── subtask1 // ├── subtask2 // └── subtask3 // // For the steps of specific task type, see step.go for more detail. // // # Task order // // As the resources are limited, we need to schedule tasks in a certain order. // // In DXF, we manage to run tasks of higher ranking first, and then run tasks of // lower ranking. A task of higher ranking might preempt the resources of a task // of lower ranking. // // Note, we use the word rank instead of priority as it's only part of fields that // determine the order of tasks. Task rank is defined by: // // priority asc, create_time asc, id asc. // // # Task state machine // // Note: if a task fails during running, it will end with `reverted` state. // The `failed` state is used to mean the framework cannot run the task, such as // invalid task type, scheduler init error(fatal), etc. // // normal execution state transition: // // ┌──────┐ // │failed│ // └──────┘ // ▲ // ┌──┴────┐ ┌───────┐ ┌────────┐ // │pending├────►│running├────►│succeed │ // └──┬────┘ └──┬┬───┘ └────────┘ // │ ││ ┌─────────┐ ┌────────┐ // │ │└────────►│reverting├────►│reverted│ // │ ▼ └─────────┘ └────────┘ // │ ┌──────────┐ ▲ // └─────────►│cancelling├────┘ // └──────────┘ // // Note: if ManualRecovery is enabled, when some subtask failed, the task will // move to `awaiting-resolution` state, and manual operation is needed for the // task to continue. This mechanism is used for debugging, some bug such as those // on global-sort are harder to investigate without the intermediate files, or to // manually recover from some error when importing large mount of data using // global-sort where one round of import takes a lot of time, it might be more // flexible and efficient than retrying the whole task. // // pause/resume state transition: // as we don't know the state of the task before `paused`, so the state after // `resuming` is always `running`. // // ┌───────┐ // │pending├──┐ // └───────┘ │ ┌───────┐ ┌──────┐ // ├────►│pausing├──────►│paused│ // ┌───────┐ │ └───────┘ └───┬──┘ // │running├──┘ │ // └───▲───┘ ┌────────┐ │ // └────────────┤resuming│◄─────────┘ // └────────┘ // // modifying state transition: // // ┌───────┐ // │pending├──┐ // └───────┘ │ // ┌───────┐ │ ┌─────────┐ // │running├──┼────►│modifying├────► original state // └───────┘ │ └─────────┘ // ┌───────┐ │ // │paused ├──┘ // └───────┘ // // # Subtask state machine // // NOTE: `running` -> `pending` only happens when some node is taken as dead, so // its running subtask is balanced to other node, and the subtask is idempotent, // we do this to make the subtask can be scheduled to other node again, it's NOT // a normal state transition. // // ┌──────────────┐ // │ ┌───┴──┐ // │ ┌───────►│paused│ // ▼ │ └──────┘ // ┌───────┐ ┌───┴───┐ ┌───────┐ // │pending├───►│running├───►│succeed│ // └───────┘ └┬──┬───┘ └───────┘ // ▲ │ │ ┌──────┐ // └────────┘ ├───────►│failed│ // │ └──────┘ // │ ┌────────┐ // └───────►│canceled│ // └────────┘ package framework