diff --git a/index.html b/index.html index 3df8b43..82e8309 100644 --- a/index.html +++ b/index.html @@ -18,7 +18,7 @@

The Linux Kernel Module Programming Guide

Peter Jay Salzman, Michael Burian, Ori Pomerantz, Bob Mottram, Jim Huang

-
April 15, 2024
+
April 16, 2024
@@ -4403,10 +4403,26 @@ they will not be forgotten and will activate when the unlock happens, using the 60 61MODULE_DESCRIPTION("Spinlock example"); 62MODULE_LICENSE("GPL"); -

+

Taking 100% of a CPU’s resources comes with greater responsibility. Situations +where the kernel code monopolizes a CPU are called atomic contexts. Holding a +spinlock is one of those situations. Sleeping in atomic contexts may leave the system +hanging, as the occupied CPU devotes 100% of its resources doing nothing +but sleeping. In some worse cases the system may crash. Thus, sleeping in +atomic contexts is considered a bug in the kernel. They are sometimes called +“sleep-in-atomic-context” in some materials. +

Note that sleeping here is not limited to calling the sleep functions explicitly. +If subsequent function calls eventually invoke a function that sleeps, it is +also considered sleeping. Thus, it is important to pay attention to functions +being used in atomic context. There’s no documentation recording all such +functions, but code comments may help. Sometimes you may find comments in +kernel source code stating that a function “may sleep”, “might sleep”, or +more explicitly “the caller should not hold a spinlock”. Those comments are +hints that a function may implicitly sleep and must not be called in atomic +contexts. +

12.3 Read and write locks

-

Read and write locks are specialised kinds of spinlocks so that you can exclusively +

Read and write locks are specialised kinds of spinlocks so that you can exclusively read from something or write to something. Like the earlier spinlocks example, the one below shows an "irq safe" situation in which if other functions were triggered from irqs which might also read and write to whatever you are concerned with @@ -4415,6 +4431,9 @@ anything done within the lock as short as possible so that it does not hang up the system and cause users to start revolting against the tyranny of your module.

+ + +

1/* 
 2 * example_rwlock.c 
@@ -4471,19 +4490,16 @@ module.
 53 
 54MODULE_DESCRIPTION("Read/Write locks example"); 
 55MODULE_LICENSE("GPL");
-

Of course, if you know for sure that there are no functions triggered by irqs +

Of course, if you know for sure that there are no functions triggered by irqs which could possibly interfere with your logic then you can use the simpler read_lock(&myrwlock) and read_unlock(&myrwlock) or the corresponding write functions.

12.4 Atomic operations

-

If you are doing simple arithmetic: adding, subtracting or bitwise operations, then +

If you are doing simple arithmetic: adding, subtracting or bitwise operations, then there is another way in the multi-CPU and multi-hyperthreaded world to stop other parts of the system from messing with your mojo. By using atomic operations you - - - can be confident that your addition, subtraction or bit flip did actually happen and was not overwritten by some other shenanigans. An example is shown below. @@ -4564,7 +4580,7 @@ below. 73 74MODULE_DESCRIPTION("Atomic operations example"); 75MODULE_LICENSE("GPL"); -

Before the C11 standard adopts the built-in atomic types, the kernel already +

Before the C11 standard adopts the built-in atomic types, the kernel already provided a small set of atomic types by using a bunch of tricky architecture-specific codes. Implementing the atomic types by C11 atomics may allow the kernel to throw away the architecture-specific codes and letting the kernel code be more friendly to @@ -4577,25 +4593,25 @@ For further details, see:

  • Time to move to C11 atomics?
  • Atomic usage patterns in the kernel
  • -

    +

    13 Replacing Print Macros

    -

    + + + +

    13.1 Replacement

    -

    In Section 1.7, it was noted that the X Window System and kernel module +

    In Section 1.7, it was noted that the X Window System and kernel module programming are not conducive to integration. This remains valid during the development of kernel modules. However, in practical scenarios, the necessity emerges to relay messages to the tty (teletype) originating the module load command. -

    The term “tty” originates from teletype, which initially referred to a combined +

    The term “tty” originates from teletype, which initially referred to a combined keyboard-printer for Unix system communication. Today, it signifies a text stream abstraction employed by Unix programs, encompassing physical terminals, xterms in X displays, and network connections like SSH. - - - -

    To achieve this, the “current” pointer is leveraged to access the active task’s tty +

    To achieve this, the “current” pointer is leveraged to access the active task’s tty structure. Within this structure lies a pointer to a string write function, facilitating the string’s transmission to the tty.

    @@ -4674,16 +4690,16 @@ the string’s transmission to the tty. 72module_exit(print_string_exit); 73 74MODULE_LICENSE("GPL"); -

    +

    13.2 Flashing keyboard LEDs

    -

    In certain conditions, you may desire a simpler and more direct way to communicate +

    In certain conditions, you may desire a simpler and more direct way to communicate to the external world. Flashing keyboard LEDs can be such a solution: It is an immediate way to attract attention or to display a status condition. Keyboard LEDs are present on every hardware, they are always visible, they do not need any setup, and their use is rather simple and non-intrusive, compared to writing to a tty or a file. -

    From v4.14 to v4.15, the timer API made a series of changes +

    From v4.14 to v4.15, the timer API made a series of changes to improve memory safety. A buffer overflow in the area of a timer_list structure may be able to overwrite the @@ -4699,6 +4715,9 @@ Thus, it is better to use a unique prototype to separate from the cluster that t unsigned long argument. The timer callback should be passed a pointer to the timer_list + + + structure rather than an unsigned long argument. Then, it wraps all the information the callback needs, including the timer_list @@ -4706,13 +4725,10 @@ Thus, it is better to use a unique prototype to separate from the cluster that t container_of macro instead of the unsigned long value. For more information see: Improving the kernel timers API. -

    Before Linux v4.14, setup_timer +

    Before Linux v4.14, setup_timer was used to initialize the timer and the timer_list structure looked like: - - -

    1struct timer_list { 
     2    unsigned long expires; 
    @@ -4724,7 +4740,7 @@ Thus, it is better to use a unique prototype to separate from the cluster that t
     8 
     9void setup_timer(struct timer_list *timer, void (*callback)(unsigned long), 
     10                 unsigned long data);
    -

    Since Linux v4.14, timer_setup +

    Since Linux v4.14, timer_setup is adopted and the kernel step by step converting to timer_setup from setup_timer @@ -4735,7 +4751,7 @@ Moreover, the timer_setup

    1void timer_setup(struct timer_list *timer, 
     2                 void (*callback)(struct timer_list *), unsigned int flags);
    -

    The setup_timer +

    The setup_timer was then removed since v4.15. As a result, the timer_list structure had changed to the following. @@ -4746,7 +4762,7 @@ Moreover, the timer_setup 4    u32 flags; 5    /* ... */ 6}; -

    The following source code illustrates a minimal kernel module which, when +

    The following source code illustrates a minimal kernel module which, when loaded, starts blinking the keyboard LEDs until it is unloaded.

    @@ -4835,36 +4851,36 @@ loaded, starts blinking the keyboard LEDs until it is unloaded. 83module_exit(kbleds_cleanup); 84 85MODULE_LICENSE("GPL"); -

    If none of the examples in this chapter fit your debugging needs, +

    If none of the examples in this chapter fit your debugging needs, there might yet be some other tricks to try. Ever wondered what CONFIG_LL_DEBUG in make menuconfig is good for? If you activate that you get low level access to the serial port. While this + + + might not sound very powerful by itself, you can patch kernel/printk.c or any other essential syscall to print ASCII characters, thus making it possible to trace virtually everything what your code does over a serial line. If you find yourself porting the kernel to some new and former unsupported architecture, this is usually amongst the first things that should be implemented. Logging over a netconsole might also be worth a try. -

    While you have seen lots of stuff that can be used to aid debugging here, there are +

    While you have seen lots of stuff that can be used to aid debugging here, there are some things to be aware of. Debugging is almost always intrusive. Adding debug code can change the situation enough to make the bug seem to disappear. Thus, you should keep debug code to a minimum and make sure it does not show up in production code. - - - -

    +

    14 Scheduling Tasks

    -

    There are two main ways of running tasks: tasklets and work queues. Tasklets are a +

    There are two main ways of running tasks: tasklets and work queues. Tasklets are a quick and easy way of scheduling a single function to be run. For example, when triggered from an interrupt, whereas work queues are more complicated but also better suited to running multiple things in a sequence. -

    +

    14.1 Tasklets

    -

    Here is an example tasklet module. The +

    Here is an example tasklet module. The tasklet_fn function runs for a few seconds. In the meantime, execution of the example_tasklet_init @@ -4916,7 +4932,7 @@ better suited to running multiple things in a sequence. 42 43MODULE_DESCRIPTION("Tasklet example"); 44MODULE_LICENSE("GPL"); -

    So with this example loaded dmesg +

    So with this example loaded dmesg should show: @@ -4928,23 +4944,23 @@ Example tasklet starts Example tasklet init continues... Example tasklet ends -

    Although tasklet is easy to use, it comes with several drawbacks, and developers are +

    Although tasklet is easy to use, it comes with several drawbacks, and developers are discussing about getting rid of tasklet in linux kernel. The tasklet callback runs in atomic context, inside a software interrupt, meaning that it cannot sleep or access user-space data, so not all work can be done in a tasklet handler. Also, the kernel only allows one instance of any given tasklet to be running at any given time; multiple different tasklet callbacks can run in parallel. -

    In recent kernels, tasklets can be replaced by workqueues, timers, or threaded +

    In recent kernels, tasklets can be replaced by workqueues, timers, or threaded interrupts.1 While the removal of tasklets remains a longer-term goal, the current kernel contains more than a hundred uses of tasklets. Now developers are proceeding with the API changes and the macro DECLARE_TASKLET_OLD exists for compatibility. For further information, see https://lwn.net/Articles/830964/. -

    +

    14.2 Work queues

    -

    To add a task to the scheduler we can use a workqueue. The kernel then uses the +

    To add a task to the scheduler we can use a workqueue. The kernel then uses the Completely Fair Scheduler (CFS) to execute work within the queue.

    @@ -4981,36 +4997,36 @@ Completely Fair Scheduler (CFS) to execute work within the queue. 31 32MODULE_LICENSE("GPL"); 33MODULE_DESCRIPTION("Workqueue example"); -

    +

    15 Interrupt Handlers

    -

    +

    15.1 Interrupt Handlers

    -

    Except for the last chapter, everything we did in the kernel so far we have done as a +

    Except for the last chapter, everything we did in the kernel so far we have done as a response to a process asking for it, either by dealing with a special file, sending an ioctl() , or issuing a system call. But the job of the kernel is not just to respond to process requests. Another job, which is every bit as important, is to speak to the hardware connected to the machine. -

    There are two types of interaction between the CPU and the rest of the +

    There are two types of interaction between the CPU and the rest of the computer’s hardware. The first type is when the CPU gives orders to the hardware, the other is when the hardware needs to tell the CPU something. The second, called interrupts, is much harder to implement because it has to be dealt with when convenient for the hardware, not the CPU. Hardware devices typically have a very small amount of RAM, and if you do not read their information when available, it is lost. -

    Under Linux, hardware interrupts are called IRQ’s (Interrupt ReQuests). There +

    Under Linux, hardware interrupts are called IRQ’s (Interrupt ReQuests). There are two types of IRQ’s, short and long. A short IRQ is one which is expected to take a very short period of time, during which the rest of the machine will be blocked and no other interrupts will be handled. A long IRQ is one which can take longer, and during which other interrupts may occur (but not interrupts from the same device). If at all possible, it is better to declare an interrupt handler to be long. -

    When the CPU receives an interrupt, it stops whatever it is doing (unless it is +

    When the CPU receives an interrupt, it stops whatever it is doing (unless it is processing a more important interrupt, in which case it will deal with this one only when the more important one is done), saves certain parameters on the stack and calls the interrupt handler. This means that certain things are not allowed in the @@ -5022,10 +5038,10 @@ heavy work deferred from an interrupt handler. Historically, BH (Linux naming for Bottom Halves) statistically book-keeps the deferred functions. Softirq and its higher level abstraction, Tasklet, replace BH since Linux 2.3. -

    The way to implement this is to call +

    The way to implement this is to call request_irq() to get your interrupt handler called when the relevant IRQ is received. -

    In practice IRQ handling can be a bit more complex. Hardware is often designed +

    In practice IRQ handling can be a bit more complex. Hardware is often designed in a way that chains two interrupt controllers, so that all the IRQs from interrupt controller B are cascaded to a certain IRQ from interrupt controller A. Of course, that requires that the kernel finds out which IRQ it really was @@ -5042,11 +5058,11 @@ need to solve another truckload of problems. It is not enough to know if a certain IRQs has happened, it’s also important to know what CPU(s) it was for. People still interested in more details, might want to refer to "APIC" now. -

    This function receives the IRQ number, the name of the function, flags, a name +

    This function receives the IRQ number, the name of the function, flags, a name for /proc/interrupts and a parameter to be passed to the interrupt handler. Usually there is a certain number of IRQs available. How many IRQs there are is hardware-dependent. -

    The flags can be used for specify behaviors of the IRQ. For example, use +

    The flags can be used for specify behaviors of the IRQ. For example, use IRQF_SHARED to indicate you are willing to share the IRQ with other interrupt handlers (usually because a number of hardware devices sit on the same IRQ); use the @@ -5060,16 +5076,16 @@ the SA only the IRQF flags are in use. This function will only succeed if there is not already a handler on this IRQ, or if you are both willing to share. -

    +

    15.2 Detecting button presses

    -

    Many popular single board computers, such as Raspberry Pi or Beagleboards, have a +

    Many popular single board computers, such as Raspberry Pi or Beagleboards, have a bunch of GPIO pins. Attaching buttons to those and then having a button press do something is a classic case in which you might need to use interrupts, so that instead of having the CPU waste time and battery power polling for a change in input state, it is better for the input to trigger the CPU to then run a particular handling function. -

    Here is an example where buttons are connected to GPIO numbers 17 and 18 and +

    Here is an example where buttons are connected to GPIO numbers 17 and 18 and an LED is connected to GPIO 4. You can change those numbers to whatever is appropriate for your board.

    @@ -5219,17 +5235,17 @@ appropriate for your board. 143 144MODULE_LICENSE("GPL"); 145MODULE_DESCRIPTION("Handle some GPIO interrupts"); -

    +

    15.3 Bottom Half

    -

    Suppose you want to do a bunch of stuff inside of an interrupt routine. A common +

    Suppose you want to do a bunch of stuff inside of an interrupt routine. A common way to do that without rendering the interrupt unavailable for a significant duration is to combine it with a tasklet. This pushes the bulk of the work off into the scheduler. -

    The example below modifies the previous example to also run an additional task +

    The example below modifies the previous example to also run an additional task when an interrupt is triggered.

    @@ -5401,10 +5417,10 @@ when an interrupt is triggered. 166 167MODULE_LICENSE("GPL"); 168MODULE_DESCRIPTION("Interrupt with top and bottom half"); -

    +

    16 Virtual Input Device Driver

    -

    The input device driver is a module that provides a way to communicate +

    The input device driver is a module that provides a way to communicate with the interaction device via the event. For example, the keyboard can send the press or release event to tell the kernel what we want to do. The input device driver will allocate a new input structure with @@ -5412,7 +5428,7 @@ do. The input device driver will allocate a new input structure with and sets up input bitfields, device id, version, etc. After that, registers it by calling input_register_device() . -

    Here is an example, vinput, It is an API to allow easy +

    Here is an example, vinput, It is an API to allow easy development of virtual input drivers. The drivers needs to export a vinput_device() that contains the virtual device name and @@ -5431,13 +5447,13 @@ development of virtual input drivers. The drivers needs to export a

  • the readback function: read()
  • -

    Then using vinput_register_device() +

    Then using vinput_register_device() and vinput_unregister_device() will add a new device to the list of support virtual input devices.

    1int init(struct vinput *);
    -

    This function is passed a struct vinput +

    This function is passed a struct vinput already initialized with an allocated struct input_dev . The init() function is responsible for initializing the capabilities of the input device and register @@ -5445,20 +5461,20 @@ it.

    1int send(struct vinput *, char *, int);
    -

    This function will receive a user string to interpret and inject the event using the +

    This function will receive a user string to interpret and inject the event using the input_report_XXXX or input_event call. The string is already copied from user.

    1int read(struct vinput *, char *, int);
    -

    This function is used for debugging and should fill the buffer parameter with the +

    This function is used for debugging and should fill the buffer parameter with the last event sent in the virtual input device format. The buffer will then be copied to user. -

    vinput devices are created and destroyed using sysfs. And, event injection is done +

    vinput devices are created and destroyed using sysfs. And, event injection is done through a /dev node. The device name will be used by the userland to export a new virtual input device. -

    The class_attribute +

    The class_attribute structure is similar to other attribute types we talked about in section 8:

    @@ -5469,7 +5485,7 @@ virtual input device. 5    ssize_t (*store)(struct class *class, struct class_attribute *attr, 6                    const char *buf, size_t count); 7}; -

    In vinput.c, the macro CLASS_ATTR_WO(export/unexport) +

    In vinput.c, the macro CLASS_ATTR_WO(export/unexport) defined in include/linux/device.h (in this case, device.h is included in include/linux/input.h) will generate the class_attribute structures which are named class_attr_export/unexport. Then, put them into @@ -5482,11 +5498,11 @@ will generate the class_attribute that should be assigned in vinput_class . Finally, call class_register(&vinput_class) to create attributes in sysfs. -

    To create a vinputX sysfs entry and /dev node. +

    To create a vinputX sysfs entry and /dev node.

    1echo "vkbd" | sudo tee /sys/class/vinput/export
    -

    To unexport the device, just echo its id in unexport: +

    To unexport the device, just echo its id in unexport:

    1echo "0" | sudo tee /sys/class/vinput/unexport
    @@ -5963,7 +5979,7 @@ will generate the class_attribute 416 417MODULE_LICENSE("GPL"); 418MODULE_DESCRIPTION("Emulate input events"); -

    Here the virtual keyboard is one of example to use vinput. It supports all +

    Here the virtual keyboard is one of example to use vinput. It supports all KEY_MAX keycodes. The injection format is the KEY_CODE such as defined in include/linux/input.h. A positive value means @@ -5971,12 +5987,12 @@ will generate the class_attribute while a negative value is a KEY_RELEASE . The keyboard supports repetition when the key stays pressed for too long. The following demonstrates how simulation work. -

    Simulate a key press on "g" ( KEY_G +

    Simulate a key press on "g" ( KEY_G = 34):

    1echo "+34" | sudo tee /dev/vinput0
    -

    Simulate a key release on "g" ( KEY_G +

    Simulate a key release on "g" ( KEY_G = 34):

    @@ -6097,10 +6113,10 @@ following demonstrates how simulation work. 108 109MODULE_LICENSE("GPL"); 110MODULE_DESCRIPTION("Emulate keyboard input events through /dev/vinput"); -

    +

    17 Standardizing the interfaces: The Device Model

    -

    Up to this point we have seen all kinds of modules doing all kinds of things, but there +

    Up to this point we have seen all kinds of modules doing all kinds of things, but there was no consistency in their interfaces with the rest of the kernel. To impose some consistency such that there is at minimum a standardized way to start, suspend and resume a device model was added. An example is shown below, and you can @@ -6206,13 +6222,13 @@ functions. 96 97MODULE_LICENSE("GPL"); 98MODULE_DESCRIPTION("Linux Device Model example"); -

    +

    18 Optimizations

    -

    +

    18.1 Likely and Unlikely conditions

    -

    Sometimes you might want your code to run as quickly as possible, +

    Sometimes you might want your code to run as quickly as possible, especially if it is handling an interrupt or doing something which might cause noticeable latency. If your code contains boolean conditions and if you know that the conditions are almost always likely to evaluate as either @@ -6234,16 +6250,16 @@ to succeed. -

    When the unlikely +

    When the unlikely macro is used, the compiler alters its machine instruction output, so that it continues along the false branch and only jumps if the condition is true. That avoids flushing the processor pipeline. The opposite happens if you use the likely macro. -

    +

    18.2 Static keys

    -

    Static keys allow us to enable or disable kernel code paths based on the runtime state +

    Static keys allow us to enable or disable kernel code paths based on the runtime state of key. Its APIs have been available since 2010 (most architectures are already supported), use self-modifying code to eliminate the overhead of cache and branch prediction. The most typical use case of static keys is for performance-sensitive kernel @@ -6257,7 +6273,7 @@ Before we can use static keys in the kernel, we need to make sure that gcc suppo

    1CONFIG_JUMP_LABEL=y 
     2CONFIG_HAVE_ARCH_JUMP_LABEL=y 
     3CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
    -

    To declare a static key, we need to define a global variable using the +

    To declare a static key, we need to define a global variable using the DEFINE_STATIC_KEY_FALSE or DEFINE_STATIC_KEY_TRUE macro defined in include/linux/jump_label.h. This macro initializes the key with @@ -6267,7 +6283,7 @@ code:

    1DEFINE_STATIC_KEY_FALSE(fkey);
    -

    Once the static key has been declared, we need to add branching code to the +

    Once the static key has been declared, we need to add branching code to the module that uses the static key. For example, the code includes a fastpath, where a no-op instruction will be generated at compile time as the key is initialized to false and the branch is unlikely to be taken. @@ -6277,7 +6293,7 @@ and the branch is unlikely to be taken. 2if (static_branch_unlikely(&fkey)) 3    pr_alert("do unlikely thing\n"); 4pr_info("fastpath 2\n"); -

    If the key is enabled at runtime by calling +

    If the key is enabled at runtime by calling static_branch_enable(&fkey) , the fastpath will be patched with an unconditional jump instruction to the slowpath @@ -6285,7 +6301,7 @@ and the branch is unlikely to be taken. code pr_alert , so the branch will always be taken until the key is disabled again. -

    The following kernel module derived from chardev.c, demonstrates how the +

    The following kernel module derived from chardev.c, demonstrates how the static key works.

    @@ -6484,59 +6500,59 @@ static key works. 193module_exit(chardev_exit); 194 195MODULE_LICENSE("GPL"); -

    To check the state of the static key, we can use the /dev/key_state +

    To check the state of the static key, we can use the /dev/key_state interface.

    1cat /dev/key_state
    -

    This will display the current state of the key, which is disabled by default. -

    To change the state of the static key, we can perform a write operation on the +

    This will display the current state of the key, which is disabled by default. +

    To change the state of the static key, we can perform a write operation on the file:

    1echo enable > /dev/key_state
    -

    This will enable the static key, causing the code path to switch from the fastpath +

    This will enable the static key, causing the code path to switch from the fastpath to the slowpath. -

    In some cases, the key is enabled or disabled at initialization and never changed, +

    In some cases, the key is enabled or disabled at initialization and never changed, we can declare a static key as read-only, which means that it can only be toggled in the module init function. To declare a read-only static key, we can use the DEFINE_STATIC_KEY_FALSE_RO or DEFINE_STATIC_KEY_TRUE_RO macro instead. Attempts to change the key at runtime will result in a page fault. For more information, see Static keys -

    +

    19 Common Pitfalls

    -

    +

    19.1 Using standard libraries

    -

    You can not do that. In a kernel module, you can only use kernel functions which are +

    You can not do that. In a kernel module, you can only use kernel functions which are the functions you can see in /proc/kallsyms. -

    +

    19.2 Disabling interrupts

    -

    You might need to do this for a short time and that is OK, but if you do not enable +

    You might need to do this for a short time and that is OK, but if you do not enable them afterwards, your system will be stuck and you will have to power it off. -

    +

    20 Where To Go From Here?

    -

    For those deeply interested in kernel programming, kernelnewbies.org and the +

    For those deeply interested in kernel programming, kernelnewbies.org and the Documentation subdirectory within the kernel source code are highly recommended. Although the latter may not always be straightforward, it serves as a valuable initial step for further exploration. Echoing Linus Torvalds’ perspective, the most effective method to understand the kernel is through personal examination of the source code. -

    Contributions to this guide are welcome, especially if there are any significant +

    Contributions to this guide are welcome, especially if there are any significant inaccuracies identified. To contribute or report an issue, please initiate an issue at https://github.com/sysprog21/lkmpg. Pull requests are greatly appreciated. -

    Happy hacking! +

    Happy hacking!

    -

    1The goal of threaded interrupts is to push more of the work to separate threads, so that the +

    1The goal of threaded interrupts is to push more of the work to separate threads, so that the minimum needed for acknowledging an interrupt is reduced, and therefore the time spent handling the interrupt (where it can’t handle any other interrupts at the same time) is reduced. See https://lwn.net/Articles/302043/.

    diff --git a/lkmpg-for-ht.html b/lkmpg-for-ht.html index 3df8b43..82e8309 100644 --- a/lkmpg-for-ht.html +++ b/lkmpg-for-ht.html @@ -18,7 +18,7 @@

    The Linux Kernel Module Programming Guide

    Peter Jay Salzman, Michael Burian, Ori Pomerantz, Bob Mottram, Jim Huang

    -
    April 15, 2024
    +
    April 16, 2024
    @@ -4403,10 +4403,26 @@ they will not be forgotten and will activate when the unlock happens, using the 60 61MODULE_DESCRIPTION("Spinlock example"); 62MODULE_LICENSE("GPL"); -

    +

    Taking 100% of a CPU’s resources comes with greater responsibility. Situations +where the kernel code monopolizes a CPU are called atomic contexts. Holding a +spinlock is one of those situations. Sleeping in atomic contexts may leave the system +hanging, as the occupied CPU devotes 100% of its resources doing nothing +but sleeping. In some worse cases the system may crash. Thus, sleeping in +atomic contexts is considered a bug in the kernel. They are sometimes called +“sleep-in-atomic-context” in some materials. +

    Note that sleeping here is not limited to calling the sleep functions explicitly. +If subsequent function calls eventually invoke a function that sleeps, it is +also considered sleeping. Thus, it is important to pay attention to functions +being used in atomic context. There’s no documentation recording all such +functions, but code comments may help. Sometimes you may find comments in +kernel source code stating that a function “may sleep”, “might sleep”, or +more explicitly “the caller should not hold a spinlock”. Those comments are +hints that a function may implicitly sleep and must not be called in atomic +contexts. +

    12.3 Read and write locks

    -

    Read and write locks are specialised kinds of spinlocks so that you can exclusively +

    Read and write locks are specialised kinds of spinlocks so that you can exclusively read from something or write to something. Like the earlier spinlocks example, the one below shows an "irq safe" situation in which if other functions were triggered from irqs which might also read and write to whatever you are concerned with @@ -4415,6 +4431,9 @@ anything done within the lock as short as possible so that it does not hang up the system and cause users to start revolting against the tyranny of your module.

    + + +

    1/* 
     2 * example_rwlock.c 
    @@ -4471,19 +4490,16 @@ module.
     53 
     54MODULE_DESCRIPTION("Read/Write locks example"); 
     55MODULE_LICENSE("GPL");
    -

    Of course, if you know for sure that there are no functions triggered by irqs +

    Of course, if you know for sure that there are no functions triggered by irqs which could possibly interfere with your logic then you can use the simpler read_lock(&myrwlock) and read_unlock(&myrwlock) or the corresponding write functions.

    12.4 Atomic operations

    -

    If you are doing simple arithmetic: adding, subtracting or bitwise operations, then +

    If you are doing simple arithmetic: adding, subtracting or bitwise operations, then there is another way in the multi-CPU and multi-hyperthreaded world to stop other parts of the system from messing with your mojo. By using atomic operations you - - - can be confident that your addition, subtraction or bit flip did actually happen and was not overwritten by some other shenanigans. An example is shown below. @@ -4564,7 +4580,7 @@ below. 73 74MODULE_DESCRIPTION("Atomic operations example"); 75MODULE_LICENSE("GPL"); -

    Before the C11 standard adopts the built-in atomic types, the kernel already +

    Before the C11 standard adopts the built-in atomic types, the kernel already provided a small set of atomic types by using a bunch of tricky architecture-specific codes. Implementing the atomic types by C11 atomics may allow the kernel to throw away the architecture-specific codes and letting the kernel code be more friendly to @@ -4577,25 +4593,25 @@ For further details, see:

  • Time to move to C11 atomics?
  • Atomic usage patterns in the kernel
  • -

    +

    13 Replacing Print Macros

    -

    + + + +

    13.1 Replacement

    -

    In Section 1.7, it was noted that the X Window System and kernel module +

    In Section 1.7, it was noted that the X Window System and kernel module programming are not conducive to integration. This remains valid during the development of kernel modules. However, in practical scenarios, the necessity emerges to relay messages to the tty (teletype) originating the module load command. -

    The term “tty” originates from teletype, which initially referred to a combined +

    The term “tty” originates from teletype, which initially referred to a combined keyboard-printer for Unix system communication. Today, it signifies a text stream abstraction employed by Unix programs, encompassing physical terminals, xterms in X displays, and network connections like SSH. - - - -

    To achieve this, the “current” pointer is leveraged to access the active task’s tty +

    To achieve this, the “current” pointer is leveraged to access the active task’s tty structure. Within this structure lies a pointer to a string write function, facilitating the string’s transmission to the tty.

    @@ -4674,16 +4690,16 @@ the string’s transmission to the tty. 72module_exit(print_string_exit); 73 74MODULE_LICENSE("GPL"); -

    +

    13.2 Flashing keyboard LEDs

    -

    In certain conditions, you may desire a simpler and more direct way to communicate +

    In certain conditions, you may desire a simpler and more direct way to communicate to the external world. Flashing keyboard LEDs can be such a solution: It is an immediate way to attract attention or to display a status condition. Keyboard LEDs are present on every hardware, they are always visible, they do not need any setup, and their use is rather simple and non-intrusive, compared to writing to a tty or a file. -

    From v4.14 to v4.15, the timer API made a series of changes +

    From v4.14 to v4.15, the timer API made a series of changes to improve memory safety. A buffer overflow in the area of a timer_list structure may be able to overwrite the @@ -4699,6 +4715,9 @@ Thus, it is better to use a unique prototype to separate from the cluster that t unsigned long argument. The timer callback should be passed a pointer to the timer_list + + + structure rather than an unsigned long argument. Then, it wraps all the information the callback needs, including the timer_list @@ -4706,13 +4725,10 @@ Thus, it is better to use a unique prototype to separate from the cluster that t container_of macro instead of the unsigned long value. For more information see: Improving the kernel timers API. -

    Before Linux v4.14, setup_timer +

    Before Linux v4.14, setup_timer was used to initialize the timer and the timer_list structure looked like: - - -

    1struct timer_list { 
     2    unsigned long expires; 
    @@ -4724,7 +4740,7 @@ Thus, it is better to use a unique prototype to separate from the cluster that t
     8 
     9void setup_timer(struct timer_list *timer, void (*callback)(unsigned long), 
     10                 unsigned long data);
    -

    Since Linux v4.14, timer_setup +

    Since Linux v4.14, timer_setup is adopted and the kernel step by step converting to timer_setup from setup_timer @@ -4735,7 +4751,7 @@ Moreover, the timer_setup

    1void timer_setup(struct timer_list *timer, 
     2                 void (*callback)(struct timer_list *), unsigned int flags);
    -

    The setup_timer +

    The setup_timer was then removed since v4.15. As a result, the timer_list structure had changed to the following. @@ -4746,7 +4762,7 @@ Moreover, the timer_setup 4    u32 flags; 5    /* ... */ 6}; -

    The following source code illustrates a minimal kernel module which, when +

    The following source code illustrates a minimal kernel module which, when loaded, starts blinking the keyboard LEDs until it is unloaded.

    @@ -4835,36 +4851,36 @@ loaded, starts blinking the keyboard LEDs until it is unloaded. 83module_exit(kbleds_cleanup); 84 85MODULE_LICENSE("GPL"); -

    If none of the examples in this chapter fit your debugging needs, +

    If none of the examples in this chapter fit your debugging needs, there might yet be some other tricks to try. Ever wondered what CONFIG_LL_DEBUG in make menuconfig is good for? If you activate that you get low level access to the serial port. While this + + + might not sound very powerful by itself, you can patch kernel/printk.c or any other essential syscall to print ASCII characters, thus making it possible to trace virtually everything what your code does over a serial line. If you find yourself porting the kernel to some new and former unsupported architecture, this is usually amongst the first things that should be implemented. Logging over a netconsole might also be worth a try. -

    While you have seen lots of stuff that can be used to aid debugging here, there are +

    While you have seen lots of stuff that can be used to aid debugging here, there are some things to be aware of. Debugging is almost always intrusive. Adding debug code can change the situation enough to make the bug seem to disappear. Thus, you should keep debug code to a minimum and make sure it does not show up in production code. - - - -

    +

    14 Scheduling Tasks

    -

    There are two main ways of running tasks: tasklets and work queues. Tasklets are a +

    There are two main ways of running tasks: tasklets and work queues. Tasklets are a quick and easy way of scheduling a single function to be run. For example, when triggered from an interrupt, whereas work queues are more complicated but also better suited to running multiple things in a sequence. -

    +

    14.1 Tasklets

    -

    Here is an example tasklet module. The +

    Here is an example tasklet module. The tasklet_fn function runs for a few seconds. In the meantime, execution of the example_tasklet_init @@ -4916,7 +4932,7 @@ better suited to running multiple things in a sequence. 42 43MODULE_DESCRIPTION("Tasklet example"); 44MODULE_LICENSE("GPL"); -

    So with this example loaded dmesg +

    So with this example loaded dmesg should show: @@ -4928,23 +4944,23 @@ Example tasklet starts Example tasklet init continues... Example tasklet ends -

    Although tasklet is easy to use, it comes with several drawbacks, and developers are +

    Although tasklet is easy to use, it comes with several drawbacks, and developers are discussing about getting rid of tasklet in linux kernel. The tasklet callback runs in atomic context, inside a software interrupt, meaning that it cannot sleep or access user-space data, so not all work can be done in a tasklet handler. Also, the kernel only allows one instance of any given tasklet to be running at any given time; multiple different tasklet callbacks can run in parallel. -

    In recent kernels, tasklets can be replaced by workqueues, timers, or threaded +

    In recent kernels, tasklets can be replaced by workqueues, timers, or threaded interrupts.1 While the removal of tasklets remains a longer-term goal, the current kernel contains more than a hundred uses of tasklets. Now developers are proceeding with the API changes and the macro DECLARE_TASKLET_OLD exists for compatibility. For further information, see https://lwn.net/Articles/830964/. -

    +

    14.2 Work queues

    -

    To add a task to the scheduler we can use a workqueue. The kernel then uses the +

    To add a task to the scheduler we can use a workqueue. The kernel then uses the Completely Fair Scheduler (CFS) to execute work within the queue.

    @@ -4981,36 +4997,36 @@ Completely Fair Scheduler (CFS) to execute work within the queue. 31 32MODULE_LICENSE("GPL"); 33MODULE_DESCRIPTION("Workqueue example"); -

    +

    15 Interrupt Handlers

    -

    +

    15.1 Interrupt Handlers

    -

    Except for the last chapter, everything we did in the kernel so far we have done as a +

    Except for the last chapter, everything we did in the kernel so far we have done as a response to a process asking for it, either by dealing with a special file, sending an ioctl() , or issuing a system call. But the job of the kernel is not just to respond to process requests. Another job, which is every bit as important, is to speak to the hardware connected to the machine. -

    There are two types of interaction between the CPU and the rest of the +

    There are two types of interaction between the CPU and the rest of the computer’s hardware. The first type is when the CPU gives orders to the hardware, the other is when the hardware needs to tell the CPU something. The second, called interrupts, is much harder to implement because it has to be dealt with when convenient for the hardware, not the CPU. Hardware devices typically have a very small amount of RAM, and if you do not read their information when available, it is lost. -

    Under Linux, hardware interrupts are called IRQ’s (Interrupt ReQuests). There +

    Under Linux, hardware interrupts are called IRQ’s (Interrupt ReQuests). There are two types of IRQ’s, short and long. A short IRQ is one which is expected to take a very short period of time, during which the rest of the machine will be blocked and no other interrupts will be handled. A long IRQ is one which can take longer, and during which other interrupts may occur (but not interrupts from the same device). If at all possible, it is better to declare an interrupt handler to be long. -

    When the CPU receives an interrupt, it stops whatever it is doing (unless it is +

    When the CPU receives an interrupt, it stops whatever it is doing (unless it is processing a more important interrupt, in which case it will deal with this one only when the more important one is done), saves certain parameters on the stack and calls the interrupt handler. This means that certain things are not allowed in the @@ -5022,10 +5038,10 @@ heavy work deferred from an interrupt handler. Historically, BH (Linux naming for Bottom Halves) statistically book-keeps the deferred functions. Softirq and its higher level abstraction, Tasklet, replace BH since Linux 2.3. -

    The way to implement this is to call +

    The way to implement this is to call request_irq() to get your interrupt handler called when the relevant IRQ is received. -

    In practice IRQ handling can be a bit more complex. Hardware is often designed +

    In practice IRQ handling can be a bit more complex. Hardware is often designed in a way that chains two interrupt controllers, so that all the IRQs from interrupt controller B are cascaded to a certain IRQ from interrupt controller A. Of course, that requires that the kernel finds out which IRQ it really was @@ -5042,11 +5058,11 @@ need to solve another truckload of problems. It is not enough to know if a certain IRQs has happened, it’s also important to know what CPU(s) it was for. People still interested in more details, might want to refer to "APIC" now. -

    This function receives the IRQ number, the name of the function, flags, a name +

    This function receives the IRQ number, the name of the function, flags, a name for /proc/interrupts and a parameter to be passed to the interrupt handler. Usually there is a certain number of IRQs available. How many IRQs there are is hardware-dependent. -

    The flags can be used for specify behaviors of the IRQ. For example, use +

    The flags can be used for specify behaviors of the IRQ. For example, use IRQF_SHARED to indicate you are willing to share the IRQ with other interrupt handlers (usually because a number of hardware devices sit on the same IRQ); use the @@ -5060,16 +5076,16 @@ the SA only the IRQF flags are in use. This function will only succeed if there is not already a handler on this IRQ, or if you are both willing to share. -

    +

    15.2 Detecting button presses

    -

    Many popular single board computers, such as Raspberry Pi or Beagleboards, have a +

    Many popular single board computers, such as Raspberry Pi or Beagleboards, have a bunch of GPIO pins. Attaching buttons to those and then having a button press do something is a classic case in which you might need to use interrupts, so that instead of having the CPU waste time and battery power polling for a change in input state, it is better for the input to trigger the CPU to then run a particular handling function. -

    Here is an example where buttons are connected to GPIO numbers 17 and 18 and +

    Here is an example where buttons are connected to GPIO numbers 17 and 18 and an LED is connected to GPIO 4. You can change those numbers to whatever is appropriate for your board.

    @@ -5219,17 +5235,17 @@ appropriate for your board. 143 144MODULE_LICENSE("GPL"); 145MODULE_DESCRIPTION("Handle some GPIO interrupts"); -

    +

    15.3 Bottom Half

    -

    Suppose you want to do a bunch of stuff inside of an interrupt routine. A common +

    Suppose you want to do a bunch of stuff inside of an interrupt routine. A common way to do that without rendering the interrupt unavailable for a significant duration is to combine it with a tasklet. This pushes the bulk of the work off into the scheduler. -

    The example below modifies the previous example to also run an additional task +

    The example below modifies the previous example to also run an additional task when an interrupt is triggered.

    @@ -5401,10 +5417,10 @@ when an interrupt is triggered. 166 167MODULE_LICENSE("GPL"); 168MODULE_DESCRIPTION("Interrupt with top and bottom half"); -

    +

    16 Virtual Input Device Driver

    -

    The input device driver is a module that provides a way to communicate +

    The input device driver is a module that provides a way to communicate with the interaction device via the event. For example, the keyboard can send the press or release event to tell the kernel what we want to do. The input device driver will allocate a new input structure with @@ -5412,7 +5428,7 @@ do. The input device driver will allocate a new input structure with and sets up input bitfields, device id, version, etc. After that, registers it by calling input_register_device() . -

    Here is an example, vinput, It is an API to allow easy +

    Here is an example, vinput, It is an API to allow easy development of virtual input drivers. The drivers needs to export a vinput_device() that contains the virtual device name and @@ -5431,13 +5447,13 @@ development of virtual input drivers. The drivers needs to export a

  • the readback function: read()
  • -

    Then using vinput_register_device() +

    Then using vinput_register_device() and vinput_unregister_device() will add a new device to the list of support virtual input devices.

    1int init(struct vinput *);
    -

    This function is passed a struct vinput +

    This function is passed a struct vinput already initialized with an allocated struct input_dev . The init() function is responsible for initializing the capabilities of the input device and register @@ -5445,20 +5461,20 @@ it.

    1int send(struct vinput *, char *, int);
    -

    This function will receive a user string to interpret and inject the event using the +

    This function will receive a user string to interpret and inject the event using the input_report_XXXX or input_event call. The string is already copied from user.

    1int read(struct vinput *, char *, int);
    -

    This function is used for debugging and should fill the buffer parameter with the +

    This function is used for debugging and should fill the buffer parameter with the last event sent in the virtual input device format. The buffer will then be copied to user. -

    vinput devices are created and destroyed using sysfs. And, event injection is done +

    vinput devices are created and destroyed using sysfs. And, event injection is done through a /dev node. The device name will be used by the userland to export a new virtual input device. -

    The class_attribute +

    The class_attribute structure is similar to other attribute types we talked about in section 8:

    @@ -5469,7 +5485,7 @@ virtual input device. 5    ssize_t (*store)(struct class *class, struct class_attribute *attr, 6                    const char *buf, size_t count); 7}; -

    In vinput.c, the macro CLASS_ATTR_WO(export/unexport) +

    In vinput.c, the macro CLASS_ATTR_WO(export/unexport) defined in include/linux/device.h (in this case, device.h is included in include/linux/input.h) will generate the class_attribute structures which are named class_attr_export/unexport. Then, put them into @@ -5482,11 +5498,11 @@ will generate the class_attribute that should be assigned in vinput_class . Finally, call class_register(&vinput_class) to create attributes in sysfs. -

    To create a vinputX sysfs entry and /dev node. +

    To create a vinputX sysfs entry and /dev node.

    1echo "vkbd" | sudo tee /sys/class/vinput/export
    -

    To unexport the device, just echo its id in unexport: +

    To unexport the device, just echo its id in unexport:

    1echo "0" | sudo tee /sys/class/vinput/unexport
    @@ -5963,7 +5979,7 @@ will generate the class_attribute 416 417MODULE_LICENSE("GPL"); 418MODULE_DESCRIPTION("Emulate input events"); -

    Here the virtual keyboard is one of example to use vinput. It supports all +

    Here the virtual keyboard is one of example to use vinput. It supports all KEY_MAX keycodes. The injection format is the KEY_CODE such as defined in include/linux/input.h. A positive value means @@ -5971,12 +5987,12 @@ will generate the class_attribute while a negative value is a KEY_RELEASE . The keyboard supports repetition when the key stays pressed for too long. The following demonstrates how simulation work. -

    Simulate a key press on "g" ( KEY_G +

    Simulate a key press on "g" ( KEY_G = 34):

    1echo "+34" | sudo tee /dev/vinput0
    -

    Simulate a key release on "g" ( KEY_G +

    Simulate a key release on "g" ( KEY_G = 34):

    @@ -6097,10 +6113,10 @@ following demonstrates how simulation work. 108 109MODULE_LICENSE("GPL"); 110MODULE_DESCRIPTION("Emulate keyboard input events through /dev/vinput"); -

    +

    17 Standardizing the interfaces: The Device Model

    -

    Up to this point we have seen all kinds of modules doing all kinds of things, but there +

    Up to this point we have seen all kinds of modules doing all kinds of things, but there was no consistency in their interfaces with the rest of the kernel. To impose some consistency such that there is at minimum a standardized way to start, suspend and resume a device model was added. An example is shown below, and you can @@ -6206,13 +6222,13 @@ functions. 96 97MODULE_LICENSE("GPL"); 98MODULE_DESCRIPTION("Linux Device Model example"); -

    +

    18 Optimizations

    -

    +

    18.1 Likely and Unlikely conditions

    -

    Sometimes you might want your code to run as quickly as possible, +

    Sometimes you might want your code to run as quickly as possible, especially if it is handling an interrupt or doing something which might cause noticeable latency. If your code contains boolean conditions and if you know that the conditions are almost always likely to evaluate as either @@ -6234,16 +6250,16 @@ to succeed. -

    When the unlikely +

    When the unlikely macro is used, the compiler alters its machine instruction output, so that it continues along the false branch and only jumps if the condition is true. That avoids flushing the processor pipeline. The opposite happens if you use the likely macro. -

    +

    18.2 Static keys

    -

    Static keys allow us to enable or disable kernel code paths based on the runtime state +

    Static keys allow us to enable or disable kernel code paths based on the runtime state of key. Its APIs have been available since 2010 (most architectures are already supported), use self-modifying code to eliminate the overhead of cache and branch prediction. The most typical use case of static keys is for performance-sensitive kernel @@ -6257,7 +6273,7 @@ Before we can use static keys in the kernel, we need to make sure that gcc suppo

    1CONFIG_JUMP_LABEL=y 
     2CONFIG_HAVE_ARCH_JUMP_LABEL=y 
     3CONFIG_HAVE_ARCH_JUMP_LABEL_RELATIVE=y
    -

    To declare a static key, we need to define a global variable using the +

    To declare a static key, we need to define a global variable using the DEFINE_STATIC_KEY_FALSE or DEFINE_STATIC_KEY_TRUE macro defined in include/linux/jump_label.h. This macro initializes the key with @@ -6267,7 +6283,7 @@ code:

    1DEFINE_STATIC_KEY_FALSE(fkey);
    -

    Once the static key has been declared, we need to add branching code to the +

    Once the static key has been declared, we need to add branching code to the module that uses the static key. For example, the code includes a fastpath, where a no-op instruction will be generated at compile time as the key is initialized to false and the branch is unlikely to be taken. @@ -6277,7 +6293,7 @@ and the branch is unlikely to be taken. 2if (static_branch_unlikely(&fkey)) 3    pr_alert("do unlikely thing\n"); 4pr_info("fastpath 2\n"); -

    If the key is enabled at runtime by calling +

    If the key is enabled at runtime by calling static_branch_enable(&fkey) , the fastpath will be patched with an unconditional jump instruction to the slowpath @@ -6285,7 +6301,7 @@ and the branch is unlikely to be taken. code pr_alert , so the branch will always be taken until the key is disabled again. -

    The following kernel module derived from chardev.c, demonstrates how the +

    The following kernel module derived from chardev.c, demonstrates how the static key works.

    @@ -6484,59 +6500,59 @@ static key works. 193module_exit(chardev_exit); 194 195MODULE_LICENSE("GPL"); -

    To check the state of the static key, we can use the /dev/key_state +

    To check the state of the static key, we can use the /dev/key_state interface.

    1cat /dev/key_state
    -

    This will display the current state of the key, which is disabled by default. -

    To change the state of the static key, we can perform a write operation on the +

    This will display the current state of the key, which is disabled by default. +

    To change the state of the static key, we can perform a write operation on the file:

    1echo enable > /dev/key_state
    -

    This will enable the static key, causing the code path to switch from the fastpath +

    This will enable the static key, causing the code path to switch from the fastpath to the slowpath. -

    In some cases, the key is enabled or disabled at initialization and never changed, +

    In some cases, the key is enabled or disabled at initialization and never changed, we can declare a static key as read-only, which means that it can only be toggled in the module init function. To declare a read-only static key, we can use the DEFINE_STATIC_KEY_FALSE_RO or DEFINE_STATIC_KEY_TRUE_RO macro instead. Attempts to change the key at runtime will result in a page fault. For more information, see Static keys -

    +

    19 Common Pitfalls

    -

    +

    19.1 Using standard libraries

    -

    You can not do that. In a kernel module, you can only use kernel functions which are +

    You can not do that. In a kernel module, you can only use kernel functions which are the functions you can see in /proc/kallsyms. -

    +

    19.2 Disabling interrupts

    -

    You might need to do this for a short time and that is OK, but if you do not enable +

    You might need to do this for a short time and that is OK, but if you do not enable them afterwards, your system will be stuck and you will have to power it off. -

    +

    20 Where To Go From Here?

    -

    For those deeply interested in kernel programming, kernelnewbies.org and the +

    For those deeply interested in kernel programming, kernelnewbies.org and the Documentation subdirectory within the kernel source code are highly recommended. Although the latter may not always be straightforward, it serves as a valuable initial step for further exploration. Echoing Linus Torvalds’ perspective, the most effective method to understand the kernel is through personal examination of the source code. -

    Contributions to this guide are welcome, especially if there are any significant +

    Contributions to this guide are welcome, especially if there are any significant inaccuracies identified. To contribute or report an issue, please initiate an issue at https://github.com/sysprog21/lkmpg. Pull requests are greatly appreciated. -

    Happy hacking! +

    Happy hacking!

    -

    1The goal of threaded interrupts is to push more of the work to separate threads, so that the +

    1The goal of threaded interrupts is to push more of the work to separate threads, so that the minimum needed for acknowledging an interrupt is reduced, and therefore the time spent handling the interrupt (where it can’t handle any other interrupts at the same time) is reduced. See https://lwn.net/Articles/302043/.