Linux Kernel Design and Implementation Reading Notes

Linux Kernel Design and Implementation Reading Notes

Monolithic and Microkernel Design

Monolithic: Single module binary file, direct communication, one large kernel address space, single large procedure.

Microkernel: Processes are divided into multiple independent processes, each called a server. Only servers that strongly request privileged services run in privileged mode, while other servers run in user space. Uses message passing for communication within the microkernel, where various servers communicate through IPC. However, message passing generates additional overhead.

Linux belongs to monolithic kernels, running all kernel operations in kernel mode with direct calls instead of message passing. It supports dynamic loading or unloading of partial kernel code, and the kernel does not distinguish between threads and processes—only some share resources.

Kernel Activity Scope

  1. Running in user space, executing user processes
  2. Running in kernel space, in the process’s kernel space, representing a specific process execution
  3. Running in kernel space, in a specific interrupt context space, unrelated to any process, handling specific interrupts

Kernel Development Considerations

  • Lack of memory protection mechanisms like those in user space
  • Difficult floating-point operations (floating-point operations in user space require trapping and conversion of operation modes, but in kernel space only manual mode conversion is possible)
  • Kernel stack space size is two pages: 8KB under 32-bit, 16KB under 64-bit
  • Linux kernel can be preempted

Processes

Process descriptors task_struct are stored in a task queue (doubly circular linked list). This process descriptor contains opened files, address space, pending signals, and process state. Starting from 2.6, slab allocators dynamically allocate task_struct, with the descriptor pointer stored at the bottom of the kernel stack.

Accessing process descriptors under X86 requires calculating the descriptor pointer through offset.

Process Context

When executing system calls or triggering exceptions, execution enters kernel space. At this time, the kernel executes on behalf of the process and is in process context (kernel space).

Process Tree

All processes are descendants of the init process with PID 1. Each process’s process descriptor structure stores both a pointer to the parent process task and a pointer list to child processes.

fork()

Created using copy-on-write technology—only copies page tables when writing occurs. If immediately followed by exec(), copying can be avoided to improve speed.

vfork has historical reasons and is similar to current copy-on-write fork.

Process Termination

Releases system resources (file reference count decreased by 1, address space released, exit return value handed to kernel for parent process retrieval), sends signal to parent process, reassigns children to new parents, with new parents being other threads in the thread group or the init process. At this point it enters zombie state: remaining kernel stack, thread_info, and task_struct. The process descriptor is still retained and needs to be released after the parent process receives it.

Orphan Processes

Prioritizes finding parents within the same thread group; if unsuccessful, assigns init as parent.

Threads

From the kernel’s perspective, threads are merely viewed as processes that share certain resources with others.

The essence of creating threads and processes is calling clone, just specifying some shared resources.

Therefore, when calling getpid, the process’s PID displays the thread group’s TGID!

Kernel Threads

Difference from ordinary threads: No independent address space, derived from kthreadd.

Process Scheduling Strategy

Processes can be categorized as I/O-bound or CPU-bound, with most GUI applications being I/O-bound.

Scheduling strategy design needs to balance between process response time and maximum system utilization.

Priority: Measured through nice values.

Higher nice values mean lower priority, reducing processor usage ratio. Lower nice values mean higher priority, increasing processor usage ratio.

2.5: O(1) scheduling strategy

2.6: Completely Fair Scheduler (CFS) Preemption timing is determined by the processor usage ratio consumed by the new executable program. If the usage ratio is smaller than the currently running process, the new process preempts the current process. The smaller the usage ratio, the more we can determine it’s an I/O-bound process, thus requiring higher response time and needing to preempt the current process when immediate response is required.

Uses relative nice values to determine processor usage ratio. If there are many processes, the default minimum CPU time allocated is 1ms.

Every process must record its remaining time slice, which decreases by one when each system tick occurs. When reduced to 0, it will be preempted by another process that hasn’t yet been reduced to 0.

vruntime stores the process’s virtual runtime, recording how long a program has run and how much longer it should run, regularly updated by the system timer.

The runnable process queue is a red-black tree. CFS uses it to quickly find the process with minimum vruntime to execute, with the leftmost node cached beforehand. Process addition: When a process becomes runnable or is created via fork for the first time Process deletion: When a process blocks (becomes non-runnable) or terminates Scheduling selection: Select the highest priority process from the highest priority scheduling class

Sleeping processes are placed in a wait queue (simple linked list implementation). After a process is awakened by a condition completion, it checks again whether conditions are satisfied: satisfied → exit loop, unsatisfied → call schedule again and repeat. For example, when IO data arrives, VFS is responsible for calling wake_up to awaken.

User Preemption

  1. Occurs when returning from system calls
  2. Occurs when returning to user space from interrupt handling

Kernel Preemption

As long as the kernel doesn’t hold locks, the kernel can be safely preempted for rescheduling and yielding CPU.

Real-time Scheduling

CFS discussions are limited to SCHED_NORMAL level processes. For special requirements, can be set to SCHED_FIFO and SCHED_RR. RR is essentially a time-sliced FIFO—the process cannot continue execution after its time slice expires. FIFO-level processes can continue execution while in executable state until blocking or explicitly yielding CPU.

System Calls

Purpose:

  1. Hardware abstraction interface
  2. Ensuring system stability and security

The only means for user space to access the kernel Except for exceptions and traps, they are the only legitimate entry points to the kernel

Unix interface design: Provide mechanisms rather than policies Kernel space uses 64-bit long type as return value to indicate success or failure (for 64-bit compatibility), user space uses int. On error, the C library writes error codes to the global errno variable, and error strings can be obtained through the perror() function.

System call implementation: Prefixed with asmlinkage keyword to ensure compiler extracts parameters from stack only Mechanism for system calls to notify kernel: Triggered through software interrupt int $0x80

System call number — eax Parameters — five registers ebx, ecx, edx, esi, edi More than five parameters use separate registers storing pointers to user space addresses of parameters Return value — eax

In process context, the kernel can sleep (e.g., during system call blocking or explicit schedule calls) and can be preempted. Because the current process may be preempted by other processes, system calls must be reentrant.

Kernel Data Structures

Built-in: Conveniently traversable singly linked lists, doubly linked lists, circular linked lists Producer-consumer queues Mappings — implemented using binary search trees rather than hash tables because binary search trees maintain O(log N) time complexity even in worst case, unlike hash tables which can degrade to linear complexity Red-black trees for storing large amounts of data requiring fast retrieval

Red-Black Trees

A self-balancing binary search tree maintaining semi-balanced structure with the following properties:

  1. All leaf nodes are either red or black
  2. Leaf nodes are all black
  3. Leaf nodes don’t contain data
  4. All non-leaf nodes have two children
  5. If a node is red, its children must be black
  6. In paths from any node to leaf nodes, if all paths contain the same number of black nodes, then that path is shortest compared to other paths

Why can it guarantee that the depth of the deepest leaf node doesn’t exceed twice the depth of the shallowest leaf node?

A path can be all black or alternately red and black. If black node counts are the same, then the longest is the latter, shortest is the former.

Interrupts

No need to consider processor clock synchronization, can interrupt kernel anytime. Interrupt handlers are called and run in special interrupt context.

Interrupt context is also called atomic context. Code inside is non-blocking and non-preemptive, running until completion. Therefore, its handler must complete execution in a short time.

Why can’t interrupt routines sleep? Because the execution context is unrecoverable and cannot be rescheduled—it must execute quickly.

Upper half of interrupt handling: Answer received interrupts (registered interrupt handlers) under interrupt disabled conditions. Work allowed to be completed later is deferred to the lower half, which runs with interrupts enabled. Hardware provides status registers for interrupt routine queries.

Interrupt handlers are pre-registered callback functions in the kernel, different functions located in different driver programs.

  1. Install driver, initialize hardware
  2. Register interrupt handler
  3. Unregister interrupt handler
  4. Uninstall driver

Interrupt handlers don’t need to be reentrant. The same interrupt handler doesn’t allow nesting because when an interrupt routine is running, all processors on the corresponding interrupt line are masked, while other interrupts on other lines remain unaffected. Therefore, the same interrupt routine will never be executed in nested fashion.

Interrupt handlers use the kernel stack space of the interrupted process (when no process is executing, empty tasks execute) (32-bit 8K two pages, 64-bit 4 pages 16K).

Lower Half of Interrupts

Execution-intensive processing routines that allow interruption should be placed in the lower half, but this part still doesn’t allow rescheduling or sleeping.

Three delayed kernel work mechanisms:

  1. Soft interrupts — Run in interrupt context. Statically allocated at compile time, suitable for high real-time and continuous tasks such as networking. Since soft interrupts may trigger themselves again during execution, it may cause shared data contention leading to lock competition. Generally can be resolved using per-processor data.

A registered soft interrupt must be executed after being marked. Therefore, upper-half tasks mostly mark triggers, while lower-half performs soft interrupt routines through process checking.

  1. Tasklets — Run in interrupt context. Dynamically registered at runtime, convenient to use, with lower priority than soft interrupts. Implemented using soft interrupts, but unlike soft interrupts, two identical tasklets won’t execute simultaneously, though different tasklets can run on different processors. A tasklet always executes on the processor that scheduled it, better utilizing CPU cache.

  2. Work queues — The only interrupt lower-half running in process context, involving kernel thread context switching overhead.

Soft Interrupt Management

Since soft interrupts can re-trigger themselves during execution for repeated execution, such as in network subsystems, a group of kernel threads is needed to monitor this process to avoid starvation situations.

This monitoring thread processes soft interrupts whenever there’s a free processor, while ensuring user processes don’t starve.

Work Queues !!!

Common deferred work approach uses queues, with the key being this approach allows rescheduling and sleeping. This is also the only lower-half implementation mechanism running in process context.

Essentially an interface: delegate tasks requiring deferred execution to specific generic threads (kernel threads). Each processor corresponds to one thread, each thread takes one work function from the queue (linked list) and executes it, repeating continuously. Examples: acquiring large amounts of memory, acquiring semaphores, performing blocking IO operations.

Synchronization Mechanisms

Protect data rather than source code

Deadlock

Simplest deadlock is self-deadlock. If an executing thread attempts to acquire a lock it already holds, it falls into busy-waiting state leading to deadlock.

ABBA deadlock

Deadlock prevention:

  1. Lock in order
  2. Prevent starvation
  3. Don’t repeatedly request the same lock
  4. Design should strive for simplicity

Lock granularity: When lock contention is severe, coarse-grained locking reduces scalability; when lock contention is not obvious, fine-grained locking increases system overhead.

Synchronization Methods Provided by Kernel

  1. Atomic operations Implemented with inline assembly instructions, often defined as macros; unlike sequentiality, atomicity emphasizes mutual exclusion more than read/write order
  2. Spinlocks Used across critical sections, threads without locks enter busy-waiting, holding spinlock time should be less than the cost of completing two context switches, suitable for short-term locking
  3. Semaphores A sleeping lock, when unavailable pushed to a wait queue, tasks in wait queue awakened and obtain semaphore when lock is released
  4. Mutex A sleeping lock, suitable for long-term locking scenarios
  5. Completion variables Similar to semaphores, let tasks wait on a variable enabling task synchronization
  6. Sequential locks (optimistic locks) This lock mainly relies on a sequence counter (equivalent to version records in databases), sequence value increases after writer writes, if sequence values before and after reading are the same, it indicates read operation wasn’t interrupted by write operation. Suitable for scenarios with many readers, few writers; and where writes take priority over reads

Barriers: Read barriers and write barriers prevent compiler reordering of shared data read/write sequences!

Timers

If tick rate is k Hz, then one cycle is 1/k seconds Default x86 tick rate is 100 Hz Increasing tick rate provides higher time resolution and precision, making time-dependent tasks execute more accurately, enabling more accurate process preemption and faster scheduling response time. Drawback is increased system burden.

Real Time

Reading variable xtime requires obtaining sequential lock, reading time needs continuous reading until confirming it hasn’t been modified by write operations.

Timers

Implemented by dynamically creating timers and registering their execution functions Generally timers execute immediately upon timeout, but may be postponed to next clock tick, so timers aren’t suitable for hard real-time tasks.

Kernel uses linked lists to store timers, but to improve efficiency, kernel allocates timers with similar timeouts to groups, reducing system overhead for timeout checking.

Delayed Execution

  1. Busy waiting: Suitable when delay time is integer multiples of ticks or precision isn’t critical. Requires explicit kernel rescheduling
  2. Short delays: Suitable for brief delays requiring precise timing. Can achieve precision beyond tick numbers since the number of loops per second is fixed, allowing calculation of higher precision delays through loop counts. Minimum precision可达us level

Memory Management

Memory Management Unit (MMU) manages page tables in system by pages. From virtual memory perspective, pages are the smallest memory unit, page size relates to architecture. Most 32-bit architectures support 4KB pages, 64-bit architectures support 8KB page tables.

Properties of page struct page:

  1. flags indicates page status
  2. _count indicates page reference count — when count becomes -1, page has no references and can be reused by page allocator for new allocation
  3. virtual field stores page’s virtual address

This structure leans toward physical pages, so its description of pages is temporary. Kernel only uses this data structure to describe what’s stored in related physical pages at the current moment, describing only physical memory, not the data within. Kernel uses this structure to manage every page in system, occupying some space but negligible compared to 4GB memory size.

Memory Zones

Kernel doesn’t treat all pages equally, so needs zone division grouping pages with similar functions.

  1. ZONE_DMA: Pages used by DMA, physical memory < 16MB
  2. ZONE_NORMAL: Normally addressable pages, 16 ~ 896 MB
  3. ZONE_HIGHMEM: Dynamically mapped pages, > 896 MB

Kernel page allocation rules: When resources aren’t tight, normally mapped pages are allocated from ZONE_NORMAL pool; if resources are tight, can attempt allocation from ZONE_DMA. But DMA-purpose pages must be allocated from ZONE_DMA pool.

Note that x86-64 has no ZONE_HIGHMEM high memory zone.

Page Allocation and Release

Kernel provides a set of interfaces for page allocation and release alloc_pages allocates 2^N consecutive physical pages Zero-filled physical pages etc. kmalloc() obtains kernel memory in bytes, consecutive physical memory, higher performance vmalloc() obtains memory that’s consecutive in virtual address but not necessarily consecutive in physical memory, requiring page table entries construction. Similar to malloc, this is how user space allocates pages, causing larger TLB thrashing.

Slab allocator serves as cache layer for general data structures, solving memory fragmentation issues caused by frequent memory allocation and release, handling all low-level alignment, coloring, allocation and release recovery under memory shortage.

It divides different objects into so-called cache groups, each cache group stores different objects, each object type corresponds to one cache, this layer used by kmalloc.

Cache is divided into multiple slabs, each slab has multiple consecutive physical pages. Slab objects have three states: full, partially full, or empty. When kernel requests allocation of new specific-type objects, allocator prioritizes acquisition from partially full and empty slabs.

Three-state slabs are connected with linked lists. Examples of using cache for rapid memory allocation appear in fork, where fork_init() creates cache storing task_struct, and when process ends without children, process descriptor is released and memory returned to cache.

Stack Space

User space stack is large while kernel stack is small and fixed at 2 pages, 8K for 32-bit, 16K for 64-bit. When one-page kernel stack option is activated, interrupt handlers get their own stack instead of sharing with interrupted process kernel stack.

Static allocation on stack is very dangerous because when stack overflows, thread_info is affected first, potentially endangering any kernel data. Therefore, dynamic allocation is a better solution that can detect overflow promptly.

In x86, physical memory range above 896M is mostly high memory, these pages must be mapped to kernel’s logical address when allocated, cannot be permanently mapped to kernel’s address space.

Virtual File System (VFS)

Different file systems rely on VFS for coexistence Enables users to directly use open(), read(), write() system calls without considering specific file systems, accessing file systems through virtual interfaces, making this collaboration and generality possible.

write       -> sys_write -> specific filesystem write method -> physical medium
user space      VFS              specific filesystem

Unix files are ordered byte strings, first byte is file head, last byte is file tail, abstracted as byte streams.

VFS treats directories as regular files as well, this file lists all contained files. Unix system file information and file itself are separated, file-related information is stored separately in inode index nodes, filesystem control information is stored in superblock, superblock also contains filesystem information.

VFS implementation adopts object-oriented design: Superblock object — specific mounted filesystem Inode object — specific file (directory) Dentry object — a directory entry, part of a path File object — represents file opened by process

Every major object contains an operation object, operation object implemented as structure pointer containing function pointers for operating parent objects as generic functions (C generics). If generic functions can’t meet needs, actual filesystem-specific methods must fill these function pointers.

Superblock Object

Various filesystems must implement superblock objects for storing specific filesystem information, typically corresponding to filesystem superblocks or filesystem control blocks stored in specific disk sectors, non-disk filesystems create superblocks on-site and save them in memory.

Inode Object

Contains all information kernel needs when operating files or directories, must be created in memory for filesystem use. Created in memory only when file is accessed. An inode represents a file in filesystem, can also be special files like devices or pipes.

Inode mainly records file attributes and which blocks actually store file data, contains at least following information (excluding filename!):

  • File size, block numbers of actual content
  • Access mode rwx
  • Owner and group
  • Various timestamps

Filename stored in directory’s block.

Dentry Object

To facilitate directory lookup, VFS introduced dentries, making path string comparison simpler. Three states: in-use, unused, and negative. Objects can be saved to slab object cache after release.

Dentries use cache to enable fast path lookup, using hash tables and linked lists to build cache.

File Object

This is the object directly dealing with processes, memory representation of opened files, file object points to related dentry object through pointer field pointing to inode to check if file is dirty (modified).

Process of reading file based on file path:

  • Start from high-level directory’s inode and block
  • Find inode number corresponding inode through mount point information, check if read permission is allowed. If allowed, get next-level inode numbers from reading block content, recursively proceed
  • Finally read out file’s block

Other Process-Related Data Structures

Each process has its own set of opened files, default array size for opened files on 64-bit machines is 64. If exceeding 64, kernel allocates new array, replacing old with new array pointer.

Block I/O Layer

Block devices — random access, e.g., hard disks Character devices — sequential access, e.g., serial ports and keyboards

Block Data

Real place storing file data, block size fixed at formatting time Every block has a number, sizes include 1K 2K 4K

Block Devices

Block device’s smallest addressable unit is sector, size generally integer powers of 2, most common is 512 bytes Blocks are filesystem abstraction, can only access filesystem based on blocks, block size can only be integer multiples of sector size, usually block size is 512B, 1KB or 4KB, but cannot exceed one page maximum.

When a block is loaded into memory it must correspond to a buffer, each buffer has a corresponding descriptor.

Request Queue and I/O Scheduling

Block devices store suspended block I/O requests in request queues (doubly linked lists), I/O scheduler allocates disk I/O to all suspended block I/O requests in system. This resource allocation is completed through merging and sorting suspended requests in request queue to reduce addressing time.

Merging: Multiple requests merged into one request, adjacent sectors requested only once, improving performance through reducing addressing commands.

Elevator scheduling algorithm — forward merging and sorting, always inserts requests in sector order to queue, never checking for long-waiting requests. Disadvantage: Can cause starvation problems.

Deadline scheduling algorithm: Solution for starvation problems, classifies each request (read queue, write queue, sorted queue) and determines timeout, expired requests prioritized for extraction from queue and sent to dispatch queue (all different class queues converge here for request merging).

Anticipatory I/O scheduling algorithm: Adds predictive heuristic capability on top of deadline scheduling, difference being after request submission doesn’t directly return to handle other requests, instead deliberately idles briefly, giving applications opportunity to submit requests (may enable another round of merging!).

Process Address Space

A process in 32-bit address space can address 4GB virtual memory.

Process can only access memory addresses within valid memory regions, each region has its own readable/writable/executable permission controls. If process accesses invalid memory regions, or accesses valid memory in incorrect ways, kernel terminates the process and returns segmentation fault information.

Code segment Data segment — initialized global variable memory mapping BSS segment — uninitialized global variables Process user stack Memory-mapped files Shared memory segments Heap — anonymous memory mapping

Process address space specifically represented using memory descriptor to describe, each process has unique memory descriptor.

Thread essence: Parent process wants to share address space with child process, when calling clone() (fork uses this) sets CLONE_VM flag, such processes called threads. To kernel, threads are merely processes sharing specific resources.

Kernel threads: Have no process address space, no memory descriptor. Don’t access user space memory, but kernel can provide previous process’s memory descriptor.

mmap domain connects memory region objects with separate linked list, mm_rb domain connects all memory region objects with red-black tree, linked list used for traversal while red-black tree improves efficiency for quickly locating specific memory regions in address space.

Page Tables

Virtual address -> Physical address requires segmenting virtual address, making each virtual address segment serve as index pointing to page table, page table entries then point to next-level page tables or final physical pages.

Linux uses three-level page tables to complete address translation, multi-level page tables can significantly reduce storage space. Page table searching work done by hardware. Each process has its own page table, to improve parsing speed, uses TLB as hardware cache mapping virtual addresses to physical addresses.

Page Cache

Page cache is kernel-implemented disk cache, used to reduce disk I/O operations. By caching disk data to physical memory, disk access becomes physical memory access, expanding or contracting to relieve memory usage pressure by occupying idle memory.

During read operations, first checks if in page cache, if present reads directly from memory.

Write operations also write to cache first, backend storage doesn’t update immediately, instead marking written pages in page cache as dirty and adding to dirty page list. Finally multiple writeback kernel threads write back to disk.

Cached page updates follow dual-list strategy, divided into active and inactive lists, using LRU/2 strategy. Inactive list pages can be swapped out for recycling.

Buffer cache maps memory pages to disk blocks through buffer, reducing I/O operations for finding specific blocks, writing to buffer first, then flushing to physical blocks.