The previous part was the first part of the chapter that describes the system call concepts in the Linux kernel. In the previous part we learned what a system call is in the Linux kernel, and in operating systems in general. This was introduced from a user-space perspective, and part of the write system call implementation was discussed. In this part we continue our look at system calls, starting with some theory before moving onto the Linux kernel code.
A user application does not make the system call directly from our applications. We did not write the Hello world!
program like:
int main(int argc, char **argv)
{
...
...
...
sys_write(fd1, buf, strlen(buf));
...
...
}
We can use something similar with the help of C standard library and it will look something like this:
#include <unistd.h>
int main(int argc, char **argv)
{
...
...
...
write(fd1, buf, strlen(buf));
...
...
}
But anyway, write
is not a direct system call and not a kernel function. An application must fill general purpose registers with the correct values in the correct order and use the syscall
instruction to make the actual system call. In this part we will look at what occurs in the Linux kernel when the syscall
instruction is met by the processor.
From the previous part we know that system call concept is very similar to an interrupt. Furthermore, system calls are implemented as software interrupts. So, when the processor handles a syscall
instruction from a user application, this instruction causes an exception which transfers control to an exception handler. As we know, all exception handlers (or in other words kernel C functions that will react on an exception) are placed in the kernel code. But how does the Linux kernel search for the address of the necessary system call handler for the related system call? The Linux kernel contains a special table called the system call table
. The system call table is represented by the sys_call_table
array in the Linux kernel which is defined in the arch/x86/entry/syscall_64.c source code file. Let's look at its implementation:
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
[0 ... __NR_syscall_max] = &sys_ni_syscall,
#include <asm/syscalls_64.h>
};
As we can see, the sys_call_table
is an array of __NR_syscall_max + 1
size where the __NR_syscall_max
macro represents the maximum number of system calls for the given architecture. This book is about the x86_64 architecture, so for our case the __NR_syscall_max
is 547
and this is the correct number at the time of writing (current Linux kernel version is 5.0.0-rc7
). We can see this macro in the header file generated by Kbuild during kernel compilation - include/generated/asm-offsets.h`:
#define __NR_syscall_max 547
There will be the same number of system calls in the arch/x86/entry/syscalls/syscall_64.tbl for the x86_64
. There are two important topics here; the type of the sys_call_table
array, and the initialization of elements in this array. First of all, the type. The sys_call_ptr_t
represents a pointer to a system call table. It is defined as typedef for a function pointer that returns nothing and does not take arguments:
typedef void (*sys_call_ptr_t)(void);
The second thing is the initialization of the sys_call_table
array. As we can see in the code above, all elements of our array that contain pointers to the system call handlers point to the sys_ni_syscall
. The sys_ni_syscall
function represents not-implemented system calls. To start with, all elements of the sys_call_table
array point to the not-implemented system call. This is the correct initial behaviour, because we only initialize storage of the pointers to the system call handlers, it is populated later on. Implementation of the sys_ni_syscall
is pretty easy, it just returns -errno or -ENOSYS
in our case:
asmlinkage long sys_ni_syscall(void)
{
return -ENOSYS;
}
The -ENOSYS
error tells us that:
ENOSYS Function not implemented (POSIX.1)
Also a note on ...
in the initialization of the sys_call_table
. We can do it with a GCC compiler extension called - Designated Initializers. This extension allows us to initialize elements in non-fixed order. As you can see, we include the asm/syscalls_64.h
header at the end of the array. This header file is generated by the special script at arch/x86/entry/syscalls/syscalltbl.sh and generates our header file from the syscall table. The asm/syscalls_64.h
contains definitions of the following macros:
__SYSCALL_COMMON(0, sys_read, sys_read)
__SYSCALL_COMMON(1, sys_write, sys_write)
__SYSCALL_COMMON(2, sys_open, sys_open)
__SYSCALL_COMMON(3, sys_close, sys_close)
__SYSCALL_COMMON(5, sys_newfstat, sys_newfstat)
...
...
...
The __SYSCALL_COMMON
macro is defined in the same source code file and expands to the __SYSCALL_64
macro which expands to the function definition:
#define __SYSCALL_COMMON(nr, sym, compat) __SYSCALL_64(nr, sym, compat)
#define __SYSCALL_64(nr, sym, compat) [nr] = sym,
So, after this, our sys_call_table
takes the following form:
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {
[0 ... __NR_syscall_max] = &sys_ni_syscall,
[0] = sys_read,
[1] = sys_write,
[2] = sys_open,
...
...
...
};
After this all elements that point to the non-implemented system calls will contain the address of the sys_ni_syscall
function that just returns -ENOSYS
as we saw above, and other elements will point to the sys_syscall_name
functions.
At this point, we have filled the system call table and the Linux kernel knows where each system call handler is. But the Linux kernel does not call a sys_syscall_name
function immediately after it is instructed to handle a system call from a user space application. Remember the chapter about interrupts and interrupt handling. When the Linux kernel gets the control to handle an interrupt, it had to do some preparations like save user space registers, switch to a new stack and many more tasks before it will call an interrupt handler. There is the same situation with the system call handling. The preparation for handling a system call is the first thing, but before the Linux kernel will start these preparations, the entry point of a system call must be initialized and only the Linux kernel knows how to perform this preparation. In the next paragraph we will see the process of the initialization of the system call entry in the Linux kernel.
When a system call occurs in the system, where are the first bytes of code that starts to handle it? As we can read in the Intel manual - 64-ia-32-architectures-software-developer-vol-2b-manual:
SYSCALL invokes an OS system-call handler at privilege level 0.
It does so by loading RIP from the IA32_LSTAR MSR
it means that we need to put the system call entry in to the IA32_LSTAR
model specific register. This operation takes place during the Linux kernel initialization process. If you have read the fourth part of the chapter that describes interrupts and interrupt handling in the Linux kernel, you know that the Linux kernel calls the trap_init
function during the initialization process. This function is defined in the arch/x86/kernel/setup.c source code file and executes the initialization of the non-early
exception handlers like divide error, coprocessor error etc. Besides the initialization of the non-early
exceptions handlers, this function calls the cpu_init
function from the arch/x86/kernel/cpu/common.c source code file which besides initialization of per-cpu
state, calls the syscall_init
function from the same source code file.
This function performs the initialization of the system call entry point. Let's look on the implementation of this function. It does not take parameters and first of all it fills two model specific registers:
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
wrmsrl(MSR_LSTAR, entry_SYSCALL_64);
The first model specific register - MSR_STAR
contains 63:48
bits of the user code segment. These bits will be loaded to the CS
and SS
segment registers for the sysret
instruction which provides functionality to return from a system call to user code with the related privilege. Also the MSR_STAR
contains 47:32
bits from the kernel code that will be used as the base selector for CS
and SS
segment registers when user space applications execute a system call. In the second line of code we fill the MSR_LSTAR
register with the entry_SYSCALL_64
symbol that represents system call entry. The entry_SYSCALL_64
is defined in the arch/x86/entry/entry_64.S assembly file and contains code related to the preparation performed before a system call handler will be executed (I already wrote about these preparations, read above). We will not consider the entry_SYSCALL_64
now, but will return to it later in this chapter.
After we have set the entry point for system calls, we need to set the following model specific registers:
MSR_CSTAR
- targetrip
for the compatibility mode callers;MSR_IA32_SYSENTER_CS
- targetcs
for thesysenter
instruction;MSR_IA32_SYSENTER_ESP
- targetesp
for thesysenter
instruction;MSR_IA32_SYSENTER_EIP
- targeteip
for thesysenter
instruction.
The values of these model specific register depend on the CONFIG_IA32_EMULATION
kernel configuration option. If this kernel configuration option is enabled, it allows legacy 32-bit programs to run under a 64-bit kernel. In the first case, if the CONFIG_IA32_EMULATION
kernel configuration option is enabled, we fill these model specific registers with the entry point for the system calls the compatibility mode:
wrmsrl(MSR_CSTAR, entry_SYSCALL_compat);
and with the kernel code segment, put zero to the stack pointer and write the address of the entry_SYSENTER_compat
symbol to the instruction pointer:
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
In another way, if the CONFIG_IA32_EMULATION
kernel configuration option is disabled, we write ignore_sysret
symbol to the MSR_CSTAR
:
wrmsrl(MSR_CSTAR, ignore_sysret);
that is defined in the arch/x86/entry/entry_64.S assembly file and just returns -ENOSYS
error code:
ENTRY(ignore_sysret)
mov $-ENOSYS, %eax
sysret
END(ignore_sysret)
Now we need to fill MSR_IA32_SYSENTER_CS
, MSR_IA32_SYSENTER_ESP
, MSR_IA32_SYSENTER_EIP
model specific registers as we did in the previous code when the CONFIG_IA32_EMULATION
kernel configuration option was enabled. In this case (when the CONFIG_IA32_EMULATION
configuration option is not set) we fill the MSR_IA32_SYSENTER_ESP
and the MSR_IA32_SYSENTER_EIP
with zero and put the invalid segment of the Global Descriptor Table to the MSR_IA32_SYSENTER_CS
model specific register:
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
You can read more about the Global Descriptor Table
in the second part of the chapter that describes the booting process of the Linux kernel.
At the end of the syscall_init
function, we just mask flags in the flags register by writing the set of flags to the MSR_SYSCALL_MASK
model specific register:
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
These flags will be cleared during syscall initialization. That's all, it is the end of the syscall_init
function and it means that system call entry is ready to work. Now we can see what will occur when a user application executes the syscall
instruction.
As I already wrote, before a system call or an interrupt handler will be called by the Linux kernel we need to do some preparations. The idtentry
macro performs the preparations required before an exception handler will be executed, the interrupt
macro performs the preparations required before an interrupt handler will be called and the entry_SYSCALL_64
will do the preparations required before a system call handler will be executed.
The entry_SYSCALL_64
is defined in the arch/x86/entry/entry_64.S assembly file and starts from the following macro:
SWAPGS_UNSAFE_STACK
This macro is defined in the arch/x86/include/asm/irqflags.h header file and expands to the swapgs
instruction:
#define SWAPGS_UNSAFE_STACK swapgs
which exchanges the current GS base register value with the value contained in the MSR_KERNEL_GS_BASE
model specific register. In other words we moved it on to the kernel stack. After this we point the old stack pointer to the rsp_scratch
per-cpu variable and setup the stack pointer to point to the top of stack for the current processor:
movq %rsp, PER_CPU_VAR(rsp_scratch)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
In the next step we push the stack segment and the old stack pointer to the stack:
pushq $__USER_DS
pushq PER_CPU_VAR(rsp_scratch)
After this we enable interrupts, because interrupts are off
on entry and save the general purpose registers (besides bp
, bx
and from r12
to r15
), flags, -ENOSYS
for the non-implemented system call and code segment register on the stack:
ENABLE_INTERRUPTS(CLBR_NONE)
pushq %r11
pushq $__USER_CS
pushq %rcx
pushq %rax
pushq %rdi
pushq %rsi
pushq %rdx
pushq %rcx
pushq $-ENOSYS
pushq %r8
pushq %r9
pushq %r10
pushq %r11
sub $(6*8), %rsp
When a system call occurs from the user's application, general purpose registers have the following state:
rax
- contains system call number;rcx
- contains return address to the user space;r11
- contains register flags;rdi
- contains first argument of a system call handler;rsi
- contains second argument of a system call handler;rdx
- contains third argument of a system call handler;r10
- contains fourth argument of a system call handler;r8
- contains fifth argument of a system call handler;r9
- contains sixth argument of a system call handler;
Other general purpose registers (as rbp
, rbx
and from r12
to r15
) are callee-preserved in C ABI). So we push register flags on the top of the stack, then user code segment, return address to the user space, system call number, first three arguments, dump error code for the non-implemented system call and other arguments on the stack.
In the next step we check the _TIF_WORK_SYSCALL_ENTRY
in the current thread_info
:
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, SIZEOF_PTREGS)
jnz tracesys
The _TIF_WORK_SYSCALL_ENTRY
macro is defined in the arch/x86/include/asm/thread_info.h header file and provides set of the thread information flags that are related to the system calls tracing:
#define _TIF_WORK_SYSCALL_ENTRY \
(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_EMU | _TIF_SYSCALL_AUDIT | \
_TIF_SECCOMP | _TIF_SINGLESTEP | _TIF_SYSCALL_TRACEPOINT | \
_TIF_NOHZ)
We will not consider debugging/tracing related stuff in this chapter, but will see it in the separate chapter that will be devoted to the debugging and tracing techniques in the Linux kernel. After the tracesys
label, the next label is the entry_SYSCALL_64_fastpath
. In the entry_SYSCALL_64_fastpath
we check the __SYSCALL_MASK
that is defined in the arch/x86/include/asm/unistd.h header file and
# ifdef CONFIG_X86_X32_ABI
# define __SYSCALL_MASK (~(__X32_SYSCALL_BIT))
# else
# define __SYSCALL_MASK (~0)
# endif
where the __X32_SYSCALL_BIT
is
#define __X32_SYSCALL_BIT 0x40000000
As we can see the __SYSCALL_MASK
depends on the CONFIG_X86_X32_ABI
kernel configuration option and represents the mask for the 32-bit ABI in the 64-bit kernel.
So we check the value of the __SYSCALL_MASK
and if the CONFIG_X86_X32_ABI
is disabled we compare the value of the rax
register to the maximum syscall number (__NR_syscall_max
), alternatively if the CONFIG_X86_X32_ABI
is enabled we mask the eax
register with the __X32_SYSCALL_BIT
and do the same comparison:
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
After this we check the result of the last comparison with the ja
instruction that executes if CF
and ZF
flags are zero:
ja 1f
and if we have the correct system call for this, we move the fourth argument from the r10
to the rcx
to keep x86_64 C ABI compliant and execute the call
instruction with the address of a system call handler:
movq %r10, %rcx
call *sys_call_table(, %rax, 8)
Note, the sys_call_table
is an array that we saw above in this part. As we already know the rax
general purpose register contains the number of a system call and each element of the sys_call_table
is 8-bytes. So we are using *sys_call_table(, %rax, 8)
this notation to find the correct offset in the sys_call_table
array for the given system call handler.
That's all. We did all the required preparations and the system call handler was called for the given interrupt handler, for example sys_read
, sys_write
or other system call handler that is defined with the SYSCALL_DEFINE[N]
macro in the Linux kernel code.
After a system call handler finishes its work, we will return back to the arch/x86/entry/entry_64.S, right after where we have called the system call handler:
call *sys_call_table(, %rax, 8)
The next step after we've returned from a system call handler is to put the return value of a system handler on to the stack. We know that a system call returns the result to the user program in the general purpose rax
register, so we are moving its value on to the stack after the system call handler has finished its work:
movq %rax, RAX(%rsp)
on the RAX
place.
After this we can see the call of the LOCKDEP_SYS_EXIT
macro from the arch/x86/include/asm/irqflags.h:
LOCKDEP_SYS_EXIT
The implementation of this macro depends on the CONFIG_DEBUG_LOCK_ALLOC
kernel configuration option that allows us to debug locks on exit from a system call. And again, we will not consider it in this chapter, but will return to it in a separate one. In the end of the entry_SYSCALL_64
function we restore all general purpose registers besides rcx
and r11
, because the rcx
register must contain the return address to the application that called system call and the r11
register contains the old flags register. After all general purpose registers are restored, we fill rcx
with the return address, r11
register with the flags and rsp
with the old stack pointer:
RESTORE_C_REGS_EXCEPT_RCX_R11
movq RIP(%rsp), %rcx
movq EFLAGS(%rsp), %r11
movq RSP(%rsp), %rsp
USERGS_SYSRET64
In the end we just call the USERGS_SYSRET64
macro that expands to the call of the swapgs
instruction which exchanges again the user GS
and kernel GS
and the sysretq
instruction which executes on exit from a system call handler:
#define USERGS_SYSRET64 \
swapgs; \
sysretq;
Now we know what occurs when a user application calls a system call. The full path of this process is as follows:
- User application contains code that fills general purpose register with the values (system call number and arguments of this system call);
- Processor switches from the user mode to kernel mode and starts execution of the system call entry -
entry_SYSCALL_64
; entry_SYSCALL_64
switches to the kernel stack and saves some general purpose registers, old stack and code segment, flags and etc... on the stack;entry_SYSCALL_64
checks the system call number in therax
register, searches a system call handler in thesys_call_table
and calls it, if the number of a system call is correct;- If a system call is not correct, jump on exit from system call;
- After a system call handler will finish its work, restore general purpose registers, old stack, flags and return address and exit from the
entry_SYSCALL_64
with thesysretq
instruction.
That's all.
This is the end of the second part about the system calls concept in the Linux kernel. In the previous part we saw theory about this concept from the user application view. In this part we continued to dive into the stuff which is related to the system call concept and saw what the Linux kernel does when a system call occurs.
If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.
Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.