Seccomp and Seccomp-BPF are used to limit the system calls available to a Linux process. System call filtering isn’t a sandbox. It provides a clearly defined mechanism for minimizing the exposed kernel surface.
The initial implementation, also known as “mode 1 seccomp” only allowed ‘read’, ‘write’, ‘exit’ and ‘sigreturn’ syscalls to be only possible to read/write to already opened files and exit.
Example how we call seccomp:
Seccomp is good for absolute restrictions, a more fine grained approach is required when attempting to lock down more complex applications. In order to solve this problem Seccomp - Berkley Packet Filter (Seccomp-BPF) was introduced.
Seccomp-BPF program receives the following struct as an input argument. We can filter based on the system call number and on the arguments.
Seccomp using Linux Socket Filtering aka Berkeley Packet Filter (BPF) which contains the following structures
So basically what BPFs do in seccomp is to operate on this data, and return a value that tells the kernel what to do next: allow the process to perform the call (SECCOMP_RET_ALLOW), kill it (SECCOMP_RET_KILL), or other options.
As per the documentation of seccomp the best practice is to check the arch system call numbers with seccomp_data->arch to avoid any issues like if the syscall numbers in the different calling conventions overlap, then checks in the filters may be abused.
Example filter to allow open syscall
PR_SET_NO_NEW_PRIVS, which impedes child processes to have more privileges than those of the parent. This is needed to make the following call to prctl, which sets the seccomp filter using the PR_SET_SECCOMP option, succeed even when not being root.