leah blogs: January 2022

It all started with a simple question: how can a Linux process determine whether it is the init process of a freshly booted system?

A dozen years ago, the Unix textbook answer to this would have been: well, if its process id (pid) is 1, then it is init by definition.

⁂

These days, things are not that simple anymore. Containerization creates situations where pid is 1, but the process runs, well, in a container. In Linux, this is realized by using a feature called “pid namespaces”. The clone(2) syscall can take the flag CLONE_NEWPID (“since Linux 2.6.24”), which puts the new process into a new pid namespace. This means that this process will have pid 1 inside the pid namespace, but outside (i.e. in the parent pid namespace), the process has a regular pid. Various Linux API transparently translate pids between these namespaces.

The pid namespaces form a hierarchy, and the one at the very top is called “initial pid namespace”.

You can use the tool unshare(1) to play with pid namespaces:

% unshare --fork --map-root-user --pid bash -c 'echo $$' 
1

This is a way to spawn (as a regular user!) a process that has pid 1, at least, that’s what it looks like to the process.

We can try to find some evidence that we’re a freshly booted init, but none of it is really conclusive:

Our user id is 0, we are root (necessary but not sufficient of course).
$TERM should be linux; trivial to override.
$BOOT_IMAGE is set, but this depends on the boot loader.
System uptime is “low”, but it takes the initrd boot time into account. Our non-root init could be spawned in a container at boot time.

There are also some indicators the process runs in a container using one of the popular solutions such as docker or podman:

The process has a lot of supplementary groups already.
If we were put inside a cgroup, reading /proc/1/cgroup will indicate it.
The file /.dockerenv exists.

But there are still situations, such as the unshare call above, where all of these things may not be true.

Therefore I tried to find the ultimate way to detect whether we are in the initial pid namespace.

I started to research this and quickly found the ioctl(2) NS_GET_PARENT which seemed to be useful: “Returns a file descriptor that refers to the parent namespace of the namespace referred to by fd.” However, it is useless for this purpose:

EPERM  The requested namespace is outside of the caller's
       namespace scope.  This error can occur if, for example,
       the owning user namespace is an ancestor of the caller's
       current user namespace.  It can also occur on attempts to
       obtain the parent of the initial user or PID namespace.

Of course, it makes a lot of sense that we cannot get a handle to the surrounding pid namespace, as this would make the encapsulation provided by namespaces futile. However, coalescing these two error conditions (namespace is outside the caller namespace, and namespace is initial pid namespace) doesn’t make our life easier.

So, we need to bring out bigger guns in. I searched the kernel source for occurrences of init_pid_ns, as this namespace is called in the Linux source code. There are not too many occurrences we can rely on. The taskstats module limits the TASKSTATS_CMD_ATTR_REGISTER_CPUMASK command to the initial pid namespace only, but to use this requires speaking the netlink interface, which is terrible. Also, the behavior could change in future versions.

One interesting, and viable approach, is this limitation of the reboot(2) syscall: only some LINUX_REBOOT_CMD_* commands are allowed to be sent inside a nested pid namespace. Now, we need to find a “harmless” command to call reboot(2) with to test this! (Obviously, only being able to suspend the machine from the initial pid namespace is not a very useful check…) There are two commands that do not do much harm: LINUX_REBOOT_CMD_CAD_{ON,OFF} will toggle the action that Ctrl-Alt-Delete performs. Unfortunately, it is impossible to read the state of this flag, making this test a destructive operation still. (But if you are pid 1, you may want to set it anyway, so you get pid namespace detection for free.)

⁂

So I kept looking for other ways until I realized there’s a quite natural property to check for, and that is to find out if there are kernel threads in the pid namespace. Kernel threads are spawned by the kernel in the initial pid namespace and help perform certain asynchronous actions the kernel has to do, subject to process scheduling. As far as I know, kernel threads never occur in a nested pid namespace, and at least the parent process of kernel threads, kthreadd, will always exist. Conveniently, it also always has pid 2.

Thus, we just need to figure out if pid 2 is a kernel thread! Note that just checking whether pid 2 exists is cheap, but racy: the container runtime could have spawned another process before we are scheduled to do the check, and this process will as well get pid 2 then.

Luckily, kernel threads have quite a few special properties, that are of different difficulty to check from a C program:

/proc/PID/cmdline is empty (not a good indicator, user space processes can clear it too).
kernel threads have parent pid 0 (requires parsing /proc/PID/stat, which everyone gets wrong the first time, or /proc/PID/status).
kernel threads have no Vm* data in /proc/PID/status.
kernel threads have the flag PF_KTHREAD set (requires parsing /proc/PID/stat again).
kernel threads have an empty symlink for /proc/PID/exe.

I decided to go with the latter. On Linux, empty symlinks are impossible to create as a user, so we just need to check that and we’re done, right?

On a regular file system, using lstat(2) would have filled st_size with the length of the symlink. But on a procfs, lstat is not to be trusted, and even non-empty symlinks have st_size equal to 0. We thus really need to use the readlink(2) syscall to read the link. After doing this, you will notice that it returns ENOENT… exactly the same as if pid 2 did not exist!

We therefore need another check, to verify that pid 2 does exist. Luckily, here a lstat on /proc/2/exe file is fine. It must return zero.

Note that you need to do these operations in exactly this order, else you are subject to race conditions again: the only reason this works is that if pid 2 is kthreadd, it will not have terminated before the lstat check (because it cannot terminate).

[Addendum 2023-09-17: vmann points out that this is still racy: a container can spawn a new pid 2 between the lstat and the readlink call. Please use one of the more complicated approaches mentioned above!]

Therefore, readlink(2) failing with ENOENT and lstat(2) succeeding is exactly the combination required to check pid 2 is kthreadd, which implies there are kernel threads in our pid namespace, which implies that we are in the initial namespace.

Phew, this went deeper than expected.

NP: David Bowie—Lazarus

leah blogs: January 2022

08jan2022 · How to check you're in the initial pid namespace?