It all started with a simple question: how can a Linux process determine whether it is the init process of a freshly booted system?
A dozen years ago, the Unix textbook answer to this would have been:
well, if its process id (pid) is 1, then it is init
by definition.
These days, things are not that simple anymore. Containerization
creates situations where pid is 1, but the process runs, well, in a
container. In Linux, this is realized by using a feature called “pid
namespaces”. The clone(2) syscall can take the flag CLONE_NEWPID
(“since Linux 2.6.24”), which puts the new process into a new pid
namespace. This means that this process will have pid 1 inside the
pid namespace, but outside (i.e. in the parent pid namespace), the
process has a regular pid. Various Linux API transparently translate
pids between these namespaces.
The pid namespaces form a hierarchy, and the one at the very top is called “initial pid namespace”.
You can use the tool unshare(1) to play with pid namespaces:
% unshare --fork --map-root-user --pid bash -c 'echo $$'
1
This is a way to spawn (as a regular user!) a process that has pid 1, at least, that’s what it looks like to the process.
We can try to find some evidence that we’re a freshly booted init
, but
none of it is really conclusive:
- Our user id is 0, we are root (necessary but not sufficient of course).
$TERM
should belinux
; trivial to override.$BOOT_IMAGE
is set, but this depends on the boot loader.- System uptime is “low”, but it takes the initrd boot time into account.
Our non-root
init
could be spawned in a container at boot time.
There are also some indicators the process runs in a container using
one of the popular solutions such as docker
or podman
:
- The process has a lot of supplementary groups already.
- If we were put inside a cgroup, reading
/proc/1/cgroup
will indicate it. - The file
/.dockerenv
exists.
But there are still situations, such as the unshare
call above,
where all of these things may not be true.
Therefore I tried to find the ultimate way to detect whether we are in the initial pid namespace.
I started to research this and quickly found the ioctl(2)
NS_GET_PARENT
which seemed to be useful: “Returns a file descriptor
that refers to the parent namespace of the namespace referred to by
fd.” However, it is useless for this purpose:
EPERM The requested namespace is outside of the caller's
namespace scope. This error can occur if, for example,
the owning user namespace is an ancestor of the caller's
current user namespace. It can also occur on attempts to
obtain the parent of the initial user or PID namespace.
Of course, it makes a lot of sense that we cannot get a handle to the surrounding pid namespace, as this would make the encapsulation provided by namespaces futile. However, coalescing these two error conditions (namespace is outside the caller namespace, and namespace is initial pid namespace) doesn’t make our life easier.
So, we need to bring out bigger guns in. I searched the kernel source for
occurrences of init_pid_ns
,
as this namespace is called in the Linux source code. There are not
too many occurrences we can rely on. The taskstats module limits the
TASKSTATS_CMD_ATTR_REGISTER_CPUMASK
command to the initial pid
namespace only, but to use this requires speaking the netlink
interface, which is terrible.
Also, the behavior could change in future versions.
One interesting, and viable approach, is
this limitation
of the reboot(2) syscall: only some LINUX_REBOOT_CMD_*
commands are allowed
to be sent inside a nested pid namespace. Now, we need to find a
“harmless” command to call reboot(2) with to test this! (Obviously,
only being able to suspend the machine from the initial pid namespace
is not a very useful check…) There are two commands that do
not do much harm: LINUX_REBOOT_CMD_CAD_{ON,OFF}
will toggle the action
that Ctrl-Alt-Delete performs. Unfortunately, it is impossible to
read the state of this flag, making this test a destructive operation
still. (But if you are pid 1, you may want to set it anyway, so you get
pid namespace detection for free.)
So I kept looking for other ways until I realized there’s a quite
natural property to check for, and that is to find out if there are
kernel threads in the pid namespace. Kernel threads are spawned by
the kernel in the initial pid namespace and help perform certain
asynchronous actions the kernel has to do, subject to process
scheduling. As far as I know, kernel threads never occur in a nested
pid namespace, and at least the parent process of kernel threads,
kthreadd
, will always exist. Conveniently, it also always has pid 2.
Thus, we just need to figure out if pid 2 is a kernel thread! Note that just checking whether pid 2 exists is cheap, but racy: the container runtime could have spawned another process before we are scheduled to do the check, and this process will as well get pid 2 then.
Luckily, kernel threads have quite a few special properties, that are of different difficulty to check from a C program:
/proc/PID/cmdline
is empty (not a good indicator, user space processes can clear it too).- kernel threads have parent pid 0 (requires parsing
/proc/PID/stat
, which everyone gets wrong the first time, or/proc/PID/status
). - kernel threads have no
Vm*
data in/proc/PID/status
. - kernel threads have the flag
PF_KTHREAD
set (requires parsing/proc/PID/stat
again). - kernel threads have an empty symlink for
/proc/PID/exe
.
I decided to go with the latter. On Linux, empty symlinks are impossible to create as a user, so we just need to check that and we’re done, right?
On a regular file system, using lstat(2) would have filled st_size
with the length of the symlink. But on a procfs
, lstat is not to be
trusted, and even non-empty symlinks have st_size
equal to 0.
We thus really need to use the readlink(2) syscall to read the link.
After doing this, you will notice that it returns ENOENT
… exactly
the same as if pid 2 did not exist!
We therefore need another check, to verify that pid 2 does exist.
Luckily, here a lstat on /proc/2/exe
file is fine. It must return zero.
Note that you need to do these operations in exactly this order, else
you are subject to race conditions again: the only reason this works
is that if pid 2 is kthreadd
, it will not have terminated before the
lstat check (because it cannot terminate).
[Addendum 2023-09-17: vmann points out that this is still racy: a container can spawn a new pid 2 between the lstat and the readlink call. Please use one of the more complicated approaches mentioned above!]
Therefore, readlink(2) failing with ENOENT
and lstat(2) succeeding
is exactly the combination required to check pid 2 is kthreadd
,
which implies there are kernel threads in our pid namespace, which
implies that we are in the initial namespace.
Phew, this went deeper than expected.
NP: David Bowie—Lazarus