In a previous post, I discussed how you can determine that you are pid 1, the init process, when the system is booting. Today, we’ll consider the end of the init process: system shutdown.
If you look into a book on Unix system administration, the classic way to manually turn off a Unix system contains a few steps:
- Bring the system to single-user mode (
init 1
orshutdown
). - Unmount all filesystems except for /.
- Remount the root file system read-only.
- Run
sync
. - Turn off the system.
What step 1 essentially does is is kill all processes (except for pid 1), and spawn a new shell. (Unix doesn’t have a concept of “single-user mode” in kernel space.) This is necessary to orderly stop all daemons, kill all remaining user processes, and close open files that would stop step 2 from progressing.
Step 3 is necessary to ensure the root file system is in a consistent state. Since we cannot unmount it (we still use it!), remounting it read-only is the best available way to ensure consistency.
Finally, we flush buffers, and then it’s safe to turn off the machine.
Now, since this is not my first rodeo with writing custom init scripts, I’ve implemented these steps a bunch of times and found out some things which were not obvious.
So let’s see how some of this works in detail.
Killing all processes
This sounds easy, but is tricky to get right. If you are not pid 1,
using kill(-1, SIGTERM)
will send SIGTERM to all processes except for
pid 1 and itself
(on Linux on the BSDs). You should then send a SIGCONT to all processes, so
stopped processes will wake up and handle the SIGTERM, too. Then you
usually wait a bit for their graceful shutdown and run kill(-1, SIGKILL)
to kill the rest. Only two processes, you and init, should
remain. The main problem with this is you don’t know when all
processes have exited after the first kill, so there’s necessarily a
delay.
It is therefore better to let pid 1 do the killing and reaping. Again
we run kill(-1, SIGTERM); kill(-1, SIGCONT)
, and then do the usual
reaping an init process should do. When waitpid
fails with ECHILD,
we know there’s no child left over. Else, after some timeout, you fall
back to sending SIGKILL to the rest, and reap again.
(As a historical aside, Alan Cox pointed out that on Unix V7, wait(2) in pid 1 keeps waiting for itself, since the parent of pid 1 is pid 1 itself. However all contemporary systems deal with this fine.)
In the real world, you still want a timeout here. A process could
be stuck in state D
and not respond to SIGKILL either. We still
want to power down at some point and not lock up shutdown due to this.
Remounting the file system read-only
On Linux, you do this by calling mount -o remount,ro /
or the equivalent syscall mount("/", "/", "", MS_REMOUNT | MS_RDONLY, "")
.
This can fail when the “file system is still in use” with error code
EBUSY.
I ran into this EBUSY error a few times before, and lately a lot during development, and I finally tried to track down why it happens. Usually, it’s caused by some process that still has a file handle open, but at this point of shutdown, there’s nothing running anymore except for our init itself, so how can that fail?
At first I thought it was just some erratic behavior (race condition etc.), but then I realized I could trigger the error each time I updated init (which happens a lot when you are testing code…). However, when I didn’t update init, everything shutdown fine!
Now, I update init in my testing VM like this:
scp leah@10.0.2.2:prj/.../init /bin/init- && mv /bin/init- /bin/init
We can’t overwrite /bin/init
directly, else we get ETXTBSY. So we
do the usual dance of atomically renaming the file into the
destination, similar to how package managers do it.
On an inode level, what does this do? Overwriting the /bin/init
file decrements the st_nlink
field, usually to 0, which means the
old file is deleted. However, as the init binary is still running (of
course), the inode is kept alive. We can verify this:
# stat -L /proc/1/exe
File: /proc/1/exe
Size: 193032 Blocks: 384 IO Block: 4096 regular file
Device: 8,1 Inode: 137195 Links: 0
Access: (0755/-rwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2024-12-19 17:29:21.289000000 +0000
Modify: 2024-12-19 17:29:21.302000000 +0000
Change: 2024-12-20 15:18:14.480000000 +0000
Birth: 2024-12-19 17:29:21.289000000 +0000
The link count is zero indeed.
But this causes the file system to stay busy,
since it wants to delete the file when it will be closed, so it cannot
be remounted read-only while there are open file handles to deleted files!
(Thanks to Simon Richter for explaining
this.) This is
also the reason for the occasional shutdown issues I had using
runit
—likely the runit
binary was updated during the uptime.
I tried many ways to work around this (old posts may suggest we can
perhaps link /proc/1/exe
back into a file, but this behavior has been
forbidden in Linux since 2011), but ultimately I think this is a
policy problem and not one of pid 1 itself. I therefore suggest a
simple workaround that users of other init systems can use as well:
in the startup scripts, after the root filesystem is mounted writable,
we just make a backup link for the currently booted init:
ln -f /sbin/init /sbin/.init.old
This ensures that even when /sbin/init
is overwritten, its link
count doesn’t drop to zero and we don’t block remounting read-only,
preventing a clean shutdown.
That’s it for now, let’s see what other surprises appear in the future.
NP: Laura Jane Grace—Punk Rock In Basements