leah blogs: How to properly shut down a Linux system

In a previous post, I discussed how you can determine that you are pid 1, the init process, when the system is booting. Today, we’ll consider the end of the init process: system shutdown.

If you look into a book on Unix system administration, the classic way to manually turn off a Unix system contains a few steps:

Bring the system to single-user mode (init 1 or shutdown).
Unmount all filesystems except for /.
Remount the root file system read-only.
Run sync.
Turn off the system.

What step 1 essentially does is is kill all processes (except for pid 1), and spawn a new shell. (Unix doesn’t have a concept of “single-user mode” in kernel space.) This is necessary to orderly stop all daemons, kill all remaining user processes, and close open files that would stop step 2 from progressing.

Step 3 is necessary to ensure the root file system is in a consistent state. Since we cannot unmount it (we still use it!), remounting it read-only is the best available way to ensure consistency.

Finally, we flush buffers, and then it’s safe to turn off the machine.

Now, since this is not my first rodeo with writing custom init scripts, I’ve implemented these steps a bunch of times and found out some things which were not obvious.

So let’s see how some of this works in detail.

Killing all processes

This sounds easy, but is tricky to get right. If you are not pid 1, using kill(-1, SIGTERM) will send SIGTERM to all processes except for pid 1 and itself (on Linux on the BSDs). You should then send a SIGCONT to all processes, so stopped processes will wake up and handle the SIGTERM, too. Then you usually wait a bit for their graceful shutdown and run kill(-1, SIGKILL) to kill the rest. Only two processes, you and init, should remain. The main problem with this is you don’t know when all processes have exited after the first kill, so there’s necessarily a delay.

It is therefore better to let pid 1 do the killing and reaping. Again we run kill(-1, SIGTERM); kill(-1, SIGCONT), and then do the usual reaping an init process should do. When waitpid fails with ECHILD, we know there’s no child left over. Else, after some timeout, you fall back to sending SIGKILL to the rest, and reap again.

(As a historical aside, Alan Cox pointed out that on Unix V7, wait(2) in pid 1 keeps waiting for itself, since the parent of pid 1 is pid 1 itself. However all contemporary systems deal with this fine.)

In the real world, you still want a timeout here. A process could be stuck in state D and not respond to SIGKILL either. We still want to power down at some point and not lock up shutdown due to this.

Remounting the file system read-only

On Linux, you do this by calling mount -o remount,ro / or the equivalent syscall mount("/", "/", "", MS_REMOUNT | MS_RDONLY, ""). This can fail when the “file system is still in use” with error code EBUSY.

I ran into this EBUSY error a few times before, and lately a lot during development, and I finally tried to track down why it happens. Usually, it’s caused by some process that still has a file handle open, but at this point of shutdown, there’s nothing running anymore except for our init itself, so how can that fail?

At first I thought it was just some erratic behavior (race condition etc.), but then I realized I could trigger the error each time I updated init (which happens a lot when you are testing code…). However, when I didn’t update init, everything shutdown fine!

Now, I update init in my testing VM like this:

scp leah@10.0.2.2:prj/.../init /bin/init- && mv /bin/init- /bin/init

We can’t overwrite /bin/init directly, else we get ETXTBSY. So we do the usual dance of atomically renaming the file into the destination, similar to how package managers do it.

On an inode level, what does this do? Overwriting the /bin/init file decrements the st_nlink field, usually to 0, which means the old file is deleted. However, as the init binary is still running (of course), the inode is kept alive. We can verify this:

# stat -L /proc/1/exe
  File: /proc/1/exe
  Size: 193032    	Blocks: 384        IO Block: 4096   regular file
Device: 8,1	Inode: 137195      Links: 0
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-12-19 17:29:21.289000000 +0000
Modify: 2024-12-19 17:29:21.302000000 +0000
Change: 2024-12-20 15:18:14.480000000 +0000
 Birth: 2024-12-19 17:29:21.289000000 +0000

The link count is zero indeed.

But this causes the file system to stay busy, since it wants to delete the file when it will be closed, so it cannot be remounted read-only while there are open file handles to deleted files! (Thanks to Simon Richter for explaining this.) This is also the reason for the occasional shutdown issues I had using runit—likely the runit binary was updated during the uptime.

I tried many ways to work around this (old posts may suggest we can perhaps link /proc/1/exe back into a file, but this behavior has been forbidden in Linux since 2011), but ultimately I think this is a policy problem and not one of pid 1 itself. I therefore suggest a simple workaround that users of other init systems can use as well: in the startup scripts, after the root filesystem is mounted writable, we just make a backup link for the currently booted init:

ln -f /sbin/init /sbin/.init.old

This ensures that even when /sbin/init is overwritten, its link count doesn’t drop to zero and we don’t block remounting read-only, preventing a clean shutdown.

That’s it for now, let’s see what other surprises appear in the future.

NP: Laura Jane Grace—Punk Rock In Basements

leah blogs

20dec2024 · How to properly shut down a Linux system

Killing all processes

Remounting the file system read-only