leah blogs: Anatomy of a Ceph meltdown

Last week, the server farm of our LMU Student Council had a major downtime over almost five days. As part of the administrator team there, I’d like to publish this post mortem to share our experiences and lessons learned to avoid situations like this in the future.

First and foremost, having a multiple-day spanning downtime is completely unacceptable for a central service like this (and I really wish there was a way to fix this quicker), but the nature of the issue made it really hard to find another solution or workaround. In theory it would have been possible to set up an emergency system restored from backups, but this would have blocked hardware that we need to ensure regular operation later. Also setting up things from scratch is likely to introduce new issues, and resources were bound on recovery. (Please remember that we are all unpaid volunteers who have our own studies and/or day jobs, and no one has had more experience with Ceph than what you get from reading the manual.)

A quick word on our setup: We have three file servers with 12TB storage each that provide each three Ceph OSDs, a monitor, and MDS (to provide CephFS to a shell server and the office machines). Connected to these are two virtualization hosts that run 24 virtual machines total in QEMU/KVM. The file servers and virtualization hosts run on Gentoo, most VM are Debian, a few run Windows. The setup is very redundant: Ceph guarantees each file server can drop out without problems, and if one virtualization host goes down, we can start all machines on the other host (even if main memory gets a bit tight then).

Unfortunately, Ceph itself is a single point of failure: when Ceph goes down, no virtual machine works.

It follows a protocol of the events:

2018-01-15: At night, trying to debug an issue related to CephFS, an administrator had to restart an MDS, which failed. Then they tried to restart an OSD, which failed too. This caused the Ceph cluster to start rebalancing. I was not involved yet; as far as I know no further action was taken.

2018-01-16: Trying to restart the OSD again, we noticed that ceph-osd crashed immediately. It turned out that all three systems had been updated a few times without restarting the OSD. No OSD could start anymore. We kept the last two OSD running (this turned out to be a mistake). The file servers, running Gentoo, also had a profile update done by another administrator. We came to the conclusion that we needed to rebuild world to get into a consistent state.

Ceph and glibc were built without debugging symbols, so all information we had came from the ceph-osd output of backtrace(3), which pointed to the functions parse_network and find_ip_in_subnet_list. These functions are run very early by Ceph during configuration file parsing. I looked into the code, and it was quite simple, and only used std::string and std::list, two interfaces that changed in the recent libstdc++ ABI change.

My working idea behind the bug was now that the libstdc++ ABI change between GCC 4.2 and GCC 6.2 triggered this.

After emerge world, which took several hours, all software was built on the new libstdc++ ABI.

ceph-osd still crashed.

Another theory was that tcmalloc was at fault, but a Ceph without tcmalloc failed as well.

We decided to build a debugging version of Ceph to inspect the issue deeper. Compiling Ceph on Gentoo failed twice: (1) Building Ceph failed due to Ceph trying to run git, which triggered a sandbox exception since we have a /.git directory in the root folder. This could be worked around by setting GIT_CEILING_DIRECTORIES. (2) Building Ceph with debugging symbols took more than 32 GB of disk space, so we had to create space for that at first.

2018-01-17: Debugging of Ceph intensified. It turned out the call to parse_network triggered a data corruption in a std::list<std::string>, which caused the destructor of this data structure to segfault. Tracking down the exact place where this corruption happened turned out to be hard: printing STL data structures is provided by gdb, but to create watchpoints on certain addresses you need to reverse-engineer the actual memory layouts. (For a short time, we assumed the switch to short string optimization was at fault, but spelling out the IPv6 address didn’t help.) Finally I managed to set a watchpoint, and it turned out inet_pton(3) triggered an overflow, which resulted in corruption of the next variable on stack, the list mentioned above. Googling some more turned up Ceph Bug #19371, which tells us that Ceph tried to parse an IPv6 address into a struct sockaddr, which only has space for an IPv4 address! This explained the data corruption. A fix was published in Ceph 10.2.8. We still ran Ceph 10.2.3, the version marked stable in Gentoo. (Up to this, we thought the quite old version of Ceph was not at fault, since it ran well before!)

We decided to update to Ceph 10.2.10.

The OSD crashed, but due to a different thing. First, the Gentoo init.d scripts were broken, secondly Ceph now assumes to run a user ceph (it ran as root before). We started ceph-osd as root again.

The OSDs started fine, so all OSD were restarted now. The MDS reported degradation and the storage itself was degraded a lot (this means the redundancy requirement was not met) and unbalanced.

Ceph started recovery, but for yet unknown reasons the OSD started to crash often and consume vast amounts of RAM (3-5x as much as usual), which drove the system into swapping at first, and then it started to disconnect the OSD because there were too slow to respond, which slowed down recovery even further.

We assume this is Ceph Bug #21761.

We reduced osd_map_cache trying to lower RAM usage, but we are not sure this had any effect.

We started adding more swap, this time on SSD which were meant to serve as Ceph cache usually. This made the situation a bit better, the OSD started to crash later, and had better responsiveness.

2018-01-18: Ceph recovery was still slow, so we looked for more information. MDS was still degraded, we did not know how to fix this. Reading the mailing list we learned to set noout (we knew that) and nodown, to force disable dropping out of the cluster. We also learned to set noup to let the OSD deal with the backlog, since the osdmap epochs were seriously out of sync (up to 10000). After setting noup and letting the OSD churn (this took several hours at high CPU load), the MDS was not degraded anymore! The system continued to balance and started backfilling.

At some point we took single OSD (backed by XFS) down to chown their storages to ceph:ceph, which took several hours each.

OSD RAM usage normalized.

2018-01-19: Backfilling progressed slowly, so we increased osd_max_backfill and osd_recovery_threads. We set noscrub and nodeepscrub to reduce non-recovery I/O. At some point later at night, the system went from HEALTH_ERR to HEALTH_WARN again!

2018-01-20: The OSD went all back to active+clean. Two things were stopping us from HEALTH_OK: we needed to set require_jewel_osds and sortbitwise. Setting both was unproblematic and worked fine.

We started to bring up first virtual machines again. This caused some minor fallout:

The LDAP server started fine, but did not bring up its IPv6 route (a Debian issue we hit before), so the mail server could not identify accounts. This was fixed quickly.
The mailing list server received a few mails to bigger mailing lists, and started to send them out all at once, which caused us to exceed quota at our upstream SMTP server (and the quota was too low, as it turned out later). This meant we had a backlog of over 5000 messages for several hours.

At the end of the day, all systems were operational again.

There is no evince that data was lost during the downtime. It is possible that inbound mail was bounced at the gateway, and thus not delivered, but in this case the sender was notified of this fact. All other mail that was sent inbound was delivered when the mail server came back up.

Lessons learned:

If we notice something is going wrong with Ceph, we will not hesitate to shut down the cluster prematurely. It’s better to have 30 min downtime once, than a mess of this scale.
We should not update Ceph on all machines at once. After updating Ceph (or other critical parts of the system), we will check all services restart fine.
We will build glibc with debugging symbols. (I think this would have pointed me to inet_ptoa quicker and saved a few hours of debugging.)
We will track Ceph releases more closely, and generally trust upstream releases (I don’t know why Gentoo does not stabilize newer releases of Ceph, they fix significant bugs).

(At some point I had proposed to run the OSD in a Debian chroot, but stretch contains Ceph 10.2.5 which was affected by the same bugs.)
We need to find a solution to fix the Debian IPv6 issue, which bit us a bit too often.

NP: Light Bearer—Aggressor & Usurper

leah blogs

22jan2018 · Anatomy of a Ceph meltdown

It follows a protocol of the events:

Lessons learned: