
Anatomy of a Ceph meltdown - panic
http://chneukirchen.org/blog/archive/2018/01/anatomy-of-a-ceph-meltdown.html
======
nimbius
Most of this post-mortem has nothing to do with Ceph or Gentoo, and everything
to do with competent system administration and change management. No one
appears to have stopped to consider the ramifications of unannounced upgrades
to the fileserver or OSD's. Once you're frantically rebuilding world, its
over. the lessons learned completely ignore communications between sysadmins
and management or, god forbid, even customers. "do everything whenever you
like" is not system administration.

>If we notice something is going wrong with Ceph, we will not hesitate to shut
down the cluster prematurely.

once bitten..twice shy? how does this solve future problems in the HA
filesystem for your customer?

>We should not update Ceph on all machines at once.

HA/HPC 101. rolling upgrades are fine, slow and steady, with lots of testing
and a documented rollback procedure if things go screwy. large enough/critical
systems often have dev test and prod

>We will build glibc with debugging symbols.

you will plan your OS upgrades separate from your Ceph upgrades and with
properly communicated actions before, during, and after the events.

>We will track Ceph releases more closely

as opposed to...what? being a sysadmin, Im on more than a dozen release
mailing lists. I track load balancer known issues and patches to make sure i
dont roll out, say, an F5 firmware with known failure to properly proxy HTTP2

~~~
saganus
To be fair, the author says right at the beginning:

"Please remember that we are all unpaid volunteers who have our own studies
and/or day jobs, and no one has had more experience with Ceph than what you
get from reading the manual."

So while you might have a valid point, it's hard to develop those skills if
you are not a proper sysadmin. If you are just an unpaid volunteer, it means
that most likely the environment you work in can't afford (or won't pay) for
actual sysadmins whose paid job is to know these things.

Maybe the discussion should be, why were there unpaid volunteers doing
sysadmin work instead of an actual sysadmin, but I guess that is a normal
occurrence in universities all over the world?

~~~
sekh60
In defense of ceph the documentation is fantastic and ceph clusters have only
gotten easier toanage with every release. I am a homelabber, not a sysadmin
and I run a cluster and have only run into one or two issues that the docs or
patch notes didn't cover. Also the mailing list is incredibly friendly and
helpful. The IRC channel is not as useful iny experience though, seems like
with many there are a ton of idlers.

One does need to keep up with the release notes, mainly for the past couple
releases, especially with the latest release, luminous which marked stable
bluestore as a more performance alternative to filestore.

~~~
saganus
I don't even think Ceph is at fault here, since they are not the ones making a
critical piece of infrastructure be the responsibility of unpaid volunteers.

Even if said volunteers where seasoned sysadmins, I don't think they should be
stressing over something so critical while being unpaid. Obviously you don't
always get to be paid in money. Sometimes you feel like doing it for the
greater good, or because you like the institution, or whatever the reason.
However I still think as a university you should not rely on unpaid staff to
handle critical systems because it's not a nice thing to do. I don't think
most people _enjoy_ stressing over a hobby.

------
bane
Ceph meltdowns seem to be the worst. I've seen two other projects struggle
with Ceph and eventually both gave up and moved on to something else. The main
problem seems to be lack of documented usage guidelines and not much written
on what to do when something blows up...and things seemed to blow up with both
Ceph systems frequently and in many cases without any discernable error
logging.

The after actions on both systems seemed to indicate that they had simply been
provisioned wrong, or had violated some other best practice, but that advice
never seemed to be written down anywhere and the documentation on the Ceph
site was woefully inadequate.

One team simply moved on to using HDFS and rewrote their entire approach to
assume HDFS and it's been pretty solid since. I think the other group moved to
using Gluster.

I really hope it's getting better, because the idea is really great, but I
wouldn't recommend it to anybody right now based on what I've seen.

~~~
api
There are lots of other options too like LizardFS:

[https://lizardfs.com](https://lizardfs.com)

Seems really nice and easy to deploy.

~~~
takeda
That one is derived from MooseFS
([https://moosefs.com/](https://moosefs.com/)).

What turned me away from LizardFS was when they forked the project the very
first thing they did was rewriting the code to be in C++, this makes much
harder to merge future improvements, and feels like they make some decisions
for the wrong reason (the user doesn't care what language it is written in)

MooseFS is quite good and has fairly good performance, but its weakness is
that is also that the central point is its master, and the open source version
doesn't have HA.

I know that LizardFS supposedly has HA (it didn't have it at the time I looked
into it), how is that implemented?

~~~
xyproto
MooseFS can have a second master that takes over if the first one should fail,
though.

~~~
takeda
I'm familiar with 1.x which was very basic, glad they added that. The previous
approach with metadataloggers had the issue that before you can convert it to
a master you had to merge changelogs which could take a while when you have
many files.

------
jakobdabo
I don't understand why people use Gentoo in production servers. Is it good
practice? I can see so many downsides.

~~~
sgtmas2006
Like what?

~~~
kitotik
It’s almost impossible to achieve any sort of reproducibility when all
packages are built from source.

Portage is a great tool, but I always thought it was better suited to build
packages and operating systems as opposed being an end user package management
solution.

~~~
KirinDave
I've learned recently that even using Nix when building from source is
something of a false hope of reproducibility. Many packages use the fetchgit
function and fetchgit violates the contract.

~~~
Filligree
fetchgit does no such thing. To use it, you need to specify the git revision
_and_ the hash of the output; it's a constant-output derivation just like
fetchurl.

Reproducbility is still a difficult problem, but that isn't one of them, and
Nix comes far closer than Gentoo.

~~~
KirinDave
[https://twitter.com/shlevy/status/956313877800566784?ref_src...](https://twitter.com/shlevy/status/956313877800566784?ref_src=twcamp%5Ecopy%7Ctwsrc%5Eandroid%7Ctwgr%5Ecopy%7Ctwcon%5E7090%7Ctwterm%5E0)

~~~
mst
I do hope somebody either makes that start warning, then removes it, or at
least adds it to a linter of some sort.

------
hpcjoe
I hate to be blunt, but the moral of this story is that you should not trade
good operational practice for some personal preferences. Distros generally
don't matter, until they do. Choices made at the distro level, like, hey lets
build world! don't mesh well into providing stable and reliable service.

You can run any distro you want, safely, with best practices. Which means you
aim to not make upstream changes which have a high probability of breaking
things ... like glibc ... and you focus upon maintaining a stable platform for
your service. You can do that with any distro. Any unix(-like) thing really.
There's no magic, just common sense.

The flip side is that some practices encouraged by various distros range
between glacial change (read as stability), or near relativistic change (read
as here be dragons). Gentoo encourages the latter, and Debian/CentOS encourage
the former. This does not mean one is better than the other, just that you
have to pay more attention with some distros, to maintaining operational
discipline.

~~~
sitkack
I would have run the Ceph in a VM, with the disks passed through so I could
downgrade/crossgrade to different Ceph versions while having a rollback plan
that is the identity function.

------
perlgeek
The "restart after upgrade" lesson is one we learned the hard way too, though
typically at the OS level, not the application level.

We had cases where machines with 200+ days of uptime changed from one
operations team to another, and the new team did a reboot -- and found that
some NFS mounts or ip routes or firewall routes had been added manually at run
time, without persistent configuration (like calling mount directly instead of
adding stuff to /etc/fstab). Of course, those things were lost after a reboot,
and had to be reconstructed somehow.

And this is in an organization with several professional sysadmins per team,
not "just" a team of volunteers.

As for the choice of OS: it makes sense to use one that people are familiar
with, and that fits their style of operation. I don't want to pass any value
judgment on that topic.

All in all, impressive debugging work!

~~~
linsomniac
I used to be all about how long my uptime was.

Around a decade ago, I started hating big uptime. "That machine has been up
200 days? WAY past time for a reboot!" In my previous job I implemented a
policy of rebooting at least every 6 months, or whenever kernel or glibc/etc
updates were done.

This was very effective at ensuring that changes to systems were _ALWAYS_
reflected in boot scripts, etc... Way better than finding out that all sorts
of problems existed on a hundred servers during an emergency when power went
out, and nobody could remember the change that was made a year or more ago.

~~~
grandinj
Even that is too long. Once a week at a scheduled time is best. That way if
anything goes wrong, it's relatively easy to find out what changed because
people's memory is still fresh.

------
darksim905
I have a dumb question when it comes to Ceph:

Whatever happened to the good ol' days of best practices? At least with
Microsoft products, say, something like SCCM, you have metrics that can tell
you what works & what doesn't work. Or, a software product on Microsoft's
platform like PDQ Inventory/PDQ Deploy. I know with a decent 4GB memory/60GB
HDD/1Gbps network, I can manage 40-50 desktop PCs fairly easily.

Do people not do baselines anymore? There's so much random documentation for
Ceph. It's poorly documented. There are no logical help switches for it. And
there's no "Oh, you should have x mons for this many OSDs"

It's just poor. Poor all around. :-(

~~~
nickvanw
If you're operating it in an environment where you need it to be up with
multiple 9s or you're losing money, you call Red Hat and pay for a consulting
engagement. As an open source piece of software, I think it's significantly
harder to keep up with the level of public documentation that MSFT can with
some of their products.

They'll provide RH certified packages that have been tested in-depth on
certain OSes (not Gentoo), provide someone on-site to make sure you're
following their best practices around configuration, upgrade process, etc.

Ceph is a very complex distributed system and it's very difficult (IMO) to
nail down what the best practices are around hardware and resources - it
depends almost entirely on the workload, and requirements out of it. If you're
operating it solely for sequential RGW workloads, it looks _completely_
different than what you'd want for high performance RBD devices.

They do have some documentation online, to the extent that they can: separate
networks, 5 MON nodes (to avoid one going down and ruining your day), etc. I
just think it's significantly harder given that it's used on such a diverse
set of workloads, which is why they are more than happy to sell consulting for
this.

~~~
jclulow
I was forced by management at a previous gig to use a RedHat consultant to set
up a satellite server. In the end, I had to help him set it up after it became
clear he wasn't able to do it on his own. It was a huge waste of money, and
I'd caution anybody to pay close attention to contract management if you're
going to get them in.

~~~
xorcist
Satellite is a product sold to management, not to people who do not mind the
command line.

That can be reflected a little bit in Red Hat technicians. Do no let that
affect your judgement of their technical and operational skills.

------
trengrj
Obviously as others have mentioned there were a lot of mistakes here around
engineering best practises: using a bleeding edge source based distribution
for critical storage infrastructure, allowing drift of package versions, and
not testing releases.

I've done some work in Ceph and in my opinion it is too ambitious: block,
object, and file system storage, custom raw partition format (Bluestore),
along with a somewhat legacy C/C++ codebase makes the system fairly fragile. I
would only recommend it if you "know what you are doing" or are willing to get
some help from Red Hat.

The other comment I'd make is that 12TB unreplicated (assuming they are using
3x replication for Ceph) is actually not a huge amount of data and in my
opinion a ZFS setup would be cheaper and more stable. It is not too
challenging to do a HA ZFS setup, and also ZFS's many mirroring, scrubing, and
checksuming abilities makes it very resilient. Copy on write and ZFS snapshots
are also a great features that can save on disk usage.

I feel the sweet spot for Ceph is probably 50TB - 1PB where you need both
block and object storage and are unwilling to use cloud solutions. Lower than
50TB and the overhead and risk of managing Ceph makes it less practical than
traditional solutions. From the other direction CERN did testing up to 30PB
with Ceph but they had to make significant code changes and had Ceph
committers in their team. To compare Hadoop is running in clusters of up to
600PB (but you will be accessing the data very differently).

[1]
[https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2...](https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf)

------
rconti
I distinctly remember being dialed into a meeting by phone where someone was
explaining the project requirements of what would be our first Ceph
deployment. Over and over it was explained there was simply NO provision for
restarting nodes without the remaining nodes trying to rebuild for the
configured n+x redundancy. I completely disagreed and said it made no sense at
all, until I read up all I could on the matter. No provision for patching.
Period. If a node goes down, the cluster tries to rebuild.

I'm not sure what reality this design comes from, but it's not one I care to
inhabit.

~~~
antongribok
This is absolutely not true.

The normal flags you're looking for are: noout, noscrub, nodeep-scrub. Do your
maintenance. Unset those flags.

The noout is what tells Ceph to essentially not shuffle data when OSDs go
down. This is also referenced in the article.

~~~
puzzle
Even that might be too rigid. A better way is to have a time-based setting.
For x < N seconds, ignore the missing node. N could be 600 or 900. During that
time, you have degraded performance, but the system is still serving. If you
exceed that time, then perhaps the node is gone for good, so you should
rebuild.

Think of the case where you have more than the three servers describe in this
incident. You really don't want to disable cluster-wide automated repairs
because of some other kind of maintenance. Something else WILL happen during
your scheduled work and... ooops.

Why 600 or 900? Unavailability events in larger clusters mostly tend to last
only 10-15 minutes, for routine things such as regular node reboots, per page
2 of
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/36737.pdf)

~~~
antongribok
Ceph already has this, the setting is called mon_osd_down_out_interval, and
the default is 300 seconds (you can easily change it).

The reason for "noout" is you generally want to minimize IO operations while
you're performing your maintenance. You should be closely monitoring your
cluster while performing maintenance, and if something else goes wrong, you
abort and unset noout, wait for it to finish rebuilding, and reassess.

EDIT: This used to be 300 in Jewel, and seems to have changed to 600 in a
later version.

~~~
puzzle
That's cool. Thanks!

I'd still argue that you want to provision N+2 capacity, so you can withstand
a planned event and an unplanned one at the same time, without having to go
through manual tweaking. I'd leave that for truly exceptional cases, such as
when the whole cluster is hit by some nasty bug or has entered a spiral that
calls for drastic measures.

------
etcet
Nice debugging skills, but I can't help think that's what you get for running
Gentoo in production. You should be using what everyone else uses (i.e. Debian
or CentOS) so they run into problems before you.

------
mlosapio
I didn’t read past the first paragraph before I formed the opinion that
running all your VMs on a file system like this is a terrible idea.

If you can’t afford an enterprise SAN (I’m not even talking a NAS, I mean a
real fiberchannel-based block-store) for your virtual environment (and I get
it - many cannot) then just do yourself a favor and run on local disk.

The reduction in moving parts will dividends. I promise.

~~~
riffic
Agreed, and this post does a good job describing why:

[https://thornelabs.net/2014/06/14/do-not-use-shared-
storage-...](https://thornelabs.net/2014/06/14/do-not-use-shared-storage-for-
openstack-instances.html)

------
mbid
According to [https://www.fs.lmu.de/angebot](https://www.fs.lmu.de/angebot),
this server hosts

* a bunch of wordpress pages

* mailing lists

* mail server for council adresses, likely mostly unused

* git server

for a few student councils at LMU. Git is likely used only by the CS council,
maybe maths and physics.

Is it _really_ necessary to have 3 * 12TB for this, and all this complexity?
Fault tolerance? Gentoo? Would a single server running boring software and
occasional backups not have been more appropriate for something like this? To
me it looks like a few CS students tasked with managing digital went
completely overboard.

------
jskrablin
Ceph is a very finicky beast and I'd advise anyone to stay away unless you're
wiling to dedicate significant resources to master it. I've seen more than one
Ceph system gone belly up during my days at IBM. IBM is throwing significant
resources at it have people and procedures developed to tackle the usual and
less usual operations... and it still caused headaches.

It will make you hate your job whether there's some kind of networking problem
(configuration, bad NIC or just plain ol' Softlayer dropping packets while
support claiming no problems), some kind of HW problem (one bad OSD node in
semi failed state will pretty much kill cluster throughput) or just some bad
disks causing it to rebalance in the middle of high I/O time. It always made
me nervous even with the most basic 'let's swap the failed OSD' operations.
The blog post itself is a classic showcase of putting all eggs in one basket.
They'd be better off with a simpler setup. Btw running only 1 monitor node
will make you cry when the single monitor decides to take a break which will
happen sooner or later.

------
KirinDave
Another story where the root of the problem is code that lacks robust type
checking, allowing for subtle memory corruption.

------
nailer
For anyone else that doesn't know Ceph:
[http://docs.ceph.com/docs/master/](http://docs.ceph.com/docs/master/)

------
amq
What would be a more fool-proof distributed storage? I tried GlusterFS
briefly, but also heard some meltdown stories about it.

It almost seems like if you need some persistent storage running plain NFS
would be the safest thing if you don't have a dedicated team.

~~~
wmf
That's what I've done. Hardware RAID + XFS + NFS + backups, done. It wasn't
HA, but at least I understood how to get it back up.

~~~
snark42
> That's what I've done. Hardware RAID + XFS + NFS + backups, done.

Even if this was HA, it doesn't scale to HPC type needs. You end up with a
single node bottleneck for all data you need. You also would have some sort of
direct attached storage all connected to a single node.

Lustre/Gluster/Ceph/GFS and maybe HDFS (ideally with commercial MapR type NFS
access) are probably the only viable options in the HPC use case.

If you have a lot of money to burn Isilon/NFS is an option as well.

~~~
wmf
Yep, I haven't worked in a HPC/big data environment but neither are the people
who wrote the original article. They have three OSDs and only two(?) servers
accessing them.

------
korethr
I'm not surprised here to see comments here blaming them for using Gentoo in
production, even though I argue that wasn't fully the cause of their outage.
Hell, I use Gentoo on my personal boxen because I love it's package
management, and even I winced when I read that. "Oh man, I hope they're
keeping on top of their system administration. Damn, nope got bit by the
libstd++ change, _and_ the recent profile change. This is gonna be painful."

~~~
hathawsh
I'm also a fan of Gentoo, but realistically, if they were running Debian or
Red Hat or a derivative, they would be able to use the recommended releases
directly from Ceph:

[http://docs.ceph.com/docs/master/install/get-
packages/](http://docs.ceph.com/docs/master/install/get-packages/)

My interpretation of the writeup suggests that most of their problems would
have been avoided by running the latest supported release of Ceph, on a
supported distribution.

I ran Gentoo servers myself for a few years, but I had to give it up when I
realized I wasn't getting much benefit for all the extra effort I was putting
in. It was a great way to learn how free software packages interact with each
other, but it became a fairly significant time sink to rebuild the world every
so often.

------
ec109685
Unless they host their file system on public facing IPs, it would be simpler
to use private ipv4 addresses versus ipv6. With IPv6, you are running your
code against less hardened paths and depdenending the type of IP address
assigned to your server, behavior can change like this.

------
gjs278
everyone blaming gentoo here is completely wrong. we had a very similar
meltdown where the osds were hosted on rhel 7. they would not start. we had
mds failures as well on rhel.

ceph is just a buggy piece of shit. it configures its options like a windows
registry. at one point I literally built ceph on my gentoo system and ran the
mds over the vpn. it was the only mds that wasn't crashing. nothing about
gentoo is to blame here, it's more stable with ceph if anything.

~~~
CaptSpify
This has been my experience with ceph as well. Hell, the documented setup
commands for Debian didn't even work out of the box.

I also agree with everyone that gentoo was probably not a great choice, but
ceph doesn't even seem production ready unless you have a dedicated team just
to manage it.

~~~
cullenking
Part time sysadmin here (founder, so biz duties, dev duties, PM and product
duties, so not a ton of time for sysadmin stuff) and I've had no major
problems with ceph. Had some performance issues but figured out how to tune
scrubs appropriately for my workload, and found out the hard way that prosumer
SSD do not make good journal devices. I run a three node 18 osd ceph cluster
and it's been no problem. I may eat my words when I do a rolling upgrade from
Firefly to jewel (multiple versions) but not dreading that process much.

I don't understand why people are complaining about docs here. The docs are
good. Read them, read config options, read until you understand what config
options actually do. Passively watch the mailing list. I think expecting to
run your own petabyte capable storage cluster without a little effort is a bit
misguided...

