>If we notice something is going wrong with Ceph, we will not hesitate to shut down the cluster prematurely.
once bitten..twice shy? how does this solve future problems in the HA filesystem for your customer?
>We should not update Ceph on all machines at once.
HA/HPC 101. rolling upgrades are fine, slow and steady, with lots of testing and a documented rollback procedure if things go screwy. large enough/critical systems often have dev test and prod
>We will build glibc with debugging symbols.
you will plan your OS upgrades separate from your Ceph upgrades and with properly communicated actions before, during, and after the events.
>We will track Ceph releases more closely
as opposed to...what? being a sysadmin, Im on more than a dozen release mailing lists. I track load balancer known issues and patches to make sure i dont roll out, say, an F5 firmware with known failure to properly proxy HTTP2
"Please remember that we are all unpaid volunteers who have our own studies and/or day jobs, and no one has had more experience with Ceph than what you get from reading the manual."
So while you might have a valid point, it's hard to develop those skills if you are not a proper sysadmin. If you are just an unpaid volunteer, it means that most likely the environment you work in can't afford (or won't pay) for actual sysadmins whose paid job is to know these things.
Maybe the discussion should be, why were there unpaid volunteers doing sysadmin work instead of an actual sysadmin, but I guess that is a normal occurrence in universities all over the world?
> First and foremost, having a multiple-day spanning downtime is completely unacceptable for a central service like this [...]
Which goes to your last point a bit. It's not a normal occurrence to all universities, but to some. If it was me, I'd turn this thing off repeatedly until the people used it realized it's value and tried to dedicate a bit of manpower to it.
One does need to keep up with the release notes, mainly for the past couple releases, especially with the latest release, luminous which marked stable bluestore as a more performance alternative to filestore.
Even if said volunteers where seasoned sysadmins, I don't think they should be stressing over something so critical while being unpaid. Obviously you don't always get to be paid in money. Sometimes you feel like doing it for the greater good, or because you like the institution, or whatever the reason. However I still think as a university you should not rely on unpaid staff to handle critical systems because it's not a nice thing to do. I don't think most people enjoy stressing over a hobby.
Some people think because their app is stateless, rollback is trivial. But virtually every single piece of technology that has an indirect effect on your app does have state, and each one moves your app away from a known good state. Everything from firewall rules and network routes, to switch firmware, to DNS, to BIOS updates, to LDAP directory changes, compute and storage configuration changes, software updates, etc all affect whether you can roll back successfully.
Do you know when your cloud provider updates its software, and how it will affect your app? If they change something and don't notify you, and it causes a problem, at what point will you notice when their latest changes were? Can they even roll back their change, and how long will that take?
This was an example of one tier or service failing. But it could have easily been a firewall change, a DNS change, a time sync issue, etc. If you want to roll back to the last time your app worked, all of those need a rollback method, too.
Yep, while we always had a plan in release/upgrades for a fallback, everyone knew it would never happen.
We had enough testing in qa/preprod that no matter what happened, a fallback would cost much more time and money than any methods (manual or otherwise) to rectify the problem.
The after actions on both systems seemed to indicate that they had simply been provisioned wrong, or had violated some other best practice, but that advice never seemed to be written down anywhere and the documentation on the Ceph site was woefully inadequate.
One team simply moved on to using HDFS and rewrote their entire approach to assume HDFS and it's been pretty solid since. I think the other group moved to using Gluster.
I really hope it's getting better, because the idea is really great, but I wouldn't recommend it to anybody right now based on what I've seen.
Seems really nice and easy to deploy.
What turned me away from LizardFS was when they forked the project the very first thing they did was rewriting the code to be in C++, this makes much harder to merge future improvements, and feels like they make some decisions for the wrong reason (the user doesn't care what language it is written in)
MooseFS is quite good and has fairly good performance, but its weakness is that is also that the central point is its master, and the open source version doesn't have HA.
I know that LizardFS supposedly has HA (it didn't have it at the time I looked into it), how is that implemented?
 I did not evaluate LizardFS but back then it was freshly forked so it was pretty much the same thing.
In practice, many people don't understand that Gentoo is a meta-distribution and try to run it like a simple distribution, leading to arbitrary ABI breakage and chaos. If you want to run Gentoo on servers, you need to test all upgrades on build machines before rolling them out. You need to publish your own binary packages. You need to do a lot of the testing that Debian, Ubuntu, Red Hat, etc. would be doing for you.
I have some friends who set up their own hosting business based on Gentoo and they seem to be doing it right. Their servers are very stable. Testing updates is an ongoing expense, but I think it has worked out well for them.
> Compared with Debian or Red Hat, Gentoo is really a meta-distribution. It's a nice basis for creating your own distribution.
I do not want that on my production server. We're in the $SOME_BACKEND_SERVICE business, not in the distro creating business.
> In theory, Gentoo lets you mix the latest version of some packages with well-tested, old versions of other packages and generate a release perfected for your organization.
I want a hyper stable base that is tested by the masses and when possible has some big org behind it who's supporting it (and thus very conservative when changing stuff). On top of that base I might roll my own packages or include a fringy package repository. Never I want the base to be different for everyone.
> If the Gentoo release of some package, Ceph for example, isn't as up to date as you want, you can easily update it yourself and publish a new release to your servers.
This is not too hard for any distro I've used.
> In practice, many people don't understand that Gentoo is a meta-distribution and try to run it like a simple distribution, leading to arbitrary ABI breakage and chaos. If you want to run Gentoo on servers, you need to test all upgrades on build machines before rolling them out. You need to publish your own binary packages. You need to do a lot of the testing that Debian, Ubuntu, Red Hat, etc. would be doing for you.
So may we conclude, as GP said, that running Gentoo on production servers is just not a good idea?
> I have some friends who set up their own hosting business based on Gentoo and they seem to be doing it right. Their servers are very stable.
This is a little too anecdotal for me :)
I've done my share of Gentoo, and this system in intriguing. But when I need stability, as I need for backend services, then it will not even be on the long list.
I found some similar-level intriguing features in NixOS. I'd probably still not run it for backend services, but I might be compelled to run it on the application layer in some cases.
I don't at all read your parent as asserting that all production deployments should make use of a meta-distribution like Gentoo, but instead describing the specialized features that are useful for some specific niches. If you have some business reason to want to run arbitrary combinations of different versions of different software together, or to more-easily manage distributing software to your platform with many different features that you sometimes want compiled out, or otherwise if a slow stable extremely-reliable distribution isn't handling your needs, you might want to consider using Gentoo instead of building all that infrastructure yourself.
That's not the case for you, or for me, and that's fine. You've described some of the reasons I also prefer a stable well-supported distribution and don't use Gentoo. My reading of your comment is saying that because you personally have no use for the benefits, therefore Gentoo is completely unsuitable for any production use ever. If I've misread your comment, and you were just commenting on your personal use cases without trying to speak to any general audience, I apologize for my mischaracterization.
But honestly, I think most people who run Gentoo do it for the learning experience and for fun. :-)
Yeah, nice one! Very customized flags required.
I was more talking about typical backend services, like Ceph or, say, a Riak or Postgres cluster.
Minimise your deviation from the stable stuff that everyone else is using.
But still, maybe there is some stuff with special processors that needs all to be compiled specifically with support for that type of thing, in order to make proper use of the underlying machine.
> I do not want that on my production server. We're in the $SOME_BACKEND_SERVICE business, not in the distro creating business.
A lot of companies know that Google and Facebook have their own distros. Ergo, to become as successful, they too must build their own distro.
This hasn't stopped Google from using a Debian variant internally as a desktop OS, so they all kind of have their place depending on what you intend to build.
No, you should have build hosts for that, publish binpkgs, and then install the binpkgs.
Gentoo's binpkgs do not require compiling on your production machines; compile them elsewhere.
> an update might require configuration file change, this doesn't happen in CentOS or Debian Stable
Debian has debconf, and Centos has a similar tool. When configuration files change in the debian world, they often prompt with those maintainer scripts.
Gentoo's dispatch-conf/etc-update model is quite nice because it also lets maintainers recommend new config defaults which the user may easily accept or reject.
You clearly don't know what you're talking about. Then again, many people who use gentoo don't use it as you would in production, with binpkg mirrors and well managed package unmasking/masking/etc.
The nice thing about Arch, its absolutely trivial to rollback a package, or even rollback all your packages to the ones published on any given date. So I was able to deal with all of these issues quite effectively which is a credit to Arch, but it simply is the case that you will see more breakage on the leading edge.
What you're describing there is bleeding edge; not rolling release. If you ran a Ubuntu with the testing repos then you'd have experienced the same issues.
> The nice thing about Arch, its absolutely trivial to rollback a package, or even rollback all your packages to the ones published on any given date. So I was able to deal with all of these issues quite effectively which is a credit to Arch, but it simply is the case that you will see more breakage on the [bleeding] edge.
Indeed you will on bleeding edge. That was my point. Most of the complaints people attribute to rolling release are actually problems with bleeding edge distros rather than rolling release. You've downvoted me only to reiterate the same point.
I'd covered that in my first post as well (did you even read it? :P)
> without a release cycle, what exactly would they be waiting for?
Rolling release distros still have release cycles. eg they often have testing repos where many packages will be trialled before they hit the main repositories. Much like you see happen with packages which get updated between major releases on non-rolling release distros.
The difference between rolling release and non-rolling release is a bit less pronounced these days because even most non-rolling release distros now have easy upgrade paths from one major release version to the next. Heck with the Debian / Ubuntu derived distros you can just update your apt config to point to the next release repos and apt will carry on as if you're a rolling release distro. So I think the real difference between rolling release and non-rolling is simply just that you're given greater assurances that you don't need to perform manual intervention during package upgrades outside of the major version upgrades where as with rolling release the risk of manual intervention can come without warning
In terms of package stability, there's nothing stopping someone creating an Arch fork which runs packages a few months behind the main Arch repos. In fact I think that might have been the concept behind Arch Server - not that it ever took off.
So anyway, back to my original point: rolling release doesn't have to mean "unstable". It just means breaking changes don't all get held back until major milestones are reached. But that's often true for non-rolling release as well. It just so happens that most rolling release distros are also bleeding edge - and that is where the trope come from. So saying rolling release can only be unstable is akin to saying Redhat can only be bleeding edge if you only ever use Fedora.
 I was going to say "guaranteed" but that's not quite true either; eg FreeBSD will happily perform breaking changes to applications (within a major release version) if you're not careful what application versions are being installed
 This is also not quite true but many distros will warn you about breaking changes in their news feed. But also the package manager itself gives strong clues too (eg Apache 2.2.x -> 2.4.x will obviously result in manually updating some Apache config files).
Software and firmware updates from Linux distros have recently bricked modern hardware several times. This should never have happened, because a period of testing is necessary to evaluate whether a given update will cause failures. With rolling release, you update and pray.
Portage is a great tool, but I always thought it was better suited to build packages and operating systems as opposed being an end user package management solution.
Reproducbility is still a difficult problem, but that isn't one of them, and Nix comes far closer than Gentoo.
As for reproducibility when building from source, that depends on your practices. Is every box a special snowflake with it's own USE flags and package set? Well then yes, good luck with reproducibility. Or are you standardizing your targets and building one package to deploy to them all?
It, of course, depends on the project; if the ./configure script drops the current date in source code, it'll be different, but that doesn't happen often.
You can run any distro you want, safely, with best practices. Which means you aim to not make upstream changes which have a high probability of breaking things ... like glibc ... and you focus upon maintaining a stable platform for your service. You can do that with any distro. Any unix(-like) thing really. There's no magic, just common sense.
The flip side is that some practices encouraged by various distros range between glacial change (read as stability), or near relativistic change (read as here be dragons). Gentoo encourages the latter, and Debian/CentOS encourage the former. This does not mean one is better than the other, just that you have to pay more attention with some distros, to maintaining operational discipline.
We had cases where machines with 200+ days of uptime changed from one operations team to another, and the new team did a reboot -- and found that some NFS mounts or ip routes or firewall routes had been added manually at run time, without persistent configuration (like calling mount directly instead of adding stuff to /etc/fstab). Of course, those things were lost after a reboot, and had to be reconstructed somehow.
And this is in an organization with several professional sysadmins per team, not "just" a team of volunteers.
As for the choice of OS: it makes sense to use one that people are familiar with, and that fits their style of operation. I don't want to pass any value judgment on that topic.
All in all, impressive debugging work!
Around a decade ago, I started hating big uptime. "That machine has been up 200 days? WAY past time for a reboot!" In my previous job I implemented a policy of rebooting at least every 6 months, or whenever kernel or glibc/etc updates were done.
This was very effective at ensuring that changes to systems were ALWAYS reflected in boot scripts, etc... Way better than finding out that all sorts of problems existed on a hundred servers during an emergency when power went out, and nobody could remember the change that was made a year or more ago.
Whatever happened to the good ol' days of best practices? At least with Microsoft products, say, something like SCCM, you have metrics that can tell you what works & what doesn't work. Or, a software product on Microsoft's platform like PDQ Inventory/PDQ Deploy. I know with a decent 4GB memory/60GB HDD/1Gbps network, I can manage 40-50 desktop PCs fairly easily.
Do people not do baselines anymore? There's so much random documentation for Ceph. It's poorly documented. There are no logical help switches for it. And there's no "Oh, you should have x mons for this many OSDs"
It's just poor. Poor all around. :-(
They'll provide RH certified packages that have been tested in-depth on certain OSes (not Gentoo), provide someone on-site to make sure you're following their best practices around configuration, upgrade process, etc.
Ceph is a very complex distributed system and it's very difficult (IMO) to nail down what the best practices are around hardware and resources - it depends almost entirely on the workload, and requirements out of it. If you're operating it solely for sequential RGW workloads, it looks _completely_ different than what you'd want for high performance RBD devices.
They do have some documentation online, to the extent that they can: separate networks, 5 MON nodes (to avoid one going down and ruining your day), etc. I just think it's significantly harder given that it's used on such a diverse set of workloads, which is why they are more than happy to sell consulting for this.
That can be reflected a little bit in Red Hat technicians. Do no let that affect your judgement of their technical and operational skills.
I've done some work in Ceph and in my opinion it is too ambitious: block, object, and file system storage, custom raw partition format (Bluestore), along with a somewhat legacy C/C++ codebase makes the system fairly fragile. I would only recommend it if you "know what you are doing" or are willing to get some help from Red Hat.
The other comment I'd make is that 12TB unreplicated (assuming they are using 3x replication for Ceph) is actually not a huge amount of data and in my opinion a ZFS setup would be cheaper and more stable. It is not too challenging to do a HA ZFS setup, and also ZFS's many mirroring, scrubing, and checksuming abilities makes it very resilient. Copy on write and ZFS snapshots are also a great features that can save on disk usage.
I feel the sweet spot for Ceph is probably 50TB - 1PB where you need both block and object storage and are unwilling to use cloud solutions. Lower than 50TB and the overhead and risk of managing Ceph makes it less practical than traditional solutions. From the other direction CERN did testing up to 30PB with Ceph but they had to make significant code changes and had Ceph committers in their team. To compare Hadoop is running in clusters of up to 600PB (but you will be accessing the data very differently).
I'm not sure what reality this design comes from, but it's not one I care to inhabit.
The normal flags you're looking for are: noout, noscrub, nodeep-scrub. Do your maintenance. Unset those flags.
The noout is what tells Ceph to essentially not shuffle data when OSDs go down. This is also referenced in the article.
Think of the case where you have more than the three servers describe in this incident. You really don't want to disable cluster-wide automated repairs because of some other kind of maintenance. Something else WILL happen during your scheduled work and... ooops.
Why 600 or 900? Unavailability events in larger clusters mostly tend to last only 10-15 minutes, for routine things such as regular node reboots, per page 2 of https://static.googleusercontent.com/media/research.google.c...
The reason for "noout" is you generally want to minimize IO operations while you're performing your maintenance. You should be closely monitoring your cluster while performing maintenance, and if something else goes wrong, you abort and unset noout, wait for it to finish rebuilding, and reassess.
EDIT: This used to be 300 in Jewel, and seems to have changed to 600 in a later version.
I'd still argue that you want to provision N+2 capacity, so you can withstand a planned event and an unplanned one at the same time, without having to go through manual tweaking. I'd leave that for truly exceptional cases, such as when the whole cluster is hit by some nasty bug or has entered a spiral that calls for drastic measures.
In an ideal world, we would have had multiple people to dedicate to building the kind of cluster they wanted to build. In the real world, after a dozen people left, I'm afraid they may well have ended up in a situation not much better than the University in question.
My memory is generally quite poor, but I vaguely recall this feature being present for as long as I've been familiar with Ceph. I obviously have no way of knowing what happened in that meeting you were in, and maybe the other people who were proposing using Ceph were not very familiar with performing maintenance on it, or maybe there was other constraint or use case they had in mind, but I'm quite confident in saying that the normal case for Ceph maintenance involves only a very marginal amount of data movement (to bring the temporarily-down OSD back up-to-date with changes that occurred while it was down).
Assuming that you run with 3 failure domains, and only maintain one failure domain at a time. Noout mostly gets the job done. What it doesn't do for you is save you from an actual failure in a different failure domain during maintenance. EC pools & k+(m>=2) or replication > 3 - would cover this as well.
We've had mostly great success with noout + maintain failure domain at a time, wait for recovery, proceed to next failure domain, repeat until done. To the point where we've been comfortable leaving a lot of the babysitting & work to machines.
ceph is just a buggy piece of shit. it configures its options like a windows registry. at one point I literally built ceph on my gentoo system and ran the mds over the vpn. it was the only mds that wasn't crashing. nothing about gentoo is to blame here, it's more stable with ceph if anything.
It sounds to me like the choice of Gentoo was at least partially responsible for that 5 day outage.
With RHEL/CentOS, it's pretty good about keeping things stable from a software version perspective.
The downside is when you really do need (or just plain really want) newer versions of software. But at least then you can make the conscious decision to manually maintain a few packages, while the rest of the core stays stable.
I won't run anything but an LTS release in production. People always say "Well <my favorite distro> has release cycles every 2 years, and upgrading your systems that often should be fine." But I've been involved in OS upgrades that require 6+ months to do, and not infrequently.
Most reasonably complex services (in my experience) take man-months of effort to go from, say, 12.04 to 16.04. Start having 3 or 4 such services, and non-upgrade work that needs to be done, and you start running out of time in a 2 year release cycle to complete updates.
I also agree with everyone that gentoo was probably not a great choice, but ceph doesn't even seem production ready unless you have a dedicated team just to manage it.
I don't understand why people are complaining about docs here. The docs are good. Read them, read config options, read until you understand what config options actually do. Passively watch the mailing list. I think expecting to run your own petabyte capable storage cluster without a little effort is a bit misguided...
If you can’t afford an enterprise SAN (I’m not even talking a NAS, I mean a real fiberchannel-based block-store) for your virtual environment (and I get it - many cannot) then just do yourself a favor and run on local disk.
The reduction in moving parts will dividends. I promise.
* a bunch of wordpress pages
* mailing lists
* mail server for council adresses, likely mostly unused
* git server
for a few student councils at LMU. Git is likely used only by the CS council, maybe maths and physics.
Is it really necessary to have 3 * 12TB for this, and all this complexity? Fault tolerance? Gentoo? Would a single server running boring software and occasional backups not have been more appropriate for something like this? To me it looks like a few CS students tasked with managing digital went completely overboard.
It will make you hate your job whether there's some kind of networking problem (configuration, bad NIC or just plain ol' Softlayer dropping packets while support claiming no problems), some kind of HW problem (one bad OSD node in semi failed state will pretty much kill cluster throughput) or just some bad disks causing it to rebalance in the middle of high I/O time. It always made me nervous even with the most basic 'let's swap the failed OSD' operations. The blog post itself is a classic showcase of putting all eggs in one basket. They'd be better off with a simpler setup. Btw running only 1 monitor node will make you cry when the single monitor decides to take a break which will happen sooner or later.
It almost seems like if you need some persistent storage running plain NFS would be the safest thing if you don't have a dedicated team.
> It turned out that all three systems had been updated a few times without restarting the OSD. No OSD could start anymore. We kept the last two OSD running (this turned out to be a mistake). The file servers, running Gentoo, also had a profile update done by another administrator.
If you don't have the resources to perform safe upgrades, the fool-proof option is to pay a vendor to run your storage for you. I agree that if you want to run a reliable distributed storage system, you need a professional sysadmin who has enough time to maintain it safely. I further agree that if you don't have any professionals who are funded to dedicate at least some time to this, you'll probably have a lot less failure by just running a big dedicated node providing NFS, iSCSI, SMB, or whatever.
I can't think of any software that I'd call fool-proof under "We've upgraded multiple versions without any testing, we have no systems in a known-good state, and we don't have any way to actually revert back to a known-good state".
It's //really// nice to actually have a budget so you can setup a testing environment and actually validate the things you're about to do to your production environment.
Even if this was HA, it doesn't scale to HPC type needs. You end up with a single node bottleneck for all data you need. You also would have some sort of direct attached storage all connected to a single node.
Lustre/Gluster/Ceph/GFS and maybe HDFS (ideally with commercial MapR type NFS access) are probably the only viable options in the HPC use case.
If you have a lot of money to burn Isilon/NFS is an option as well.
Looks like in this case you never actually needed what Ceph brings to the table - for a tremendous cost in complexity (and fragility, from personal anecdotal evidence) you can allow your dataset and -load to grow beyond the physical capabilities of a single node.
The HA argument in less demanding settings ends up being a cargo cult mostly. Your setup will probably produce higher reliability figures than running Ceph, assuming the lack of some serious engineering capabilities.
My interpretation of the writeup suggests that most of their problems would have been avoided by running the latest supported release of Ceph, on a supported distribution.
I ran Gentoo servers myself for a few years, but I had to give it up when I realized I wasn't getting much benefit for all the extra effort I was putting in. It was a great way to learn how free software packages interact with each other, but it became a fairly significant time sink to rebuild the world every so often.