Gandi goes into some detail on the recovery process and on ways to fix the issue in the future. But, apart from some hand-waving, they don't have any specifics about how they'll communicate expectations better with their customers in the future.
Imagine the counterfactual: Gandi's docs clearly communicate "this service has no backups, you can take a snapshot through this api, you're on your own." Of course customers with data loss would've complained, but, at the end of the day, the message from both Gandi and the community would've been "well, next time buy a service with backups?" Yet there's no explicit plan to improve documentation.
I have a bad feeling someone is going to read their write up and tweet at them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12 seconds”.
Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial.
I have not read this post-mortem yet, but I can attest that this is a viable strategy.
As many know, rsync.net is built entirely on ZFS.
While we have never come close to a blown array (we use extremely conservatively configured raidz3 vdevs) what we have seen are weird corner cases where suddenly a 'zfs destroy' or even a common 'rm' deletion of hundreds of millions of files will either take forever (years) or will halt the (FreeBSD) system.
In one of these cases, after several days of degraded performance and intermittent outages, we did an alternate boot to a newer FreeBSD version with a newer, production, release version of ZFS, and the operation completed in a timely and graceful manner.
What we continue to learn, decade after decade, from UFS2 through to ZFS, is that extremely simple infrastructure configuration is resilient and fails in predictable and boring ways.
We could gain so much "efficiency" and save a lot of money if we did common sense things like bridge zpools across multiple JBODs or run larger vdevs, etc. - but then we'd find ourselves with fascinating failures instead of boring ones.
Issues occur from time to time, and I can assure that these times are very stressful. I am grateful to rely on ZFS, because yet I have never lost any data from people (datasets are often around 10TiB).
Really Gandi should of had backups from day one. If your hosting data you should always have backups ready and tested on day one.
what it might leave you hanging with for a long time is before an actual transfer, while it builds and compares the lists on both sending and receiving side, when you have big filesystems (hundreds of millions of files).
if you have a strategy to select beforehand which files to transfer (for example from a DB which tracks what has been created or changed, direct from worker or production input) you have a good headstart and can minimize rsync on complete filesystems -- and rather run it on a selection, which is tiny compared to the complete project(s) most of the time.
If our internet (or Box's datacenter) were slow, we could easily collect data faster than we could send it to our collaborators.
An old highrise is filled with ten thousands of old second hand server blades, floors by floors of equpiment prolifically producing waste heat. A sure recipe for disaster?
A wrongly installed fuse at one phase in the building made one phase burn out too early. I saw a picture of an archeological breaker equipment. They fixed that.
However the missing phase destroyed the compressor motors of
their cooling systems. Temperature crept up higher and higher. They had to turn off whole floors of servers. When they believed they fixed the problem, they turned on row by row. Renters then frantically tried to copy what they had on the servers and the half repaired cooling system was overtaxed and they had to turn off servers again.
Edit: made some details more specific.
Drive failures and HVAC failures due to bad power are not "black swawn" events. These are very common problems for DCs, and a good design takes these problems into account.
However, a "bad" design is cheap, and hopefully the savings is passed to the customer.
You can't really fault them for the zfs version being so old the feature they needed wasn't yet implemented, because the machine was literally part of the last batch to be upgraded. The root cause is just some random hardware failure that can't be anticipated.
Just bad luck. Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this. Kudos for releasing the post mortem though, at least they've been fairly honest and direct about it.
It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand?
I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for.
You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have.
Intangibles. Whenever you go to talk to management or even co-workers about this stuff, they look at you like you are crazy. I think it is just human nature to not even think that something could go wrong, let alone make decisions based on this.
Today, yes. Two years ago?
Designing a really robust system to failures like this is a very difficult problem. You can see this in the complexity of systems like S3 and Google's Colossus. Colossus in particular is probably one of Google's single greatest competitive advantages, especially considering none of it is open sourced.
Comparing these guys to AWS/S3 is perhaps not entirely fair given the assumption that they have very different levels of resources. For a medium size shop and the constraints they've defined, I think this is a fair outcome of the situation. I agree though in that it could have been mitigated by making the decision to actually store backups.
While I did say S3, what I was really thinking about was Ceph. I don't think it's a silver bullet (almost certainly way more maintenance than a bunch of ZFS nodes), but if you're big enough to have multiple storage nodes with 100's of customers each (and again, triply redundant disks), then you could have built around the eventual failure of a node with what you already have. I'm not expecting them to hit S3's 11 9 availability, just taking a glance at what they have said about their design and proposing that basic changes to how they allocate what they already have would have avoided their problem in the first place.
I don't know what their exact situation looks like, or how they got into this situation. I see a post-mortem that says they spent 5 days trying desperately to recover customer data because they don't have backups, and they're not going to change anything about how they do things to eliminate the problem, even though it appears they have the raw storage capacity to have a backup. A sister comment says that brand damage, customer costs and recovery costs are just hypotheticals. They were, right up until this incident. Hopefully their internal postmortem has more details about what the costs were.
Clearly if they're trying to recover the customer data, it was important enough to the business to do so, and maybe it's time to re-evaluate 'no backups'.
If only there were a cloud storage provider that you could 'zfs send', over SSH, to ...
If only ...
From all of the cases I've read where people where not idiots (Not using snapshots and overwriting a dataset..), it's by far the safest filesystem I've seen during my 12 years working with it and I've yet to loose a single file.
Sure, performance can suffer and RAM is pricey, but safety of the data is more important.
Considering this is a hardware fault, I think Gandi.net did their best. However, they should offer clients optional ZFS-Replication as an extra measure.
The take-away here is clear: don't trust Gandi with anything you care about.
I don't know if I expect a postmortem to say "sorry", and I think you are being needlessly harsh. But I agree this level of service doesn't seem up to current best in class. Like Amazon etc. (Which of course still have unexpeted outages very occasionally, although a 5 day time to recovery would certainly be... unusual).
But this partially shows how much expectations/standards have raised in the past few-10 years. When an unacceptable not up to par level of reliably still involves no data loss, we're doing pretty good. And I think "don't trust Gandi with anything you care about" is probably an exagerated response. But yes, they don't seem to be providing mega-cloud-service-provider level of service.
See this thread for the support at the time:
They do so a bit here: https://news.gandi.net/en/2020/01/major-incident-on-our-host...
>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.
I still think the fact that there was no data loss, and we're still on the edge of calling it unacceptable incompetence, is worth noting, as to how far our expectations and standards have come. Which is good of course.
> Hi Andrea. It is confirmed we have lost data and we are terribly sorry for that. However, please note that what happenend[sp] could happen to any web host.
Customers that were forced to migrate to a different webhost had to restore from whatever backups they had, and they lost data for sure. Even if Gandi ultimately recovered everything (and it's not completely clear if they did) at that point the customer data/databases have already been forked so it's too late.
> We managed to restore the data and bring services back online the morning of January 13.
Is that wrong? It's bad to lose data, it's even worse to tell people you didn't lose data in one place when you did, and tell them you did lose data in another.
What about any data that would have accumulated in those 5 days? This was storage for their IAAS and PAAS products, so anyone using those lost access for 5 days?
> The take-away here is clear: don't trust Gandi with anything you care about.
The take-away is not this one. Its: backup anything you care about.
> "Snapshots allow you to create a backup copy of a volume"
They are doing a lot of preaching about backups when failing to do internal backups (not customer facing backups) of their own products.
They also failed to address their abysmal responses on Twitter that essentially belittled and poked fun at the affected users.
https://news.gandi.net/en/2020/01/major-incident-on-our-host... (linked in the Postmortem)
If you have a single point of failure for data and "snapshots" then you should explain that very clearly to customers. Moreover, as I understand it, competitors like AWS do not have such a single point of failure (ie: EBS Snapshots are on S3 and not EBS) so using the same terminology/workflow is going to cause confusion.
Case in point.
Am I reading this right? This works out to just ~3.8TiB.
So much drama over, basically, one HDD worth of data?
They probably thought their no-BS morally-right stance was supposed to comfort me, and it's not like I host anything that would likely meet that criteria. But who is to say a blog post has cursing in it that they decide, at their sole discretion, to be bad? Or speak up on abortion in a way they don't agree with? Or any of the other morally-charged topics out there? I'm not hosting with the thought police and it's always made me wonder how others felt comfortable with them.
Don't most modern service providers have a clause like that?
Are they using ECC RAM?
This does also mean that other data would be corrupted too, running ZFS without ECC RAM is frequently warned against.
It becomes even worse when RAM Is usually only tested when the computer is built.
I've had several cases where RAM becomes faulty a couple of years down the road.
Recently I had a very weird case of two stick of four went bad due to only moving the computer from one corner to the next without even opening the case.
If you want to learn from such an outage, you have to do a fault analysis that leads to parameters you can can control.
Sure, there can be faulty hardware and software, but you are the ones selecting and running and monitoring them.
If recovery takes ages, you might want to practice recovery and improve your tooling.
And so on.
Blaming ZFS and faulty hardware and old software all cries "we didn't do anything wrong", so no improvements in sight.