Hacker News new | past | comments | ask | show | jobs | submit login
Postmortem of the failure of one hosting storage unit on Jan. 8, 2020 (gandi.net)
120 points by nachtigall 26 days ago | hide | past | web | favorite | 77 comments



It's interesting -- they ID the actual cause of the problem up top, and then just zip right past it. The problem wasn't the hardware failure, or the lack of backups, it was that customers expected them to have backups.

Gandi goes into some detail on the recovery process and on ways to fix the issue in the future. But, apart from some hand-waving, they don't have any specifics about how they'll communicate expectations better with their customers in the future.

Imagine the counterfactual: Gandi's docs clearly communicate "this service has no backups, you can take a snapshot through this api, you're on your own." Of course customers with data loss would've complained, but, at the end of the day, the message from both Gandi and the community would've been "well, next time buy a service with backups?" Yet there's no explicit plan to improve documentation.


People posted an excerpt of their manual, it explicitly called "snapshots" a "backup" and customers were surprised when they asked them to restore the snapshots and only now they explained the snapshots were not backups...


Absolutely! And yet no explicit planning for how they'd reword the manual in the future. That is the root cause of the real problem here -- not that there weren't backups, but that there weren't and the manual said that there were, and there's no plan to fix that!


I’m no ZFS expert, but it must have been incredibly stressful, if not mildly terrifying, going that far down the rabbit hole with customer data on the line.

I have a bad feeling someone is going to read their write up and tweet at them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12 seconds”.

Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial.


"Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial."

I have not read this post-mortem yet, but I can attest that this is a viable strategy.

As many know, rsync.net is built entirely on ZFS.

While we have never come close to a blown array (we use extremely conservatively configured raidz3 vdevs) what we have seen are weird corner cases where suddenly a 'zfs destroy' or even a common 'rm' deletion of hundreds of millions of files will either take forever (years) or will halt the (FreeBSD) system.

In one of these cases, after several days of degraded performance and intermittent outages, we did an alternate boot to a newer FreeBSD version with a newer, production, release version of ZFS, and the operation completed in a timely and graceful manner.

---

What we continue to learn, decade after decade, from UFS2 through to ZFS, is that extremely simple infrastructure configuration is resilient and fails in predictable and boring ways.

We could gain so much "efficiency" and save a lot of money if we did common sense things like bridge zpools across multiple JBODs or run larger vdevs, etc. - but then we'd find ourselves with fascinating failures instead of boring ones.


I don't have hundreds customers, but I handle hundreds of TiB of data (for a science lab).

Issues occur from time to time, and I can assure that these times are very stressful. I am grateful to rely on ZFS, because yet I have never lost any data from people (datasets are often around 10TiB).


No offsite backups? Backblaze B2 is exceedingly cheap for example.


The animation studio I worked for had almost a petabyte of data. It may be cheap to buy the storage but transferring is costly. It's very easy to saturate a MPLS circuit with data, even rSync on a 10Gbit internal connection takes a long while.

Really Gandi should of had backups from day one. If your hosting data you should always have backups ready and tested on day one.


rsync went quite fine while transferring data (in the same situation as you describe), when taken care of some important bottlenecks (not running it over SSH, disabling compression on files that don't compress well, disabling full checksums, TCP sockopts, ...)

what it might leave you hanging with for a long time is before an actual transfer, while it builds and compares the lists on both sending and receiving side, when you have big filesystems (hundreds of millions of files).

if you have a strategy to select beforehand which files to transfer (for example from a DB which tracks what has been created or changed, direct from worker or production input) you have a good headstart and can minimize rsync on complete filesystems -- and rather run it on a selection, which is tiny compared to the complete project(s) most of the time.


I’d be very curious if you evaluated the post-MPLS guys like Megaport for connectivity.


Hey, good business idea: backup storage in vans! :)


AWS beat you to it[0]. Not a van, but a 45 foot trailer.

[0] https://aws.amazon.com/snowmobile/


That's impressive!


My lab generates similar-sized data sets and the transfer, more than the at-rest storage, is tough.

If our internet (or Box's datacenter) were slow, we could easily collect data faster than we could send it to our collaborators.


We do have backups, but getting back your hundred's TiB is really long: you want to keep it where they are living.


Just bad luck. A different story: I had a cheap dedicated host in Atlanta. Their failure was epic. You get what you pay for.

An old highrise is filled with ten thousands of old second hand server blades, floors by floors of equpiment prolifically producing waste heat. A sure recipe for disaster?

Sure!

A wrongly installed fuse at one phase in the building made one phase burn out too early. I saw a picture of an archeological breaker equipment. They fixed that.

However the missing phase destroyed the compressor motors of their cooling systems. Temperature crept up higher and higher. They had to turn off whole floors of servers. When they believed they fixed the problem, they turned on row by row. Renters then frantically tried to copy what they had on the servers and the half repaired cooling system was overtaxed and they had to turn off servers again.

Edit: made some details more specific.


Not bad luck, bad design. Same for Gandi, and for the situation you just described.

Drive failures and HVAC failures due to bad power are not "black swawn" events. These are very common problems for DCs, and a good design takes these problems into account.

However, a "bad" design is cheap, and hopefully the savings is passed to the customer.


Hopefully... My dedicated hosting was very cheap, I paid $20 a month.


It's Delimiter, wasn't it? They weren't a very good hosting organization to begin with, but you get what you pay for.


Right. When I relocated to a home with FTTH, I switched to self-hosting.


This is basically the stuff of nightmares.

You can't really fault them for the zfs version being so old the feature they needed wasn't yet implemented, because the machine was literally part of the last batch to be upgraded. The root cause is just some random hardware failure that can't be anticipated.

Just bad luck. Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this. Kudos for releasing the post mortem though, at least they've been fairly honest and direct about it.


There was one, obvious thing they could have done. Backups. ZFS even makes it easy. They state they have triple redundancy on their servers. Pull 1/2 of the drives and have backups. ZFS supports streaming snapshots (maybe not on this particularly old system). It sounds like they have multiple ZFS servers per datacenter, so given 2 servers, they could use 50% of the storage on each nodes as a backup of the other node.

It's not like the backups have to be customer available, use them to increase availability and decrease MTTR. In this situation, even with a daily snapshot they could have had customers up and running with yesterday's data while they took their time recovering the old system and not moving boxes around and bypassing safeties for speed. How much did five days of panic cost them? Their customers? Their brand?

I feel like they read something about how S3 has at least three copies of everything, and then did that locally with ZFS, instead of accounting for all the other failures that can happen that the S3 design accounts for.

You are right, there isn't a whole lot that could have been done without radically changing their infrastructure, but they're clearly at the scale and have the hardware available to make better choices than they have.


> How much did five days of panic cost them? Their customers? Their brand?

Intangibles. Whenever you go to talk to management or even co-workers about this stuff, they look at you like you are crazy. I think it is just human nature to not even think that something could go wrong, let alone make decisions based on this.


Oh, I think they recognise the damage to their brand.


> Oh, I think they recognise the damage to their brand.

Today, yes. Two years ago?


They do mention in the postmortem that they explicitly do not provide backups, and say so on their product page, but perhaps that it could be better communicated to customers.

Designing a really robust system to failures like this is a very difficult problem. You can see this in the complexity of systems like S3 and Google's Colossus[1]. Colossus in particular is probably one of Google's single greatest competitive advantages, especially considering none of it is open sourced[2].

Comparing these guys to AWS/S3 is perhaps not entirely fair given the assumption that they have very different levels of resources. For a medium size shop and the constraints they've defined, I think this is a fair outcome of the situation. I agree though in that it could have been mitigated by making the decision to actually store backups.

[1]https://www.wired.com/2012/07/google-colossus/

[2]https://cloud.google.com/files/storage_architecture_and_chal...


Do not provide backups and don't have backups aren't the same thing. If you have triply redundant local disks, you can probably afford take half of them and use them as backups for other systems and achieve better availability results (I'm assuming it's not triply redundant for performance).

While I did say S3, what I was really thinking about was Ceph. I don't think it's a silver bullet (almost certainly way more maintenance than a bunch of ZFS nodes), but if you're big enough to have multiple storage nodes with 100's of customers each (and again, triply redundant disks), then you could have built around the eventual failure of a node with what you already have. I'm not expecting them to hit S3's 11 9 availability, just taking a glance at what they have said about their design and proposing that basic changes to how they allocate what they already have would have avoided their problem in the first place.

I don't know what their exact situation looks like, or how they got into this situation. I see a post-mortem that says they spent 5 days trying desperately to recover customer data because they don't have backups, and they're not going to change anything about how they do things to eliminate the problem, even though it appears they have the raw storage capacity to have a backup. A sister comment says that brand damage, customer costs and recovery costs are just hypotheticals. They were, right up until this incident. Hopefully their internal postmortem has more details about what the costs were.

Clearly if they're trying to recover the customer data, it was important enough to the business to do so, and maybe it's time to re-evaluate 'no backups'.


"Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this."

If only there were a cloud storage provider that you could 'zfs send', over SSH, to ...

If only ...


This actually makes me glad to use ZFS (FreeBSD and ZOL) on all servers, a broken RAID on a different filesystem could have meant complete data-loss.

From all of the cases I've read where people where not idiots (Not using snapshots and overwriting a dataset..), it's by far the safest filesystem I've seen during my 12 years working with it and I've yet to loose a single file.

Sure, performance can suffer and RAM is pricey, but safety of the data is more important.

Considering this is a hardware fault, I think Gandi.net did their best. However, they should offer clients optional ZFS-Replication as an extra measure.


But would this replicate the broken metadata? And exactly how do they think it got there in the first place? TBH this sounds like a problem with an ancient version of Solaris that's been enhanced by a relatively small company and it's finally just bitten them.


They don't say they're sorry, because they're not. Instead they minimize their actions by: 1) stating how few customers customers were affected, 2) how it's not really their fault because it was a hardware error, 3) it's not really their fault because they had already planned to upgrade the server, 4) it's not really their fault the restore procedure took so long because they had to make backups first, 5) the restore took so long because spinning disks are slow, and they really had no way to know this in advance. And to top it all off they point out they're not contractually obligated to provide working snapshots at all, so really it's the customers who are at fault here.

The take-away here is clear: don't trust Gandi with anything you care about.


No data was lost though, true?

I don't know if I expect a postmortem to say "sorry", and I think you are being needlessly harsh. But I agree this level of service doesn't seem up to current best in class. Like Amazon etc. (Which of course still have unexpeted outages very occasionally, although a 5 day time to recovery would certainly be... unusual).

But this partially shows how much expectations/standards have raised in the past few-10 years. When an unacceptable not up to par level of reliably still involves no data loss, we're doing pretty good. And I think "don't trust Gandi with anything you care about" is probably an exagerated response. But yes, they don't seem to be providing mega-cloud-service-provider level of service.


Honestly, I too was looking for a straightforward 'We screwed up, sorry.' I wouldn't care nearly as much if they'd just had 5 days without snapshots. But the way they poorly handled support deserves to be addressed in a postmortem.

See this thread for the support at the time:

https://twitter.com/andreaganduglia/status/12152827193300664...


> Honestly, I too was looking for a straightforward 'We screwed up, sorry.'

They do so a bit here: https://news.gandi.net/en/2020/01/major-incident-on-our-host...

>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.


Fair enough. That probably is bad marketing if nothing else, and maybe something else. what you're saying about the major failure being in support/customer-management, even more than the technical issue, seems potentially reasonable. (I am not a Gandi customer, so it's not personal for me).

I still think the fact that there was no data loss, and we're still on the edge of calling it unacceptable incompetence, is worth noting, as to how far our expectations and standards have come. Which is good of course.


The linked twitter thread explicitly mentions data loss:

> Hi Andrea. It is confirmed we have lost data and we are terribly sorry for that. However, please note that what happenend[sp] could happen to any web host.

Customers that were forced to migrate to a different webhost had to restore from whatever backups they had, and they lost data for sure. Even if Gandi ultimately recovered everything (and it's not completely clear if they did) at that point the customer data/databases have already been forked so it's too late.


OK, that makes it even worse then. The postmortem linked above definitely says:

> We managed to restore the data and bring services back online the morning of January 13.

Is that wrong? It's bad to lose data, it's even worse to tell people you didn't lose data in one place when you did, and tell them you did lose data in another.


There was confusion with this. Originally they thought they lost all data, which is why a lot of people went crazy at them via Twitter, they later said there now might be a chance to recover data - luckily they ended up finding a way to recover it.


> No data was lost though, true?

What about any data that would have accumulated in those 5 days? This was storage for their IAAS and PAAS products, so anyone using those lost access for 5 days?


Well, it's a technical post-portem, not a love letter. I'm not affiliated to Gandi in any way, but I find the finger-pointing a bit too pedantic.

> The take-away here is clear: don't trust Gandi with anything you care about.

The take-away is not this one. Its: backup anything you care about.


They called snapshots backups in their web interface when viewing snapshots. From their docs:

> "Snapshots allow you to create a backup copy of a volume"

https://pbs.twimg.com/media/EN2UZ6TX4AAMe-H?format=png&name=...

They are doing a lot of preaching about backups when failing to do internal backups (not customer facing backups) of their own products.


As it's said in the postmortem, they are agreeing that they should have stated in a more obvious way that the backups availability was not contractually assured.


The postmortem says "we don’t provide a backup product for customers" while the docs describe the snapshots as a backup (see screenshot from my higher level comment). This is the disconnect for me that I'm sure is causing a lot of the frustration they are hearing from customers. They are not accepting that they sold the snapshots as a backup and this is disappointing in a postmortem where users are looking for empathy, acknowledgement, and a path forward.


I for one am glad they released a factual account and timeline of what went wrong. I don't see it as an attempt to minimize their actions. They even admit that they have no clear explanation of the original issue, when they could easily have committed to a stronger theory to make themselves look more competent. Overall I'd much rather read this than a massaged PR apology that keeps us in the dark of what actually happened.


This is the messaged PR “postmortem”. It’s basically a shoulder shrug emoji and takes zero responsibility for the incident.

They also failed to address their abysmal responses on Twitter that essentially belittled and poked fun at the affected users.

E.g. https://news.ycombinator.com/item?id=22002258


>They don't say they're sorry,

>We’re very sorry for this truly unfortunate incident and we offer our sincere apologies to anyone impacted.

https://news.gandi.net/en/2020/01/major-incident-on-our-host... (linked in the Postmortem)


>But contractually, we don’t provide a backup product for customers. That may have not been explained clearly enough in our V5 documentation.

If you have a single point of failure for data and "snapshots" then you should explain that very clearly to customers. Moreover, as I understand it, competitors like AWS do not have such a single point of failure (ie: EBS Snapshots are on S3 and not EBS) so using the same terminology/workflow is going to cause confusion.


Are S3 and EBS not hosted on the same underlying storage subsystems?


I realize everyone here seems focused on the file system, but one of the things stressed by the OpenBSD project is that difficulty in upgrades are the root cause of the biggest problems.

Case in point.


> As disks are read at 3M/s, we estimate the duration of the operation to be up to 370 hours.

Am I reading this right? This works out to just ~3.8TiB.

So much drama over, basically, one HDD worth of data?


They said that in the end only 414 users were affected, and it was their simple hosting package. Honestly, I’m almost surprised it was that much.


If they had some spare SSD capacity lying around they could have done a linear copy of the HDD to SSD and then done the import, that could have sped up the random-access scans.


Absolutely pathetic on their end.


Their services otherwise run flawlessly for me. I appreciate the transparency.


And now their postmortem blog post is down with a 503 error. Doesn’t exactly fill me with confidence about their abilities.


Their handling of the issue on Twitter was enough for to decide to move my domain names away from them when their renewal is due.


Just a heads up, and forgive me if this is obvious, but you can move right away, the expire date will be the same anyway. You certainly don't want a failed migration too close to the expiration date.


yes, good point!


The last I looked at them for either hosting or domain, they had a provision in their TOS that basically said they could terminate my account at any time if I did anything they felt was morally wrong. I emailed them and asked if what I was reading was true and they confirmed it. I never looked at them again.

They probably thought their no-BS morally-right stance was supposed to comfort me, and it's not like I host anything that would likely meet that criteria. But who is to say a blog post has cursing in it that they decide, at their sole discretion, to be bad? Or speak up on abortion in a way they don't agree with? Or any of the other morally-charged topics out there? I'm not hosting with the thought police and it's always made me wonder how others felt comfortable with them.


> they had a provision in their TOS that basically said they could terminate my account at any time if I did anything they felt was morally wrong.

Don't most modern service providers have a clause like that?


Any recommendations? Preferred within the EU.


[flagged]


No they didn't, they were referencing the "shame" scene, and even included a gif, from Game of Thrones. In context, them saying "who do you want to see naked" is obviously not offering to send nudes, but who do you want to see punished. The worst part was the CEO getting on twitter and mocking the person who had originally complained about the outage on twitter.



Anyone not familiar with GoT could certainly have easily misinterpreted that though.


Once had a data loss on a test ZFS system with iSCSI on top. What I learnt from that is that you need to schedule scrubs your ZFS pools regularly. Its always easy to be wise afterwards but harder to predict before. Not sure if that would have helped.


> We think it may be due to a hardware problem linked to the server RAM.

Are they using ECC RAM?


Sounds like they didn't and the metadata logs got corrupted..

This does also mean that other data would be corrupted too, running ZFS without ECC RAM is frequently warned against.


Running any resilient storage system without ECC RAM is warned against, people just really make a big deal about it with ZFS. If your data in RAM is corrupted before it makes it to the hard drive, pretty much any file system is going to write corrupted data to the drive.


Indeed, ECC should be _default_ these days!

It becomes even worse when RAM Is usually only tested when the computer is built.

I've had several cases where RAM becomes faulty a couple of years down the road.

Recently I had a very weird case of two stick of four went bad due to only moving the computer from one corner to the next without even opening the case.



As a postmortem, this does not inspire confidence. It's a very technical piece, but doesn't even try to take a customer's perspective.

If you want to learn from such an outage, you have to do a fault analysis that leads to parameters you can can control.

Sure, there can be faulty hardware and software, but you are the ones selecting and running and monitoring them.

If recovery takes ages, you might want to practice recovery and improve your tooling.

And so on.

Blaming ZFS and faulty hardware and old software all cries "we didn't do anything wrong", so no improvements in sight.


Being a Gandi customer must be terrifying, generally.


They used to have a great reputation in France about 20 to 15 years ago. It went downhill since, the company got sold and they started to sell expensive and slow cloud services, a bit like Amazon with less success.


Would it be possible to backup the precious metadata separately to mitigate the issue?


Sounds like data was on one pool with a 3 disk mirror setup and nowhere else. This is 'RAID is not a backup' territory. Much better would have been to duplicate the volumes themselves somewhere else (eg using zfs send/receive to a different host) and ideally the contents of those volumes too.


I don't think it's a problem of the data corrupting while on disk, I think the problem has occurred in ram and then been written to disk.


After a glance I thought, 'why a storage unit?, where do they get power, how do they cool it, its not physically secure, etc'. Then, oh, that kind of storage unit. Yes, I'm dumb.




Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: