
Postmortem of the failure of one hosting storage unit on Jan. 8, 2020 - nachtigall
https://news.gandi.net/en/2020/01/postmortem-of-the-failure-of-one-hosting-storage-unit-at-lu-bi1-on-january-8-2020/
======
kaizendad
It's interesting -- they ID the actual cause of the problem up top, and then
just zip right past it. The problem wasn't the hardware failure, or the lack
of backups, it was _that customers expected them to have backups_.

Gandi goes into some detail on the recovery process and on ways to fix the
issue in the future. But, apart from some hand-waving, they don't have any
specifics about how they'll communicate expectations better with their
customers in the future.

Imagine the counterfactual: Gandi's docs clearly communicate "this service has
no backups, you can take a snapshot through this api, you're on your own." Of
course customers with data loss would've complained, but, at the end of the
day, the message from both Gandi and the community would've been "well, next
time buy a service with backups?" Yet there's no explicit plan to improve
documentation.

~~~
speeder
People posted an excerpt of their manual, it explicitly called "snapshots" a
"backup" and customers were surprised when they asked them to restore the
snapshots and only now they explained the snapshots were not backups...

~~~
kaizendad
Absolutely! And yet no explicit planning for how they'd reword the manual in
the future. _That_ is the root cause of the real problem here -- not that
there weren't backups, but that there weren't and the manual said that there
were, and there's no plan to fix that!

------
zaroth
I’m no ZFS expert, but it must have been incredibly stressful, if not mildly
terrifying, going that far down the rabbit hole with customer data on the
line.

I have a bad feeling someone is going to read their write up and tweet at
them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12
seconds”.

Indeed it appears that the option they needed existed, but only in a later
version of ZFS than they were running, and part of the fix was moving the
broken array to a system that could run a newer version of ZFS, which
apparently was itself not trivial.

~~~
piti_
I don't have hundreds customers, but I handle hundreds of TiB of data (for a
science lab).

Issues occur from time to time, and I can assure that these times are very
stressful. I am grateful to rely on ZFS, because yet I have never lost any
data from people (datasets are often around 10TiB).

~~~
toomuchtodo
No offsite backups? Backblaze B2 is exceedingly cheap for example.

~~~
doublerabbit
The animation studio I worked for had almost a petabyte of data. It may be
cheap to buy the storage but transferring is costly. It's very easy to
saturate a MPLS circuit with data, even rSync on a 10Gbit internal connection
takes a long while.

Really Gandi should of had backups from day one. If your hosting data you
should always have backups ready and tested on day one.

~~~
jotm
Hey, good business idea: backup storage in vans! :)

~~~
ficklepickle
AWS beat you to it[0]. Not a van, but a 45 foot trailer.

[0] [https://aws.amazon.com/snowmobile/](https://aws.amazon.com/snowmobile/)

~~~
jotm
That's impressive!

------
_nalply
Just bad luck. A different story: I had a cheap dedicated host in Atlanta.
Their failure was epic. You get what you pay for.

An old highrise is filled with ten thousands of old second hand server blades,
floors by floors of equpiment prolifically producing waste heat. A sure recipe
for disaster?

Sure!

A wrongly installed fuse at one phase in the building made one phase burn out
too early. I saw a picture of an archeological breaker equipment. They fixed
that.

However the missing phase destroyed the compressor motors of their cooling
systems. Temperature crept up higher and higher. They had to turn off whole
floors of servers. When they believed they fixed the problem, they turned on
row by row. Renters then frantically tried to copy what they had on the
servers and the half repaired cooling system was overtaxed and they had to
turn off servers again.

Edit: made some details more specific.

~~~
robbyt
Not bad luck, bad design. Same for Gandi, and for the situation you just
described.

Drive failures and HVAC failures due to bad power are not "black swawn"
events. These are very common problems for DCs, and a good design takes these
problems into account.

However, a "bad" design is cheap, and hopefully the savings is passed to the
customer.

~~~
_nalply
Hopefully... My dedicated hosting was very cheap, I paid $20 a month.

~~~
halfeatenpie
It's Delimiter, wasn't it? They weren't a very good hosting organization to
begin with, but you get what you pay for.

~~~
_nalply
Right. When I relocated to a home with FTTH, I switched to self-hosting.

------
ZeroCool2u
This is basically the stuff of nightmares.

You can't really fault them for the zfs version being so old the feature they
needed wasn't yet implemented, because the machine was literally part of the
last batch to be upgraded. The root cause is just some random hardware failure
that can't be anticipated.

Just bad luck. Beyond radically changing how their core infrastructure works,
doesn't seem like there was a lot they could have done to prevent this. Kudos
for releasing the post mortem though, at least they've been fairly honest and
direct about it.

~~~
escardin
There was one, obvious thing they could have done. Backups. ZFS even makes it
easy. They state they have triple redundancy on their servers. Pull 1/2 of the
drives and have backups. ZFS supports streaming snapshots (maybe not on this
particularly old system). It sounds like they have multiple ZFS servers per
datacenter, so given 2 servers, they could use 50% of the storage on each
nodes as a backup of the other node.

It's not like the backups have to be customer available, use them to increase
availability and decrease MTTR. In this situation, even with a daily snapshot
they could have had customers up and running with yesterday's data while they
took their time recovering the old system and not moving boxes around and
bypassing safeties for speed. How much did five days of panic cost them? Their
customers? Their brand?

I feel like they read something about how S3 has at least three copies of
everything, and then did that locally with ZFS, instead of accounting for all
the other failures that can happen that the S3 design accounts for.

You are right, there isn't a whole lot that could have been done without
radically changing their infrastructure, but they're clearly at the scale and
have the hardware available to make better choices than they have.

~~~
generalpass
> How much did five days of panic cost them? Their customers? Their brand?

Intangibles. Whenever you go to talk to management or even co-workers about
this stuff, they look at you like you are crazy. I think it is just human
nature to not even think that something could go wrong, let alone make
decisions based on this.

~~~
RantyDave
Oh, I think they recognise the damage to their brand.

~~~
generalpass
> Oh, I think they recognise the damage to their brand.

Today, yes. Two years ago?

------
tmikaeld
This actually makes me glad to use ZFS (FreeBSD and ZOL) on all servers, a
broken RAID on a different filesystem could have meant complete data-loss.

From all of the cases I've read where people where not idiots (Not using
snapshots and overwriting a dataset..), it's by far the safest filesystem I've
seen during my 12 years working with it and I've yet to loose a single file.

Sure, performance can suffer and RAM is pricey, but safety of the data is more
important.

Considering this is a hardware fault, I think Gandi.net did their best.
However, they should offer clients optional ZFS-Replication as an extra
measure.

~~~
RantyDave
But would this replicate the broken metadata? And exactly how do they think it
got there in the first place? TBH this sounds like a problem with an ancient
version of Solaris that's been enhanced by a relatively small company and it's
finally just bitten them.

------
gizmo
They don't say they're sorry, because they're not. Instead they minimize their
actions by: 1) stating how few customers customers were affected, 2) how it's
not really their fault because it was a hardware error, 3) it's not really
their fault because they had already planned to upgrade the server, 4) it's
not really their fault the restore procedure took so long because they had to
make backups first, 5) the restore took so long because spinning disks are
slow, and they really had no way to know this in advance. And to top it all
off they point out they're not contractually obligated to provide working
snapshots at all, so really it's the customers who are at fault here.

The take-away here is clear: don't trust Gandi with anything you care about.

~~~
jrochkind1
No data was lost though, true?

I don't know if I expect a postmortem to say "sorry", and I think you are
being needlessly harsh. But I agree this level of service doesn't seem up to
current best in class. Like Amazon etc. (Which of course still have unexpeted
outages very occasionally, although a 5 day time to recovery would certainly
be... unusual).

But this partially shows how much expectations/standards have raised in the
past few-10 years. When an unacceptable not up to par level of reliably still
involves no data loss, we're doing pretty good. And I think "don't trust Gandi
with anything you care about" is probably an exagerated response. But yes,
they don't seem to be providing mega-cloud-service-provider level of service.

~~~
FemmeAndroid
Honestly, I too was looking for a straightforward 'We screwed up, sorry.' I
wouldn't care nearly as much if they'd just had 5 days without snapshots. But
the way they poorly handled support deserves to be addressed in a postmortem.

See this thread for the support at the time:

[https://twitter.com/andreaganduglia/status/12152827193300664...](https://twitter.com/andreaganduglia/status/1215282719330066434)

~~~
jrochkind1
Fair enough. That probably is bad marketing if nothing else, and maybe
something else. what you're saying about the major failure being in
support/customer-management, even more than the technical issue, seems
potentially reasonable. (I am not a Gandi customer, so it's not personal for
me).

I still think the fact that there was no data loss, and we're still on the
edge of calling it unacceptable incompetence, is worth noting, as to how far
our expectations and standards have come. Which is good of course.

~~~
gizmo
The linked twitter thread explicitly mentions data loss:

> Hi Andrea. It is confirmed we have lost data and we are terribly sorry for
> that. However, please note that what happenend[sp] could happen to any web
> host.

Customers that were forced to migrate to a different webhost had to restore
from whatever backups they had, and they lost data for sure. Even if Gandi
ultimately recovered everything (and it's not completely clear if they did) at
that point the customer data/databases have already been forked so it's too
late.

~~~
jrochkind1
OK, that makes it even worse then. The postmortem linked above definitely
says:

> We managed to restore the data and bring services back online the morning of
> January 13.

Is that wrong? It's bad to lose data, it's even worse to tell people you
didn't lose data in one place when you did, and tell them you did lose data in
another.

~~~
ialexpw
There was confusion with this. Originally they thought they lost all data,
which is why a lot of people went crazy at them via Twitter, they later said
there now might be a chance to recover data - luckily they ended up finding a
way to recover it.

------
marcinzm
>But contractually, we don’t provide a backup product for customers. That may
have not been explained clearly enough in our V5 documentation.

If you have a single point of failure for data and "snapshots" then you should
explain that very clearly to customers. Moreover, as I understand it,
competitors like AWS do not have such a single point of failure (ie: EBS
Snapshots are on S3 and not EBS) so using the same terminology/workflow is
going to cause confusion.

~~~
AaronFriel
Are S3 and EBS not hosted on the same underlying storage subsystems?

------
generalpass
I realize everyone here seems focused on the file system, but one of the
things stressed by the OpenBSD project is that difficulty in upgrades are the
root cause of the biggest problems.

Case in point.

------
vahomu
> As disks are read at 3M/s, we estimate the duration of the operation to be
> up to 370 hours.

Am I reading this right? This works out to just ~3.8TiB.

So much drama over, basically, one HDD worth of data?

~~~
robin_reala
They said that in the end only 414 users were affected, and it was their
simple hosting package. Honestly, I’m almost surprised it was that much.

------
loa_in_
Their services otherwise run flawlessly for me. I appreciate the transparency.

------
BrentOzar
And now their postmortem blog post is down with a 503 error. Doesn’t exactly
fill me with confidence about their abilities.

~~~
polyvisual
Their handling of the issue on Twitter was enough for to decide to move my
domain names away from them when their renewal is due.

~~~
slig
Just a heads up, and forgive me if this is obvious, but you can move right
away, the expire date will be the same anyway. You certainly don't want a
failed migration too close to the expiration date.

~~~
polyvisual
yes, good point!

------
acd
Once had a data loss on a test ZFS system with iSCSI on top. What I learnt
from that is that you need to schedule scrubs your ZFS pools regularly. Its
always easy to be wise afterwards but harder to predict before. Not sure if
that would have helped.

------
iicc
> We think it may be due to a hardware problem linked to the server RAM.

Are they using ECC RAM?

~~~
tmikaeld
Sounds like they didn't and the metadata logs got corrupted..

This does also mean that other data would be corrupted too, running ZFS
without ECC RAM is frequently warned against.

~~~
vel0city
Running any resilient storage system without ECC RAM is warned against, people
just really make a big deal about it with ZFS. If your data in RAM is
corrupted before it makes it to the hard drive, pretty much any file system is
going to write corrupted data to the drive.

~~~
tmikaeld
Indeed, ECC should be _default_ these days!

It becomes even worse when RAM Is usually only tested when the computer is
built.

I've had several cases where RAM becomes faulty a couple of years down the
road.

Recently I had a very weird case of two stick of four went bad due to only
moving the computer from one corner to the next without even opening the case.

------
cdubzzz
Previously:
[https://news.ycombinator.com/item?id=22001822](https://news.ycombinator.com/item?id=22001822)

------
perlgeek
As a postmortem, this does not inspire confidence. It's a very technical
piece, but doesn't even try to take a customer's perspective.

If you want to learn from such an outage, you have to do a fault analysis that
leads to parameters you can can control.

Sure, there can be faulty hardware and software, but you are the ones
selecting and running and monitoring them.

If recovery takes ages, you might want to practice recovery and improve your
tooling.

And so on.

Blaming ZFS and faulty hardware and old software all cries "we didn't do
anything wrong", so no improvements in sight.

------
_eht
Being a Gandi customer must be terrifying, generally.

~~~
speedgoose
They used to have a great reputation in France about 20 to 15 years ago. It
went downhill since, the company got sold and they started to sell expensive
and slow cloud services, a bit like Amazon with less success.

------
gdm85
Would it be possible to backup the precious metadata separately to mitigate
the issue?

~~~
tomatocracy
Sounds like data was on one pool with a 3 disk mirror setup and nowhere else.
This is 'RAID is not a backup' territory. Much better would have been to
duplicate the volumes themselves somewhere else (eg using zfs send/receive to
a different host) and ideally the contents of those volumes too.

------
dmh2000
After a glance I thought, 'why a storage unit?, where do they get power, how
do they cool it, its not physically secure, etc'. Then, oh, that kind of
storage unit. Yes, I'm dumb.

