
1.7 petabytes and 850M files lost, and how we survived it - beck5
https://csc.fi/web/blog/post/-/blogs/the-largest-unplanned-outage-in-years-and-how-we-survived-it
======
zimpenfish
"The directory is intended for temporary storage of results before staging
them into a more permanent location [...] During the three years that the
filesystem has been in operation, it has accumulated 1.7 Petabytes of data in
850 million objects."

There needs to be some law about how temporary directories always end up
containing vitally important data.

~~~
sevensor
What was interesting to me about this was that they had decided _not_ to
enforce a deletion policy on /wrk, because they had so much space and the
filesystem hadn't ever failed. But a rolling deletion policy would have gone a
long way to containing the damage by encouraging the users to move their data
to a filesystem optimized for reliability instead of availability. Still, I
appreciate the heroics involved in restoring the data.

~~~
ople
Author here. We have had an automated deletion policy on our previous
filesystems but opted out this time: There are users that have temporary files
that they want to persist on the /wrk and we have plenty of capacity. We
definitely learned our lesson, though. :)

~~~
Natanael_L
Better solution for the future: client side scripts that push those files back
after every purge.

------
hga
Lots of fun; while backing up the filesystem prior to wiping and rebuilding
it, they ran out of IOPS to do it in a reasonable time frame, so after
considering other options:

 _One obvious solution would be to use a ramdisk, a virtual disk that actually
resides in the memory of a node. The problem was that even our biggest system
had 1.5TB of memory while we needed at least 3TB.

As a workaround we created ramdisks on a number of Taito cluster compute
nodes, mounted them via iSCSI over the high-speed InfiniBand network to a
server and pooled them together to make a sufficiently large filesystem for
our needs._

A hack they weren't at all sure would work, but it did nicely.

~~~
powercf
Couldn't they add 1.5TB of swap to their 1.5TB of memory system and run a
ramdisk on that? I'm curious what performance would look like, but given 2-3k
IOPS for the on-disk solution, and 20k IOPS for the in-memory I would naively
expect at least 11k IOPS for random access, which should have been fast enough
without the headache of clustering?

~~~
ople
Author here: We considered that but as the access pattern was likely pretty
much random, the performance would have been terrible. Due to the break we had
nearly a 1000 clustered servers sitting idle so it was reasonably quick to do
the ramdisk trick.

~~~
icefo
I'm sorry but I don't understand something. What did you put on that big
ramdisk ? The metadata ?

~~~
ople
We copied the raw image file of the corrupted metadata filesystem (MDT in
Lustre lingo) to the ramdisk.

Then we mounted it via loopback and copied the files to tarballs. The bit that
was really slow on the spinning disk was reading the millions of files from
the metadata FS.

The basic process of the file-level backup is documented here:
[https://build.hpdd.intel.com/job/lustre-
manual/lastSuccessfu...](https://build.hpdd.intel.com/job/lustre-
manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#dbdoclet.50438207_21638)

~~~
garthk
For those still not quite getting it:

The first copy to RAM was a sequential image copy, thus not bottlenecked on
seeks despite spinning platters.

The second copy from RAM was a file copy with a lot of random I/O, but not
bottlenecked on seeks because it was reading from RAM.

Bulk writes tend to be more efficient. They might have made temporary
configuration changes to make that end faster, or not if they lacked the
appetite for the extra risk.

------
ghubbard
Current HN Title: 1.7 petabytes and 850M files lost, and how we survived it.

Article title: The largest unplanned outage in years and how we survived it

Article overview: A month ago CSC's high-performance computing services
suffered the largest unplanned outage in years. In total approximately 1.7
petabytes and 850 million files were recovered.

Although technically correct, the HN title is misleading.

~~~
tnorthcutt
_1.7 petabytes and 850M files lost_ \--- _1.7 petabytes and 850 million files
were recovered_

Given that the latter statement is from the article, how is the former
"technically correct"?

~~~
biot
Imagine an article "One web server lost and how we survived it" simply said
"Our load balancer automatically removed that server from the pool and we let
the other 15 web servers pick up the load. We didn't have to do anything."
This is different from "Oh crap, we only had one web server and we absolutely
had to do a lengthy recovery process to get it back online."

------
pinewurst
It should be noted that this is about a Lustre filesystem hosted on DDN
hardware. It's unclear whether the failed controller contributed to the file
system corruption, but Lustre is quite capable of accelerating local entropy
all by itself. It was designed/spec-ed at LLNL as huge file, high performance,
short term scratch/swap and even after 15 years isn't especially reliable or
fit for use outside that domain.

------
gnufx
I'm surprised that the copying bottleneck seems to have been entirely at the
target rather than the source. Is that because there were multiple copies of
the source?

I've had to employ the horrible hack of iscsi from compute nodes, raided and
re-exported, but it's not what I'd have tried to use first. The article
doesn't mention the possibility of just spinning up a parallel filesystem on
compute node local disks (assuming they have disks); I wonder if that was
ruled out. I don't have a good feeling for the numbers, but I'd have tried
OrangeFS on a good number of nodes initially.

By the way, it's been pointed out that RAM disk is relatively slow, if in the
context of data rates rather than metadata <[http://mvapich.cse.ohio-
state.edu/static/media/publications/...](http://mvapich.cse.ohio-
state.edu/static/media/publications/slide/rajachan-hpdc13.pdf>).

~~~
ople
The reading of the metadata required quite a lot of random acces. We were
fairly sure that if a high-end array and controller with fast disks is
struggling with it, then a traditional clustered solution with slower node
local disks would not fare much better. Thus we tried to find the solution
which yields the highest possible IOPS.

~~~
gnufx
I misunderstood the bottleneck, not having had to do that. (Distributed
metadata for the parallel filesystem could actually be tuned to be memory
resident.)

------
ajford
Out of curiosity, why weren't they running the metadata drive in a mirroring
raid? If you have PB of data, wouldn't it make sense to spend the ~$100 for a
second 3TB drive to mirror your metadata?

Or was the inode problem not a local disk problem but a problem in the Luster
fs? I couldn't quite tell from the article.

~~~
pinewurst
It's almost a certainty that the MDS (metadata server) was situated on a
mirrored RAID (prob RAID10). I'm guessing that the RAID system itself
(software MDRAID or some HW array, DDN or something like a NetApp E-Series)
corrupted the data under the FS that the MDS used, which I'm also assuming was
XFS.

Lustre, for those who don't know it, is a cluster meta-filesystem, with
separate metadata and object servers, each sitting on top of host file
systems/RAID/storage.

~~~
ople
The metadata target (MDT) in the MDS is actually "ldiskfs" which is an
enhanced version of ext4. One possibility may be to use ZFS in the future as
the support in Lustre seems to be quite stable now.

It seems pretty impossible to find out the exact root cause in retrospect as
the system was running for a long time without apparent issue. Any ideas are
welcome though.

------
beezle
I bookmarked this for whenever I think I'm having a really bad day...

~~~
ople
Hehe.. In retrospect the whole team was in fairly good spirit although the
situation was stressful. A lot of this was due to the top management giving
the time and space for the specialists to do their thing and the very
understanding response from the customers once we explained the situation.

