
Postmortem of last week's fileserver failure - mnemonik
http://github.com/blog/666-postmortem-of-last-week-s-fileserver-failure
======
bdr
Server-side data loss always sucks, but this would be much harder to recover
from with a non-distributed VCS. Another win for the D.

------
oomkiller
I absolutely love postmortems. I think it's because I love seeing the complex
ways our dear friend Murphy works. Also, it's really nice to be able to see
problems other people run into, so you can avoid them or plan for them
yourself.

------
steamboiler
I find candid postmortems indicative of a dependable service. Good work
Github.

------
bcl
THIS is how you handle failure. You learn from it, and let others learn as
well.

------
sh1mmer
I love how up front those guys are.

------
Andys
This is a classic example of how the naieve plan of 'Get two servers and
mirror them' can lead to downtime regardless of the best intentions of
sysadmin.

~~~
Loic
Maybe the best comment about this issue.

The problem with the GitHub RAID approach + daily snapshot is that you end up
with only the equivalent of a daily backup and you do not have data integrity
insurance.

Each of their file storage pair is running in RAID + RAID over the network
(DRBD). RAID is not a backup because you can get data corruption and you just
replicate the corruption.

Imagine that they got part of their A server having issues, which are with
current FS+RAID simply ignored. You can run months without integrity check
even with Git because if you add new objects and do not need to repack, no
integrity check is performed on the old objects. So your corruption spread
from A to B (RAID) to backup (daily snapshot). Then you need at home to
checkout because you lost your HD, you are doomed. Full data loss without a
single warning. If you are a single developer on your own private project,
this scenario is easy to achieve.

I really really like the way Fastmail is doing backup. They are replaying the
IMAP operations with checksum. With Git you can also replay with checksum too.
In fact, it is built in thanks to the DSCM nature of Git.

For the backup of my customer git repositories, this is what I am doing, in
the post-update hook, I fire a git sync with git on another server, it is
doing everything needed while being CPU+bandwidth efficient. Thank you git.

The easy way to do it:
<http://kerneltrap.org/mailarchive/git/2007/10/18/346839>

------
jolan
> Prior to the failure, we had been seeing some anomalous behaviour from the A
> server in this pair. The machine had been taken offline and memory tests
> performed, but no issue was found.

Presumably the anomalous behavior was disk-related and yet they only test
memory.

Why not check SMART status? Or if the RAID driver doesn't support SMART
passthrough; move the drives to the SATA controller and boot off of a flash
drive.

Also, there are tools to generate lots of i/o and test for failures, i.e.:

<http://www.peereboom.us/iogen/>

------
tomjen3
Damn, that is sluppy. GitHub is, essentially a glorified file storage place
and they don't have a wy to ensure th at corruption doesn't happen?

