Happypenguin.org: Hard drive crash, a cautionary tale

albertzeyer · on Nov 8, 2010

For people who might check this link a few days/months later, here the text as of Nov 8th:

Due to a hard drive failure which occurred at the same time as our backup system failed, the site is currently down. We expect to have full data recovery within the next few days. As of 9th Oct 2010 the data recovery is still ongoing. We have been told to expect a result early next week, so crossed fingers, we are hoping to have service resumed on or around the 13th Oct.

As of 15th Oct 2010 the data recovery is still ongoing. Apparently the Western Digital HDD that died has corrupted itsown firmware, making it hard to extract the data from the disc. As such, we are beginning the process of backup recovery from our most recent tier 2 backups which are a few weeks out of date, but waiting on the tier 1 recovery is getting ridiculous. We plan on working over the weekend on this recovery, and hope to have more news for you shortly.

As of 22nd Oct 2010 data recovery is STILL ongoing. The Western Digital Caviar Green 2TB drive that failed has chemical degredation on the surface making the data recovery much slower and harder. All going well, we will have the older backups making sense soon, and will be back up to speed.

As of Nov 4th, we have been unable to bring the site back up yet, due to the discovery that the older backups bring brought online would end up with a number of users being unable to access their games due to copy protection issues. As such, rather than give ANY user a problem with their legally purchased game, we will be keeping the site down for the next few days while the final stages of data recovery are performed. All going well, we hope to be back up towards the start of next week.

chronomex · on Nov 17, 2010

As of Nov 8th, we have a report from the data recovery company of a complete recovery. We should be receiving the recovered data on Wednesday 10th and we can then begin the process of finally getting things back to normal around here.
As of Nov 12th, we have now received the data recovery files, but there has been a fair amount of filesystem damage, and it will take us some time to locate the files, rebuild the filesystem, and get everything back on its feet. We are still hoping for being back online next week, all going well.

As of Nov 15th, we have begin database reconstruction. As there was significant filesystem damage, this will take a while, as we need to manually sanity check the database records, and there are a lot of them. We still hope to have things up and running this week.

tygorius · on Nov 7, 2010

The drive in question sounds like the 2TB bargain that New Egg regularly offers deals on: (http://www.newegg.com/Product/Product.aspx?Item=N82E16822136...)

I've got three of these in a JBOD server, running a couple of months without incident. They're cheap, quiet, and low-power. They're not high-performance, however, and not ideally suited to RAID applications. Also, given the new sector size you can get dreadful performance from them if you aren't careful in setting your partition boundaries.

Having suffered the consequences of ignoring data safety rules in the past, I now make a practice of having three copies of important data in at least two separate locations. I'm frankly astonished that there apparently wasn't a differential backup process at the company. I mean, a month's down time, WTF? These hard disks are pretty darn cheap backup media if you've got lots of data (e.g. video files), but after this item I'm wondering if biodiversity should be among my archive criteria.

That the drives in question had "chemical degredation on the surface making the data recovery much slower and harder" is a puzzler to me. Perhaps there's something about today's drive materials I'm unaware of, but my CYA radar went off on that choice of words: someone screwed up during the recovery process would be my first suspicion.

uxp · on Nov 8, 2010

I've personally had nothing but trouble from the WD Green series. One of their energy saving features is to spin down sooner, which at least for my usage, just increased how often they spun back up.

I also have no idea how they set up their server, but backup issues aside, why is a e-commerce site with subscription plans running their database, media and website off of the same harddrive. The 2TB drive sounds fantastic for the game files, but there is no reason that should be tied to the database which has user information stored on it.

Backing up your data in case of these issues is a very high priority, but also configuring your servers so one single failure can't take down every piece of your site should also be considered.

mike-cardwell · on Nov 7, 2010

Diversity is a good idea. I have a TimeCapsule which automatically backs up my Macbook once an hour, but I still put important files in my Dropbox folder so I have a 3rd offsite backup of them too.

pmjordan · on Nov 8, 2010

Is there anything remotely as convenient as Dropbox, but with sane, client-side crypto? As far as I know, Dropbox encrypt the data, but on the server, which, as far as I'm concerned, is about as good as storing it in plaintext.

Alternatively/even better: is there a remote mirroring filesystem and/or block storage system which allows reasonable client-side crypto over a slow asymmetric (ADSL) link? I have a local backup server, it would be great if I could mirror that into a server in a datacentre somewhere. Is there anything remotely as convenient as Dropbox, but with sane, client-side crypto? As far as I know, Dropbox encrypt the data, but on the server, which, as far as I'm concerned, is about as good as storing it in plaintext.

Alternatively/even better: is there a remote mirroring filesystem and/or block storage system which allows reasonable client-side crypto over a slow asymmetric (ADSL) link? I have a local backup server, it would be great if I could mirror that into a server in a datacentre somewhere. Has anyone tried this kind of thing with lessfs, dm-crypt and drbd?

Robin_Message · on Nov 8, 2010

I think this is what TarSnap http://www.tarsnap.com/ is meant to be for.

pmjordan · on Nov 8, 2010

As far as I can tell, Tarsnap is only for archives, it doesn't give you random-access, mountable-as-a-filesystem style convenience. For backups, Tarsnap is definitely the correct architecture, for syncing, not so much.

cperciva · on Nov 9, 2010

Correct. Theoretically the Tarsnap client-server protocol could be used to synthesize a mountable filesystem, but that would add hard latency requirements -- one of the great advantages of Tarsnap's transactional archive-creation is that it can tolerate high latency and even requests failing with little impact.

mike-cardwell · on Nov 8, 2010

You could combine encfs with Dropbox. Should work quite well as it uses one encrypted file per real file.

pmjordan · on Nov 8, 2010

OK, I'll give that a go too, thanks!

EDIT: I guess one disadvantage of this approach is that the file system structure is exposed. Convenient though.

perniciosus · on Nov 8, 2010

http://www.wuala.com/

pmjordan · on Nov 8, 2010

That looks pretty cool, thanks! Going to read up on the tech behind this and test it out.

viraptor · on Nov 7, 2010

Also: DRM, a cautionary tale (unless I misunderstood what they mean by that - I don't know what did happypenguin.org actually offer):

"older backups bring brought online would end up with a number of users being unable to access their games due to copy protection issues"

sqrt17 · on Nov 7, 2010

I always knew happypenguin.org as a site that had reviews of linux games. DRM issues sound weird in my ears, but maybe they offered some Steam-like service for downloadable games.

Dramatize · on Nov 7, 2010

They're a non-profit bringing happiness to penguins?

NathanKP · on Nov 8, 2010

No, they offer a list of linux games, a forum for the discussion of linux games, and a store for buying linux games.

foobar2k · on Nov 7, 2010

Ma.gnolia anyone? http://www.wired.com/epicenter/2009/01/magnolia-suffer/

When our backups or replication fail I usually say we are at Ma.gnolia threat level orange :)

narrator · on Nov 7, 2010

Don't forget the big CouchSurfing (http://techcrunch.com/2006/06/29/couchsurfing-deletes-itself...) crash. They appear to have recovered from it though.

rarrrrrr · on Nov 8, 2010

Offsite backup services created by HNers:

http://www.tarsnap.com/

http://www.haystacksoftware.com/arq/

https://spideroak.com/

(I co-founded SpiderOak in 2006.)

fluidcruft · on Nov 8, 2010

The problem with tarsnap is that it's run by one (extremely competent) guy. Unfortunately, if he dies/is disappeared/suffers a mental breakdown/etc your backups may well be toast and unrecoverable.

bugsy · on Nov 7, 2010

I have been places that have been through this same routine with backup after backup being found to have failed even though they were all checked recently through test restores.

Another thing that likes to happen is that even if you have a good backup, it's likely due to psychological stress that the first two or three restoration attempts will accidentally erase the backups due to someone typing an incorrect command.

viraptor · on Nov 8, 2010

So true. I lost ~3h of work because I screwed up my hg repo with msc dissertation somehow... on the hand-in day. Then in a panic, tried to restore it - of course I made a mistake in rsync source and destination and overwrote my backups before I could stop the sync (also ignored modification times - yay).

Fortunately I had another copy on bitbucket. Not the latest version, so I needed another day to finish all document formatting.... but I'm not sure how this would end otherwise.

lnguyen · on Nov 8, 2010

I've had 5 years of graduate research literally walk out the door when someone decided to steal the workstation out of my office at the beginning of 2000. Thankfully everything was being backed up remotely but it took over a month before a successful restore from tape. I'm guessing it was the first time the department's backup system was put to an actual "test" and during that month it was questionable whether or not I'd get anything back.

Not the best thing to happen right before you're set to write your PhD dissertation.

NginUS · on Nov 8, 2010

I almost lost half my backups this year.

That panic is what got me, right when things don't need to get any worse, they do.

After all the frustration of suffering through slow as molasses copy operations because they had to be with checksumming, wondering to myself 'how can I justify spending so much time on this?'...

Ultimately it paid off & a bad thing wasn't allowed to get worse because of it, and the third copy saved the day.

My infrastructure is more resilient now after, and I'm less likely to make concessions with regard to redundancy later.

I can still feel overwhelmed when I think about all the time & potential progress I lost recovering, but when I read this article I totally understand & it takes the edge off some. I'm glad for them that all wasn't lost.

Some of the good that came from it all was that I found an app called SyncBackPro to pull backups onto an offline sequence with from Windows machines.

Unlike others I saw, it can implement logic that says when this serial USB storage is mounted on drive foo, trigger job bar then run this script & eject- plus it does write checksumming for <$100.

idoh · on Nov 7, 2010

This keeps happening ... make sure to backup, and then test your backup. An easy way to do this is to run your dev environment on a backup snapshot.

iamjustlooking · on Nov 7, 2010

Having a mirrored RAID array would have helped here as well, as long as you don't mistake it for a backup solution.

eli · on Nov 7, 2010

And given how cheap hard drives are (and how likely they are to eventually fail) it seems like that would almost always be a good investment.

jrockway · on Nov 7, 2010

Indeed. My /home directory is on a 1TB 3-way RAID 1. Because I am that paranoid about disk failure.

Interestingly, all three disks failed, because Newegg does not know how to ship them properly. Never buying disks from them again. The good news is that only two failed at the same time, so there was no data loss. And the replacement disks from the manufacturer, Samsung, have been chugging along for months just fine.

RAID on your desktop is nice. RAID on your server is mandatory. The total cost was $300 for one fucking terabyte. Just do it :)

X-Istence · on Nov 8, 2010

I've purchased many hard drives from NewEgg, those same hard drives have been through two moves now and they are still going along strong. Not a single bad sector or anything along those lines.

jrockway · on Nov 8, 2010

OEM or retail? Retail disks have better packaging, but I bought OEM disks.

Mine came in plastic egg cartons with a few packing peanuts. I'm surprised they didn't shatter on their way across the country.

X-Istence · on Nov 11, 2010

OEM, never retail since retail is always more expensive. Yep, same plastic egg cartons with a ton of packaging peanuts. Received 8 1 TB drives.

Luyt · on Nov 8, 2010

I agree with that: the cost of a few extra mirror disks is negligible compared to the amount of time one loses rebuilding a machine (OS + applications + configuration) from scratch.

Yet, all too often I come across people who don't want RAID1 because they can save $60 by not installing that second disk.

jws · on Nov 7, 2010

And if you are using linux LVM raid for your root partition, you should practice breaking a drive. It works, but best you find out how to deal with it before having to do some panic googling.

pmjordan · on Nov 8, 2010

Also, I strongly recommend setting up mdadm in daemon mode to send you an email on failure as you can go days or weeks without noticing a degraded array. It's really easy to set up in most distros, so do it now (and don't forget to test it).

motters · on Nov 7, 2010

If anything, it's better to have an over-zealous backup policy.

jodrellblank · on Nov 7, 2010

And the classic LeafyHost Saga http://arstechnica.com/civis/viewtopic.php?f=25&t=238085 ( might want to look for a summary, it's a long thread )

At least HappyPenguin sent their disk for real recovery...

zppx · on Nov 8, 2010

Whoa, I just visited this page 4 hours before I saw it here on HN front page, I used to visit the page weekly back in 2003 and 2004 when I had much more spare time. I wish everything goes fine for them.

Anyway, where I work we use a controller with 4 attached SAS hard disks working in RAID and also realize backups using bacula just to remain safe.

noonespecial · on Nov 8, 2010

There comes a point where "backup" does not mean another hard drive, it means another server. (preferably running in replicate so all you have to do is change a router entry to recover).

My bet is that Yoda would encourage you to know when this is.

synack · on Nov 8, 2010

I read this, logged into the AWS console and grabbed a fresh snapshot of my EBS volumes. Problem solved.