For people who might check this link a few days/months later, here the text as of Nov 8th:
Due to a hard drive failure which occurred at the same time as our backup system failed, the site is currently down. We expect to have full data recovery within the next few days.
As of 9th Oct 2010 the data recovery is still ongoing. We have been told to expect a result early next week, so crossed fingers, we are hoping to have service resumed on or around the 13th Oct.
As of 15th Oct 2010 the data recovery is still ongoing. Apparently the Western Digital HDD that died has corrupted itsown firmware, making it hard to extract the data from the disc.
As such, we are beginning the process of backup recovery from our most recent tier 2 backups which are a few weeks out of date, but waiting on the tier 1 recovery is getting ridiculous. We plan on working over the weekend on this recovery, and hope to have more news for you shortly.
As of 22nd Oct 2010 data recovery is STILL ongoing. The Western Digital Caviar Green 2TB drive that failed has chemical degredation on the surface making the data recovery much slower and harder.
All going well, we will have the older backups making sense soon, and will be back up to speed.
As of Nov 4th, we have been unable to bring the site back up yet, due to the discovery that the older backups bring brought online would end up with a number of users being unable to access their games due to copy protection issues. As such, rather than give ANY user a problem with their legally purchased game, we will be keeping the site down for the next few days while the final stages of data recovery are performed. All going well, we hope to be back up towards the start of next week.
As of Nov 8th, we have a report from the data recovery company of a complete recovery. We should be receiving the recovered data on Wednesday 10th and we can then begin the process of finally getting things back to normal around here.
As of Nov 12th, we have now received the data recovery files, but there has been a fair amount of filesystem damage, and it will take us some time to locate the files, rebuild the filesystem, and get everything back on its feet. We are still hoping for being back online next week, all going well.
As of Nov 15th, we have begin database reconstruction. As there was significant filesystem damage, this will take a while, as we need to manually sanity check the database records, and there are a lot of them. We still hope to have things up and running this week.
I've got three of these in a JBOD server, running a couple of months without incident. They're cheap, quiet, and low-power. They're not high-performance, however, and not ideally suited to RAID applications. Also, given the new sector size you can get dreadful performance from them if you aren't careful in setting your partition boundaries.
Having suffered the consequences of ignoring data safety rules in the past, I now make a practice of having three copies of important data in at least two separate locations. I'm frankly astonished that there apparently wasn't a differential backup process at the company. I mean, a month's down time, WTF? These hard disks are pretty darn cheap backup media if you've got lots of data (e.g. video files), but after this item I'm wondering if biodiversity should be among my archive criteria.
That the drives in question had "chemical degredation on the surface making the data recovery much slower and harder" is a puzzler to me. Perhaps there's something about today's drive materials I'm unaware of, but my CYA radar went off on that choice of words: someone screwed up during the recovery process would be my first suspicion.
I've personally had nothing but trouble from the WD Green series. One of their energy saving features is to spin down sooner, which at least for my usage, just increased how often they spun back up.
I also have no idea how they set up their server, but backup issues aside, why is a e-commerce site with subscription plans running their database, media and website off of the same harddrive. The 2TB drive sounds fantastic for the game files, but there is no reason that should be tied to the database which has user information stored on it.
Backing up your data in case of these issues is a very high priority, but also configuring your servers so one single failure can't take down every piece of your site should also be considered.
Is there anything remotely as convenient as Dropbox, but with sane, client-side crypto? As far as I know, Dropbox encrypt the data, but on the server, which, as far as I'm concerned, is about as good as storing it in plaintext.
Alternatively/even better: is there a remote mirroring filesystem and/or block storage system which allows reasonable client-side crypto over a slow asymmetric (ADSL) link? I have a local backup server, it would be great if I could mirror that into a server in a datacentre somewhere. Is there anything remotely as convenient as Dropbox, but with sane, client-side crypto? As far as I know, Dropbox encrypt the data, but on the server, which, as far as I'm concerned, is about as good as storing it in plaintext.
Alternatively/even better: is there a remote mirroring filesystem and/or block storage system which allows reasonable client-side crypto over a slow asymmetric (ADSL) link? I have a local backup server, it would be great if I could mirror that into a server in a datacentre somewhere. Has anyone tried this kind of thing with lessfs, dm-crypt and drbd?
As far as I can tell, Tarsnap is only for archives, it doesn't give you random-access, mountable-as-a-filesystem style convenience. For backups, Tarsnap is definitely the correct architecture, for syncing, not so much.
Correct. Theoretically the Tarsnap client-server protocol could be used to synthesize a mountable filesystem, but that would add hard latency requirements -- one of the great advantages of Tarsnap's transactional archive-creation is that it can tolerate high latency and even requests failing with little impact.
Indeed. My /home directory is on a 1TB 3-way RAID 1. Because I am that paranoid about disk failure.
Interestingly, all three disks failed, because Newegg does not know how to ship them properly. Never buying disks from them again. The good news is that only two failed at the same time, so there was no data loss. And the replacement disks from the manufacturer, Samsung, have been chugging along for months just fine.
RAID on your desktop is nice. RAID on your server is mandatory. The total cost was $300 for one fucking terabyte. Just do it :)
Also, I strongly recommend setting up mdadm in daemon mode to send you an email on failure as you can go days or weeks without noticing a degraded array. It's really easy to set up in most distros, so do it now (and don't forget to test it).
I have been places that have been through this same routine with backup after backup being found to have failed even though they were all checked recently through test restores.
Another thing that likes to happen is that even if you have a good backup, it's likely due to psychological stress that the first two or three restoration attempts will accidentally erase the backups due to someone typing an incorrect command.
So true. I lost ~3h of work because I screwed up my hg repo with msc dissertation somehow... on the hand-in day. Then in a panic, tried to restore it - of course I made a mistake in rsync source and destination and overwrote my backups before I could stop the sync (also ignored modification times - yay).
Fortunately I had another copy on bitbucket. Not the latest version, so I needed another day to finish all document formatting.... but I'm not sure how this would end otherwise.
I've had 5 years of graduate research literally walk out the door when someone decided to steal the workstation out of my office at the beginning of 2000. Thankfully everything was being backed up remotely but it took over a month before a successful restore from tape. I'm guessing it was the first time the department's backup system was put to an actual "test" and during that month it was questionable whether or not I'd get anything back.
Not the best thing to happen right before you're set to write your PhD dissertation.
That panic is what got me, right when things don't need to get any worse, they do.
After all the frustration of suffering through slow as molasses copy operations because they had to be with checksumming, wondering to myself 'how can I justify spending so much time on this?'...
Ultimately it paid off & a bad thing wasn't allowed to get worse because of it, and the third copy saved the day.
My infrastructure is more resilient now after, and I'm less likely to make concessions with regard to redundancy later.
I can still feel overwhelmed when I think about all the time & potential progress I lost recovering, but when I read this article I totally understand & it takes the edge off some. I'm glad for them that all wasn't lost.
Some of the good that came from it all was that I found an app called SyncBackPro to pull backups onto an offline sequence with from Windows machines.
Unlike others I saw, it can implement logic that says when this serial USB storage is mounted on drive foo, trigger job bar then run this script & eject- plus it does write checksumming for <$100.
Whoa, I just visited this page 4 hours before I saw it here on HN front page, I used to visit the page weekly back in 2003 and 2004 when I had much more spare time. I wish everything goes fine for them.
Anyway, where I work we use a controller with 4 attached SAS hard disks working in RAID and also realize backups using bacula just to remain safe.