Due to a hard drive failure which occurred at the same time as our backup system failed, the site is currently down. We expect to have full data recovery within the next few days.
As of 9th Oct 2010 the data recovery is still ongoing. We have been told to expect a result early next week, so crossed fingers, we are hoping to have service resumed on or around the 13th Oct.
As of 15th Oct 2010 the data recovery is still ongoing. Apparently the Western Digital HDD that died has corrupted itsown firmware, making it hard to extract the data from the disc.
As such, we are beginning the process of backup recovery from our most recent tier 2 backups which are a few weeks out of date, but waiting on the tier 1 recovery is getting ridiculous. We plan on working over the weekend on this recovery, and hope to have more news for you shortly.
As of 22nd Oct 2010 data recovery is STILL ongoing. The Western Digital Caviar Green 2TB drive that failed has chemical degredation on the surface making the data recovery much slower and harder.
All going well, we will have the older backups making sense soon, and will be back up to speed.
As of Nov 4th, we have been unable to bring the site back up yet, due to the discovery that the older backups bring brought online would end up with a number of users being unable to access their games due to copy protection issues. As such, rather than give ANY user a problem with their legally purchased game, we will be keeping the site down for the next few days while the final stages of data recovery are performed. All going well, we hope to be back up towards the start of next week.
As of Nov 12th, we have now received the data recovery files, but there has been a fair amount of filesystem damage, and it will take us some time to locate the files, rebuild the filesystem, and get everything back on its feet. We are still hoping for being back online next week, all going well.
As of Nov 15th, we have begin database reconstruction. As there was significant filesystem damage, this will take a while, as we need to manually sanity check the database records, and there are a lot of them. We still hope to have things up and running this week.
I've got three of these in a JBOD server, running a couple of months without incident. They're cheap, quiet, and low-power. They're not high-performance, however, and not ideally suited to RAID applications. Also, given the new sector size you can get dreadful performance from them if you aren't careful in setting your partition boundaries.
Having suffered the consequences of ignoring data safety rules in the past, I now make a practice of having three copies of important data in at least two separate locations. I'm frankly astonished that there apparently wasn't a differential backup process at the company. I mean, a month's down time, WTF? These hard disks are pretty darn cheap backup media if you've got lots of data (e.g. video files), but after this item I'm wondering if biodiversity should be among my archive criteria.
That the drives in question had "chemical degredation on the surface making the data recovery much slower and harder" is a puzzler to me. Perhaps there's something about today's drive materials I'm unaware of, but my CYA radar went off on that choice of words: someone screwed up during the recovery process would be my first suspicion.
I also have no idea how they set up their server, but backup issues aside, why is a e-commerce site with subscription plans running their database, media and website off of the same harddrive. The 2TB drive sounds fantastic for the game files, but there is no reason that should be tied to the database which has user information stored on it.
Backing up your data in case of these issues is a very high priority, but also configuring your servers so one single failure can't take down every piece of your site should also be considered.
Alternatively/even better: is there a remote mirroring filesystem and/or block storage system which allows reasonable client-side crypto over a slow asymmetric (ADSL) link? I have a local backup server, it would be great if I could mirror that into a server in a datacentre somewhere. Is there anything remotely as convenient as Dropbox, but with sane, client-side crypto? As far as I know, Dropbox encrypt the data, but on the server, which, as far as I'm concerned, is about as good as storing it in plaintext.
Alternatively/even better: is there a remote mirroring filesystem and/or block storage system which allows reasonable client-side crypto over a slow asymmetric (ADSL) link? I have a local backup server, it would be great if I could mirror that into a server in a datacentre somewhere. Has anyone tried this kind of thing with lessfs, dm-crypt and drbd?
EDIT: I guess one disadvantage of this approach is that the file system structure is exposed. Convenient though.
"older backups bring brought online would end up with a number of users being unable to access their games due to copy protection issues"
When our backups or replication fail I usually say we are at Ma.gnolia threat level orange :)
(I co-founded SpiderOak in 2006.)
Another thing that likes to happen is that even if you have a good backup, it's likely due to psychological stress that the first two or three restoration attempts will accidentally erase the backups due to someone typing an incorrect command.
Fortunately I had another copy on bitbucket. Not the latest version, so I needed another day to finish all document formatting.... but I'm not sure how this would end otherwise.
Not the best thing to happen right before you're set to write your PhD dissertation.
That panic is what got me, right when things don't need to get any worse, they do.
After all the frustration of suffering through slow as molasses copy operations because they had to be with checksumming, wondering to myself 'how can I justify spending so much time on this?'...
Ultimately it paid off & a bad thing wasn't allowed to get worse because of it, and the third copy saved the day.
My infrastructure is more resilient now after, and I'm less likely to make concessions with regard to redundancy later.
I can still feel overwhelmed when I think about all the time & potential progress I lost recovering, but when I read this article I totally understand & it takes the edge off some. I'm glad for them that all wasn't lost.
Some of the good that came from it all was that I found an app called SyncBackPro to pull backups onto an offline sequence with from Windows machines.
Unlike others I saw, it can implement logic that says when this serial USB storage is mounted on drive foo, trigger job bar then run this script & eject- plus it does write checksumming for <$100.
Interestingly, all three disks failed, because Newegg does not know how to ship them properly. Never buying disks from them again. The good news is that only two failed at the same time, so there was no data loss. And the replacement disks from the manufacturer, Samsung, have been chugging along for months just fine.
RAID on your desktop is nice. RAID on your server is mandatory. The total cost was $300 for one fucking terabyte. Just do it :)
Mine came in plastic egg cartons with a few packing peanuts. I'm surprised they didn't shatter on their way across the country.
Yet, all too often I come across people who don't want RAID1 because they can save $60 by not installing that second disk.
At least HappyPenguin sent their disk for real recovery...
Anyway, where I work we use a controller with 4 attached SAS hard disks working in RAID and also realize backups using bacula just to remain safe.
My bet is that Yoda would encourage you to know when this is.