The reason we lost so much data was that we only do nightly backups. That seemed enough when we started. Now that HN is a bigger part of more people's lives, we'll make more of an effort to make it proof against this sort of problem.
As a quick example, for those still not sure what I mean: what used to be an article about Python 2/3...
...becomes a comment with numerous references for how to learn arduino hacking.
Of course the community is diverse and my view is not necessarily representative of any type of majority.
You've probably all seen it by now, but from @HNStatus: 
Server back up and seemingly stable. Now restoring our latest backup to recover from limited filesystem corruption.
Good thing they're just silly internet points :)
edit: to clarify "lost" karma
Just internet points, fortunately :)
 - https://news.ycombinator.com/bestcomments
It looks like you've already exceeded that now... Well done! The internet gods must favour you.
Now commented back.. Hope the thread owner reads them and replies back.
On a more general note if anybody has backups and they aren't regularly tested restoring them, then you really don't have backups! As an added bonus, regular restoration tests let you practice for the "real deal" and you know how long the entire process will take.
We'll never need that old repo again.
The google cache got a few comments, but very few.
I really don't want credit, I just want the original question and the couple of responses to have the opportunity to see the day of light, despite the HN outage. So please vote up the original post!!!
Indeed, but the time and CPU it takes to do backups can be a very nasty trade off.
For that matter, as I write this, due to a mistake I made after a 85 minutes power outage yesterday, I'm just now doing my daily incremental backup of my home machines to an LTO-4 tape drive. Keeping that drive fed fast enough to prevent "shoe-shining" took some effort, Bacula spools up to 100G at a time to a partition on a single 15K disk on a separate controller. But if I had a LTO-5 drive, from what I've heard there's no single disk in existence that can keep up with a drive (not counting SSDs, which are a very poor match for this use case).
I'd like to migrate to ZFS but have yet to. Still just running EXT4.
HN should be on a replicated data store like Riak. Losing a node or two shouldn't take the system down, or should at least run in a degraded state (read only) until hardware is restored.
edit: the code has been public for a long time, and there is not a database to replicate. the site ran as a single server for years, and it is unlikely the front end caching has changed anything about the "database" components.
Since RAID failures actually are somewhat common, they are probably looking at a higher level replicated storage system now, a la DRBD, or some kind of distributed file system, a la Gluster.
Similarly, if you rm -rf a vital directory tree, RAID can ensure that it goes away reliably.
Filesystem corruption without hardware failure is far rarer in my experience. Have you seen an instance that wasn't a proverbial user error?
Back in ~2004 I watched IT spend a whole day recovering our 60-person startup's main Linux NFS server, due to a software bug in the storage driver. Had to rebuild the whole system from backups.
HN is persisted to flat files.
Maybe that's nothing new, but I just noticed it. Seems like a bug.
It doesn't seem to do anything though.
edited: "They're" --> "Their" (there/their/they're will be the death of me!)
I sure hope that isn't literal. I've heard of "grammar nazis" but that would be ridiculous. Stay safe!
(But I think the original thread was totally lost, I submitted it and it's not listed in my submission history.)
snapshot - http://oi40.tinypic.com/2mmbv5y.jpg
Appreciate the gift of perspective that has been given.
Appreciate the gift a new perspective gives you.
EDIT: (never mind, it was just cached)
Obviously Im a fan of the site, etc, etc, but "important"? On what level?
Im not even sure I'd call Facebook or Twitter important. Banking, yes. Weather warnings, yes. Things like that, sure. But, Im also pretty sure "important" is slightly over egging it for dear HN.
(No offence PG xxxx)
Imagine Twitter or Facebook being down during Egyptian revolutions.
Up-time for those type of sites is probably important for retention. But not for significant world events. But then again, Twitter kinda proved that even for retention that's not very important given how flaky that used to be.
Which is why I have a backup of it in case this happens again.
So https://news.ycombinator.com/news works, but https://news.ycombinator.com still redirects to "Sorry for the downtime. We hope to be back soon.".
On this basis can't you shutdown pretty much anything the majority use day-to-day?
Now I'm back at 1273.
I wasn't complaining, however. :)
Edit: clarification on motivation.
Speaking of which, I think you got the wrong impression here about the motivation of your fellow HNers who are simply curious. Your doubling down by snapping at people is really not the solution either.
Take a step back and look at this thread again. I think you misinterpreted the situation. Nobody's upset because of the outage. Isn't it understandable that people are giddy to find out what happened, though?
Had your criticism been that lukeqsee is behaving lame and trollish, you might have been received much better. Personally, I couldn't care less who gets the stupid points for actually asking the question - I believe getting points for posting stories is a bug anyway.
And the question itself is so simple and minimal, it doesn't really make sense to think of it in terms of being nice or not. It seems appropriate to me to just assume by default that it is being asked nicely and leave it at that.
(I would hope nobody misunderstands, but I am entirely serious.)
Edit: Udo edited his post, so I replied.
> lame and trollish
I partially agree. On one hand, it is both. On the other hand, like I said above, someone would post the question.
> I believe getting points for posting stories is a bug anyway.
I completely agree. Actually encouraging conversation and actively adding to the conversation is much better for the community. Perhaps 25% or 50% points for posts would level the field.
I'm curious - if you don't mind my asking: what are you thanking me for?
nhangen falsely critiqued my actions as complaining and then turned my own admission against me; you fairly accessed that (IMO). You also added to the conversation with your own assessment of the broad trend, i.e., posts are worth points.
I think all of those actions are worthy of thanking.
That's not at all what you initially commented on. Even if that was your intent, you simply said something completely different and unrelated. You were judged by what you actually said and when people challenged you about it you got aggressive. I'm really sorry to go on about this, but that's honestly what it looks like from here.
One of the reasons why I chose to comment on this is that it's a mistake I made as well once or twice (getting motivations of fellow users wrong and snapping at them), and I had the good fortune of people pointing that out to me.
1) We also have to keep our site and services running. Learning from other people's (bad) experiences is always welcome.
2) After an outage of this magnitude on our site and services, a post-mortem would be expected as part of the clean-up. It's not an extra demand, it's normal.
it's not about blame and anger; it's about learning and preventing.
There's no doubt that the whole time it was down, these questions were being asked -- both by PG and co. as well as the end users.
Not that we expect immediate and concrete answers, but it's certainly expected that the question will come up.
And it's pretty common for services to post a quick update stating the issue, cause, and postmortem at a high level, once service has been restored.
I know I try to do this every time my site goes down for even a couple minutes, let alone ~24 hours, and it has nowhere near the activity of HN.
Several HNers have the same question, thus the reason it is the top-voted new post.
I know it was on my mind, just out of curiosity, and was one of the first topics I looked for when I found service had been restored.
Your post might have done better if you had gone at it this way:
"Hey guys, PG and his minions are probably flat out on a couch somewhere, breathing many sighs if relief, and slinging back a well earned Scotch (might be a cider). Could be a while until we hear form them."