Hacker News new | comments | show | ask | jobs | submit login
Ask PG: Postmortem of the outage?
330 points by lukeqsee on Jan 7, 2014 | hide | past | web | favorite | 132 comments
Clarification: This is not meant with any ill-will towards PG or any of the other individuals who help run HN. It is a simple request for a postmortem eventually. Perhaps it's an unneeded request, but I think a lot of HNers echo the sentiment.

I don't know the details. Nick Sivo is in charge of this stuff, and he'll post something about it. I know he thinks the root of the problem was a disk failure. The server got wedged, and when we rebooted, the file system was corrupted. I'm not sure exactly why it took so long to restore. I was out of town the whole time this was happening.

The reason we lost so much data was that we only do nightly backups. That seemed enough when we started. Now that HN is a bigger part of more people's lives, we'll make more of an effort to make it proof against this sort of problem.

Another point to make is that, even with a lot of data loss, it at least would be good in the future to skip a bunch of post identifiers (potentially even just saying "well, we certainly didn't use 100,000 of them, so I skipped up to 7100000") in order to try to not cause data that used to be there, and may have been archived by various sources including Google Cache or in the various reader clients people use on various devices , to suddenly have been swapped out by different posts leading to wide-spread cache corruption (and further confusion or loss). I mean, even just for the sake of people who may have posted links to this content in various places: maybe they wrote an article or posted a tweet/comment somewhere referencing how great/horrible a post was, and now suddenly it is saying something entirely different and potentially quite awkward ;P.

As a quick example, for those still not sure what I mean: what used to be an article about Python 2/3...


...becomes a comment with numerous references for how to learn arduino hacking.


The problem happens also with HN own search, tried to get comments for first entry of https://www.hnsearch.com/search#request/all&q=openstreetmap

www.hnsearch.com is down itself.

I'll post something more detailed tomorrow, but in terms of data loss, we went down at 2014-01-05 16:10:29 PST and restored a backup from 2014-01-05 01:00:00 PST.

I lied about tomorrow. The "post" of post-mortem requires it be over. Still not done changing things.

Its amazing to think that HN is still someone's "side project"

It seems to be valued much more by the community than those who manage it - this is a problem IMO as there's then an inbalance in the desire of the community to have an optimal tool and the desire of the management which appears to be more to provide something akin to an MVP.

Of course the community is diverse and my view is not necessarily representative of any type of majority.

why? the value of hn isn't the software, but the community.

That's the point. It's still a side project, but the community has enormous value.

Thanks for the update. I don't think all that much was lost, and a day to restore from a disk failure ain't bad. Please thank Nick and whoever else worked overtime to get things back up and running!

Thanks for the update!

It seems all activity from the past two days has disappeared -- backup storage is something you never regret paying for.

You've probably all seen it by now, but from @HNStatus: [1]

  Server back up and seemingly stable. Now restoring our latest backup to recover from limited filesystem corruption.
[1] https://twitter.com/HNStatus/status/420179162138021888

Yeah, I lost about 200 karma (what was about 15% of my total) in the crash.

Good thing they're just silly internet points :)

I lost 25% of mine! A whole point!

I lost 50%

I hope at some point things reverse, and instead of accruing karma we shed it. When we reach zero--

Thank you for making my paltry 22 lost karma points look so, well, so paltry. ;-)

edit: to clarify "lost" karma

If this were reddit, you'd be getting tipped in Dogecoin as well.

Here, unfortunately, Internet play money is Serious Business.

Just gave you an internet point for having perspective.

And here I was wondering what I was being downvoted for. Turns out: nothing, a post just disappeared :)

Gave you a point so you can feel special again. Here's looking at you, kid.

Indeed, the first time I got something on main page and my karma skyrocketed from 5 to 80 something, I lost all.

Just internet points, fortunately :)

Well you're number 1 on bestcomments [0]. Apparently, this is the most effective way to rack up karma.

[0] - https://news.ycombinator.com/bestcomments

So, you had about 1333 points?

It looks like you've already exceeded that now... Well done! The internet gods must favour you.

I suspect the favour of the Internet Gods would be greater if the user had 1337 points ;)

I gave you back an Internet point.

I lost few comments :(

Now commented back.. Hope the thread owner reads them and replies back.

Here you go, I gave you one back.

Yes I noticed this as well from (the lack of) my own comment activity. I don't comment that often but I had written something yesterday that has disappeared.

On a more general note if anybody has backups and they aren't regularly tested restoring them, then you really don't have backups! As an added bonus, regular restoration tests let you practice for the "real deal" and you know how long the entire process will take.

One of the nicest ways I've experienced of making sure your backups are good is to sync up your development machine with them occasionally. Obviously there are situations like HIPAA where you can't do that, but if you can, do. You'll catch problems with your backups long before you actually need to use them.

This. Sometime last century I had to restore an old codebase from a tape backup. Step one find the correct drive...

We'll never need that old repo again.

IMO data loss is less a symptom of untested backups than it is of developer-managed systems. I wonder if ycombinator has a sysadmin (who isn't also a developer)?

That's a shame, I was really looking forward to the comments for the article below. Unfortunately I had it loaded, but hit Ctrl+R (like I sometimes do) and lost it forever. :/

The google cache got a few comments, but very few.




I missed the original posting of https://news.ycombinator.com/item?id=7015438 and it's right up my alley (now it's tomorrow morning's reading :) )

I've reposted a link to the original here - https://news.ycombinator.com/item?id=7015767

I really don't want credit, I just want the original question and the couple of responses to have the opportunity to see the day of light, despite the HN outage. So please vote up the original post!!!

"backup storage is something you never regret paying for"

Indeed, but the time and CPU it takes to do backups can be a very nasty trade off.

For that matter, as I write this, due to a mistake I made after a 85 minutes power outage yesterday, I'm just now doing my daily incremental backup of my home machines to an LTO-4 tape drive. Keeping that drive fed fast enough to prevent "shoe-shining" took some effort, Bacula spools up to 100G at a time to a partition on a single 15K disk on a separate controller. But if I had a LTO-5 drive, from what I've heard there's no single disk in existence that can keep up with a drive (not counting SSDs, which are a very poor match for this use case).

My feed array to the LTO-5 drive is 4 2TB (Hitachi) in RAID-10. Backup strategies, much like build systems are some constant factor harder than they appear.

I'd like to migrate to ZFS but have yet to. Still just running EXT4.

HN should be on a replicated data store like Riak. Losing a node or two shouldn't take the system down, or should at least run in a degraded state (read only) until hardware is restored.

Were they not using raid or performing multiple database writes? A mechanical hard drive failure is pretty common and can be mitigated fairly easily.

RAID arrays fail all the time; the system has famously been one server, and the only visible recent scaling work has been front end caching.

edit: the code has been public for a long time, and there is not a database to replicate. the site ran as a single server for years, and it is unlikely the front end caching has changed anything about the "database" components.

Since RAID failures actually are somewhat common, they are probably looking at a higher level replicated storage system now, a la DRBD, or some kind of distributed file system, a la Gluster.

Deosn't RAID usually at least give some warning if you watch the syslogs? (Genuine question, I am not a sysadmin, we have linux servers with hetzner on software raid 1 and a couple have had single-disk issues which we spotted straight away in zenoss and had hetzner replace the disk. Am I incorrect in thinking this is normal?)

RAID is a method for surviving hardware failure. If you have a software failure in, say, the VFS layer, RAID will happily accept the order to write garbage all over your inode trees and will carefully store and make sure that all the appropriate disks can return the same garbage every time. And yes, it should warn you when you need to replace a disk which is no longer returning the right garbage.

Similarly, if you rm -rf a vital directory tree, RAID can ensure that it goes away reliably.

yes you're right. so replies will now switch to how they don't stop you from deleting data, because... well, i have no idea why. it seems to just be a law of nature.

DRBD and Gluster are not any more resilient to filesystem corruption than a RAID device is. In this kind of case you hope for either real-time replicated storage on a completely separate physical host or very recent backups.

What are DRBD and Gluster if not real-time replicated storage on completely separate physical hosts?

Filesystem corruption without hardware failure is far rarer in my experience. Have you seen an instance that wasn't a proverbial user error?

You never ran reiserfs I see...

Back in ~2004 I watched IT spend a whole day recovering our 60-person startup's main Linux NFS server, due to a software bug in the storage driver. Had to rebuild the whole system from backups.

Yes, I have in fact, in a DRBD configuration. The bug was esoteric, but it happened and was not the result of user error. DRBD and Gluster both allow faults in the VFS layer to propagate to all replicas.

Gluster should by design I think avoid replicating filesystem metadata corruption (but would replicate internal metadata issues in files on top of the filesystem) but DRBD won't... At high volumes I still regularly break Gluster but it'd probably be OK for lower bandwidth/ops use. Not sure what the HN disk usage pattern is though.

IIRC Glusrerfs was the thing that gave me multiple identically-named files in the same directory. Useless.

Or, I dunnoh... writing to S3? ;-)

Databases? What databases?

HN is persisted to flat files.

I guess he meant having two separate logs. One for production, and secondary with his journal. In this case you could restore from backup the original data, and then replay rest of stuff from the external log. That's the solution I'm using with really important data where I cannot afford any data loss, even if down time is acceptable. On commit, it committed to two separate systems, but the secondary system is only journal which can be replayed.

What's odd is that if you look at your submission history, you can up vote your own submissions.

Maybe that's nothing new, but I just noticed it. Seems like a bug.

> What's odd is that if you look at your submission history, you can up vote your own submissions.

It doesn't seem to do anything though.

Several comment threads that I was following when it went down are gone ("No such item."), although their original links are still valid.

Indeed you are right; the original links still work for me. Though the ones I checked so far look exactly the same as the Google Cache version, so I don't know what happened there.

During the outage, https://twitter.com/HNStatus went from somewhere around 300 to 1163 followers.

Earlier today I saw it had about 45 followers. I think it was a new account. (Please correct me if I'm wrong.)

I'm surprised HN is using twitter instead of one one of their investments, Statuspage.io.

Their first post is dated 28 July - https://twitter.com/HNStatus/status/361707202123268096

edited: "They're" --> "Their" (there/their/they're will be the death of me!)

> (there/their/they're will be the death of me!)

I sure hope that isn't literal. I've heard of "grammar nazis" but that would be ridiculous. Stay safe!

here's a tweet from July 28th, so it's not a new account https://twitter.com/HNStatus/status/361707202123268096

It has tweets from as far back as July of 2013.

I actually looked for a twitter account, but didn't know which would be official. Then jgrahamc retweeted hnstatus and I knew which to follow.

I was bummed that the conversation around openstreetmaps got killed in the middle of it, and now I do not see it on the front page. Does anyone have a link to that thread or did it disappear?

It's been resubmitted: https://news.ycombinator.com/item?id=7015502

(But I think the original thread was totally lost, I submitted it and it's not listed in my submission history.)

me too. i guess we can start over: https://news.ycombinator.com/item?id=7015502

Its no longer on the front page anymore..

I find it interesting that this question is fresher (by a minute), has more points (67 v 42 at snapshot), and has more comments (18 v 10 at snapshot) than "HackerNews down, unwisely returning http 200 for outage message" but is ranked lower (2 v 1 at snapshot).

snapshot - http://oi40.tinypic.com/2mmbv5y.jpg

Self posts are penalized so they don't clog the front page for long.


Postmortem: it went down last night when people should have been going to sleep before their first day back at the job after holidays. It stayed down until the end of that day, with the last couple of days of vacation insanity erased.

Appreciate the gift of perspective that has been given.

Interesting that your perspective is locked into one side of the globe. ;) HN was down during the day my time, when we had already slept before returning to work. :)

Appreciate the gift a new perspective gives you.

I scoped it to a YC framing.

Would like to read it too. And it looks like right now is a good time to get just about anything in the front page. Front pretty much == new.

Are you actually able to see 'top'? I'm still getting the error.

EDIT: (never mind, it was just cached)

Yeah, the 200 response during the outage is playing with everyone, I think; you have to do a hard refresh on any URL you had visited during the downtime :/

pg: I don't know how much you care to get back the data that was lost, but it seems like it's at least partially available in the hnsearch.com API: http://api.thriftdb.com/api.hnsearch.com/items/_search?prett...

I'm not an expert in internet architecture, but shouldn't a site this important be running on redundant servers? The irony of a tech site going down due to technical issue is making me grin, however. Glad to see it back :)



Obviously Im a fan of the site, etc, etc, but "important"? On what level?

Im not even sure I'd call Facebook or Twitter important. Banking, yes. Weather warnings, yes. Things like that, sure. But, Im also pretty sure "important" is slightly over egging it for dear HN.

(No offence PG xxxx)

>>Im not even sure I'd call Facebook or Twitter important.

Imagine Twitter or Facebook being down during Egyptian revolutions.

IRC was the go-to before social networking. It's where I got up to the minute updates as the events of 9/11 unfolded, despite being 800+ miles away. That's also when I realized TV news is obsolete.

They'd have just gone to Facebook or Twitter instead. If you see what I mean.

Up-time for those type of sites is probably important for retention. But not for significant world events. But then again, Twitter kinda proved that even for retention that's not very important given how flaky that used to be.

I definitely find it important, from a career perspective.

Whaaa? I guess I can see that, in the sense that chatting and/or blowing off steam may have psychological benefits, but we're both just killing time here. Don't kid yourself.

I've learned a lot on HN, and I think learning is important, ergo I think HN is important.

Which is why I have a backup of it in case this happens again.

Important, absolutely – important as in critical service requiring 100% uptime? Not so much.

Is there any plans to release a new version of Arc, if it exists or server side code (without business-critical stuff)? I guess that there are lots of improvements since last Arc release.)

Despite the website being back online, the root URL still redirects to the error page (at the time of writing this).

So https://news.ycombinator.com/news works, but https://news.ycombinator.com still redirects to "Sorry for the downtime. We hope to be back soon.".

It's your browser cache.

Yes it was!

This must have been the most productive time for the tech industry in months.

No, I just kept wasting time reloading HN home page or following notifications on twitter!

I'm also interested in what the infrastructure of HN looks like. One of the tweets via @HNStatus seemed to imply that the site runs off of one application server.

HN is indeed running on a single (10 month old) server, it seems [1].

[1]: https://news.ycombinator.com/item?id=5229364

Social experiment

You may jest, but I once suggested something like that: https://news.ycombinator.com/item?id=2403880

I considered this as well (seriously), until I saw the lost data from the past day or so.

I wonder how much effort would be reasonable to improve the resilience of HN to this kind of issues, given that's a relatively rare issue and HN doesn't really have a money loss in case of a downtime such as this.

Probably little. There's no ad revenue being lost, no business transactions that can't be completed, no life-saving information that can't be accessed. When you boil it down, it's a social/entertainment site, nothing anyone can't live without for a day or two.

>nothing anyone can't live without for a day or two //

On this basis can't you shutdown pretty much anything the majority use day-to-day?

but afaik there was some dataloss, some users lost their karma and their comments.

If the outage was due to something malicious I don't really expect to see a postmortem.

Do we get the karma we lost refunded somehow? I am certain I am missing around 30 points.


I am sure pg is going to write an essay on this :-)

It bothered me more than it (probably) should

Damn I made it to 1300 karma's!

Now I'm back at 1273.

Because it's not good enough that the site is back, we need to pile on and complain too...

Asking a technical site about a technical site outage isn't, in and of itself, complaining. It's an opportunity to learn.

I'm not complaining. I'm asking a legitimate question. A popular site has been down for a significant period of time; I think a postmortem will be insightful.

If you're asking, it means you know that PG authors them in most instances. The fact you felt the need to start a new thread, just minutes after the site is back, tells me you're anxious and overeager.

I don't disagree with your assessment. I think it's more of an opportunist running an (admittedly) self-serving social experiment of my own. And I was also curious (naturally) and well-aware someone would ask the question.

I wasn't complaining, however. :)

Edit: clarification on motivation.

Asking for a postmortem is not complaining.

The site has been up for minutes and already asking for a report. I'm sure that's the last thing on PG's mind at this point. It's just clueless and classless.

> It's just clueless and classless.

Speaking of which, I think you got the wrong impression here about the motivation of your fellow HNers who are simply curious. Your doubling down by snapping at people is really not the solution either.

Take a step back and look at this thread again. I think you misinterpreted the situation. Nobody's upset because of the outage. Isn't it understandable that people are giddy to find out what happened, though?

OP didn't even ask nicely, and if you read the follow up to my comment, you'll see that the OP was motivated by something other than curiosity.

On the surface it seems fortuitous that the OP confessed to ulterior motives for asking the question, however it's misleading to bring this up, since that was not your original criticism at all - instead you accused us collectively of piling on and complaining.

Had your criticism been that lukeqsee is behaving lame and trollish, you might have been received much better. Personally, I couldn't care less who gets the stupid points for actually asking the question - I believe getting points for posting stories is a bug anyway.

And the question itself is so simple and minimal, it doesn't really make sense to think of it in terms of being nice or not. It seems appropriate to me to just assume by default that it is being asked nicely and leave it at that.

Thank you, Udo.

(I would hope nobody misunderstands, but I am entirely serious.)

Edit: Udo edited his post, so I replied.

> lame and trollish

I partially agree. On one hand, it is both. On the other hand, like I said above, someone would post the question.

> I believe getting points for posting stories is a bug anyway.

I completely agree. Actually encouraging conversation and actively adding to the conversation is much better for the community. Perhaps 25% or 50% points for posts would level the field.

I didn't mean it as a personal attack, but I do stand by the assertion that it was a pretty lame thing to do.

I'm curious - if you don't mind my asking: what are you thanking me for?

I thanked you because you accurately and fairly assessed my actions. A large reason I post or comment on HN is to receive constructive criticism (which is why I admitted to existence of self-interest). You provided it, and that’s why I’m thanking you.

nhangen falsely critiqued my actions as complaining and then turned my own admission against me; you fairly accessed that (IMO). You also added to the conversation with your own assessment of the broad trend, i.e., posts are worth points.

I think all of those actions are worthy of thanking.

No, I actually read it as someone trying to get cheap points (or popularity) by being the first one to post it. I would have felt differently had the question been asked with tact, and I'm certain those with high Karma know that the question was unnecessary. It just felt over zealous.

> No, I actually read it as someone trying to get cheap points (or popularity) by being the first one to post it.

That's not at all what you initially commented on. Even if that was your intent, you simply said something completely different and unrelated. You were judged by what you actually said and when people challenged you about it you got aggressive. I'm really sorry to go on about this, but that's honestly what it looks like from here.

One of the reasons why I chose to comment on this is that it's a mistake I made as well once or twice (getting motivations of fellow users wrong and snapping at them), and I had the good fortune of people pointing that out to me.

That's cool, I don't begrudge you for the sentiment.

I think you're getting "asking for a postmortem" confused with angrily raving about downtime. Or, you know, self righteously admonishing others for asking such a question

I am mildly interested in seeing a post-mortem at some stage, because

1) We also have to keep our site and services running. Learning from other people's (bad) experiences is always welcome.

2) After an outage of this magnitude on our site and services, a post-mortem would be expected as part of the clean-up. It's not an extra demand, it's normal.

it's not about blame and anger; it's about learning and preventing.

This is a bit silly. Anytime a (very active) social site goes down for a few minutes, people are interested in what happened, why, and what is being considered to prevent it going further.

There's no doubt that the whole time it was down, these questions were being asked -- both by PG and co. as well as the end users.

Not that we expect immediate and concrete answers, but it's certainly expected that the question will come up.

And it's pretty common for services to post a quick update stating the issue, cause, and postmortem at a high level, once service has been restored.

I know I try to do this every time my site goes down for even a couple minutes, let alone ~24 hours, and it has nowhere near the activity of HN.

Several HNers have the same question, thus the reason it is the top-voted new post.

I agree that knowing what happened is a legit curiosity, I'm just calling the op out for being a point sniffer.

I'm saying, though, that asking this alone doesn't make them a point sniffer. Somebody was bound to ask it within minutes of service being restored because I'm sure everyone was curious of this after seeing HN go down for several hours.

I know it was on my mind, just out of curiosity, and was one of the first topics I looked for when I found service had been restored.

Who's complaining? I only see curiosity.

You know.... just to help you in future...

Your post might have done better if you had gone at it this way:

"Hey guys, PG and his minions are probably flat out on a couch somewhere, breathing many sighs if relief, and slinging back a well earned Scotch (might be a cider). Could be a while until we hear form them."

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact