Bitbucket Downtime Post-Mortem

woogley · on Jan 12, 2012

Hmm. I had no idea they were down last week, but now I've learned there blog is down right now.

Cache link: http://webcache.googleusercontent.com/search?q=cache:http://...

srl · on Jan 12, 2012

Protip: if you're having load problems but think you've resolved them, try posting a story about them to HN.

Disclaimer: haven't read and can't read the article. EDIT cache is here: http://webcache.googleusercontent.com/search?q=cache:http://...

(Totally OT: would it be correct to say "haven't and can't read"?)

aiiie · on Jan 13, 2012

bitbucket.org and blog.bitbucket.org are different applications running on different servers.

We recently migrated our WordPress instance to a new machine, but not all of the configuration was moved over. We've put caching back in place, so it should be back to normal.

antoko · on Jan 12, 2012

I think "can't read" would also include the case of "haven't read". Making the "haven't read" component somewhat redundant.

Aesthetically it doesn't scan well - I think adding it in parens would be the way to go if you want to make that kind of emphasis.

Haven't read (and can't) ...

If the question is about whether it is grammatically correct then I wouldn't like to say - I never really paid much attention to those rules. :)

obeattie · on Jan 12, 2012

Seems to me a little investigation into the (many) ways rsyslog can be configured is in order here. It can be relatively simply configured to ship logs asynchronously, and reliably, to a remote log server, and buffer them on disk / in memory in the event the remote server is down. This is what we're doing, and it works very well.

http://rsyslog.com/doc/queues.html

http://rsyslog.com/doc/rsyslog_reliable_forwarding.html

(This isn't, however, to say the documentation on this stuff is easy to find / understand. The rsyslog documentation is definitely not the best.)

antoncohen · on Jan 12, 2012

Interesting, it had not occurred to me that a syslog server going down would take down the whole infrastructure. How long was the syslog server down for? rsyslog supports reliable forwarding, where it will buffer to memory when the server is down, then to disk when it runs out of configured memory [1].

[1] http://rsyslog.com/doc/rsyslog_reliable_forwarding.html

gtaylor · on Jan 12, 2012

General summary: Don't send syslogs over TCP.

colmmacc · on Jan 13, 2012

I think the lesson is not that TCP should not be used, but rather that service-affecting code-paths should not block inappropriately.

I disagree with the conversion to UDP, and consider it an anti-pattern. In general, whenever you have many front-ends sharing a back-end service (in this case syslog), some flow-control is necessary - and UDP with no flow control will one day come back to bite you.

Consider an attack or peak load event on the front-ends, now in addition to the direct capacity problems that that may induce, you also impact your backend and control-planes by flooding a network or service with uncontrollable UDP.

As others have noted; it would be better to use non-blocking TCP I/O with some bounded queuing for retries. It's also generally a good idea to use a LIFO queue, so that when the backend is restored after an outage, the recent data takes priority over old data (for logging information the health of the live system is most important).

ethereal · on Jan 12, 2012

Or at least, don't use blocking TCP calls if you do. Timeout and fail gracefully . . .

xxiao · on Jan 12, 2012

your blog site is down too, probably for another post-mortem

jphackworth · on Jan 13, 2012

Sort of a silly thing to bring up, but at my previous job bitbucket was down so often we started calling it sh*tbucket. At my current gig we're on github and it seems far more reliable.