Hacker News new | past | comments | ask | show | jobs | submit login
Bitbucket Downtime Post-Mortem (blog.bitbucket.org)
42 points by urbanjunkie on Jan 12, 2012 | hide | past | favorite | 11 comments



Hmm. I had no idea they were down last week, but now I've learned there blog is down right now.

Cache link: http://webcache.googleusercontent.com/search?q=cache:http://...


Protip: if you're having load problems but think you've resolved them, try posting a story about them to HN.

Disclaimer: haven't read and can't read the article. EDIT cache is here: http://webcache.googleusercontent.com/search?q=cache:http://...

(Totally OT: would it be correct to say "haven't and can't read"?)


bitbucket.org and blog.bitbucket.org are different applications running on different servers.

We recently migrated our WordPress instance to a new machine, but not all of the configuration was moved over. We've put caching back in place, so it should be back to normal.


I think "can't read" would also include the case of "haven't read". Making the "haven't read" component somewhat redundant.

Aesthetically it doesn't scan well - I think adding it in parens would be the way to go if you want to make that kind of emphasis.

Haven't read (and can't) ...

If the question is about whether it is grammatically correct then I wouldn't like to say - I never really paid much attention to those rules. :)


Seems to me a little investigation into the (many) ways rsyslog can be configured is in order here. It can be relatively simply configured to ship logs asynchronously, and reliably, to a remote log server, and buffer them on disk / in memory in the event the remote server is down. This is what we're doing, and it works very well.

http://rsyslog.com/doc/queues.html

http://rsyslog.com/doc/rsyslog_reliable_forwarding.html

(This isn't, however, to say the documentation on this stuff is easy to find / understand. The rsyslog documentation is definitely not the best.)


Interesting, it had not occurred to me that a syslog server going down would take down the whole infrastructure. How long was the syslog server down for? rsyslog supports reliable forwarding, where it will buffer to memory when the server is down, then to disk when it runs out of configured memory [1].

[1] http://rsyslog.com/doc/rsyslog_reliable_forwarding.html


General summary: Don't send syslogs over TCP.


I think the lesson is not that TCP should not be used, but rather that service-affecting code-paths should not block inappropriately.

I disagree with the conversion to UDP, and consider it an anti-pattern. In general, whenever you have many front-ends sharing a back-end service (in this case syslog), some flow-control is necessary - and UDP with no flow control will one day come back to bite you.

Consider an attack or peak load event on the front-ends, now in addition to the direct capacity problems that that may induce, you also impact your backend and control-planes by flooding a network or service with uncontrollable UDP.

As others have noted; it would be better to use non-blocking TCP I/O with some bounded queuing for retries. It's also generally a good idea to use a LIFO queue, so that when the backend is restored after an outage, the recent data takes priority over old data (for logging information the health of the live system is most important).


Or at least, don't use blocking TCP calls if you do. Timeout and fail gracefully . . .


your blog site is down too, probably for another post-mortem


Sort of a silly thing to bring up, but at my previous job bitbucket was down so often we started calling it sh*tbucket. At my current gig we're on github and it seems far more reliable.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: