
Bitbucket Downtime Post-Mortem - urbanjunkie
http://blog.bitbucket.org/2012/01/12/follow-up-on-our-downtime-last-week/?utm_source=twitterfeed&utm_medium=twitter&utm_campaign=Feed%3A+Bitbucket+%28Bitbucket%29
======
woogley
Hmm. I had no idea they were down last week, but now I've learned there blog
is down right now.

Cache link:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://blog.bitbucket.org/2012/01/12/follow-
up-on-our-downtime-last-
week/&hl=en&client=firefox-a&hs=cW8&rls=org.mozilla:en-
US:official&channel=np&strip=1)

------
srl
Protip: if you're having load problems but think you've resolved them, try
posting a story about them to HN.

Disclaimer: haven't read and can't read the article. EDIT cache is here:
[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://blog.bitbucket.org/2012/01/12/follow-
up-on-our-downtime-last-
week/&hl=en&client=firefox-a&hs=cW8&rls=org.mozilla:en-
US:official&channel=np&strip=1)

(Totally OT: would it be correct to say "haven't and can't read"?)

~~~
brodie
bitbucket.org and blog.bitbucket.org are different applications running on
different servers.

We recently migrated our WordPress instance to a new machine, but not all of
the configuration was moved over. We've put caching back in place, so it
should be back to normal.

------
obeattie
Seems to me a little investigation into the (many) ways rsyslog can be
configured is in order here. It can be relatively simply configured to ship
logs asynchronously, and reliably, to a remote log server, and buffer them on
disk / in memory in the event the remote server is down. This is what we're
doing, and it works very well.

<http://rsyslog.com/doc/queues.html>

<http://rsyslog.com/doc/rsyslog_reliable_forwarding.html>

(This isn't, however, to say the documentation on this stuff is easy to find /
understand. The rsyslog documentation is definitely not the best.)

------
antoncohen
Interesting, it had not occurred to me that a syslog server going down would
take down the whole infrastructure. How long was the syslog server down for?
rsyslog supports reliable forwarding, where it will buffer to memory when the
server is down, then to disk when it runs out of configured memory [1].

[1] <http://rsyslog.com/doc/rsyslog_reliable_forwarding.html>

------
gtaylor
General summary: Don't send syslogs over TCP.

~~~
colmmacc
I think the lesson is not that TCP should not be used, but rather that
service-affecting code-paths should not block inappropriately.

I disagree with the conversion to UDP, and consider it an anti-pattern. In
general, whenever you have many front-ends sharing a back-end service (in this
case syslog), some flow-control is necessary - and UDP with no flow control
will one day come back to bite you.

Consider an attack or peak load event on the front-ends, now in addition to
the direct capacity problems that that may induce, you also impact your
backend and control-planes by flooding a network or service with
uncontrollable UDP.

As others have noted; it would be better to use non-blocking TCP I/O with some
bounded queuing for retries. It's also generally a good idea to use a LIFO
queue, so that when the backend is restored after an outage, the recent data
takes priority over old data (for logging information the health of the live
system is most important).

------
xxiao
your blog site is down too, probably for another post-mortem

------
jphackworth
Sort of a silly thing to bring up, but at my previous job bitbucket was down
so often we started calling it sh*tbucket. At my current gig we're on github
and it seems far more reliable.

