

Thoughts Evoked by CircleCI's July 2015 Outage - d23
http://blog.mihasya.com/2015/07/19/thoughts-evoked-by-circleci-outage.html?hn

======
Sanddancer
I think the biggest addition to this that I would make is why does CircleCI
have only one ops person to handle the day-to-day operations of all of these
moving parts? They've got a considerable number of moving parts, and one
person fighting to keep everything from fighting over leaves no room for other
people, with other operations experience, to look at solutions that could at
least provide a bit of breathing room. For example, there are a few ways of
throttling based on IP; limiting requests from github to, say, a kilobyte per
second of bandwidth would have slowed down that incoming tide to let the queue
start to drain.

I think this is the biggest frustration of cloud-based services that I have.
Just because you don't have physical systems to maintain doesn't mean you
don't need people who are comfortable climbing in and through the systems and
network level of the stack to ensure everything's working as well as it could
be. A good, solid, ops department gives a different perspective, and their
focus on other layers means that developers don't have to worry about those
pieces. Ops is not a profanity, and just because your devices are virtualized
doesn't mean you don't need them for the rest of the things they do.

------
wcdolphin
I enjoyed this piece. One piece of hard learned advice: add exponential back
off in failure to your clients, now. It'll take only a small amount of work,
and will save you from the inevitable self DDOS when your ingestion endpoint
hiccups, and clients' buffered data creates a load so large that you may not
be able to recover.

------
yeukhon
One thing about CircleCI's PM is that they don't really specify what kind of
DB. Sounds like they implemented something on their own. (I am betting on not
Cassandra).

~~~
kevinburke
They use Mongo, I'm pretty sure

~~~
boulos
Yeah, back in 2013 they were hit by the MongoHQ "admin account hacked" issue:
[http://blog.circleci.com/mongohq-security-incident-
response/](http://blog.circleci.com/mongohq-security-incident-response/)

Edit: I'm not sure they're still using Mongo though.

