
Full technical details on Asana's worst outage - marcog1
https://blog.asana.com/2016/09/yesterdays-outage/
======
merb
> Initially the on-call engineers didn’t understand the severity of the
> problem

Every outage I read, something like that happened. At least asana didn't
blamed the technology their were using.

~~~
babo
For me that was the great part of the post mortem, they identified the
response process itself as the root cause.

~~~
merb
yep that was what I thinking aswell.

------
katzgrau
These sort of deeply apologetic and hyper-transparent post-mortems have become
commonplace, but sometimes I wonder how beneficial they are.

Customers appreciate transparency, but perhaps delving into the fine details
of the investigation (various hypotheses, overlooked warning signs, yada yada)
might actually end up leaving the customer more unsettled than they would have
been otherwise.

Today I learned that Asana had a bunch of bad deploys and put the icing on the
cake with one that resulted in an outage the next day.

This is coming from someone who runs an ad server - if that ad server goes
down it's damn near catastrophic for my customers and their customers. When we
do have a (rare) outage, I sweat it out, reassure customers that people are on
it, and give a brief, accurate, and high level explanation without getting
into the gruesome details.

I'm not saying my approach is best, but I do think trying to avoid scaring
people in your explanation is an idea.

~~~
bognition
I work at a shop that does these kinds of post mortems. I find them highly
beneficial.

They require us to actually do the work of identifying the issues and writing
up what happened and why. I realize that having a customer contract to do this
shouldn't be a requirement but human psychology is funny thing. I can turn to
my pm and say "I have to do this it's part of the contract" and they
immediately back off.

I agree it might not be the best solution but it's definitely better than not
doing them.

~~~
dogma1138
I think the OP didn't mean that these post mortems are not beneficial
internally, what he said that disclosing all these details to the public can
be confusing and maybe counter productive.

~~~
merb
I'm not sure whats better.

1\. Describing the root cause and what you failed at. 2\. Blame the stuff you
are using / other people (clouds you use) 3\. just says nothing and try to
forget what happened.

What do you think is best?

~~~
themartorana
It's maybe a level of details thing. "A bad deploy went unnoticed, causing a
cascading failure. We identified how that happened and have new checks in
place to prevent it in the future."

Two lines, with the same information someone not very technically literate
would understand from the OP. I agree with being transparent, but I also
believe in not unnecessarily scaring and/or confusing customers, either.

(Pretty soon they'll just start outting individual engineers...)

~~~
marcog1
We will never out individuals. The person who committed the code was innocent.
We got him a fun gift as a sort of joke.

~~~
merb
Yep the best thing some could do:

\- train the people more \- help them to get over (some ppl could be really
mad and infconfident after they did bad)

------
madelinecameron
>And to make things even more confusing, our engineers were all using the
dogfooding version of Asana, which runs on different AWS-EC2 instances than
the production version

... That kind of defeats the purpose of "dogfooding". Sure, you have to use
the same code (hopefully) but it doesn't give you the same experience.

~~~
marcog1
You want to replicate as much as possible, but if we ran canary on the same
machines we could have testing code bring down production. That's bad.

------
bArray
Was this incident really recorded minute by minute or is that made up? I've
noticed a lot of companies that give this kind of detail like to give a minute
by minute report, I just don't understand how they get that accuracy?

~~~
gjtorikian
Oh, man. Most definitely that's real.

If you're working in Slack or chat, you've got a minimum of half a dozen
people typing and putting out suggestions and offering to investigate
something. That's all time stamped. And even if you're not doing that real-
time, you may be using something like a GitHub issue to discuss the problem
via comments, which are also time-stamped.

No one at the moment of the incident is probably going "Ah, it's 8:01, better
write down that I identified the problem." It's most likely "hay I think I got
it one sec" and then that works. Or doesn't. But hopefully it does.

~~~
jwatte
Yes, slack and irc time stamps is common. Ideally your shell and auditing
gives you that for commands, too!

------
kctess5
I find it interesting that they didn't notice the overloading for so long.
Also that it took so long to roll back. Given that they reportedly roll out
twice a day, it seems like identifying a rollback target would be fairly
quick.

~~~
marcog1
This was the first time we had this class of outage. Many things were in a
very bad state, and many of these symptoms were more familiar to us. So we
spent time ruling them out before realising webserver CPU was closer to the
root cause than the other symptoms.

We roll back by reverting to a previous release on the load balancers, which
is usually pretty instant. The previous releases were bad and themselves
rolled back, which is a rare situation for us. So there was a bit of
scrambling to look into the chat logs to determine a safe (non-rolled back)
release we could roll back to. Then the high CPU caused our roll back to be
really, really slow. Then we still had old processes running the bad release
running, and killing them on webservers with high CPU took a while to actually
work. Then it took a bit of time for load to come down on its own. All of this
took place within the 8:08-8:29 window reported in the post. And I'm still
simplifying a lot.

~~~
tomjen3
What I don't get is why you didn't see the relatively low cpu usage on the
database server and the super high ones on the webserver immediately in a
nagios (or similar) dashboard.

~~~
lrascao
And apparently there were no alarms in place for these kind of things

~~~
babo
Apparently a lot of parts of the system were on alarm.

------
mathattack
Not a bad reaction. With all the reverts is there a QA issue? Or too many
releases?

~~~
marcog1
When you do daily deployments, you can't QA every one much. You rely on
automated tests and Internal users using the new code for a couple hours
before the deployment. We were unlucky in this case with the number of bad
releases. Each was relatively minor, and ironically one was to fix a bug with
the code that caused this outage. We run a 5 whys for most of them.

~~~
Mtinie
> When you do daily deployments, you can't QA every one much.

In that case, should you be doing daily deployments to production?

------
zzzcpan
Strangely, there are no actual technical details in the report and the blame
is on the process. Although most of the times there is some way to prevent
bugs from causing problems with better architecture.

~~~
jwatte
The detail was right there: debugging something in security caused massive
logging which caused CPU bottlenecking.

Performance is the hardest thing to integration test for. Keeping careful
track of CPU/memory/network/disk load with automated alerts can help.

(Fancy systems like running a traffic replica can help, too, but at a much
higher cost.)

~~~
marcog1
We actually have a traffic replica (dark client) setup for the new webserver
architecture we are gradually migrating to. It likely would have caught this
before deploying to users.

------
cookiecaper
Reading through this, it sounds like some basic monitoring would've quickly
allowed them to pinpoint the cause instead of wasting time with database
servers. All it would take is pulling up the charts in Munin or Datadog or
whatever and seeing "Oh, there's a big spike correlated with our deploy and
the server is redlining now, better roll that back". A bug or issue in the
recent deploy would logically be one of the first suspects in such a
circumstance. Don't know why they wasted 30-60 minutes on a red herring. The
correlation would be even more obvious if they took advantage of Datadog's
event stream and marked each deployment.

Additionally, CPU alarms on the web servers should've informed them that the
app was inaccessible because the web servers did not have sufficient resources
to serve requests. This can be alleviated _prior to_ pinpointing the cause by
a) spinning up more web servers and adding them to the load balancer; or b)
redirecting portions of the traffic to a static "try again later" page hosted
on a CDN or static-only server. This can be done at the DNS level.

Let this be a lesson to all of us. Have basic dashboards and alarming.

~~~
marcog1
We have very comprehensive dashboards. Getting the perfect ones that help in
all cases, while not being information overload (the problem here) and being
discoverable is a hard, iterative process.

~~~
cookiecaper
Yes, monitoring requires a lot of tuning until you find a sweet spot, but it
doesn't sound like this is something that would've been buried deep in the
annals of monitor. CPU/load data on your web servers should be pretty
visible/accessible and one of the first graphs that get pulled up (and your
alarms should've pointed out the issue anyway).

I'm not sure what you're using for dashboards but Datadog makes it pretty easy
to find this stuff. I'm not a Datadog shill and I actually am not a _huge_ fan
of the product, but it's what we use and it's been a big help over our
previous Munin installation.

Other process changes that could prevent this are good load testing in a stage
environment and getting your company using the real prod code on the real prod
infrastructure as its main/default install. A lot of the benefits of
"dogfooding" are lost if you're using alpha code on dev-only boxes (as you
state that you are in another comment).

As another commenter said, I'm not sure that postmortems like this are
valuable unless the problem was particularly complex/interesting. I'm sure
that a lot of people at Asana know how to fix this and that it's just a matter
of getting management to allow them to do so. I'm sure you owe your customers
an explanation of some sort, but I don't know if you need to get into details
that say "Yeah, it was just a pretty typical organizational failure, we really
should've known better". Everyone has those, but it's best not to publicize
them too much.

I'm not going to hold it against Asana because I've worked at a lot of
companies and I know how this goes, but when people come here and analyze the
cause, as a postmortem invites the readers to do, you seem a little defensive.
Perhaps it's best to keep the explanation more brief/vague when it's not a
complex failure.

------
qaq
This is "not that different" from getting a very high load spike do you guys
not have some autoscaling setup?

~~~
marcog1
We do, but it didn't help given the cause of the high cpu was our logging
infrastructure (Amazon Kinesis) being overloaded by the webservers.

~~~
matt_wulfeck
Does kinesis not support UDP sylog style logging, some of these old
technologies had the right idea: if your sending too much data, drop the
packets on the floor instead of falling over!

------
jwatte
The real support for a frequent deployment system is in the immune system!
I've had good luck with a deployment immune system that rolls back if CPU or
other load jumps, even if it doesn't immediately cause user failure. (I e,
monitor crucial internals, not just user availability)

