
Stripe – Outage postmortem - byroot
https://support.stripe.com/questions/outage-postmortem-2015-10-08-utc
======
jorgeortiz85
Hi, I work in infrastructure at Stripe and I'm happy to provide more insight.
Several threads here have commented on our tooling and processes around index
changes. I can give a bit more detail about how that works.

We have a library that allows us to describe expected schemas and expected
indexes in application code. When application developers add or remove
expected indexes in application code, an automated task turns these into
alerts to database operators to run pre-defined tools that handle index
operations.

In this situation, an application developer didn't add a new index description
or remove an index description, but rather modified an existing index
description. Our automated tooling erroneously handled this particular change
and interpreted it not as a single intention but instead encoded it as two
separate operations (an addition and a removal).

Developers describe indices directly in the relevant application/model code to
ensure we always have the right indices available -- and in part to help avoid
situations like this. In addition, the tooling for adding and removing indexes
in production is restricted to a smaller set of people, both for security and
to provide an additional layer of review (also to help prevent situations like
this). Unfortunately, because of the bug above, the intent was not accurately
communicated. The operator saw two operations, not obviously linked to each
other, among several other alerts, and, well, the result followed.

There are some pretty obvious areas for tooling and process improvements here.
We've been investigating them over the last few days. For non-urgent
remediations, we have a custom of waiting at least a week after an incident
before conducting a full postmortem and determining remediations. This gives
us time to cool down after an incident and think clearly about our
remediations for the long-term. We'll be having these in-depth discussions,
and making decisions about the future of our tooling and processes, over the
next week.

~~~
eldavido
hi jorge

I'd actually applied to work at stripe about two years ago, you guys turned me
down ;)

I was responsible for ops at a billion-device-scale mobile analytics company
for about 1.5 years. Your tooling is far superior to anything we produced. I
like the idea of a single source of truth describing the data model (code,
tables, query patterns, etc.) a lot, and doubly-so that it's revision-
controlled and available right alongside the code.

I think it's far from decided though, how much to involve human operators in
processes like this. Judging from this answer, you seem to be on the extreme
end of "automate everything". How then, I'm curious, do you train/communicate
to developers what can be done safely vs. something that would cause i/o
bottlenecks, slowdown, or other potentially production-impacting effects? Can
you even predict these things accurately in advance? (Some of our worst
outages were caused by emergent phenomena that only manifested at production
scale, such as hitting packet throughput and network bandwidth limits on
memcached -- totally unforseeable in a code-only test environment).

It sounds like you let developers request changes (a la "The Phoenix Project")
but ops is responsible for final approval of the change? That actually sounds
like a great system. Would love some elaboration on this.

In any case, great writeup and from one guy who's been there when the pager
goes off to another, sounds like the recovery went pretty smoothly.

~~~
toomuchtodo
Shouldn't developers understand how a database change is going to impact an
environment based on the code they've written?

~~~
BinaryIdiot
Yes they very much should! But in my, admittedly anecdotal, experience only
the best / most senior ever do. Almost every junior or mid developer I've
worked with (and a small handful of senior folks) not only have no idea how
changes like this would impact the larger environment but many won't even care
to look into it.

~~~
noir_lord
In part though that's because the tooling to do it easily absolutely _sucks_ ,
the impedance mismatch (overused but in context here) between the two parts of
the system causes a lot of the underlying issues, better tooling is a large
part of the solution I think but I've not seen anything that would help and
the surface area of a modern RDBMS is so large without even getting into
vendor specific stuff I'm not sure what that would even look like.

~~~
BinaryIdiot
That's certain a great point! If there was a way to automatically test much of
this I bet even the newest of engineers could stop this. Doing that is tough,
hmm...

~~~
noir_lord
I think the only way you could do it on top of a RDBMS is to use a strict
subset of features that are common (something that many ORM's already do)
which reduce the problem scope down to something manageable, the issue then
would be that there would always be the temptation to use something outside
that subset and forgo the easier testing, fast forward and you have the same
issue.

It would be interesting to build a RDBMS that enforced that subset by simply
not allowing those features to be used/abused with support for many of the
modern features (JSONB etc) but that is _way_ beyond my area of expertise.

------
Animats
What's so striking about this is that entire retail chains can be shut down by
a problem in some cloud server. Starbucks had a server outage in April which
caused stores to close.[1]

There's a trend towards "hosted POS", where point of sale systems have to talk
to the "cloud" to do anything, even handle cash. Until recently, most POS
systems were running off a server in the manager's office, which communicated
to servers and credit card systems elsewhere. A network outage didn't affect
cash transactions. Often, the systems could even process a credit card
transaction without external help, giving up real-time validation but still
logging the transaction for later processing.

There's a single point of failure being designed into retail here.

[1] [http://www.usatoday.com/story/tech/2015/04/24/starbucks-
poin...](http://www.usatoday.com/story/tech/2015/04/24/starbucks-point-of-
sale-cash-registers-down/26338505/)

~~~
thrownaway2424
When Blue Bottle Coffee switches to Square there was a noticeable decline in
the throughput at the cash register. It just takes the retail employee longer
to do anything on an iPad. Pretty much everything can be done faster on a real
cash register. There have also been the requisite outages, of course. Recently
I was at Blue Bottle and the Square terminal wasn't opening the cash drawer.
They were making change out of the tip jar :-/

~~~
Animats
There's something that a lot of retailers don't get - never put an obstacle in
front of the customer giving you their money. Don't let lines form at
checkouts. Don't clog up the counter with impulse-purchase stuff. Don't put
displays in the path of customers headed for checkout. Don't make customers
jump through hoops with loyalty cards and data entry. Don't do anything that
slows the checkout process.

There are expensive retail consultants who clean up stores and improve sales
by doing this.

Gap gets this right. Gap stores have big, clear counters, so you can bring up
lots of merchandise and have a place to put it. This increases sales per
customer. Gap is amazingly successful despite a rather blah product line.

~~~
jacobolus
On the other hand, having a line out the door seems to sometimes be good
marketing.

~~~
Animats
No. That went out decades ago. The Syufy chain of theaters (later Century)
used to try to create lines by not opening up enough ticket windows. Then came
video rental. Then came half-empty theaters.

Retailers no longer have the power to make customers wait. Consumers have too
many other buying options.

~~~
peoplelovelines
Wrong. Clearly you've never lived in NYC, where people apparently use lines as
a proxy for how good something is. Lots of other big cities too.

Also, box office revenue hasn't collapsed or anything, so I'm not sure what
you mean with your example. In fact, come to think of it, I see people lining
up for things multiple times every year at the theaters I frequent.

~~~
oalders
When I see those long lines in front of [insert trendy food item] stores I
think, "I'll have to try that one day when there's no lineup" but when that
day actually comes, I've already forgotten about the place. Some of us do
gauge popularity by lineups, but we also can't personally be bothered to stand
in line.

------
noir_lord
This is one of the best incident reports I think I've ever read, detailed,
honest and without recrimination.

I'm reminded of something I read in a book a while back about aviation
disasters "It's not the first problem that kills you, it's the second".

In this post the described system for change management is at least as good as
any I've seen in production and yet a series of small problems got out of hand
quickly.

~~~
squiggy22
Totally agree. Came here to say pretty much the same thing. On a related note,
highly recommend the Checklist manifesto (book) for devops.

~~~
j_s
[http://amzn.com/B0030V0PEW](http://amzn.com/B0030V0PEW)

------
methehack
I appreciate the post-mortem and, of course, we've all been there. I have to
say, though, that the cause is a little surprising. That DDL needs to be
executed sequentially is pretty basic and known by, I'm sure, everyone in the
engineering and operations organizations at Stipe. It surprises me that an
engineering group that is obviously so clever and competent would come up with
a process that lost track of the required execution order of two pieces of
DDL. If process (like architecture) reflects organization, what does this mean
about the organization that came up with this process? It's not a sloppy
exactly. Is it overly-specialized? It reminds me of that despair.com poster
"Meetings: none of us is as dumb as all of us" in that a process was invented
that makes the group function below the competence level of any given member
of the group.

~~~
jlarocco
I was thinking the same thing.

If nothing else, I don't understand the role of "database operator" if they'll
just blindly delete a critical index without thinking about it. Shouldn't that
person have known better than anybody how critical the index was?

~~~
gms
There is an open secret you may not be aware of: every startup is a shitshow
on the inside.

~~~
xacaxulu
Also top 20 financial institutions, USG orgs and places that store your
healthcare and tax information ;-) I'd argue that a good number of startups
these days (especially ones borne out of larger organizations with lots of
combined experience) are way more capable of handling these issues with
finesse and speed.

------
wpietri
Nicely done. Good job avoiding retrospectively blaming people and instead
focusing on future system improvements.

(For those wondering why this is important, Sidney Dekker's "Field Guide to
Understanding Human Error" is a mindblowing book.)

~~~
jskulski
Agreed. It looks like this was caused not because the application developer or
DBA caused an error, but because the system didn't allow for ticket
dependencies.

~~~
wpietri
Definitely. But it would have been very easy to yell at the developer ("you
should have known not to do it that way") or the DBA ("why are you doing
tickets out of order? you know we have to do deletes last!").

Especially after a dramatic event, those are very easy reactions to have, and
they can sound very sensible.

------
siliconc0w
For schema/index changes I prefer migrations. These should be committed along
side the code, tested, and promoted through environments. This largely
prevents dependency issues because the migrations are ordered.

Devops is like 80% dependency management. It's painful initially but you have
to crack down on manual changes to production - all production changes should
be defined in code, committed to git, tested, and flow through non-prod
environments first. (just had an outage yesterday that was essentially because
I failed to do this)

~~~
rawnlq
Do people even test for perf issues in non-prod environments?

~~~
toomuchtodo
I've found it to be extremely difficult to replicate production workloads in
staging/development.

------
pcunite
>> the database operator didn’t have a way to check whether the index had
recently been used for a query.

Appreciate the disclosure. Stripe engineers have been so helpful to me and my
business. Keep up the good work.

------
leeleelee
What is the point of having a human "database operator" who carries out simple
tasks like deleting an index when it shows up in a work request log? Is this
how most companies structure their dev teams?

If you have a human there to perform tasks, then it would seem natural to
allow them "human advantages" such as the ability to communicate with the
person who created the request, or the ability to have some of their own
checks and balances before performing the index deletion (ex. let's take a
look and see if this seems safe based on the current schema and codebase).

I am also surprised how easy it was for a single dev to make a request that
subsequently results in the modification of a production database.

~~~
thrownaway2424
Indeed, what is the point? If they have some separation of concerns between
dev and ops then it also makes sense to give ops some separate rights and
responsibilities, such as vetting a change like this by analyzing the
resulting query plans from a sample of production database queries. I'm sure
something like that will be in their internal postmortem action items under
"prevention".

------
vruiz
> At 00:08 UTC, our on-call engineer had been paged and had responded. At
> 00:10 UTC, we linked the API degradation to the removal of the index.

I'm sure stripe has very good metrics of all their systems, nonetheless that's
some rockstar level debugging skills.

~~~
protomyth
We had a utility I wrote at one of the places I worked where you ran it
against the database and it showed you all of the query plans running. It was
easy with 15 seconds to see some non-indexed query and what was executing it.
We used to run it after new version deployments to see if we had query
problems. Tooling is very important.

~~~
perlgeek
As a stopgap solution, enabling the slow query log can also be very helpful.

~~~
protomyth
Yep. A lot of databases have some form of system tables that can tell you a
lot about a running system with just some queries. I know Sybase and MS SQL
server have the tables to find all the current queries and what they are
doing. Ingress had a neat graph (seeing FSM should illicit a panic in most
DBAs) and a log monitor. Read up on what is part of the database and write
some tools.

------
nartz
Good post mortem, just some thoughts, would love to hear what others think.

Shouldn't the app developers be in charge of deploying their code and making
sure its working? It seems odd that they pass off a migration like this to a
DBA to then go deploy randomly, or that they weren't around when it was
happening to monitor it.

Also, it seems like there should have been a script that can be run that
encapsulates the required dependencies, namely, don't drop an index before
building the new one maybe? This should at least minimize the amount of
context needed in a ticket.

Relying on fully correct context in tickets seems like it could be super error
prone.

------
perlgeek
It's a vicious circle: an index is missing, so requests take longer, so more
DB queries run at the same time, the load on the DB server shoots through the
roof.

And then it needs to rebuild an index on top of the already above-average
load.

------
la6470
Why would someone delete a existing index without recreating it first? This is
common sense for a DBA but not for our full stack engineers who have to know
everything about everything g.

------
ryporter
People seem to be focusing on the fact that the two operations should have
been linked, but I think they actually should have been further separated. The
old index presumably could have been left around for days, at the cost of some
performance, and only deleted once they were sure that the new index was
successfully being used by all production code.

While splitting these two changes may be silly in the case of a simple index
change, I think that it's a good general policy to only deploy the minimal set
of changes to a production database at once. On the production system I manage
(admittedly, much simpler, and many orders of magnitude smaller), I always
deploy changes in two stages, with the second stage generally deployed about a
week later, once I'm sure that only new code will be running against the
database.

This comment is not meant to second guess the Stripe developers (who produced
a great postmortem), but to suggest another possible remediation.

~~~
mst
I had a similar thought but if it's that critical an index it's entirely
possible they couldn't afford the write overhead to keep both copies around.

------
pbiggar
Very interesting.

In a continuous delivery environment like this, you generally want to test and
validate any change you make. Do any DBs support the idea of turning off an
index (but keep updating it). That way you could disable it, keep going for a
while, and then delete it once you're sure it's unused?

~~~
tempestn
In MySQL/MariaDB (and presumably other major databases) you can instruct any
given query to ignore an index, but it has to be done at the query level.
Perhaps the db interaction layer of the application code could be written such
that all select queries are built with a variable containing a list of indexes
to disable. You could then set that globally (or using dependency injection)
and any query using the index would stop using it, without needing to go
around and search for uses manually.

------
coolrhymes
Great postmortem report. I am very surprised to see that the DB ops simply
deleted the index without even considering the consequences. More importantly,
why isn't the 2 changes part of a single migration file? For e.g. in Django,
south migrations can have both remove, create indexes in one file and executed
together. Also, like some of the guys mentioned, why is the update performed
to your entire global cluster. Shouldn't it be incremental like for e.g. one
availability zone at a time?

------
t0mk
Would be cool to start the report with a tldr, containing just the essence of
the incident, sth like

"dev needed to update index and at time .. submitted two tickets for new ibdex
creation and old index removal. at time .. Op processed the removal ticket
first which caused outage in service .. It was alerted at time .. and on-call
op identified it at time .. He proceeded to ...

Just for people who want to know what's happened but dont care for details.

------
radicalbyte
Is it an option for you to maintain a set of tests which simulate behavior on
the tables?

I've have great success doing this with datawarehouses (i.e. star schema,
large tables with few writes). You run the tests after each index change on
acceptance. It caught a errors.

For OLTP it's harder, you need to record some production workload and reply
it. At your scale it's easier said than done, though.

~~~
brown9-2
If the change that was introduced required creating a new index, it's unlikely
that a test would have though to remove the now-outdated index or testing
performance if the create-new/remove-old operations were reversed.

------
matthewarkin
Woot, a postmortem from Stripe! I've been asking for these for over a year and
hopefully we'll see more from them in the future.

------
scurvy
"Quick code fixes" almost always make the problem worse and just cause stress
and anxiety. It's always better to simply tackle the root cause and fix that.

Unless you're disabling a feature, don't push "quick code fixes". You'll pay
for it later.

~~~
MichaelGG
Except this seems to be counter factual? And the quick fixes were just
temporary until the index was finished building. Seems like a perfectly fine
thing to do to restore service during an outage, as long as they're rolled
back or reviewed later.

~~~
scurvy
> Seems like a perfectly fine thing to do to restore service during an outage,
> as long as they're rolled back or reviewed later.

Rolling back later seems to forgotten in most cases (unless it's disabling a
feature). Then you end up with this weird behavior and code path that few
remember. "Oh yeah we did that when...." You usually only rediscover it after
that quick hack is a problem. This is decades of experience here.

------
protomyth
So, and engineer can just submit a production change that an database operator
will execute? What is the ceremony around this and does Stripe employ DBAs for
production. What is the review process for a production change?

~~~
jewel
The change was to add a new index and then remove the old one. It's something
that would have passed review.

The cause was a defect in the tooling. The requests weren't tied together in
the way they were displayed and so the DBA removed the old index first.

~~~
protomyth
That's not a tooling defect, that's a process and personal knowledge of the
system defect. No DBA should remove an index without knowing what is replacing
it or that code got deployed to make it obsolete. Checking if that index is
currently being used would be a minimum. If a DBA cannot stop a deploy you
have process problems.

~~~
mst
The post mortem specifically noted that it looked unused, presumably because
the code in master now claimed to depend on the new index.

~~~
protomyth
I do not mean by looking at the source code. I mean by looking at the
production database that you are about to remove an index from. This is a
financial institution, and I am really struggling with this post mortem to
understand how they do not have more ceremony around production changes.
Checks and balances are important.

------
codegeek
ohh. Could this mean that a payment may have failed ? I did have a failure of
a payment around this time but it said "Card declined". So I hope it really
was the card and not related to this incident.

~~~
matthewarkin
Failures caused by this should have returned an api_error, a card_error (like
card declined) would have been returned upstream from Stripe.

------
rdancer
"Outrage postmortem"

------
vostro_mf
Sigh, yet another example of hot shot teams using MongoDB just because it's
new and sexy. Existing, established tools such as Oracle and Postgres would
have offered lots of ways of avoiding such a problem.

~~~
dang
We detached this subthread from
[https://news.ycombinator.com/item?id=10366459](https://news.ycombinator.com/item?id=10366459)
and marked it off-topic.

------
depsypher
They clearly just need to add Devops. That'll fix everything.

~~~
depsypher
Wow, message received HN. Don't make fun of Devops! It doesn't seem that long
ago that it was hailed as the silver bullet that eliminates the "throw it over
the wall" mentality that causes issues exactly like this one.

This issue was caused by a failure in communication between team members. That
communication is just as important as good engineering.

~~~
kzhahou
Your comment was obviously meant as commentary on the DevOps trend, but in
fact devops is not so well-defined a trend to make a good target for sniping
comments like this. I mean, people talk about "devops" today to mostly just
say "we must increase investment in our infrastructure." No one's saying
"devops will fix everything." So I don't think people downvoted you for
cutting on devops per se, as much as for not presenting any clear POV.

It's not like Node, where evangelists DO go around saying it's the best thing
ever and blindly ignore the problems with it. Now... there's a good target for
snippy remarks! :-P

