
Root cause analysis: significantly elevated error rates on 2019‑07‑10 - gr2020
https://stripe.com/rcas/2019-07-10
======
vjagrawal1984
In the face of so many outages from big companies, I wonder how
Visa/MasterCard is so resilient.

Is it because they are over the curve and don't make "any" changes to their
system. As opposed to other companies, we are still maturing?

~~~
wallflower
Mainframes.

> Visa, for example, uses the mainframe to process billions of credit and
> debit card payments every year.

> According to some estimates, up to $3 trillion in daily commerce flows
> through mainframes.

[https://www.share.org/blog/mainframe-matters-how-
mainframes-...](https://www.share.org/blog/mainframe-matters-how-mainframes-
keep-the-financial-industry-up-and-running)

[https://blog.syncsort.com/2018/06/mainframe/9-mainframe-
stat...](https://blog.syncsort.com/2018/06/mainframe/9-mainframe-statistics/)

[https://www.ibm.com/it-
infrastructure/servers/mainframes](https://www.ibm.com/it-
infrastructure/servers/mainframes)

~~~
andrewg
Specifically they run IBM zTPF on their mainframes, which is also used by
airlines. Some installations have uptimes measured in decades.

[https://www.ibm.com/it-infrastructure/z/transaction-
processi...](https://www.ibm.com/it-infrastructure/z/transaction-processing-
facility)

~~~
wereHamster
It's rarely the hardware that fails, it's more often due to software. So I
wonder what the software that's running on mainframes does differently than
the software that's written for ordinary computers.

~~~
Terretta
> _So I wonder what the software that 's running on mainframes does
> differently than the software that's written for ordinary computers._

Not change.

------
ssalazars
[2019-07-10 20:13 UTC] During our investigation into the root cause of the
first event, we identified a code path likely causing the bug in a new minor
version of the database’s election protocol. [2019-07-10 20:42 UTC] We rolled
back to a previous minor version of the election protocol and monitored the
rollout.

There's a 20 minute gap between investigation and "rollback". Why did they
rollback if the service was back to normal? How can they decide, and document
the change within 20 minutes? Are they using CMs to document changes in
production? Were there enough engineers involved in the decision? Clearly all
variables were not considered.

To me, this demonstrates poor Operational Excellence values. Your first goal
is to mitigate the problem. Then, you need to analyze, understand, and
document the root cause. Rolling-back was a poor decision, imo.

~~~
dps
(Stripe CTO here)

Thanks for the questions. We have testing procedures and deploy mechanisms
that enable us to ship hundreds of deploys a week safely, including many which
touch our infrastructure. For example, we do a fleetwide version rollout in
stages with a blue/green deploy for typical changes.

In this case, we identified a specific code path that we believed had a high
potential to cause a follow-up incident soon. The course of action was
reviewed by several engineers; however we lacked an efficient way to fully
validate this change on the order of minutes. We're investing in building
tooling to increase robustness in rapid response mechanisms and to help
responding engineers understand the potential impact of configuration changes
or other remediation efforts they're pushing through an accelerated process.

I think our engineers’ approach was strong here, but our processes could have
been better. Our continuing remediation efforts are focused there.

~~~
tus88
> ship hundreds of deploys a week safely

That seems like a lot of change in a week, or does deploys mean something else
like customer websites being deployed?

~~~
tschwimmer
They very likely have continuous deployment. So each change could potentially
be released as a separate deploy. If the changes have changed to the data
model, they gotta run a migration. So hundreds seems reasonable to me.

------
laCour
"[Four days prior to the incident] Two nodes became stalled for yet-to-be-
determined reasons."

How did they not catch this? It's super surprising to me that they wouldn't
have monitors for this.

~~~
lethain
(Stripe infra lead here)

This was a focus in our after-action review. The nodes responded as healthy to
active checks, while silently dropping updates on their replication lag,
together this created the impression of a healthy node. The missing bit was
verifying the absence of lag updates. (Which we have now.)

~~~
throwaway3489
I am a curious and very amateur person, but do you think that if "100%" uptime
were your goal, this:

"[Three months prior to the incident] We upgraded our databases to a new minor
version that introduced a subtle, undetected fault in the database’s failover
system."

could have been prevented if you had stopped upgrading minor versions, i.e.
froze on one specific version and not even applied security fixes, instead
relying on containing it as a "known" vulnerable database?

The reason I ask is that I heard of ATM's still running windows XP or stuff
like that. but if it's not networked could it be that that actually has a
bigger uptime than anything you can do on windows 7 or 10?

what I mean is even though it is hilariously out of date to be using windows
xp, still, by any measure it's had a billion device-days to expose its failure
modes.

when you upgrade to the latest minor version of databases, don't you sacrifice
the known bad for an unknown good?

excuse my ignorance on this subject.

~~~
Jorsiem
How do you have a ATM thats not networked?

~~~
throwaway3491
Same user (sorry I guess I didn't enter my password carefully as I can't log
in.)

Well I mean they're not exactly on the Internet with an IP address and no
firewall, are they? (Or they would have been compromised already.)

Whatever it is, it must be separated off as an "insecure enclave".

So that's why I'm wondering about this technique. You don't just miss out on
security updates, you miss performance and architecture improvements, too, if
you stop upgrading.

But can that be the path toward 100% uptime? Known bad and out of date
configurations, carefully maintained in a brittle known state?

~~~
Operyl
Secure .. enclave? I'm sorry but I think you're throwing buzzwords around
hoping to hit a homerun here.

~~~
nitrogen
No, it's a fair question. The word "enclave" has a general meaning in English
as a state surrounded entirely by another, or metaphorically a zone with some
degree of isolation from its surroundings.

So the legit question is, can insecure systems (e.g. ancient mainframes) be
wrapped by a security layer (WAF, etc.) to get better uptime than patching an
exposed system?

~~~
throwaway3491
yes, thank you.

------
zby
So the article identifies a software bug and a software/config bug as the root
cause. That sounds a bit shallow for such a high visibility case - I was
expecting something like the
[https://en.wikipedia.org/wiki/5_Whys](https://en.wikipedia.org/wiki/5_Whys)
method with subplots on why the bugs where not caught in testing. By the way I
only clicked on it because I was hoping it would be an occasion to use the
methods from [http://bayes.cs.ucla.edu/WHY/](http://bayes.cs.ucla.edu/WHY/) \-
alas no - it was too shallow for that.

~~~
zbentley
It is likely that this RCA was shallow because it was intended for everyone--
including non-technical users, who (at least in my experience) tend to
misinterpret or get confused by deep technical or systemic failure analysis.

It would be excellent if Stripe published a truly technical RCA, perhaps for
distribution via their tech blog, so that folks like us could get a more
complete understanding and what-not-to-do lesson (if the failing systems were
based on non-proprietary technologies, that is).

~~~
throwawaydba
From reading the RCA, this should be the trinity of mysql + orchestrator +
vitess. If stripe can't get it right, there is no chance for the others.

------
gr2020
Anybody know what database they’re using?

~~~
conroy
MongoDB is the primary data store used at Stripe.

~~~
a13n
Really speaks volumes about how mature MongoDB has become considering how
solid Stripe's reliability is.

~~~
londons_explore
MongoDB is a really scary database to use at scale.

It doesn't shard nicely. Failovers have rather nasty semantics that can cause
nasty bugs in client side code. Performance cliffs abound.

If your datastore is anything over 1TB, I'd be using postgres, or if you can
manage it something bigtable-like.

------
segmondy
As I mentioned early, " human error often, configuration changes often, new
changes often. "
[https://news.ycombinator.com/item?id=20406116](https://news.ycombinator.com/item?id=20406116)

------
chance_state
This reads like the marketing/PR teams wrote much of it. Compare to the
Cloudflare post-mortem from today: [https://blog.cloudflare.com/details-of-
the-cloudflare-outage...](https://blog.cloudflare.com/details-of-the-
cloudflare-outage-on-july-2-2019/)

~~~
dps
I'm Stripe's CTO and wrote a good deal of the RCA (with the help of others,
including a lot of the engineers who responded to the incident). If you've any
specific feedback on how to make this more useful, I'd love to hear it.

~~~
patio11
I work at Stripe, on the marketing team, and assisted a bit here. My last
major engineering work was writing the backend to a stock exchange.

If anyone on HN knows anyone who has the sort of interesting life story where
they both know what can cause a cluster election to fail and like writing
about that sort of thing, we would eagerly like to make their acquaintance.

~~~
luizfelberti
Maybe Kyle Kingsbury (aka @aphyr) is the person you are looking for?

[https://jepsen.io/services#consulting](https://jepsen.io/services#consulting)

~~~
wbronitsky
Kyle used to work at Stripe and left. I don’t think he would come back
unfortunately. That guy is absolutely amazing, especially with regards to
distributes DBs and writing about them

------
mual
Is this Stripe's first public RCA? Looking through their tweets, there do not
appear to be other RCAs for the same "elevated error rates". It seems hard to
conclude much from one RCA.

------
jacquesm
Why don't they call 'significantly elevated error rates' an 'outage' instead?

~~~
NikolaeVarius
Because "A substantial majority of API requests during these windows failed. "
implying that there was not a complete outage.

I don't understand why people demand the usage of incorrect language.

~~~
teraflop
In my mind, a "degradation" would be if some fraction of requests were
randomly failing, but they would be likely to eventually succeed if retried.
Or if the service itself was essentially accessible, but some non-essential
functionality was not working correctly.

On the other hand, if for a significant number of users the site was
completely unusable for some period of time, then I think it's fair to use the
word "outage". (Even if it's not a _complete_ outage affecting all users.)

I don't know whether other people would interpret these terms the same way I
do, nor do I think there's enough information in this blog post to determine
for sure which label is more accurate for this particular incident. So
personally, I'm not going to be too picky about the wording.

------
luminati
Since both companies' root cause analysis are currently trending on HN, it's
pretty apparent that Stripe's engineering culture has a long ways to go catch
up with Cloudflare's.

------
debt
"We identified that our rolled-back election protocol interacted poorly with a
recently-introduced configuration setting to trigger the second period of
degradation."

Damn what a mess. Sounds like y'all are rolling out way to many changes too
quickly with little to no time for integration testing.

It's a somewhat amateur move to assume you can just arbitrarily rollback
without consequence, without testing etc.

One solution I don't see mentioned, don't upgrade to minor versions ever. And
create a dependency matrix so if you do rollback, you rollback all the other
things that depend on the thing you're rolling back as well.

~~~
cetico
Yes this was very surprising. The system was working fine after the cluster
restart. There was no need for an emergency rollback.

Doing a large rollback based on a hunch seems like an overreaction.

It's totally normal for engineers to commit these errors. That's fine. The
detail that's missing in this PM is what kind of operational culture,
procedures and automation is in place to reduce operator errors.

Did the engineer making this decision have access to other team members to
review their plan of action? I believe that a group (2-3) of experienced
engineers sharing information in real-time and coordinating the response could
have reacted better.

Of course, I wasn't there so I could be completely off.

~~~
debt
"That's fine."

idk the suits have a very different viewpoint; 30 minutes of downtime for a
large financial system isn't fine. it can be very costly.

~~~
uponcoffee
I think the GP means that as far as incidents occurring, so far as care is (or
was) taken to prevent them and learn from them, then that's all one can really
reasonably ask for. The first incident falls under that heading and 'is fine'
in a 'life happens' sense.

The following incident comes across as reckless and avoidable as there should
have been procedures to safely test the rollback (and perhaps there were, but
a perfect storm allowed it fail in prod). Lacking details about how the second
incident came to be or how they will be prevented going forward places the
second incident as 'not fine'.

This information is what the GP comment is asking for.

Compare this PM with Cloudflare's PM, where they detail how they tested rules,
what safeguards were in place, how the incident came to be, and how they
intend to prevent similar incidents; the impression given here is that they
will put up more fire alarms and fire extinguishers but do little fire
prevention.

