
Game day exercises at Stripe - gdb
https://stripe.com/blog/game-day-exercises-at-stripe
======
jblow
It is good that they are testing, but the mindset that this is a special
occasion seems very weird to me.

Rule #1 of programming is that if you didn't test it, it doesn't work. (It may
still not work for real after you test it, but at least it's got something.)

You can't claim to anyone, or even yourself, that you have some kind of fault-
tolerant system if you don't do this kind of test after every change.

~~~
sokoloff
We only started doing this about 3 years ago after I heard John Allspaw speak
at Velocity. It has cut down dramatically the "Oh, that system that we thought
was redundant was only redundant in the sense of an appendix, not in the sense
of kidneys..." It was initially a hard sell to my ops team to implement it. (I
promised them air cover for any losses we incurred as a result of game-day
testing, of course.)

I agree with you that it should be common, but I'd suspect an honest survey of
the field would show that it's far from standard practice. I'd guess fewer
than 25% of companies do this in any meaningful way.

------
jfroma
Reminds me to Netflix chaos monkey [1].

They shut down servers and components of the system to make sure they are
resilent to failures.

> The Simian Army is a suite of tools for keeping your cloud operating in top
> form. Chaos Monkey, the first member, is a resiliency tool that helps ensure
> that your applications can tolerate random instance failures

More interestingly is that it is opensource. [2]

[1] [http://techblog.netflix.com/2012/07/chaos-monkey-released-
in...](http://techblog.netflix.com/2012/07/chaos-monkey-released-into-
wild.html?m=1)

[2]
[https://github.com/Netflix/SimianArmy](https://github.com/Netflix/SimianArmy)

------
gatehouse
I've heard this (or similar tests) also referred to as "fire drills".

The first one I like to do is to restore the entire infrastructure from the
off-site backup. When I know this is good I can sleep soundly.

------
meritt
So Stripe is the company that configured their master to no persistence,
killed it, and blamed Redis for losing their data? Annoying they reported the
"issue" to Aphyr, decided to withhold their name, and proceeded to spread a
ton of FUD against Redis when the core issue was a mis-configuration on their
part. Hand-holding can only go so far.

~~~
gdb
The main point of the post is that if there's a configuration you expect to
work, you need to test it, independent of whether or not it's supported by the
author. I can think of a dozen other distributed system failures I've seen
that happened despite using stock configuration — this particular failure was
simply a recent example, and helps illustrate why game days are so valuable.

On the plus side, Antirez has said that this configuration will soon be
supported:
[https://news.ycombinator.com/item?id=8522630](https://news.ycombinator.com/item?id=8522630)
(thanks Antirez!), so future users will be able to run in this configuration
safely.

------
dkarapetyan
I like the part about three replicas of redis destroying all the data when the
primary went down. If anyone out there is using Redis as more than a cache
then you're doing it wrong.

~~~
antirez
This is not what happened, a replica with persistence turned off restarted
with an empty data set, and the slaves replicated that empty data set.
Basically Redis replication currently must be used with some form of on-disk
persistence turned on. However after the introduction of diskless replication
([http://antirez.com/news/81](http://antirez.com/news/81)) we now have a good
reason to support replication with persistence turned down in a proper way, so
there is already a feature work in progress in order to support the Stripe-
alike use case:
[https://github.com/antirez/redis/issues/2087](https://github.com/antirez/redis/issues/2087)

~~~
abalone
If the promotion logic is wrong how would persistence have helped? Say the
primary disk fails. If it's still considered primary when it's brought back up
with a fresh disk, wouldn't you get the same empty-replication problem? (I
know nothing about redis, just wondering.)

~~~
seiji
In the case described, there is no promotion logic.

The replicas will try to reconnect to their original master forever unless
something else (like Sentinel) redirects them in an actual failover/promotion
setup.

So, the master had data, it died, it restart with no data, then the replicas
immediately reconnect. If the master had persistence enabled, it would have
reloaded the old dataset on startup and the replicas would have re-downloaded
everything—since they are _replicas_ of the master, they will always prefer
the master data over their own, even if the master is empty.

If you were in a strange case where the disk failed and you replaced it with
an empty disk (is that what you mean by "fresh disk?") then it's the same as
starting an empty dataset. Not entirely relevant since the server would be
intentionally started empty after a maintenance action instead of just
restarting the already-populated process that restarts as empty because
there's no saved dataset to load on startup.

The "all replicas resync an empty dataset" is a logical consequence of the
configuration they enabled, but one without obvious repercussions without
either directly experiencing it or a longer multi-chain thought experiment.
(but, fixes for such things are already on the way—soon!)

~~~
antirez
Just to add some more info:

Funny enough what triggers this problem when you have master persistence
turned down is, the _lack_ of failover, if the reboot happens fast enough, in
case you are using Sentinel, for it to failover to a replica. So no failure
was sensed at all, just the master magically wiped its data set.

So from the point of view of distributed systems, if you want to analyze the
sum of Redis replicated nodes + Sentinel, the problem is that the system is
not designed to cope with nodes losing state on restarts.

However it is possible to improve it, and I'm doing it, but before diskless
replication it was IMHO pretty useless to have support for persistence-less
operations in conjunction with replication, since the slaves to synchronize
required _anyway_ the master to save an RDB file on disk.

------
kenrose
PagerDuty implemented a similar type of test called "Failure Fridays":

[https://blog.pagerduty.com/2013/11/failure-friday-at-
pagerdu...](https://blog.pagerduty.com/2013/11/failure-friday-at-pagerduty/)

The learnings from forcing actual failure in production parts of your stack
are incredible.

------
midhir
How are they getting such low write times with Postgres relative to Redis with
comparable datasets and throughput? It's been a while since I used Postgres,
are they implying that they're using a particular add-on or that they've
optimised their queries for it?

~~~
nemothekid
No, their tests don't imply that they are getting better write times with
redis across the board, they are measuring the 99%-tile latency.

What they have found, in a worst case scenario Redis is worse than Postgres.

So for example, their Redis response times might look like ,{1ms, 1ms, 1ms,
9ms}, while their Postgres response times might look like {3ms, 4ms, 3ms,
4ms}. Looking at it this way you can expect redis to perform better, however
that outlier is worrying. For stripe they value having consistent and
predictable performance over varying performance.

~~~
midhir
That's really interesting, thanks!

------
arthursilva
And still, every other day you hear people using cache as primary data store.

~~~
seiji
That's kinda an obtuse version of a middlebrow dismissal. You're positioning
yourself as "of course— _I_ knew better all along! Bow before my foresight!"

In other anecdotal evidence, I've been using Redis as a persistent store since
2011 and haven't lost any data.

There's always the giant caveat: just because software says it does X doesn't
mean it does X until you have seen it actually happen. In this case,
restarting a zero-persistence master was bad because it goes
run->die->restart[empty]. Then, the replicas immediately reconnect,
resync[empty], and now they are completely up to date with the master (which
is the only Redis contract asked for here: always be an exact copy of your
master; since persistence was not requested, the replicas re-sync an empty
dataset).

In better news: there are already fixes for _each_ of these issues showing up
in Redis very quite Real Soon Now.

~~~
arthursilva
That wasn't my intention at all.

Anyway. Redis is considerably mature but still have edge cases that will
literally void all you data. And that's definitely not okay when it's being
used as your source of truth.

------
zallarak
This is such a good idea. Thanks for the write-up / sharing.

------
misiti3780
what software are you using to plot those graphs?

~~~
scott_karana
Looks like graphite:

[https://www.google.com/search?q=graphite+graphs&tbm=isch](https://www.google.com/search?q=graphite+graphs&tbm=isch)

~~~
ihsw
To be explicitly pedantic -- graphite-web plots the graph, whereas graphite
manages time series data processing and storage.

~~~
ojilles
To increase the pedantic-levels: these are not plotted by graphite-web. In
this case graphite-web only spits out source data in csv or json, actual
plotting is done by Grafana (see stereo typical Grafana legends on the
screenshots).

