
Twilio incident and Redis - flyingyeti
http://antirez.com/news/60
======
zackmorris
What caught my attention was where Twilio said the redis-slaves were timing
out to the redis-master:

[http://www.twilio.com/blog/2013/07/billing-incident-post-
mor...](http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html)

I think timeouts should be abolished for the vast majority of software today.

The usual reasoning goes something like this: for a TCP connection, if you
don't hear from the server for some period of time, you can assume that
something is "wrong" and drop the connection. The fallacy is, the TCP
connection is not really important to the shared state of two devices. From
the very beginning (I'm talking 1970s!), devices should have been using tokens
to identify one another, regardless of communication status. The tokens could
be saved in nonvolatile memory on servers so that jobs could always continue
where they left off.

Instead we have a whole slew of nondeterministic pathological cases -exactly-
like the one that hit Twilio. If you take on the burden of timeouts, you end
up with dozens of places in your code (even more, potentially) where you just
don't know what to do if you lose communication.

If you don't take on the burden of timeouts, then you can just track each
connection and all it costs you is storage space, which is practically free
today and getting cheaper every year. With credentials from the client, you
don't even have to worry about duplicate connections. You can write your
client-server code deterministically and stick to the logic, and easily stress
test failure modes.

~~~
donpdonp
This resonated with me because I've been building services with zeromq lately.
The zeromq bind and connect calls isolate the caller from managing
disconnects. Zeromq will reestablish a dropped connection and the client code
is none the wiser. Now I have some extra reasoning as to why this is a good
idea. Thanks!

~~~
dpe82
It's a good idea until the local send buffer fills up and starts silently
dropping data.

There's no free lunch.

~~~
tlrobinson
In theory don't programs already need to handle that case for slow/congested
connections?

~~~
dpe82
Of course, but zeromq makes it feel like you don't. It provides a message
massing system that you pass messages into on one end and they magically pop
out the other. If the network is slow, congested or down it'll cache messages
locally until it can reconnect and send them. The problem is you don't know
what's going on and if the local cache fills up it'll silently drop messages.
In some use cases that's perfectly fine- it depends on your application.

Zeromq is great for some things - it's super flexible and FAST. It's just not,
on its own, durable. If you need durability you have to build it on top.
Doable, just not free.

~~~
300bps
You're absolutely right. It just masks the timeout from the client for as long
as it can and then fails just as ungracefully.

------
aidos
Very clear and thoughtful post from antirez, as ever.

It's worth reading his post on how persistence works in Redis (and other dbs).
It's very interesting and gives great insights as to what goes on down in dbs
to try to keep our data safe - particularly for those of us how don't ever
interact with that layer directly.

[http://oldblog.antirez.com/post/redis-persistence-
demystifie...](http://oldblog.antirez.com/post/redis-persistence-
demystified.html)

------
eblume
It's good to see Twilio post this! That being said - yeah, I really am
concerned that Twilio is using an ephemeral database to store such important
data. Why not simply use Postgres? Is Twilio really making so many
transactions per second that Postgres won't scale?

~~~
RobSpectre
Totally agree. Need to clear up a developing misconception - Redis does not
serve as the primary store for the account balance of Twilio's customers. The
billing system uses a double bookkeeping model common to many high volume
designs with Redis as the in-flight data store (e.g. when a call or SMS
message is created) with the transaction also stored independently to an RDBMS
post-flight (e.g. when a call or SMS message is completed).

Clearly however our implementation failed dangerously and did not recover in a
manner that meets our customers' expectations. Totally get how such a
misconception would occur from a cursory read of the incident report - just
need to be clear.

~~~
PommeDeTerre
Is there actually a legitimate performance or scalability need to incorporate
a NoSQL database in this case?

Ever since NoSQL databases started receiving a lot of hype a few years back,
I've witnessed a number of teams use them without any real justification.
They'll build unnecessarily complex systems using one or more NoSQL database
systems, all while a relational database would be more than sufficient for
their needs.

In several of these cases, some of the developers have been quite adamant that
these NoSQL databases are essential. Then we rip them out, usually because
they've been causing problems like described in this case. It quickly becomes
obvious that they were never needed in the first place, and won't be needed
even in the face of significantly increased load.

~~~
threeseed
No offence but attitudes like this are the worst. If we all took your advice
we would all be still using punch cards or writing everything in assembler.
Sometimes you don't need a clear justification to use newer technologies.
Perhaps developers just want to experience the significant developer
productivity that comes with using many of the NoSQL databases.

Also might be worth dropping the whole "SQL is better" insinuation. We have
seen some pretty major data loss bugs in PostgreSQL and MySQL recently.

~~~
PommeDeTerre
Billing systems are not to be taken lightly, namely because money is
inherently involved.

When developing such systems, it is irresponsible to use new, unproven
technologies without justification.

When developing such systems, it is irresponsible to trade off the reliability
and safety of the system for some "developer productivity".

Such irresponsibility is just not acceptable. Failures due to such
irresponsibility should not be tolerated, either.

~~~
taf2
money can be refunded... we're not talking about a life - or a human life
we're just talking about software. Reading how Twilio was using redis it
sounds very reasonable (fsync). I've seen MySQL servers melt under extreme
load running on very high end servers so if you're saying that somehow having
had SQL in this case would have saved them... well... maybe you're right...
but I wasn't there were you?

I believe the engineers at Twilio are building and have built a platform to
handle a scale that is going to continue to reach further than many
traditional models will scale. It makes sense to me that they would need to
look for in memory systems to push the limits - They are fsync'ing immediately
- so while it's in memory for a read it's disk for writes so that sounds
pretty safe to me... Also, a lot of people use Redis now, so I definitely
wouldn't say it's unproven, that is as others have said like saying we should
still be using typewriters to put ink on paper because it's "proven"... I'd
answer that with... "good luck with that".

And at the end of the day - "the shit will hit the fan" and it did...

the team did great and we're all thankful.

~~~
PommeDeTerre
Just because human lives aren't necessarily at stake it doesn't mean that poor
(in my opinion) software system designs should be tolerated.

And when we talk about "relational databases", we basically never mean MySQL.
I'm not even sure why you'd bother to bring it up, given its poor reputation,
its lack of proper functionality in some cases, and the many other (and far
better) alternatives.

As for scalability, some other commenter here posted a link to a twitter
comment describing the scale in this case:
[https://twitter.com/dN0t/status/360119871318659074](https://twitter.com/dN0t/status/360119871318659074)

It clearly states "tens of thousands" of transactions per second. That is well
within the capabilities of the lower-end database systems on modern, low-end
server hardware. Twilio will require a huge amount of growth just to reach the
point where higher end database systems and hardware become appropriate, never
mind "a scale that is going to continue to reach further than many traditional
models will scale".

Running into significant billing issues due to a likely unnecessary use (in my
opinion) of a NoSQL database while still at a relatively small scale is not
something that a team should be commended for, and nobody should be "thankful"
that the incident did happen.

~~~
defen
The equivalent failure with PostgreSQL would be to update kernel.shmmax with a
call to sysctl, and then forget to update /etc/sysctl.conf. Then if the
machine reboots PostgreSQL will fail to start because it can't allocate enough
memory. It mostly wasn't a redis problem as far as I can tell. That said,
Twilio is probably at a big enough scale where they can afford to hire a
couple full-time admins to deal with PostgreSQL.

------
mountaineer
Here's the Twilio post-mortem thread on HN:
[https://news.ycombinator.com/item?id=6093954](https://news.ycombinator.com/item?id=6093954)

~~~
VladRussian2
dog pile - reminds about FB outage couple years ago when their in-memory cache
machines got simultaneously flushed by software update and as result piled up
upon MySQL databases for the refresh. Twilio's prohibition of master restart
seems like a solution to a consequence only.

~~~
encoderer
That's a truly common problem. Experienced the same thing when I was working
at Formspring. We relied heavily on Redis and SimpleDB for caching and when a
large portion of the cache was lost the site was pretty instantly DOSd. Not
fun at all.

------
mbillie1
I'm curious if you're using anything other than redis-cli to set the
master/slave relationships, and if you have any failover mechanism. I've used
corosync/pacemaker for a high-availability redis cluster, but without an awful
lot of confidence (we likely misconfigured it, to be fair).

Just "slaveof <masterip>" and other redis-cli commands? Or are you using any
automated process?

Or has anyone else got a great redis failover/HA solution that they'd care to
share?

(I apologize for this having nothing to do with Twilio; I'm just curious)

~~~
hijinks
The best thing out there is redis sentinel. It's in 2.6 and the issue I ran
into 6 months ago is not a lot of drivers supported it yet.

[http://redis.io/topics/sentinel](http://redis.io/topics/sentinel)

------
mountaineer
Twilio definitely uses ec2, it's been an oft-highlighted choice in many
presentations and posts over the years.

\- [http://www.slideshare.net/twilio/twilio-voice-
applications-w...](http://www.slideshare.net/twilio/twilio-voice-applications-
with-amazon-aws-s3-and-ec2-presentation)

\- [http://www.twilio.com/engineering/2011/04/22/why-twilio-
wasn...](http://www.twilio.com/engineering/2011/04/22/why-twilio-wasnt-
affected-by-todays-aws-issues/)

------
Vitaly
Just like I commented on the original incident report post, I think systems
like Redis are not suitable to work as a db for payment processing and
transaction storage. Reading through the report I can't imagine something like
this happening with a payment system built around Postgres. Not unless you are
doing something incredibly stupid. And stupid those guys are not.

They are obviously bright guys meaning well, and yet they've designed and
implemented payment system with such a bad failure mode.

I do understand that they have a LOT of billing events, and have to update
customer billable amounts for each of them. But instead of holding the
customer balances in Redis and doing payment processing on top of that, my
paranoia would most probably lead me to only store 'amount to charge' in Redis
and update it as frequently as needed, and store customer balances and
transactions in an RDBMS. And only change during actual charge event. This
way, if Redis data were to be lost, I'd under-charge my customers and not
over-double-tripple charge them. The failure mode becomes less disastrous.

~~~
ra
If you're running with AOF then redis is perfectly fine for storing eg: call
logs where the transaction is naturally atomical to a single command.

In my experience problems have occur because AOF isn't the default persistence
setting (Snapshotting is the default in Ubuntu apt at least). So if Redis
get's upgraded in an apt-get upgrade then the sysadmin needs to take care not
to override the AOF configuration.

This is not unlike postgres upgrading from 9.1 to 9.2, for example. Not
catastrophic, but boy it'll make your heart pump!

I think the best solution at the moment it to not use apt to manage Redis
updates so that you have full control over the configuration.

------
MichaelGG
I do not understand why, when updating a balance from a CC transaction, you
wouldn't be using transactions.

    
    
      Start Transaction
      Update Balances
      Call CC Processor
      Commit
    

That would eliminate "the billing system charged customer credit cards to
increase account balances without being able to update the balances
themselves" \-- you don't go call a non-transactional CC processor until
you've actually been able to process the update in your own system (which you
can easily rollback).

If you're worried about Commits failing (due to not using pessimistic locking,
for instance), then separate it into two transactions. That way when you go to
process the CC the next time, you have a record stating there's already a
transaction in-flight.

For financial records, I'd expect a bit more care. Sounds like they had proper
records, but only as a backup/logging.

(Even for telecom, in which I work. There are fully ACID databases that have
no problems handling millions of transactions/sec. In-flight balance
information is trivial to handle.)

~~~
superuser2
What databases are those?

~~~
aphyr
These ones, for starters:
[http://www.tpc.org/tpcc/results/tpcc_perf_results.asp?result...](http://www.tpc.org/tpcc/results/tpcc_perf_results.asp?resulttype=undefined&amp;version=5%&amp;currencyID=1)

On good server hardware, Postgres will happily push 100-200K small
transactions per second. Naturally the definition of "transaction" varies, and
you'll see vastly different performance depending on contention, locality,
indices, etc. I'd expect logging independent events in a single table to be a
good deal faster than multi-table transactions, especially those involving
contended rows.

On EC2, the story is (naturally) a good deal less heartening. I think you'd be
hard-pressed to make it to 10K on TPC-B, given reports like
[http://www.palominodb.com/blog/2013/05/08/benchmarking-
postg...](http://www.palominodb.com/blog/2013/05/08/benchmarking-postgres-
aws-4000-piops-ebs-instances)

Granted, single-node performance may not be the question to ask here, because
this problem is readily shardable by customer ID, and operations can also be
buffered in local memory to some extent.

Remember: if it fits in Redis, the problem requires, at _maximum_ , a single
node's memory and a single core--and can tolerate network latencies.

~~~
ironchef
"Remember: if it fits in Redis, the problem requires, at maximum, a single
node's memory and a single core--and can tolerate network latencies." Not
entirely true. A number of places out there use sharded redis. In which case,
you're not limited by the memory or core in the same manner. A nit to pick,
but hoping to keep the conversation on point (as you alluded to with the
"shardable" comment)

~~~
aphyr
If you're using a sharded Redis, your problem is a collection of problems
satisfying the above law. What I'm getting at is that if performance is
_seriously_ a problem, few things beat a dedicated service with concurrent,
in-process memory--instead of pushing every operation to an external, fully
serialized heap.

