

Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause - RobSpectre
http://www.twilio.com/blog/2013/07/billing-incident-post-mortem.html

======
pilif
_> 2:42 AM PDT / 7:42 UTC: Our on-call engineers restarted the redis-master to
address the high load._

Over the years I've learned that restarting something that's under high load
is not the correct solution to the problem.

By restarting the thing under load, you will always further increase the load
when the thing you've restarted is coming back. Recovery will take longer and
stuff could fail even more badly.

Of course there's the slim possibility that the thing under load has actually
crashed or entered some infinite loop, but over the years I've learned that
it's far more likely to be my fault than their fault; the abnormal load is far
more likely caused by a problem at my end than a bug in the software that's
misbehaving.

I know that just restarting is tempting, but I've seen so many times that
spending the extra effort to analyze the root case usually is worth it.

~~~
RobSpectre
It is indeed. Proper procedure would be to pivot to a slave. We have
reconfigured all our Redis hosts to prohibit a restart on masters to prevent
exactly that temptation.

~~~
aphyr
I should mention that promoting a secondary to a primary in an asynchronously
replicated system can (and almost certainly will, under load) result in the
loss of committed transactions.

Redis has two resynchronization modes. One just recovers a small delta from
the primary, if it hasn't diverged that far. If the primary has accepted too
many writes, the secondary has to initiate a full resync, which is much more
expensive. Since Twilio's postmortem says their nodes initiate full resyncs, I
suspect that had Twilio promoted a secondary, they would have lost a
significant number of writes.

Probably best, under these kinds of scenarios, to design the system such that
lost writes don't cause overbilling. ;-)

------
wpietri
Good for them for being open about this. Solid incident reports like this make
me trust a vendor so much more. Not only do I have a good idea how they will
handle a big public failure next time, but it tells me a lot about how they're
handling the private issues, and therefore how robust their system is likely
to be.

One minor suggestion: the root cause of an incident is never a technical
problem. The technology is made by people, so you should always come back to
the human systems that make a company go. Not so you can blame people. It's
just the opposite: you want to find better ways to support people in the
future.

~~~
RobSpectre
Agree completely.

Ultimately the responsibility for getting this right is on us. We feel a
thorough, dispassionate accounting of the human and technical errors when we
come up short is part of living up to that responsibility.

~~~
porker
Yes, well done for how you've handled the situation. Very professional, very
honest. I liked Twilio before, but I'm even more impressed now.

------
ajtaylor
I've been looking forward to reading this. It never fails to amaze me how
these sort of incidents are caused by a cascade of small, unrelated problems,
any one of which on its own would likely not have caused the end problem.

The post-mortem has plenty of details but without being wishy washy in any
way. Just the fact's ma'am, ending with a sincere apology and steps to prevent
a recurrence. Well done!

~~~
RobSpectre
Very true - in this business, it nearly always is a waterfall of small errors
that produces large consequences.

Part of why I love being a software person - the devil lies literally in the
details.

~~~
ricardobeat
"literally"

[http://blog.fittothefinish.com/wp-
content/uploads/2013/03/de...](http://blog.fittothefinish.com/wp-
content/uploads/2013/03/devil-in-the-details.jpg)

------
CptCodeMonkey
Taking a page from something I learned from MySQL replication tricks, have you
thought about trying a master -> inner slaves -> slaves out peers? Where
there's one inner slave per datacenter/region.

I know atleast with 1.2.6 a slave can be a slave of a slave but I never
measured the latency of a write at master, through the inner peers, out to the
fan out slaves. Admittedly a more complicated topology but it would circumvent
the stampede's against a master instance and also makes it easier to spin up
larger numbers of slaves without wiping out the entire platform.

~~~
RobSpectre
I'm not sure I know anyone who has tried such a topology with Redis. Would be
interested to see how it affects writes as well.

------
pkteison
It sounds like, by deciding to make a charge based on what amounts to a
temporary cache of the balance, there has always been a race condition in the
billing code? I would think it would have always been necessary to
successfully update the balance before proceeding with the charge in order to
avoid double charges.

I can see how this sort of situation is hard to predict and test for, but that
means it needs to be designed and implemented very carefully. $0->charge card
just isn't a reasonable approach at all.

------
statusgraph
Interesting to slave redis across datacenters. I wonder how well that works in
practice.

~~~
Xorlev
I'd imagine pretty well...until partition behavior occurs.

~~~
bsg75
[http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-
an...](http://aphyr.com/posts/281-call-me-maybe-carly-rae-jepsen-and-the-
perils-of-network-partitions)

------
chill1
I don't use Twilio, but this was an interesting write-up. After reading it, I
can totally understand how important it is to have safe-guards and
redundancies baked into user balance information. I like how they even
mentioned (although a bit vaguely) how they plan to implement those additional
protections [1].

[1] "We are now introducing robust fail-safes, so that if billing balances
don’t exist or cannot be written, the system will not suspend accounts or
charge credit cards. Finally, we will be updating the billing system to
validate against our double-bookkeeping databases in real-time."

~~~
RobSpectre
Truthfully the billing system is one of the most fascinating parts of our
stack to me. Twilio's utility pricing model generates an incredible volume of
minute transactions, which is a fascinating engineering problem to solve at
scale.

That solution ultimately came up short last week, but think the foresight
behind a number of the features (double bookkeeping to an unrelated,
persistent datastore in particular) aided in recovery.

------
richardv
Maybe I missed this bit of information, but can someone explain....

They use a multi datacenter master - slave redis cluster? What's the
relationship between the master and slaves?

Are the slaves just failover? Or are they for read only?

How have they configured their writes and reads from their main application?
I'm just curious how their application routes the reads/writes in such a
multi-database setup. (Are they using DNS for a manual failover?)

~~~
brown9-2
This is speculation, but it seems like in the result of a true loss of the
redis-master or it's datacenter, someone would have to reconfigure one of the
slaves to become a master and reconfigure the applications that use it (or
perhaps they have this type of service discovery built-in).

------
ahk
This seems to only explain a single faulty recharge, due to the customer using
the service with a zero balance and triggering the charge attempt. Why would
there be multiple recharge attempts? Was the customer using the service in
that period multiple times triggering each of the recharge attempts or was it
the code re-trying the transaction? If it was the code why would it restart
the transaction from the top instead of just the part that failed - the
balance update?

~~~
RobSpectre
I could have explained this better. Many usage events on the Twilio API
generate a billing transaction (e.g. call, SMS message, phone number purchase,
etc). When the transaction attempted to apply against a balance of zero and
the customer had auto-recharge enabled, the billing system would trigger a
charge attempt. With balances set to zero and read-only, each subsequent usage
event would trigger a charge attempt, resulting in the erroneous charges.

~~~
ahk
So, the users most affected were some of your most active/high traffic ones?
Ouch.

~~~
patio11
While it's unlikely that Appointment Reminder is one of Twilio's largest
accounts, we do have a number of customers in the Eastern Standard time zone,
and due to customer usage patterns we have a predictable spike in outgoing
calls and SMS messages which happened to coincide with the early-morning PST
Twilio problem.

Under normal circumstances, I have Twilio set up to bill us $500 if the
balance ever dips below $500. That $500 is our rebilling increment. Each SMS
message and phone call we made caused us to get charged our rebilling
increment. We hit $3,500 before our credit card company started rejecting
charges. I think that if they hadn't, we would likely have saturated our
credit line.

I'm thrilled with Twilio's response to this issue, and most other transient
issues I've had being a Twilio customer. The system is mostly rock-solid
reliable. I actually went to bed during the middle of this event (midnight,
Japan time) because a) my systems were reporting that our messages were
successfully going out (so no customer-visible downtime) and b) I had total
confidence in Twilio to take care of things. And they did.

------
Xorlev
The double-bookkeeping Twilio is using -- what's the downside to it? Is it
only updated per-minute with Redis doing the second-to-second usage and then
rolled off into MySQL?

I'm curious how that's working, if you don't mind the question.

~~~
RobSpectre
I'm not sure there is a "downside" to be had - it is an important requirement
to every billing solution of any scale at all. The term is borrowed from
accounting ([http://en.wikipedia.org/wiki/Double-
entry_bookkeeping_system](http://en.wikipedia.org/wiki/Double-
entry_bookkeeping_system)) and states that each transaction gets recorded in a
different ledger. Our implementation at Twilio divides the two in stages of
the transaction lifecycle that we allude to in the OP - in-memory for in-
flight (when a call or SMS message is initiated) and disk for post-flight
(when a call or SMS message is complete).

