Hacker News new | past | comments | ask | show | jobs | submit login
Billing Incident Post-Mortem: Breakdown, Analysis and Root Cause (twilio.com)
77 points by RobSpectre on July 24, 2013 | hide | past | favorite | 31 comments

> 2:42 AM PDT / 7:42 UTC: Our on-call engineers restarted the redis-master to address the high load.

Over the years I've learned that restarting something that's under high load is not the correct solution to the problem.

By restarting the thing under load, you will always further increase the load when the thing you've restarted is coming back. Recovery will take longer and stuff could fail even more badly.

Of course there's the slim possibility that the thing under load has actually crashed or entered some infinite loop, but over the years I've learned that it's far more likely to be my fault than their fault; the abnormal load is far more likely caused by a problem at my end than a bug in the software that's misbehaving.

I know that just restarting is tempting, but I've seen so many times that spending the extra effort to analyze the root case usually is worth it.

It is indeed. Proper procedure would be to pivot to a slave. We have reconfigured all our Redis hosts to prohibit a restart on masters to prevent exactly that temptation.

I should mention that promoting a secondary to a primary in an asynchronously replicated system can (and almost certainly will, under load) result in the loss of committed transactions.

Redis has two resynchronization modes. One just recovers a small delta from the primary, if it hasn't diverged that far. If the primary has accepted too many writes, the secondary has to initiate a full resync, which is much more expensive. Since Twilio's postmortem says their nodes initiate full resyncs, I suspect that had Twilio promoted a secondary, they would have lost a significant number of writes.

Probably best, under these kinds of scenarios, to design the system such that lost writes don't cause overbilling. ;-)

Yeah. I totally agree with your conclusions in your article. I was making a general remark because this doesn't just apply to you guys but to everyone else too :-)

Btw: When going to a slave, be careful because the state the slave is in might be out of date compared to overloaded master at the point of moving over.

Yup. Came here to say the same thing as soon as I saw that. Never restart a service that is under high load. Block the network if you must to try to reduce the load, but don't restart.

Part of the problem of being under high load is that sometimes data does not get written out to disk like it should (because of the load on the OS), so restarting can easily corrupt things - especially if you restart the computer rather than the service.

The other takeaway is reboot all your servers monthly! (Or after any changes.) That way you can be sure they'll come back up properly. Reboot when you have time to work on it, so that you don't have an emergency at the worst possible time.

Good for them for being open about this. Solid incident reports like this make me trust a vendor so much more. Not only do I have a good idea how they will handle a big public failure next time, but it tells me a lot about how they're handling the private issues, and therefore how robust their system is likely to be.

One minor suggestion: the root cause of an incident is never a technical problem. The technology is made by people, so you should always come back to the human systems that make a company go. Not so you can blame people. It's just the opposite: you want to find better ways to support people in the future.

Agree completely.

Ultimately the responsibility for getting this right is on us. We feel a thorough, dispassionate accounting of the human and technical errors when we come up short is part of living up to that responsibility.

Yes, well done for how you've handled the situation. Very professional, very honest. I liked Twilio before, but I'm even more impressed now.

I've been looking forward to reading this. It never fails to amaze me how these sort of incidents are caused by a cascade of small, unrelated problems, any one of which on its own would likely not have caused the end problem.

The post-mortem has plenty of details but without being wishy washy in any way. Just the fact's ma'am, ending with a sincere apology and steps to prevent a recurrence. Well done!

Very true - in this business, it nearly always is a waterfall of small errors that produces large consequences.

Part of why I love being a software person - the devil lies literally in the details.

The only thing a bit wishy washy is the "only affected 1.4%" part. If the billing system was reporting $0 balances for everyone, then only 1.4% of Twilio's accounts have auto-recharge cards on file and made at least one call/text in a 1.5 hour window. A large portion of the other 98.6% could be abandoned/inactive/toy accounts or they'd have been affected too. The percentage of active accounts affected was probably much higher, for some reasonable definition of active.

That said, I'm being pedantic to mention it. I think the situation was very well handled.

Taking a page from something I learned from MySQL replication tricks, have you thought about trying a master -> inner slaves -> slaves out peers? Where there's one inner slave per datacenter/region.

I know atleast with 1.2.6 a slave can be a slave of a slave but I never measured the latency of a write at master, through the inner peers, out to the fan out slaves. Admittedly a more complicated topology but it would circumvent the stampede's against a master instance and also makes it easier to spin up larger numbers of slaves without wiping out the entire platform.

I'm not sure I know anyone who has tried such a topology with Redis. Would be interested to see how it affects writes as well.

It sounds like, by deciding to make a charge based on what amounts to a temporary cache of the balance, there has always been a race condition in the billing code? I would think it would have always been necessary to successfully update the balance before proceeding with the charge in order to avoid double charges.

I can see how this sort of situation is hard to predict and test for, but that means it needs to be designed and implemented very carefully. $0->charge card just isn't a reasonable approach at all.

Interesting to slave redis across datacenters. I wonder how well that works in practice.

I'd imagine pretty well...until partition behavior occurs.

I don't use Twilio, but this was an interesting write-up. After reading it, I can totally understand how important it is to have safe-guards and redundancies baked into user balance information. I like how they even mentioned (although a bit vaguely) how they plan to implement those additional protections [1].

[1] "We are now introducing robust fail-safes, so that if billing balances don’t exist or cannot be written, the system will not suspend accounts or charge credit cards. Finally, we will be updating the billing system to validate against our double-bookkeeping databases in real-time."

Truthfully the billing system is one of the most fascinating parts of our stack to me. Twilio's utility pricing model generates an incredible volume of minute transactions, which is a fascinating engineering problem to solve at scale.

That solution ultimately came up short last week, but think the foresight behind a number of the features (double bookkeeping to an unrelated, persistent datastore in particular) aided in recovery.

This seems to only explain a single faulty recharge, due to the customer using the service with a zero balance and triggering the charge attempt. Why would there be multiple recharge attempts? Was the customer using the service in that period multiple times triggering each of the recharge attempts or was it the code re-trying the transaction? If it was the code why would it restart the transaction from the top instead of just the part that failed - the balance update?

I could have explained this better. Many usage events on the Twilio API generate a billing transaction (e.g. call, SMS message, phone number purchase, etc). When the transaction attempted to apply against a balance of zero and the customer had auto-recharge enabled, the billing system would trigger a charge attempt. With balances set to zero and read-only, each subsequent usage event would trigger a charge attempt, resulting in the erroneous charges.

So, the users most affected were some of your most active/high traffic ones? Ouch.

While it's unlikely that Appointment Reminder is one of Twilio's largest accounts, we do have a number of customers in the Eastern Standard time zone, and due to customer usage patterns we have a predictable spike in outgoing calls and SMS messages which happened to coincide with the early-morning PST Twilio problem.

Under normal circumstances, I have Twilio set up to bill us $500 if the balance ever dips below $500. That $500 is our rebilling increment. Each SMS message and phone call we made caused us to get charged our rebilling increment. We hit $3,500 before our credit card company started rejecting charges. I think that if they hadn't, we would likely have saturated our credit line.

I'm thrilled with Twilio's response to this issue, and most other transient issues I've had being a Twilio customer. The system is mostly rock-solid reliable. I actually went to bed during the middle of this event (midnight, Japan time) because a) my systems were reporting that our messages were successfully going out (so no customer-visible downtime) and b) I had total confidence in Twilio to take care of things. And they did.

An incident like this is pretty painful for every customer affected. Hope they feel this explanation, the prompt refund of the erroneous charges and credit represents to them it is a pain we very much share.

> ".. Twilio usage that resulted in a billing transaction (e.g. 1 cent for a SMS message or a phone call) triggered the billing system to attempt a recharge using the credit card associated with the customer’s account. This only affected accounts with auto-recharge enabled."

> "Consequently, the billing system charged customer credit cards to increase account balances without being able to update the balances themselves. This root cause produced the billing incident of customer credit cards being charged repeatedly."

Sounds like when an action required funds from a user's balance (which their system thought was 0), it attempted to recharge their balance by charging their credit card. And since the system also could not write to the database (to increase the balance) the balance remained at 0. Thus, the system kept thinking it had to recharge the balance again.

I would assume from the article (I have no experience with Twilio) that there's some "purchase X more SMSes when I've fewer than Y SMSes left" feature. So whenever the customer uses their API to send an SMS (or make a call), the software would detect that there are fewer than a configured amount of messages left and would bill the customer to top off the account.

The actual billing would go through ok and the master database would likely get updated, but the frontend that's doing the sending won't see the updated balance, causing it to purchase more credit.

Maybe I missed this bit of information, but can someone explain....

They use a multi datacenter master - slave redis cluster? What's the relationship between the master and slaves?

Are the slaves just failover? Or are they for read only?

How have they configured their writes and reads from their main application? I'm just curious how their application routes the reads/writes in such a multi-database setup. (Are they using DNS for a manual failover?)

This is speculation, but it seems like in the result of a true loss of the redis-master or it's datacenter, someone would have to reconfigure one of the slaves to become a master and reconfigure the applications that use it (or perhaps they have this type of service discovery built-in).

The double-bookkeeping Twilio is using -- what's the downside to it? Is it only updated per-minute with Redis doing the second-to-second usage and then rolled off into MySQL?

I'm curious how that's working, if you don't mind the question.

I'm not sure there is a "downside" to be had - it is an important requirement to every billing solution of any scale at all. The term is borrowed from accounting (http://en.wikipedia.org/wiki/Double-entry_bookkeeping_system) and states that each transaction gets recorded in a different ledger. Our implementation at Twilio divides the two in stages of the transaction lifecycle that we allude to in the OP - in-memory for in-flight (when a call or SMS message is initiated) and disk for post-flight (when a call or SMS message is complete).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact