Over the years I've learned that restarting something that's under high load is not the correct solution to the problem.
By restarting the thing under load, you will always further increase the load when the thing you've restarted is coming back. Recovery will take longer and stuff could fail even more badly.
Of course there's the slim possibility that the thing under load has actually crashed or entered some infinite loop, but over the years I've learned that it's far more likely to be my fault than their fault; the abnormal load is far more likely caused by a problem at my end than a bug in the software that's misbehaving.
I know that just restarting is tempting, but I've seen so many times that spending the extra effort to analyze the root case usually is worth it.
Redis has two resynchronization modes. One just recovers a small delta from the primary, if it hasn't diverged that far. If the primary has accepted too many writes, the secondary has to initiate a full resync, which is much more expensive. Since Twilio's postmortem says their nodes initiate full resyncs, I suspect that had Twilio promoted a secondary, they would have lost a significant number of writes.
Probably best, under these kinds of scenarios, to design the system such that lost writes don't cause overbilling. ;-)
Btw: When going to a slave, be careful because the state the slave is in might be out of date compared to overloaded master at the point of moving over.
Part of the problem of being under high load is that sometimes data does not get written out to disk like it should (because of the load on the OS), so restarting can easily corrupt things - especially if you restart the computer rather than the service.
The other takeaway is reboot all your servers monthly! (Or after any changes.) That way you can be sure they'll come back up properly. Reboot when you have time to work on it, so that you don't have an emergency at the worst possible time.
One minor suggestion: the root cause of an incident is never a technical problem. The technology is made by people, so you should always come back to the human systems that make a company go. Not so you can blame people. It's just the opposite: you want to find better ways to support people in the future.
Ultimately the responsibility for getting this right is on us. We feel a thorough, dispassionate accounting of the human and technical errors when we come up short is part of living up to that responsibility.
The post-mortem has plenty of details but without being wishy washy in any way. Just the fact's ma'am, ending with a sincere apology and steps to prevent a recurrence. Well done!
Part of why I love being a software person - the devil lies literally in the details.
That said, I'm being pedantic to mention it. I think the situation was very well handled.
I know atleast with 1.2.6 a slave can be a slave of a slave but I never measured the latency of a write at master, through the inner peers, out to the fan out slaves. Admittedly a more complicated topology but it would circumvent the stampede's against a master instance and also makes it easier to spin up larger numbers of slaves without wiping out the entire platform.
I can see how this sort of situation is hard to predict and test for, but that means it needs to be designed and implemented very carefully. $0->charge card just isn't a reasonable approach at all.
 "We are now introducing robust fail-safes, so that if billing balances don’t exist or cannot be written, the system will not suspend accounts or charge credit cards. Finally, we will be updating the billing system to validate against our double-bookkeeping databases in real-time."
That solution ultimately came up short last week, but think the foresight behind a number of the features (double bookkeeping to an unrelated, persistent datastore in particular) aided in recovery.
Under normal circumstances, I have Twilio set up to bill us $500 if the balance ever dips below $500. That $500 is our rebilling increment. Each SMS message and phone call we made caused us to get charged our rebilling increment. We hit $3,500 before our credit card company started rejecting charges. I think that if they hadn't, we would likely have saturated our credit line.
I'm thrilled with Twilio's response to this issue, and most other transient issues I've had being a Twilio customer. The system is mostly rock-solid reliable. I actually went to bed during the middle of this event (midnight, Japan time) because a) my systems were reporting that our messages were successfully going out (so no customer-visible downtime) and b) I had total confidence in Twilio to take care of things. And they did.
> "Consequently, the billing system charged customer credit cards to increase account balances without being able to update the balances themselves. This root cause produced the billing incident of customer credit cards being charged repeatedly."
Sounds like when an action required funds from a user's balance (which their system thought was 0), it attempted to recharge their balance by charging their credit card. And since the system also could not write to the database (to increase the balance) the balance remained at 0. Thus, the system kept thinking it had to recharge the balance again.
The actual billing would go through ok and the master database would likely get updated, but the frontend that's doing the sending won't see the updated balance, causing it to purchase more credit.
They use a multi datacenter master - slave redis cluster? What's the relationship between the master and slaves?
Are the slaves just failover? Or are they for read only?
How have they configured their writes and reads from their main application? I'm just curious how their application routes the reads/writes in such a multi-database setup. (Are they using DNS for a manual failover?)
I'm curious how that's working, if you don't mind the question.