I think timeouts should be abolished for the vast majority of software today.
The usual reasoning goes something like this: for a TCP connection, if you don't hear from the server for some period of time, you can assume that something is "wrong" and drop the connection. The fallacy is, the TCP connection is not really important to the shared state of two devices. From the very beginning (I'm talking 1970s!), devices should have been using tokens to identify one another, regardless of communication status. The tokens could be saved in nonvolatile memory on servers so that jobs could always continue where they left off.
Instead we have a whole slew of nondeterministic pathological cases -exactly- like the one that hit Twilio. If you take on the burden of timeouts, you end up with dozens of places in your code (even more, potentially) where you just don't know what to do if you lose communication.
If you don't take on the burden of timeouts, then you can just track each connection and all it costs you is storage space, which is practically free today and getting cheaper every year. With credentials from the client, you don't even have to worry about duplicate connections. You can write your client-server code deterministically and stick to the logic, and easily stress test failure modes.
There's no free lunch.
Zeromq is great for some things - it's super flexible and FAST. It's just not, on its own, durable. If you need durability you have to build it on top. Doable, just not free.
You send a request to a server and no response comes back for a while. What do you do now? Wait forever?
The only insight I gained was that since billing isn't a realtime task, availability shouldn't matter in this case. IMHO that's why what happened was considered a "mistake".
Ideally, they should be able to structure their database so that a high number of transactions can be queued (up to the limit of virtual memory) and open connections shouldn't have to be maintained for each one. Then they could get rid of the notion of timeouts, as long as the master is up a high enough percentage of the time to handle the load. If it isn't, then even the clients would block when they fill up their virtual memory until the master catches up.
If they rewrite their code, they should start with the assumption that the master is normally down. They would have to solve the problem of two transactions conflicting on the client side. So for example, give each transaction a hash or digest that uniquely identifies it as a payment for a certain month. So if one client blocks for a day, and someone tries to start a payment on another client which also blocks, the master would resolve it when it comes back up by committing only one transaction for the hash and failing the other one.
If they get all that logic right, then they could fire off a payment and when the client server says it's queued, they could move on and not need a receipt for the commit. Although I just realized if the client server crashes, they have no guarantee. So they would have to start the transactions on 2 or more client servers, whichever number statistically ensures a high enough guarantee, or possibly all of the clients. That sounds like a lot of overhead at first until you realize that each transaction only takes a few k of memory to represent.
I think that possibly this points to a general solution. Give each transaction a unique ID based on its intent and context, and start it on enough client servers to provide a high enough statistical success rate. Make sure the clients store the pending transactions in nonvolatile memory so they can resume if the power goes out. I'm sure I've forgotten something but surely there is a way to solve this specific subset of problems so that companies could do a sanity check on themselves and avoid repeating history.
It's worth reading his post on how persistence works in Redis (and other dbs). It's very interesting and gives great insights as to what goes on down in dbs to try to keep our data safe - particularly for those of us how don't ever interact with that layer directly.
Clearly however our implementation failed dangerously and did not recover in a manner that meets our customers' expectations. Totally get how such a misconception would occur from a cursory read of the incident report - just need to be clear.
Ever since NoSQL databases started receiving a lot of hype a few years back, I've witnessed a number of teams use them without any real justification. They'll build unnecessarily complex systems using one or more NoSQL database systems, all while a relational database would be more than sufficient for their needs.
In several of these cases, some of the developers have been quite adamant that these NoSQL databases are essential. Then we rip them out, usually because they've been causing problems like described in this case. It quickly becomes obvious that they were never needed in the first place, and won't be needed even in the face of significantly increased load.
Usually the most compelling reason not to use a traditional SQL database is that they're expensive to field in distributed environments and require specialist engineers to support when you get out into the weeds.
But I submit it's a category error to call Redis a "NoSQL database." It was also a category error to put data that needs to be redundant and durable into a storage system that is really specialized for high volume but low-reliability cache work.
I mean, seriously. Minimum viable PostgresSQL deploy in EC2 is pretty damn expensive, but you may not have enough cash on hand to buy your own hardware and pay to put it in a "real datacenter." If you can find something cheaper and you're not even seed funded yet because no one funds your social product until you have 100k users, you're probably more likely to actually make it to where you can afford a different database by just using Redis. It's called "technical debt" and we accrue it both bitterly and often willingly.
Lot of people use relational databases as a big key value store, when really they should use a key value store. It's not necessarily about scalability or performance. =)
And similarly, the attitude that a RDBMS is the right and proper data store, that all data should rightly prefer to live in an RDBMS aside from the loftiest heights of scalability.
Finally, the assertion that using a key-value or column-based datastore is going to lead to an "unnecessarily complex" design.
I'd not be surprised to find you disagree, but if we were working together on the same team, these are some things I'd try to work with you on.
Similarly, while it is wise not to lump all of the non-SQL/non-relational databases together, to say that the bulk of them share a common obsession with scalability is not much of a reach. After all, we're not talking about Neo4J, FramerD, MetaKit, etc. We're talking about young systems that were born out of frustration with scalability issues. It's true that they all have different tradeoffs, strengths and weaknesses, but one significant advantage of Postgres is that it does so well in such a variety of situations. Maybe it isn't perfect for every scenario, but it's pretty good for most scenarios. This is partly an inherent advantage of age, and partly the benefit of a sound theoretical foundation. A big thing I dislike about the current crop of NoSQL databases is that you are expected to learn a great deal about their inner workings to make an informed decision about whether or not to use them. MongoDB really epitomizes this. But this is a pretty lazy complaint compared to random data loss and insane defaults (early days of Mongo), and I have to give you that.
Your position sounds quite reasonable compared to the one I've been primed by, but it's going to be a few more years before these databases mature to the point I'd consider one a reasonable first choice for a generic database. I don't draw the line quite as far as the GP—sometimes you have a special case and you need special tools—but I can certainly empathize with questioning the wisdom of making a distributed non-relational database the primary store for a low-load, classic RDBMS problem. But Twilio has come forward and agreed with that position and denying that they do that, so the spark for this flamewar is conjecture. While people have come forward to defend such a configuration, few seem to be endorsing it by using it themselves. This says something about the reasonableness of the idea.
Also might be worth dropping the whole "SQL is better" insinuation. We have seen some pretty major data loss bugs in PostgreSQL and MySQL recently.
When developing such systems, it is irresponsible to use new, unproven technologies without justification.
When developing such systems, it is irresponsible to trade off the reliability and safety of the system for some "developer productivity".
Such irresponsibility is just not acceptable. Failures due to such irresponsibility should not be tolerated, either.
I appreciate that you are very committed to reliability. I also have worked on (telecom) billing systems. However, what are you suggesting here? Should all Twilio's customers find another provider? (Do you have a suggestion?) Should they just not use a service like Twilio's? Should Twilio quadruple their prices so they can throw up enough hardware to do all this in postgres? Should we all fly over to Sicily to protest outside antirez's house?
Incidentally, in the late '90s I sat in on a few sales efforts in which Lucent pushed their "Datablitz" in-memory database for mediation and billing of call records. So, it's not as though nothing similar to this has ever been done.
I believe the engineers at Twilio are building and have built a platform to handle a scale that is going to continue to reach further than many traditional models will scale. It makes sense to me that they would need to look for in memory systems to push the limits - They are fsync'ing immediately - so while it's in memory for a read it's disk for writes so that sounds pretty safe to me... Also, a lot of people use Redis now, so I definitely wouldn't say it's unproven, that is as others have said like saying we should still be using typewriters to put ink on paper because it's "proven"... I'd answer that with... "good luck with that".
And at the end of the day - "the shit will hit the fan" and it did...
the team did great and we're all thankful.
And when we talk about "relational databases", we basically never mean MySQL. I'm not even sure why you'd bother to bring it up, given its poor reputation, its lack of proper functionality in some cases, and the many other (and far better) alternatives.
As for scalability, some other commenter here posted a link to a twitter comment describing the scale in this case: https://twitter.com/dN0t/status/360119871318659074
It clearly states "tens of thousands" of transactions per second. That is well within the capabilities of the lower-end database systems on modern, low-end server hardware. Twilio will require a huge amount of growth just to reach the point where higher end database systems and hardware become appropriate, never mind "a scale that is going to continue to reach further than many traditional models will scale".
Running into significant billing issues due to a likely unnecessary use (in my opinion) of a NoSQL database while still at a relatively small scale is not something that a team should be commended for, and nobody should be "thankful" that the incident did happen.
Also, reading between the lines in the incident report made it sound as if there might have been multiple teams involved in the troubleshooting and not communicating perfectly. For example, were the redis admins informed that customers had been getting billed repeatedly when the decision was made to restart the billing system? Did they have access to the billing system logs which might have contained errors related to redis being read-only?
All in all, big props to Twilio for starting to get customer accounts credited back within 11 hours of the first trouble and even more for their wonderful open disclosure.
Redis has a reputation as an "ephemeral database", because it's usually used as one, but there's nothing inherently ephemeral about it. This isn't magic; it's just file I/O.
guessing by your username where you may be from, you may be aware that freshness... err... durability comes in only one grade - first grade :)
It's worth nothing from the original Twilio post "This cluster is configured with a single master and multiple slaves distributed across data-centers for resiliency in the event of a host or data-center failure". Ironically it may well be that not having the slaves at all could have prevented the outage.
I've seen other scenarios in the past where having a (mysql) master-slave arrangement has caused an outage. Root cause was bad configuration and a bug in the app code but had there been no slaves, there wouldn't have been an outage.
Just goes to show that it always pays to think carefully about the complexity you introduce into systems.
Postgres has had it's failures under weird confluences of circumstances - and learnt from them a decade or more ago.
I would like to see Twilio standby their stack choices, invest time and effort and sponsorship of redis so that in a decade or two people will be taking it on faith that redis is bullet proof (well the combined postgredis datastore:-)
The new CONFIG REWRITE command makes it easier to avoid, but it would be nice if it were harder to get this wrong in the first place.
What it definitely isn't is fully ACIDic, though for some workloads you can get close if you use it very, very carefully.
Telco billing systems are (or at least used to be) one of the busiest uses of databases. When you're billing in increments of only a few cents per transaction (typical for outgoing SMS), the volume of transactions for any reasonable amount of money can be _huge_.
I don't know how soon Twilio are going to get within an order of magnitude or two of AT&T or Verizon, but I'd hazard a guess that their transactional billing database is _very_ busy by just about _anybodies_ standards.
Just "slaveof <masterip>" and other redis-cli commands? Or are you using any automated process?
Or has anyone else got a great redis failover/HA solution that they'd care to share?
(I apologize for this having nothing to do with Twilio; I'm just curious)
They are obviously bright guys meaning well, and yet they've designed and implemented payment system with such a bad failure mode.
I do understand that they have a LOT of billing events, and have to update customer billable amounts for each of them. But instead of holding the customer balances in Redis and doing payment processing on top of that, my paranoia would most probably lead me to only store 'amount to charge' in Redis and update it as frequently as needed, and store customer balances and transactions in an RDBMS. And only change during actual charge event. This way, if Redis data were to be lost, I'd under-charge my customers and not over-double-tripple charge them. The failure mode becomes less disastrous.
In my experience problems have occur because AOF isn't the default persistence setting (Snapshotting is the default in Ubuntu apt at least). So if Redis get's upgraded in an apt-get upgrade then the sysadmin needs to take care not to override the AOF configuration.
This is not unlike postgres upgrading from 9.1 to 9.2, for example. Not catastrophic, but boy it'll make your heart pump!
I think the best solution at the moment it to not use apt to manage Redis updates so that you have full control over the configuration.
Call CC Processor
If you're worried about Commits failing (due to not using pessimistic locking, for instance), then separate it into two transactions. That way when you go to process the CC the next time, you have a record stating there's already a transaction in-flight.
For financial records, I'd expect a bit more care. Sounds like they had proper records, but only as a backup/logging.
(Even for telecom, in which I work. There are fully ACID databases that have no problems handling millions of transactions/sec. In-flight balance information is trivial to handle.)
Call CC Processor
Call CC Processor
OH SHIT APPLICATION PROCESS DIES (or network partition, etc etc)
DB connection is terminated
Updated balances roll back
But you still charged your customers' cards
And destroyed your primary record of it
Hope you didn't have any other work planned this week
And if the Update Balances call is a single SQL statement, you actually wouldn't use any transactions at all. You'd just issue the write and have it be written. No point adding the overhead of a transaction if it isn't doing anything.
On good server hardware, Postgres will happily push 100-200K small transactions per second. Naturally the definition of "transaction" varies, and you'll see vastly different performance depending on contention, locality, indices, etc. I'd expect logging independent events in a single table to be a good deal faster than multi-table transactions, especially those involving contended rows.
On EC2, the story is (naturally) a good deal less heartening. I think you'd be hard-pressed to make it to 10K on TPC-B, given reports like http://www.palominodb.com/blog/2013/05/08/benchmarking-postg...
Granted, single-node performance may not be the question to ask here, because this problem is readily shardable by customer ID, and operations can also be buffered in local memory to some extent.
Remember: if it fits in Redis, the problem requires, at maximum, a single node's memory and a single core--and can tolerate network latencies.
How many companies like Twilio are using an RDBMS for which this isn't the case? I mean, if people had enough money to buy an oracle cluster, they'd still probably buy something else?