Hacker News new | past | comments | ask | show | jobs | submit login
Twilio incident and Redis (antirez.com)
188 points by flyingyeti on July 24, 2013 | hide | past | favorite | 73 comments



What caught my attention was where Twilio said the redis-slaves were timing out to the redis-master:

http://www.twilio.com/blog/2013/07/billing-incident-post-mor...

I think timeouts should be abolished for the vast majority of software today.

The usual reasoning goes something like this: for a TCP connection, if you don't hear from the server for some period of time, you can assume that something is "wrong" and drop the connection. The fallacy is, the TCP connection is not really important to the shared state of two devices. From the very beginning (I'm talking 1970s!), devices should have been using tokens to identify one another, regardless of communication status. The tokens could be saved in nonvolatile memory on servers so that jobs could always continue where they left off.

Instead we have a whole slew of nondeterministic pathological cases -exactly- like the one that hit Twilio. If you take on the burden of timeouts, you end up with dozens of places in your code (even more, potentially) where you just don't know what to do if you lose communication.

If you don't take on the burden of timeouts, then you can just track each connection and all it costs you is storage space, which is practically free today and getting cheaper every year. With credentials from the client, you don't even have to worry about duplicate connections. You can write your client-server code deterministically and stick to the logic, and easily stress test failure modes.


This resonated with me because I've been building services with zeromq lately. The zeromq bind and connect calls isolate the caller from managing disconnects. Zeromq will reestablish a dropped connection and the client code is none the wiser. Now I have some extra reasoning as to why this is a good idea. Thanks!


It's a good idea until the local send buffer fills up and starts silently dropping data.

There's no free lunch.


In theory don't programs already need to handle that case for slow/congested connections?


Of course, but zeromq makes it feel like you don't. It provides a message massing system that you pass messages into on one end and they magically pop out the other. If the network is slow, congested or down it'll cache messages locally until it can reconnect and send them. The problem is you don't know what's going on and if the local cache fills up it'll silently drop messages. In some use cases that's perfectly fine- it depends on your application.

Zeromq is great for some things - it's super flexible and FAST. It's just not, on its own, durable. If you need durability you have to build it on top. Doable, just not free.


You're absolutely right. It just masks the timeout from the client for as long as it can and then fails just as ungracefully.


I don't really see how you are proposing we live without timeouts.

You send a request to a server and no response comes back for a while. What do you do now? Wait forever?


He's a big fan of the C in the CAP theorem, but not so much the A.


Man, I wrote two rather long responses and then resigned myself to the fact that you are right. This is a very difficult problem.

The only insight I gained was that since billing isn't a realtime task, availability shouldn't matter in this case. IMHO that's why what happened was considered a "mistake".

Some thoughts:

Ideally, they should be able to structure their database so that a high number of transactions can be queued (up to the limit of virtual memory) and open connections shouldn't have to be maintained for each one. Then they could get rid of the notion of timeouts, as long as the master is up a high enough percentage of the time to handle the load. If it isn't, then even the clients would block when they fill up their virtual memory until the master catches up.

If they rewrite their code, they should start with the assumption that the master is normally down. They would have to solve the problem of two transactions conflicting on the client side. So for example, give each transaction a hash or digest that uniquely identifies it as a payment for a certain month. So if one client blocks for a day, and someone tries to start a payment on another client which also blocks, the master would resolve it when it comes back up by committing only one transaction for the hash and failing the other one.

If they get all that logic right, then they could fire off a payment and when the client server says it's queued, they could move on and not need a receipt for the commit. Although I just realized if the client server crashes, they have no guarantee. So they would have to start the transactions on 2 or more client servers, whichever number statistically ensures a high enough guarantee, or possibly all of the clients. That sounds like a lot of overhead at first until you realize that each transaction only takes a few k of memory to represent.

I think that possibly this points to a general solution. Give each transaction a unique ID based on its intent and context, and start it on enough client servers to provide a high enough statistical success rate. Make sure the clients store the pending transactions in nonvolatile memory so they can resume if the power goes out. I'm sure I've forgotten something but surely there is a way to solve this specific subset of problems so that companies could do a sanity check on themselves and avoid repeating history.


I've seen timeouts in similar situations where the IO was saturated (saving the RDB and writing to AOF at the same time).


Very clear and thoughtful post from antirez, as ever.

It's worth reading his post on how persistence works in Redis (and other dbs). It's very interesting and gives great insights as to what goes on down in dbs to try to keep our data safe - particularly for those of us how don't ever interact with that layer directly.

http://oldblog.antirez.com/post/redis-persistence-demystifie...


It's good to see Twilio post this! That being said - yeah, I really am concerned that Twilio is using an ephemeral database to store such important data. Why not simply use Postgres? Is Twilio really making so many transactions per second that Postgres won't scale?


Totally agree. Need to clear up a developing misconception - Redis does not serve as the primary store for the account balance of Twilio's customers. The billing system uses a double bookkeeping model common to many high volume designs with Redis as the in-flight data store (e.g. when a call or SMS message is created) with the transaction also stored independently to an RDBMS post-flight (e.g. when a call or SMS message is completed).

Clearly however our implementation failed dangerously and did not recover in a manner that meets our customers' expectations. Totally get how such a misconception would occur from a cursory read of the incident report - just need to be clear.


Is there actually a legitimate performance or scalability need to incorporate a NoSQL database in this case?

Ever since NoSQL databases started receiving a lot of hype a few years back, I've witnessed a number of teams use them without any real justification. They'll build unnecessarily complex systems using one or more NoSQL database systems, all while a relational database would be more than sufficient for their needs.

In several of these cases, some of the developers have been quite adamant that these NoSQL databases are essential. Then we rip them out, usually because they've been causing problems like described in this case. It quickly becomes obvious that they were never needed in the first place, and won't be needed even in the face of significantly increased load.


> Is there actually a legitimate performance or scalability need to incorporate a NoSQL database in this case?

Usually the most compelling reason not to use a traditional SQL database is that they're expensive to field in distributed environments and require specialist engineers to support when you get out into the weeds.

But I submit it's a category error to call Redis a "NoSQL database." It was also a category error to put data that needs to be redundant and durable into a storage system that is really specialized for high volume but low-reliability cache work.

I mean, seriously. Minimum viable PostgresSQL deploy in EC2 is pretty damn expensive, but you may not have enough cash on hand to buy your own hardware and pay to put it in a "real datacenter." If you can find something cheaper and you're not even seed funded yet because no one funds your social product until you have 100k users, you're probably more likely to actually make it to where you can afford a different database by just using Redis. It's called "technical debt" and we accrue it both bitterly and often willingly.


Some problems are better solved with key-value stores than with relational databases. It depends on the data model you need to map.

Lot of people use relational databases as a big key value store, when really they should use a key value store. It's not necessarily about scalability or performance. =)


Yes. And Redis provides an interesting model with very specific performance characteristics depending on how you choose to store your data in each case. If you haven't worked with Redis it might be hard to appreciate how it's a bit more than just a key-value store and how it's a different tool from a relational db.


I think you're making a number of mistakes in judgement here. Most importantly, you're conflating NoSQL with "not as reliable." As if there's no difference between Cassandra, HBase, Redis, etc. They're just "nosql" and "not needed" to you. That just seems highly ignorant of the relative merits of each database.

And similarly, the attitude that a RDBMS is the right and proper data store, that all data should rightly prefer to live in an RDBMS aside from the loftiest heights of scalability.

Finally, the assertion that using a key-value or column-based datastore is going to lead to an "unnecessarily complex" design.

I'd not be surprised to find you disagree, but if we were working together on the same team, these are some things I'd try to work with you on.


I can't speak for the person you're replying to, but I would like to point out that RDBMSes are built on theory designed explicitly to handle general data problems. Situations for which the relational model is not a good fit are almost by definition special cases.

Similarly, while it is wise not to lump all of the non-SQL/non-relational databases together, to say that the bulk of them share a common obsession with scalability is not much of a reach. After all, we're not talking about Neo4J, FramerD, MetaKit, etc. We're talking about young systems that were born out of frustration with scalability issues. It's true that they all have different tradeoffs, strengths and weaknesses, but one significant advantage of Postgres is that it does so well in such a variety of situations. Maybe it isn't perfect for every scenario, but it's pretty good for most scenarios. This is partly an inherent advantage of age, and partly the benefit of a sound theoretical foundation. A big thing I dislike about the current crop of NoSQL databases is that you are expected to learn a great deal about their inner workings to make an informed decision about whether or not to use them. MongoDB really epitomizes this. But this is a pretty lazy complaint compared to random data loss and insane defaults (early days of Mongo), and I have to give you that.

Your position sounds quite reasonable compared to the one I've been primed by, but it's going to be a few more years before these databases mature to the point I'd consider one a reasonable first choice for a generic database. I don't draw the line quite as far as the GP—sometimes you have a special case and you need special tools—but I can certainly empathize with questioning the wisdom of making a distributed non-relational database the primary store for a low-load, classic RDBMS problem. But Twilio has come forward and agreed with that position and denying that they do that, so the spark for this flamewar is conjecture. While people have come forward to defend such a configuration, few seem to be endorsing it by using it themselves. This says something about the reasonableness of the idea.


I use nosql stores for the same reason that I use a dynamically typed language (python) I like the flexibility and agility. I don't use nosql stores for performance


No offence but attitudes like this are the worst. If we all took your advice we would all be still using punch cards or writing everything in assembler. Sometimes you don't need a clear justification to use newer technologies. Perhaps developers just want to experience the significant developer productivity that comes with using many of the NoSQL databases.

Also might be worth dropping the whole "SQL is better" insinuation. We have seen some pretty major data loss bugs in PostgreSQL and MySQL recently.


What major data loss bugs have you heard about "recently" in Postgres and MySQL? There was a significant security problem in Postgres that was recently fixed, but I haven't heard of any recent data loss bug. Even MySQL has been doing well on this point lately.


In this case data was corrupted ONLY by humans mistake, it's not Redis fault (even author of article wrote it).


I'm not accusing Redis of losing data, I'm correcting a factual error about the reliability of MySQL and Postgres.


Billing systems are not to be taken lightly, namely because money is inherently involved.

When developing such systems, it is irresponsible to use new, unproven technologies without justification.

When developing such systems, it is irresponsible to trade off the reliability and safety of the system for some "developer productivity".

Such irresponsibility is just not acceptable. Failures due to such irresponsibility should not be tolerated, either.


Failures due to such irresponsibility should not be tolerated, either.

I appreciate that you are very committed to reliability. I also have worked on (telecom) billing systems. However, what are you suggesting here? Should all Twilio's customers find another provider? (Do you have a suggestion?) Should they just not use a service like Twilio's? Should Twilio quadruple their prices so they can throw up enough hardware to do all this in postgres? Should we all fly over to Sicily to protest outside antirez's house?

Incidentally, in the late '90s I sat in on a few sales efforts in which Lucent pushed their "Datablitz" in-memory database for mediation and billing of call records. So, it's not as though nothing similar to this has ever been done.


money can be refunded... we're not talking about a life - or a human life we're just talking about software. Reading how Twilio was using redis it sounds very reasonable (fsync). I've seen MySQL servers melt under extreme load running on very high end servers so if you're saying that somehow having had SQL in this case would have saved them... well... maybe you're right... but I wasn't there were you?

I believe the engineers at Twilio are building and have built a platform to handle a scale that is going to continue to reach further than many traditional models will scale. It makes sense to me that they would need to look for in memory systems to push the limits - They are fsync'ing immediately - so while it's in memory for a read it's disk for writes so that sounds pretty safe to me... Also, a lot of people use Redis now, so I definitely wouldn't say it's unproven, that is as others have said like saying we should still be using typewriters to put ink on paper because it's "proven"... I'd answer that with... "good luck with that".

And at the end of the day - "the shit will hit the fan" and it did...

the team did great and we're all thankful.


Just because human lives aren't necessarily at stake it doesn't mean that poor (in my opinion) software system designs should be tolerated.

And when we talk about "relational databases", we basically never mean MySQL. I'm not even sure why you'd bother to bring it up, given its poor reputation, its lack of proper functionality in some cases, and the many other (and far better) alternatives.

As for scalability, some other commenter here posted a link to a twitter comment describing the scale in this case: https://twitter.com/dN0t/status/360119871318659074

It clearly states "tens of thousands" of transactions per second. That is well within the capabilities of the lower-end database systems on modern, low-end server hardware. Twilio will require a huge amount of growth just to reach the point where higher end database systems and hardware become appropriate, never mind "a scale that is going to continue to reach further than many traditional models will scale".

Running into significant billing issues due to a likely unnecessary use (in my opinion) of a NoSQL database while still at a relatively small scale is not something that a team should be commended for, and nobody should be "thankful" that the incident did happen.


The equivalent failure with PostgreSQL would be to update kernel.shmmax with a call to sysctl, and then forget to update /etc/sysctl.conf. Then if the machine reboots PostgreSQL will fail to start because it can't allocate enough memory. It mostly wasn't a redis problem as far as I can tell. That said, Twilio is probably at a big enough scale where they can afford to hire a couple full-time admins to deal with PostgreSQL.


Understood. But maybe billing isn't the best place to experiment?


How are NoSql databases more productive than SQL?


Regardlesss, something about the redis query returning a zero balance was triggering bad behavior in the billing system.

Also, reading between the lines in the incident report made it sound as if there might have been multiple teams involved in the troubleshooting and not communicating perfectly. For example, were the redis admins informed that customers had been getting billed repeatedly when the decision was made to restart the billing system? Did they have access to the billing system logs which might have contained errors related to redis being read-only?

All in all, big props to Twilio for starting to get customer accounts credited back within 11 hours of the first trouble and even more for their wonderful open disclosure.


But it does appear (if I understand correctly) that sometimes you used Redis as if it was the system of record rather than checking the actual one, hence the rebilling?


So what happens when there is a failed transaction mid-flight? It's already in redis but not in RDBMS. How do you rollback then?


Redis is not an ephemeral database when used in Twilio's configuration. When configured with an AOF with fsync set to 'always', Redis will be as durable if not more durable than postgres and the like.


To expand on that a bit: when redis is in AOF mode, all write commands are appended to a journal file. When fsync is set to 'always', this journal file is fsynced on every write.

Redis has a reputation as an "ephemeral database", because it's usually used as one, but there's nothing inherently ephemeral about it. This isn't magic; it's just file I/O.


I understand how the AOF can be considered similar to a postgres WAL file but what makes you say its even more durable?


Forced fsync on every write.


Like a WAL file.


(In its default configuration; there are some unsafe speedups you can do.)


Ability to make RDB snapshots additionally.


>Redis will be as durable if not more durable than postgres and the like

guessing by your username where you may be from, you may be aware that freshness... err... durability comes in only one grade - first grade :)


Not twilio, not ephemeral, postgres also has durability issues in the many configurations, your durability is still only as good your kernel's disk controller, tps has very little to do with the decision to use redis or postgres.


This wasn't posted by Twilio, but the creator of Redis.


As others have said, Redis is resilient in this configuration and by no mean ephemeral.

It's worth nothing from the original Twilio post "This cluster is configured with a single master and multiple slaves distributed across data-centers for resiliency in the event of a host or data-center failure". Ironically it may well be that not having the slaves at all could have prevented the outage.

I've seen other scenarios in the past where having a (mysql) master-slave arrangement has caused an outage. Root cause was bad configuration and a bug in the app code but had there been no slaves, there wouldn't have been an outage.

Just goes to show that it always pays to think carefully about the complexity you introduce into systems.


Well, it's not "ephemeral". I don't consider it as durable as postgres either, but I am hard-pressed to find a reason why I think that, which is usually an indication of a faith-based, rather than fact-based, opinion.


Reasons: longevity

Postgres has had it's failures under weird confluences of circumstances - and learnt from them a decade or more ago.

I would like to see Twilio standby their stack choices, invest time and effort and sponsorship of redis so that in a decade or two people will be taking it on faith that redis is bullet proof (well the combined postgredis datastore:-)


The append-only file just... well... appends every command to a file. That's pretty solid, especially on journaled filesystems, so I can't find any durability-related reason why it shouldn't be used.


The only durability "problems" I've ever been able to come up with for Redis involve human error as a key component, like what Twilio experienced here with the AOF/RDB mixup.

The new CONFIG REWRITE command makes it easier to avoid, but it would be nice if it were harder to get this wrong in the first place.

What it definitely isn't is fully ACIDic, though for some workloads you can get close if you use it very, very carefully.


"Is Twilio really making so many transactions per second that Postgres won't scale?"

Telco billing systems are (or at least used to be) one of the busiest uses of databases. When you're billing in increments of only a few cents per transaction (typical for outgoing SMS), the volume of transactions for any reasonable amount of money can be _huge_.

I don't know how soon Twilio are going to get within an order of magnitude or two of AT&T or Verizon, but I'd hazard a guess that their transactional billing database is _very_ busy by just about _anybodies_ standards.


Twilio's write volume is on the order of tens of thousands of writes per second. https://twitter.com/dN0t/status/360119871318659074?p=p


But this issue seemed to be with processing payments vs. metering usage, quite a difference in volume.


I haven't re-read the article, but if I recall correctly the Redis in-memory data was being used to keep a working copy of the current account balance (not the canonical accounting/bookkeeping version) and it was the results of calculations based on the Redis stored data that was triggering the payment processing. So I think it _was_ a problem with a "metering usage" scale system, which cascaded into the payment processing system working as intended, but with erroneous data.


It has now been noted that Redis is not the primary store of this information.


Here's the Twilio post-mortem thread on HN: https://news.ycombinator.com/item?id=6093954


dog pile - reminds about FB outage couple years ago when their in-memory cache machines got simultaneously flushed by software update and as result piled up upon MySQL databases for the refresh. Twilio's prohibition of master restart seems like a solution to a consequence only.


That's a truly common problem. Experienced the same thing when I was working at Formspring. We relied heavily on Redis and SimpleDB for caching and when a large portion of the cache was lost the site was pretty instantly DOSd. Not fun at all.


I'm curious if you're using anything other than redis-cli to set the master/slave relationships, and if you have any failover mechanism. I've used corosync/pacemaker for a high-availability redis cluster, but without an awful lot of confidence (we likely misconfigured it, to be fair).

Just "slaveof <masterip>" and other redis-cli commands? Or are you using any automated process?

Or has anyone else got a great redis failover/HA solution that they'd care to share?

(I apologize for this having nothing to do with Twilio; I'm just curious)


The best thing out there is redis sentinel. It's in 2.6 and the issue I ran into 6 months ago is not a lot of drivers supported it yet.

http://redis.io/topics/sentinel


Twilio definitely uses ec2, it's been an oft-highlighted choice in many presentations and posts over the years.

- http://www.slideshare.net/twilio/twilio-voice-applications-w...

- http://www.twilio.com/engineering/2011/04/22/why-twilio-wasn...


Just like I commented on the original incident report post, I think systems like Redis are not suitable to work as a db for payment processing and transaction storage. Reading through the report I can't imagine something like this happening with a payment system built around Postgres. Not unless you are doing something incredibly stupid. And stupid those guys are not.

They are obviously bright guys meaning well, and yet they've designed and implemented payment system with such a bad failure mode.

I do understand that they have a LOT of billing events, and have to update customer billable amounts for each of them. But instead of holding the customer balances in Redis and doing payment processing on top of that, my paranoia would most probably lead me to only store 'amount to charge' in Redis and update it as frequently as needed, and store customer balances and transactions in an RDBMS. And only change during actual charge event. This way, if Redis data were to be lost, I'd under-charge my customers and not over-double-tripple charge them. The failure mode becomes less disastrous.


If you're running with AOF then redis is perfectly fine for storing eg: call logs where the transaction is naturally atomical to a single command.

In my experience problems have occur because AOF isn't the default persistence setting (Snapshotting is the default in Ubuntu apt at least). So if Redis get's upgraded in an apt-get upgrade then the sysadmin needs to take care not to override the AOF configuration.

This is not unlike postgres upgrading from 9.1 to 9.2, for example. Not catastrophic, but boy it'll make your heart pump!

I think the best solution at the moment it to not use apt to manage Redis updates so that you have full control over the configuration.


I do not understand why, when updating a balance from a CC transaction, you wouldn't be using transactions.

  Start Transaction
  Update Balances
  Call CC Processor
  Commit
That would eliminate "the billing system charged customer credit cards to increase account balances without being able to update the balances themselves" -- you don't go call a non-transactional CC processor until you've actually been able to process the update in your own system (which you can easily rollback).

If you're worried about Commits failing (due to not using pessimistic locking, for instance), then separate it into two transactions. That way when you go to process the CC the next time, you have a record stating there's already a transaction in-flight.

For financial records, I'd expect a bit more care. Sounds like they had proper records, but only as a backup/logging.

(Even for telecom, in which I work. There are fully ACID databases that have no problems handling millions of transactions/sec. In-flight balance information is trivial to handle.)


While transactions are generally important for, well, monetary transactions, they're not necessary in your example. You could just:

    Update Balances
    Call CC Processor
And if updating balances threw an exception, you'd never reach the Call CC Processor statement. In fact, putting it inside of a transaction is more dangerous than not as the following example demonstrates:

    Start Transaction
    Update Balances
    Call CC Processor
    OH SHIT APPLICATION PROCESS DIES (or network partition, etc etc)
    DB connection is terminated
    Updated balances roll back
    But you still charged your customers' cards
    And destroyed your primary record of it
    Hope you didn't have any other work planned this week
And unfortunately the above tends to happen more often than you'd expect.


Hence, I mentioned if you're worried about commits failing (not a major recurring issue that I've seen but possible), the n you do two separate transactions.


Nope. You'd do at most one transaction, around the transactional part. You wouldn't wrap an external call in a transaction: if it's not transactional, you're literally doing nothing but adding load to your database. If your DB doesn't have access to the data, the transaction is meaningless: your database can't roll back the state.

And if the Update Balances call is a single SQL statement, you actually wouldn't use any transactions at all. You'd just issue the write and have it be written. No point adding the overhead of a transaction if it isn't doing anything.


What databases are those?


These ones, for starters: http://www.tpc.org/tpcc/results/tpcc_perf_results.asp?result...

On good server hardware, Postgres will happily push 100-200K small transactions per second. Naturally the definition of "transaction" varies, and you'll see vastly different performance depending on contention, locality, indices, etc. I'd expect logging independent events in a single table to be a good deal faster than multi-table transactions, especially those involving contended rows.

On EC2, the story is (naturally) a good deal less heartening. I think you'd be hard-pressed to make it to 10K on TPC-B, given reports like http://www.palominodb.com/blog/2013/05/08/benchmarking-postg...

Granted, single-node performance may not be the question to ask here, because this problem is readily shardable by customer ID, and operations can also be buffered in local memory to some extent.

Remember: if it fits in Redis, the problem requires, at maximum, a single node's memory and a single core--and can tolerate network latencies.


"Remember: if it fits in Redis, the problem requires, at maximum, a single node's memory and a single core--and can tolerate network latencies." Not entirely true. A number of places out there use sharded redis. In which case, you're not limited by the memory or core in the same manner. A nit to pick, but hoping to keep the conversation on point (as you alluded to with the "shardable" comment)


If you're using a sharded Redis, your problem is a collection of problems satisfying the above law. What I'm getting at is that if performance is seriously a problem, few things beat a dedicated service with concurrent, in-process memory--instead of pushing every operation to an external, fully serialized heap.


if it fits in Redis, the problem requires, at maximum, a single node's memory and a single core

How many companies like Twilio are using an RDBMS for which this isn't the case? I mean, if people had enough money to buy an oracle cluster, they'd still probably buy something else?


As an example, VoltDB. The memory footprint required for balances and such is not very much, and VoltDB gives you near-linear scaleout to dozens of nodes. A few years ago I tried it for a CDR/balance system (insert record, update balance - single tx) and had no problem reaching 100K+ transactions/sec on a 3-way system (3x $500 servers).


Given appropriate hardware, it's possible to get astounding performance, safety and reliability out of relational database systems like NonStop SQL, DB2, and Oracle.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: