Hacker News new | past | comments | ask | show | jobs | submit login
Scaling with Queues (wingify.com)
68 points by paraschopra on Sept 10, 2013 | hide | past | favorite | 22 comments

I'm curious about the justification for Redis in this pipeline. That seems to introduce additional failure modes. Was it the case that a non-durable non-confirming (in-memory only) RabbitMQ wouldn't satisfy the latency requirements? I'd love to see some benchmarks comparing Redis to an in-memory RabbitMQ.

Pulling data out of Redis and into RabbitMQ seems fraught with problems, whereas you could have the durable RabbitMQ use shovel to pull data out of the faster in-memory RabbitMQ with a few lines of config.

The post claims that RabbitMQ is too slow, but the description of the problem is very confusing to me:

"Later, I ... ran a loader.io load test. It failed this model due to very low throughtput, we performed a load test on a small 1G DigitalOcean instance for both models. For us, the STOMP protocol and slow RabbitMQ STOMP adapter were performance bottlenecks. RabbitMQ was not as fast as Redis, so we decided to keep it and work on the second model."

So they're using Redis as a queue for RabbitMQ :-) Just use Kafka, I say:

"Kafka handles over 10 billion message writes per day with a sustained load over the highest traffic hour that averages 172,000 messages per second. ... the footprint for the Kafka pipeline is only 8 servers per data center" http://sites.computer.org/debull/A12june/pipeline.pdf

That oughta be enough.

RabbitMQ can easily do 25,000 messages per second on a single machine using the Java client. (One producer/ One Consumer)

> RabbitMQ wouldn't satisfy the latency requirements?

From the comments at the bottom of the article:

> I think it would depend on one's use case and requirements, but Redis is generally faster than RabbitMQ since it avoids the heavier AMQP protocol overhead and internal (data structure, etc.) implementation (C vs Erlang) and we are using unix sockets to communicate with it (locally) but then it's not as reliable as RabbitMQ.

In our case it is more important to process a HTTP request as fast as possible, while the processing, propagation and storage of request data for analytics could be slow but has to be reliable therefore we chose the Lua->Redis->agentredrabbit->RabbitMQ model.


Personally I wouldn't think it would be an issue, elsewhere he mentions: "we aim for less than 40ms" I can't imagine enqueuing a (seemingly) small message in rabbitMQ takes more than 1 or 2 ms.

Hi, I'm the backend engineer at Wingify who wrote this post.

whisk3rs: For our use case, message had to be reliable, we required publisher confirms and you're sort of right, RabbitMQ in our use case did not satisfy our latency requirements. I've tried shovel and it does not solve our problem like we wanted. I don't have component based benchmarks between Redis and RabbitMQ but I've already shared the loader.io results of our two pipelines. Some details below.

noelwelsh: We're using Redis as an intermediate storage sink for RabbitMQ and instead of just passing messages from Redis to RabbitMQ, we move them in chunks. I'll explain below why we could not move away from Redis. "mbell" is correct about why do it that way.

NOTE: this is going to be a long reply;

This blog post talks mostly about data acquisition, let me start by giving some background on our backend services. You may read more about VWO on visualwebsiteoptimizer.com, so I'm skipping that. We've multiple servers across globe which helps us do data acquisition (capturing data for analytics) and servers which serve javascript snippets which are applied on one's website. Our users install VWO code on their website and depending on the test etc. the code from our servers is served dynamically and applied (like "karolisd" commented, we do it as fast as possible and we're still tuning our systems, and it is dynamic).

We don't use any CDN (such as akamai, cloudfront etc.) for our dynamic content as they fail at providing us tweaking mechanisms while serving dynamic content on same url and such a design would break user experience and we don't want our users to keep changing the installed code on their websites -- for them it should just work. Many such services require you to install some code that would brings some js from a url, each time you modify something they may either ask you to install new code with the new url or if they're using CDN they might send you all the changes by the same url (for ex. increase of payload size due to unnecessary data).

So, we have two most important requirements; one -- to do data acquisition reliably, as fast as possible with minimum payload; two -- to serve content dynamically with minimum payload and as fast as possible because it is most important for us to not slow down website of our users.

To do that we use a custom-compiled OpenResty (nginx mod) with luajit and our (Lua) code runs inside OpenResty offering us minimum latencies and processing speeds (no reverse proxies). At the time we started solving our scaling problem, there was no lua-resty library that does publishing from Lua/OpenResty to RabbitMQ. Writing a production grade lua-resty AMQP library would require a lot of time (you may search our discussions on openresty-en mailing list). So I started by writing an opensource stomp based library to use RabbitMQ's STOMP adapter and STOMP was much light weight and easy to implement compared to AMQP (there are multiple versions as well). So, in our Lua/OpenResty code we had two options either to publish messages directly to RabbitMQ or to Redis. Publishing to RabbitMQ was slow due to latencies, the AMQP overhead and the small payload size (less than 1kB), it was a deal breaker for us. But running Redis locally on unix sockets to which our Lua/OpenResty code would write was much better in terms of latencies, and transferring data in chunks from Redis to RabbitMQ improved throughputs.

Coming back to the earlier comment on this thread, Redis does not add any failure case, instead it provides us a failover: So far I've seen that there may be more cases or chances of the whole datacenter (network) going down instead of a server, in that case data sits in Redis. Our cron jobs along with monitoring tools make sure local services on each servers are up and running. In case our server goes down, our Anycast DNS (with low TTL) would switch traffic to other available servers automatically. In case RabbitMQ servers/datacenter goes down, data would be pushed into it next time the network/server goes down; using reliable messaging ensures messages/chunks are written to disk (fsync) so they persist if not consumed, publisher confirms give us reliability. In case the consumers die, data sits in RabbitMQ. In case of timeouts and latencies of moving data to RabbitMQ, agentredrabbit handles that.

Comments, questions welcome.

Thanks for the detailed response. I hope to learn from your experiences.

AMQP is indeed a complicated protocol so I can appreciate your reluctance to write a client.

The failure mode I was referring to with Redis is that you now have additional custom code which pulls from Redis and inserts to RabbitMQ. This is another process that can fail, and some of those failure modes could result in duplicate or lost messages. For example, agentredrabbit could shutdown in an unclean way where the "dump" file doesn't get written. Also, you didn't mention if you're using Redis in AOF or RDB persistence mode, but in RDB it only writes to disk occasionally so you could lose messages there. Also, if agentredrabbit fails then Redis could run out of memory (RabbitMQ automatically writes to disk once memory limits are exceeded).

I will believe that writing to Redis is /always/ faster than writing to RabbitMQ, but I'd really like to know what other things you guys tried to make RabbitMQ actually work for you. Did you experiment with queue settings? Memory limits? Erlang tweaks? SSDs? What was it about shovel that didn't work the way you wanted it to? I'm working towards what I hope will be a pure RabbitMQ infrastructure so knowing what you've tried and didn't try will be informative.


Yes, you're correct there are corner cases where every component could fail. For example, a DDOS on our servers could slow us down or a kill -9 of RabbitMQ (even with persistence), OpenResty or agentredrabbit and the consumers could result in loss of messages or unexpected results. We've done such tests and seen loss of messages. Our servers have enough disk space and RAM to hold some 100s of billions messages and the multiple servers make sure we don't run out of resources. Queue settings are very important, for our case we've persistent messages with durable queues with few other tweaks and settings. Lastly, yes we explored our options and at that time used the best of what we could get and hack the best we could come up with in limited time and resources. If you're just starting with RabbitMQ I would recommend the tutorial on RabbitMQ's website and the book "RabbitMQ in Action" by old_sound et al

What happens when your RabbitMQ queue goes down and messages start building up in redis? Won't that eventually use up all of the memory on the machine?

We followed a similar path using Redis as a queue, and ended up having to write a 'valve' process that watches the queue and starts popping off messages into a file when the queue length gets too long. At that point we had half-reimplemented a real queue.

So you make sure that you've multiple RabbitMQ servers or a RabbitMQ cluster so there is no single point of failure :)

Ok, more simply, what about a network partition where the redis machine gets cut off from the RabbitMQ servers?

My only point is really just that redis-as-a-queue has some bad failure modes if you aren't careful.

Sure, I agree and I understand things go wrong, so we've monitoring tools (munin, pingdom, pagerduty etc.) and we check them often or they post notifications often. There are at least two folks on-call 24x7. Based on present workloads we've calculations on how much time it would take to exhaust resources and we plan our servers accordingly, this buys us time to react.

Is there a reason not to just use Tibco RV? Because all of these are long-solved problems.

We love opensource, AMQP was created by wall-street giants and RabbitMQ fits our case and Tibco RV does not solve our particular case in our particular environment.

Did you guys evaluate ZeroMQ? IIRC It's made by the same guys that originally designed AMQP. If you did, I'm curious to see what your conclusions were.

Yes. Among several queuing systems we played with RabbitMQ just worked out of the box with features we wanted such as reliability, confirms and the queueing patterns (routing, topologies, fan-out etc.), it was easy to use and deploy. 0MQ gave us no message persistence or broker implementation (having a broker decouples our producers and consumers), if we were to use 0MQ many features we wanted would have to written which RabbitMQ provided out of the box. I think 0MQ is more like a framework than a queueing system/platform like RabbitMQ which just works.

Last time i checked it was quite expensive

There is no such thing as cheap or expensive in business, there is only worth the money, or not. And time is also money.

Twitter too, shouldn't have wasted so much time reinventing the wheel and very publicly getting it wrong a few times. It probably cost them a lot more in the end too (but hey, it was only VC money, right?)

When I'm working on an A/B test in Visual Website Optimizer, the speed at which the script is updated when I hit save never ceases to amaze me. I've never had to hit refresh multiple times and wait for it, unlike with other services. It's interesting to see how it works behind the scenes.

Thanks for your comment. I'm the backend engineer at Wingify who wrote this post. This post talks mostly about data acquisition, I'll suggest my team to post more technical details of our dynamic CDN which serves the dynamic content.

Very cool, and awesome on opensourcing the agentredrabbit code!

We had a somewhat simpler but related "tons of data coming in via RESTful services and don't want to flood directly to the db with massive parallel connections" issue and solved it with Redis likewise. Since we used postgres as the backend, we (http://www.ivc.com) actually sponsored a project create a Redis FDW:


so we could batch up 1,000s of inserts into a single database insert, thus reducing IO to disk.

Great, thanks for sharing.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact