Hacker News new | past | comments | ask | show | jobs | submit login
Comparing Message Queue Architectures on AWS (forter.com)
87 points by itaifrenkel on Jan 24, 2015 | hide | past | web | favorite | 42 comments

Have you also evaluated Kafka? as it is a common choice feeding Storm .

Sure we have. Kinesis was considered because it was inspired by Kafka, and Kafka is known to work well. The thing is that any distributed stateful service, by design, requires DevOps experience and takes time to manage. We try to stay lean at this stage, and chose Kinesis over Kafka becuase it took very little time to setup, and has no "maintenance costs". Had we have chosen Kafka, we would have enjoyed much shorter latency times.

This is for AWS. Kafka isn't designed to work well in an environment where partitioning occurs with any frequency. AWS is such an environment.


For that matter, RabbbitMQ isn't either. see https://aphyr.com/posts/315-call-me-maybe-rabbitmq .

I always wondered if Amazon would really be open about and go public when a split brain in their network caused one of the hosted data services to loose data. Maybe they just throw money at the hardware and hope for the best.

Kafka is a log, not a queue. They have every so slightly different semantics, but logs have HOL blocking per partition for processing messages because they use a watermark, compared to queues, which track on a per-message basis.

The article already includes a couple of "not queues" that they use for event stream processing, Kinesis being one of them. Kinesis consumers also maintain a watermark to keep track of how far they've read in a stream. Considering that, a Kafka comparison seems like a reasonable ask.

The first architecture using two ELBs has a long list of cons, most of which are solved by using HAProxy as your internal load balancer. May want to considering adding that as another option on the matrix.

Interesting. Could you please elaborate?

Sure! Here's your list of cons and how Haproxy would solve it:

> Some API requests need to get a higher priority over other API requests, and that is not taken care of. This was one of our main problems with this architecture, especially with a mix of real-time clients and clients that send batch jobs.

Haproxy would let you assign pools to different requests, each with their own priority and queue. At reddit, we had it broken down roughly into four quartiles based on 95th percentile response speeds of each API call.

> This architecture assumes that there is enough Processing Servers to handle all requests (peak throughput). If there isn’t the Processing Server applies back pressure on the API server (error or timeout), which in turn returns an error to the API user which in turn re-tries the API request (applying more pressure). To avoid this, the number of running processing servers needs to be enough to handle peak traffic.

Using haproxy in the middle, that tier will queue the requests, so all the back pressure builds up at that second load balancer. Whether that is good or not is questionable, but at least you won't return errors to the clients right away. You'll still have to have a pretty long time out depending on how long it takes for resources to come online, but then you could go back to the first part and have longer or shorter timeouts based on the api call as is appropriate for your application.

> ELB was not designed to handle huge traffic spikes since it takes a few minutes to internally scale. You should contact AWS support to warm your ELB if you have a planned traffic spike. We at Forter are in the eCommerce market where traffic spikes are rare.

This is still true, and there is no easy way around it unless you also make haproxy your front end load balancer. If you make it your frontend as well, you can have a hot spare on standby and spin up new ones pretty quickly. That being said, I believe the ELB team is making improvements in this area in 2015.

> API Server needs to handle retries. Not provided by the ELB itself. Processing Server must respond within the http timeout (configurable between 1 to 3600 seconds). Otherwise the protocol would need two phases which adds more complexity.

Still true, although it will have to retry less because of the queues in the middle layer. Again though if you use haproxy as the front end you can solve this issue as well by sending the request to a "retry pool".

Really nice write up. I'm curious if you guys have done any extensive throughput and latency benchmarking? I saw the note saying "ballpark figures."

FWIW, I've been working on a framework for empirically testing queue performance for scaled-up, distributed deployments (https://github.com/tylertreat/Flotilla). Haven't gotten around to adding support for AWS services yet, but would be interesting to see how they compare.

We haven't done comparative throughput performance. SaaSiness, Low Latency and use case matching was the main concern of choosing the queues. As for throughput, the main concern was for the event stream processing pipeline. Redis and Kinesis are (somewhat) SaaSed on AWS, throughput can scale out horizontally with sharding, and we haven't had any problems since :)

Do you have a document showing the results of the Flotilla tests ... how beanstalkd compares with RabbitMQ for example?

Not yet. The blog post linked in the readme there provides some more background on the motivation behind the project.

I'm hoping to do some in-depth analysis of several brokers at some point, but I want to get the benchmark instrumentation right first.

OP here... I would be happy to discuss any comments you have.

For RabbitMQ you mention:

  No message delivery guarantee in face of RabbitMQ server failure.
Wouldn't that be solved by using a persistent queue and high availability clustering (i.e. the queue is duplicated over N servers)?

Another very minor detail, but you mention that RabbitMQ can do priority based on multiple queues. While that's certainly a fine way to do it, it's worth noting that AMQP also supports per-message priorities within one queue:


> The server MUST implement at least 2 priority levels for basic messages, where priorities 0-4 and 5-9 are treated as two distinct levels.

Depending on your client, it may be difficult to prioritize consumption of one queue over another, so this solution could be preferred.

Thanks! I'll update the post. A quick search found this RabbitMQ plugin https://github.com/rabbitmq/rabbitmq-priority-queue , which specifies: "In contrast to the AMQP spec, RabbitMQ queues by default do not support priorities. When creating priority queues using this plugin, you can specify as many priority levels as you like."

Your are correct. The sentence you referenced is in the section that describes RabbitMQ without clustering. The next section talks about the tradeoffs of RabbitMQ clustering.

OK. Fair enough. But that's merely a configuration choice and not really a shortcoming of RabbitMQ and even without a cluster, persistent queues are at your disposal, it's just that a HA cluster would be able to give you more of a guarantee.

Really enjoyed reading your post by the way...

thanks. The Pros/Cons were supposed to refer to the complete architecture setup, and not to the Q component by itself. I can see how that could have been read that way...

I know it's not the core topic of the article, but would love a deeper dive into your Kinesis implementation. What is your API & processing servers written in? How easy was it to pump events into Kinesis? What does your event processing architecture look like with respect to Kinesis? Where does the event data end up? RedShift?

The event stream processing is probably worthy of another blog post, but to be brief... we try to align all of our API servers (or event dispatchers) to be nodejs and have all of the processing in Storm (java). But reality is a bit more complicated. Specifically the API server pushing events into Kinesis is in python. It is a refactoring of some code that was written in the first days of the company. At the time it was based on multiple processes communicating through files... which didn't scale (as one would expect). The Kinesis implementation was done by a researcher which is a very talented coder but did not have experience with Kinesis. He was up and running after a short whiteboard session and a few hours of coding. We had to tweak a few things in the following days (such as adding more shards for higher read throughput, abandoning Kinesis shards altogether in favor of multiple stream with one shard)

I'm also very interested in the details of your event streaming. Are you using Node.js streams at all? I am a huge fan of Node streams but see few examples of them used in large-scale production, especially when combined with other architectures (as in your Java Storm example).

We are currently not using nodejs streams. The nodejs components are the event dispatchers placing the events in a queue, they do not include much even processing. Most of the event processing is done in Apache Storm. We have also contributed the Storm-Nodejs integration, but the nodejs Storm components are not using streams either. I have tried using streams to compress the information between the Java parent process in Storm, and the nodejs child process, but compression wasn't a good tradeoff, so even that little bit didn't get into production.

Any reason why you don't send the data directly to the processing servers as a way to minimize complexity? If it fails to reach one, try the other.

When you mention low latency, how low are we talking? ms, seconds, minutes? The reason I am asking is because you can use s3 as an intermediate storage where you ship your compressed logs/events at a rollover interval and the processing servers discover them there. Now of course this only works if latency is not a big deal.

Our first architecture was indeed peer-to-peer where all components discovered each other. It is difficult to isolate failures on this kind of architecture. A reliable queue makes a big difference in zooming in on the root cause of the problem (a queue producer or a queue consumer problem). Regarding latency - we are aiming for a few ms tops. preferably less. So bulk events using S3 was not considered

I'd argue that queues can introduce a lot of similar problems. Especially when there's slowness (backing up, slow disk etc...). Messages over tcp sent directly to the processors are not that hard to design for failures and high availability and probably your'll reduce the latency significantly.

You can also get rid of auto discovery and use config files.

Out of curiosity, what tool did you use to sketch the various architectures for your post?

I used Lucid Charts. It felt easy to work with.

https://www.lucidchart.com/ http://aws.amazon.com/architecture/icons/

Out of curiosity did you look at gearman? I really like the ability to coalesce identical jobs.

Not yet. gearman is a job service which sits on top of a queue (much like python celery). We did not need all that, given the fact we are using Storm.

You give Redis a yellow rating on prioritization, noting that this can be partially achieved using multiple lists. Wouldn't it be much better to take advantage of Redis's sorted set type?

Sorted Set does not have properties of a queue. For example, it does not allow duplicates (different messages with the same priority)

Stick a UUID on your queued items as part of the priority queue implementation.

Surprised they didn't mention IronMQ: http://www.iron.io/mq

It's not perfect but pretty damn close. And it run on AWS et al.

You are right about IronMQ, they are said to have a very strong product. For the record, we haven't actually checked redislabs and cloudamqp either, but since they are based on opensource that is mentioned in the blog post, I added them to the notes section. I couldn't say anything intelligent about IronMQ since it is not based on an opensource offering, and so other than actually trying it, I cannot tell its pros/cons.

I'm surprised that SNS wasn't mentioned here, being an AWS queueing product.

Simple Queuing Service was mentioned (SQS) Simple Notification Service (SNS) was not. Did you mean SNS or SQS ?

Given that I was surprised that SNS wasn't mentioned, I think I mean SNS, given they are both queuing services.

SNS is a push notification service, that focuses on various consumers outside the cloud. It can be used with web hooks (http endpoints) to publish the same message to various web servers. In that case it would guarantees at least once semantics (which means the message is delivered, but sometimes twice), but it does not guarantee FIFO.

SQS also doesn't guarantee FIFO.

SNS guarantees delivery with back-off to HTTP/S endpoints (handling HTTP status codes properly) with very low latency in an asynchronous manner.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact