

Comparing Message Queue Architectures on AWS - itaifrenkel
http://tech.forter.com/comparing-message-queue-architectures-on-aws/

======
yashinovsky
Have you also evaluated Kafka? as it is a common choice feeding Storm .

~~~
eikenberry
This is for AWS. Kafka isn't designed to work well in an environment where
partitioning occurs with any frequency. AWS is such an environment.

[https://aphyr.com/posts/293-call-me-maybe-
kafka](https://aphyr.com/posts/293-call-me-maybe-kafka)

~~~
yashinovsky
For that matter, RabbbitMQ isn't either. see
[https://aphyr.com/posts/315-call-me-maybe-
rabbitmq](https://aphyr.com/posts/315-call-me-maybe-rabbitmq) .

------
jedberg
The first architecture using two ELBs has a long list of cons, most of which
are solved by using HAProxy as your internal load balancer. May want to
considering adding that as another option on the matrix.

~~~
itaifrenkel
Interesting. Could you please elaborate?

~~~
jedberg
Sure! Here's your list of cons and how Haproxy would solve it:

> Some API requests need to get a higher priority over other API requests, and
> that is not taken care of. This was one of our main problems with this
> architecture, especially with a mix of real-time clients and clients that
> send batch jobs.

Haproxy would let you assign pools to different requests, each with their own
priority and queue. At reddit, we had it broken down roughly into four
quartiles based on 95th percentile response speeds of each API call.

> This architecture assumes that there is enough Processing Servers to handle
> all requests (peak throughput). If there isn’t the Processing Server applies
> back pressure on the API server (error or timeout), which in turn returns an
> error to the API user which in turn re-tries the API request (applying more
> pressure). To avoid this, the number of running processing servers needs to
> be enough to handle peak traffic.

Using haproxy in the middle, that tier will queue the requests, so all the
back pressure builds up at that second load balancer. Whether that is good or
not is questionable, but at least you won't return errors to the clients right
away. You'll still have to have a pretty long time out depending on how long
it takes for resources to come online, but then you could go back to the first
part and have longer or shorter timeouts based on the api call as is
appropriate for your application.

> ELB was not designed to handle huge traffic spikes since it takes a few
> minutes to internally scale. You should contact AWS support to warm your ELB
> if you have a planned traffic spike. We at Forter are in the eCommerce
> market where traffic spikes are rare.

This is still true, and there is no easy way around it unless you also make
haproxy your front end load balancer. If you make it your frontend as well,
you can have a hot spare on standby and spin up new ones pretty quickly. That
being said, I believe the ELB team is making improvements in this area in
2015.

> API Server needs to handle retries. Not provided by the ELB itself.
> Processing Server must respond within the http timeout (configurable between
> 1 to 3600 seconds). Otherwise the protocol would need two phases which adds
> more complexity.

Still true, although it will have to retry less because of the queues in the
middle layer. Again though if you use haproxy as the front end you can solve
this issue as well by sending the request to a "retry pool".

------
tylertreat
Really nice write up. I'm curious if you guys have done any extensive
throughput and latency benchmarking? I saw the note saying "ballpark figures."

FWIW, I've been working on a framework for empirically testing queue
performance for scaled-up, distributed deployments
([https://github.com/tylertreat/Flotilla](https://github.com/tylertreat/Flotilla)).
Haven't gotten around to adding support for AWS services yet, but would be
interesting to see how they compare.

~~~
itaifrenkel
Do you have a document showing the results of the Flotilla tests ... how
beanstalkd compares with RabbitMQ for example?

~~~
tylertreat
Not yet. The blog post linked in the readme there provides some more
background on the motivation behind the project.

I'm hoping to do some in-depth analysis of several brokers at some point, but
I want to get the benchmark instrumentation right first.

------
itaifrenkel
OP here... I would be happy to discuss any comments you have.

~~~
pan69
For RabbitMQ you mention:

    
    
      No message delivery guarantee in face of RabbitMQ server failure.
    

Wouldn't that be solved by using a persistent queue and high availability
clustering (i.e. the queue is duplicated over N servers)?

~~~
ekimekim
Another very minor detail, but you mention that RabbitMQ can do priority based
on multiple queues. While that's certainly a fine way to do it, it's worth
noting that AMQP also supports per-message priorities within one queue:

[http://www.rabbitmq.com/amqp-0-9-1-reference.html#class.basi...](http://www.rabbitmq.com/amqp-0-9-1-reference.html#class.basic)

> The server MUST implement at least 2 priority levels for basic messages,
> where priorities 0-4 and 5-9 are treated as two distinct levels.

Depending on your client, it may be difficult to prioritize consumption of one
queue over another, so this solution could be preferred.

~~~
itaifrenkel
Thanks! I'll update the post. A quick search found this RabbitMQ plugin
[https://github.com/rabbitmq/rabbitmq-priority-
queue](https://github.com/rabbitmq/rabbitmq-priority-queue) , which specifies:
"In contrast to the AMQP spec, RabbitMQ queues by default do not support
priorities. When creating priority queues using this plugin, you can specify
as many priority levels as you like."

------
jpp123
Out of curiosity did you look at gearman? I really like the ability to
coalesce identical jobs.

~~~
itaifrenkel
Not yet. gearman is a job service which sits on top of a queue (much like
python celery). We did not need all that, given the fact we are using Storm.

------
omni
You give Redis a yellow rating on prioritization, noting that this can be
partially achieved using multiple lists. Wouldn't it be much better to take
advantage of Redis's sorted set type?

~~~
itaifrenkel
Sorted Set does not have properties of a queue. For example, it does not allow
duplicates (different messages with the same priority)

~~~
omni
Stick a UUID on your queued items as part of the priority queue
implementation.

------
threeseed
Surprised they didn't mention IronMQ:
[http://www.iron.io/mq](http://www.iron.io/mq)

It's not perfect but pretty damn close. And it run on AWS et al.

~~~
itaifrenkel
You are right about IronMQ, they are said to have a very strong product. For
the record, we haven't actually checked redislabs and cloudamqp either, but
since they are based on opensource that is mentioned in the blog post, I added
them to the notes section. I couldn't say anything intelligent about IronMQ
since it is not based on an opensource offering, and so other than actually
trying it, I cannot tell its pros/cons.

------
kondro
I'm surprised that SNS wasn't mentioned here, being an AWS queueing product.

~~~
itaifrenkel
Simple Queuing Service was mentioned (SQS) Simple Notification Service (SNS)
was not. Did you mean SNS or SQS ?

~~~
kondro
Given that I was surprised that SNS wasn't mentioned, I think I mean SNS,
given they are both queuing services.

~~~
itaifrenkel
SNS is a push notification service, that focuses on various consumers outside
the cloud. It can be used with web hooks (http endpoints) to publish the same
message to various web servers. In that case it would guarantees at least once
semantics (which means the message is delivered, but sometimes twice), but it
does not guarantee FIFO.

~~~
kondro
SQS also doesn't guarantee FIFO.

SNS guarantees delivery with back-off to HTTP/S endpoints (handling HTTP
status codes properly) with very low latency in an asynchronous manner.

