
Show HN: Auto-Cluster RabbitMQ with AWS Autoscaling Groups - crad
http://aweber.github.io/rabbitmq-autocluster/
======
Cieplak
RabbitMQ is a great tool, but you should always test your deployment with a
tool like Jepsen. Rabbit has historically had issues with multi-node clusters
related to split-brains and dropping ack'd messages following network
partitions. In contrast, I've had very few issues using Rabbit shovels, a
store-and-foreword plugin:

[https://www.rabbitmq.com/shovel.html](https://www.rabbitmq.com/shovel.html)

ps: s/foreword/forward/

------
spotman
Looks like a helpful project.

Having said that the clustering is often not the hard part with rabbit, well
configuring it at least.

The more difficult issues are keeping it clustered amidst your typical day of
network hiccups in cloud environments, as rabbit can be sensitive, preventing
mnesia from going out to lunch, and as far as scaling goes certainly it can't
go without mention proper planning, including queue HA strategies and
ownership.

Since rabbit doesn't just scale each queue across all boxes in a way that
distributes load, a smart team will use several queues carefully plotted onto
specific hosts that own each queue and carefully replicate that to a small
number of other rabbits but certainly not more than a small amount of other
hosts and never the whole cluster.

For these reasons I have to think that this would be a nice addiction to aid
in setting up clusters and maintaining a fleet size, actually scaling and
operating a high volume rabbit cluster this would only be a small tool in the
toolbox however.

Without knowing the whole picture of what kind of workload birthed this
approach it makes me nervous to think rabbit would be managed by autoscale,
prone to its deciding it wants to replace pieces of the cluster at any time,
different high volume queues landing on different hosts, mnesia deciding it
doesn't want to play ball anymore etc..

Possibly it's for very volatile workloads, none of the queues are HA, and this
plugin forcefully does a hard reset of rabbit to recluster it when something
goes awry?

~~~
crad
Our use case is not for scaling the cluster up and down, but rather for
ensuring cluster size and stability. When AWS nodes randomly go away or are
retired, etc, the autoscaling group will ensure that we keep N nodes in the
cluster and they're up and running by using ELB health checking.

This is the first version that has the destructive action of removing nodes
from a cluster when they disappear and even then it's behind two different
config variables: One to turn cluster health checking on and the other to
actually do something when bad nodes are noticed. Bad nodes being nodes that
are no longer visible in the service discovery backend and no longer pingable
by the cluster.

~~~
spotman
So if you have an ASG of 4 nodes and one goes down, how do you know that the
traffic will land on the right node? What if that node is really busy? When
the new machine comes up automatically does the system migrate queue ownership
to it? Is there a time when the system can be partially unavailable?

Possibly most importantly would be, how are you routing queues to hosts and
how is the platform load balanced. Using ELB and a tcp listener is a great way
to throw away performance with clustered rabbit, not to mention adding a
ceiling as to how wide the cluster can get.

~~~
crad
In our case, we use queues in ha-mode:all and queue ownership isn't as much of
an issue there. Our biggest limiter for some of our workloads is actually
message size, not velocity. That may change in the future.

You're absolutely correct re load balancing for high-performance/ceilings/etc.
It's not a huge issue for our use, but I can how it can be for others. In my
previous environment, for example, it was absolutely critical.

I'm sure this tool isn't for everyone, for us it was about making cluster
startup easier and letting RabbitMQ handle that instead of a configuration
management system like Chef/etc.

For example, our test environment is torn down every night on an autoscaling
schedule and rebuilt in the morning. Not having to worry about config
management for starting up a new RabbitMQ cluster on demand is a nice feature.
:)

Thanks for pointing out the clear (and accurate) concerns one must consider
when running RabbitMQ @ scale. It sounds your experiences are similar to my
own with where the gotchas can be.

