
Call me maybe: RabbitMQ - timf
http://aphyr.com/posts/315-call-me-maybe-rabbitmq
======
lobster_johnson
RabbitMQ requiring a reliable network is causing problems for us in
production. Anyone else struggled with this?

We're running several clusters on different providers; one is Digital Ocean,
another is on a partner's VMware vMotion-based system. The kernel gets soft
lockups now and then (in the case of vMotion, when VMs are automatically
migrated to other physical nodes), which causes RabbitMQ partitions. The
lockups may last a few seconds, but I've seen minute-long pauses.

When this happens, RabbitMQ starts throwing errors at clients. Queues
disappear and clients can't do anything even though the local node is running.
Although I understand the behaviour, that's not what I want from a queue; I
want a local node to store messages until it rejoins the cluster and can
dispatch them, and I want the local node to continue offering messages to
those listening.

Unfortunately, RabbitMQ's design doesn't allow this: Queues live on the node
they were created on, and RabbitMQ does not replicate them. We have turned on
autohealing, but I don't like the fact that minority nodes simply wipe their
data when they rejoin the cluster. The federation plugin doesn't look like a
great solution.

I really like RabbitMQ, but maybe it's time to considering something else. Any
suggestions? Something equally lightweight and "multimaster", but without the
partition problem?

~~~
jpgvm
You need to look into using HA queues (ie. mirroring) and ensure you are using
a client that properly supports consumer cancel notifications and are actually
processing them correctly.

There is also significant TCP/IP and RabbitMQ tuning that can be done to make
fail-over much faster etc.

I have used RabbitMQ for years and though it's not perfect if you do try
something else you will quickly understand that it's still so far ahead of the
competition.

Best of luck!

~~~
atombender
HA queues have this caveat in the documentation:

    
    
        This solution requires a RabbitMQ cluster, which means
        that it will not cope seamlessly with network partitions
        within the cluster and, for that reason, is not
        recommended for use across a WAN
    

And then:

    
    
        However, there is currently no way for a slave to know
        whether or not its queue contents have diverged from the
        master to which it is rejoining (this could happen
        during a network partition, for example). As such, when
        a slave rejoins a mirrored queue, it throws away any
        durable local contents it already has and starts empty.
    

So I don't think that's helpful to us at all.

We are going to migrate to a client that supports consumer cancel
notifications, though. Thanks for the tip.

~~~
jpgvm
True, what I do recommend though is using RMQ clusters on each cloud (where
networking should be abit more reliable, the exception being AWS, which always
sucks in this regard) then using federation (probably via shovel, but there
are other means) to the other clouds.

Ultimately though.. when you get to this stage I question if your app is big
enough to warrant this is suggest you use Azure Service Bus/Simple Queuing
Services/whatever else your providers make available.

If your business really needs such control over messaging.. I understand. I
have been in the situation where those easy ways out aren't available and I
know your pain. Unfortunately there is no vendor you can go to make it go
away, Tibco, Sterling etc aren't much better than RMQ.

I wish you the best of luck in your multi-cloud federated messaging system
though, I highly suggest you look at Azure Service Bus though, I have nothing
but praise for it despite being a devout RMQ zealot.

~~~
lobster_johnson
You misunderstood me, I think. Our clouds aren't connected. The partitioning
problem exists _within_ each data center (eg., Digital Ocean).

So federation/shovel is probably not the solution.

SQS is far too simple for our needs. No idea what Azure is, but if it's SaaS
the latency will likely be too high. We need local performance.

------
rdtsc
If you didn't read the article, the first part is related to this blog post:

[http://www.rabbitmq.com/blog/2014/02/19/distributed-
semaphor...](http://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-
with-rabbitmq/)

That post talks about how to build a distributed semaphore using RabbitMQ in
the "clustered" configuration option.

Aphyr's post is good but RabbitMQ make it pretty clear that this setup is not
resilient to network partitions. They even mention it at the end of the blog
post. That is also clear in the documentation.

[https://www.rabbitmq.com/distributed.html](https://www.rabbitmq.com/distributed.html)

The big general question is how likely are you to encounter network
partitions?

Aphyr believes they are very dangerous and likely to happen to you at some
point and should be taken more seriously by the programming world.

I tend to take a more of "it depends" on your environment. Recently with VM
and cloud deployment network partitions are quite likely. So they should be at
the top of your "things to worry about" list. But, I haven't seen it happen on
a local LAN. During testing I have induced it by hand (but pulling the network
cable out from a switch), but I just haven't seen it happen otherwise. I
probably got lucky. Now I have seen other things come like memory corruption,
memory leaks in software, and so on. So many other things that worry me more
than network partition on a LAN.

All that said, it is great to see these experiments run. Please read and study
all the other "Call me maybe" series. They are just very good. It turns out
most products with a "distributed" component will fall over in the face of a
partition. And keep in mind that it is best to have Aphyr discover these
issues than your customers ;-)

~~~
jamesaguilar
If I'm reading the dates right, they admit its lack of resilience to
partitions _after_ the Aphyr blog post came out. That's a completely different
thing than proactively raising this issue.

It's also not clear what if any usefulness a lock service has, beyond as a
demonstration, if it can't survive partitions.

Network partitions are rare on small scale networks over short durations. But
as the saying goes, if you run a billion trials, the one-in-a-million event
happens a thousand times. The larger a network grows, the more likely you'll
eventually experience the joy of a partition.

~~~
rdtsc
> they admit its lack of resilience to partitions after the Aphyr blog post
> came out. That's a completely different thing than proactively raising this
> issue.

In that blog post. This doc page was there long before.

[https://www.rabbitmq.com/distributed.html](https://www.rabbitmq.com/distributed.html)

Network partition tolerance is a general configuration trade-off not related
just to that particular trick of distributed semaphores. Presumably someone
who set up a clustered configuration already decided on the likely-hood of
experiencing a network partition and read the docs.

Now there is a another issue explored and that is tolerance to network
failures in general between clients and even a single server. That is (the way
I understand it) not related to clustered or un-clustered configuration. It
relates to the stability of network connections between clients and server(s).

~~~
jamesaguilar
I don't know if you read the blog post, but it demonstrates that the partition
tolerant mode from the page you just linked is not actually, and that
RabbitMQ, even in that mode, can't be used as a lock service.

I didn't see the part of the post about connections between clients and
servers, but I was reading on mobile, so maybe I just missed it.

~~~
rdtsc
> but it demonstrates that the partition tolerant mode from the page you just
> linked is not actually

I was talking about the part about picking a response to a partition detection
event. In this case "ignore" (see "Recommendations" section), instead of
"auto-heal" or "minority-pause".

------
keypusher
I thought that in order to create a distributed locking system, you need to be
able to reliably fence unreachable nodes. "Network Partitions" sound a lot
like "Split Brain" to me. I am more familiar with traditional clustering
solutions such as Pacemaker/Corosync and the GFS2 DLM than RabbitMQ, so
perhaps I am missing something here?

~~~
teraflop
That's more of a software design consequence than a fundamental limitation.

It's common to build fault-tolerant systems by designing a master-slave
architecture, and then bolting a failover mechanism on top so that exactly one
node is the master at any time. That approach suffers from exactly the problem
you describe: if you can't contact a node, there's no foolproof way to ensure
it isn't acting as a master, so you risk split-brain unless you can remotely
fence it.

Consensus algorithms like Paxos/Raft don't suffer from this problem; in order
to make steady progress, there _should_ be only one master, but they still
operate consistently if that assumption is violated in the short term. So no
fencing is needed.

A lot of people seem to have a fundamental misunderstanding that using Paxos
for leader election is enough to make a system consistent, and it's
emphatically not. For instance, I hope this isn't still true in current
versions, but for a long time HBase was vulnerable to losing committed data
because it misused Zookeeper and allowed multiple regionservers to
simultaneously "own" a given table region. (I found about that issue when it
happened to me in production, so now my default assumption is that all
"distributed locking" systems are broken until proven otherwise.)

------
Tomdarkness
In relation to the second part of the article regarding lost messages it would
be interesting to see the same test done but with HA queues:

[http://www.rabbitmq.com/ha.html](http://www.rabbitmq.com/ha.html)

I might be wrong but I'd suspect that it might solve the problem as while the
partioned nodes will still discard the data it should be present on the master
node.

~~~
jpgvm
Those tests were done with mirrored queues:

"We’ll be using durable, triple-mirrored writes across our five-node cluster"

------
KayEss
The seriousness of this depends on the consequences of the failure and how
much of the time the lock is held. If the failure causes double charging a
customer that's likely far more serious than if it double emails a customers.
And if the lock spends most of its time not held then spotting a doubling of
its message can probably be spotted and fixed far more easily than if the lock
spends most of its time held.

------
EGreg
How does RabbitMQ compare with 0MQ?

~~~
lobster_johnson
They are conceptually very different.

Rabbit is a broker based on the AMQP model of exchanges, queues and bindings.
Producers and consumers talk to Rabbit, which processed, stores and dispatches
messages across the cluster. It supports both direct (one message goes to a
single consumer) and fan-out (every message goes to all consumers) delivery,
and can pick targets based on wildcard routes. Queues can be "durable", where
Rabbit will store messages on disk atomically in case of a crash. You can
restart Rabbit, as well as producers and consumers, and the queues will still
be intact.

ZeroMQ is (despite the name) not a message queue in the classical sense; it's
more like multicast TCP or UDP, with some message queue functionality. To
start, since Zero is embedded in each program, a "queue" does not persist if a
producer and consumer are not both running; any buffering occurs in the
consumer, so if a producer emits messages and no consumers are running, the
messages are lost. Similarly, there is no durability; all messages must be
handled right away.

ZeroMQ is a low-level toolset for implementing robust messaging -- it could be
used to implement a high-level broker like Rabbit, but in itself it doesn't
compete directly.

------
nateburke
Cool. I've been bitten by RabbitMQ before. When are you going to give
cassandra the jepsen treatment?

~~~
jpgvm
The unfortunate reality for RabbitMQ is about 99.99% of the problems people
have with it (assuming they are using it as a queue) is that they either a)
are using a bad/incomplete client or b) aren't using the HA features correctly
or c) are doing all of that but then aren't checking for failure states (CCN
etc) properly so their application isn't responsive enough in failure
conditions.

~~~
slantedview
I don't think any of that applies here.

------
mjamil
A bit meta. That was a really well-explained description of a non-trivial
problem. I wish there was (or I knew of) a curated collection of bloggers that
wrote this well about software.

~~~
freshhawk
Aphyr's Clojure tutorial series is one of the best programming language
tutorials I've ever read. I'm definitely a fan.

[http://aphyr.com/tags/Clojure-from-the-ground-
up](http://aphyr.com/tags/Clojure-from-the-ground-up)

