Hacker News new | past | comments | ask | show | jobs | submit login

RabbitMQ is a good piece of software, but it has issues. You mention unreliable connections; RMQ does not handle that well.

Even just a few seconds of network hiccup will throw it into a partitioned state, out of which you cannot reliably escape without manual intervention. There are also quite a few bugs that you tend to hit only when RMQ is exposed to network issues.

To use RMQ with bad networks, you need to use either the Shovel plugin or the Federation plugin. Both are somewhat awkward to use, and require manually setting up the different routes that copy messages across.

My advice is, if you like Erlang's architecture and need failure-tolerant IPC, just use Erlang directly. Don't use RabbitMQ for IPC if you need reliability.

I wish I could say better things about RabbitMQ, but it is easily the flakiest component of our stack. When you're used to rock solid server software like Postgres, HAProxy and Nginx, RabbitMQ is a big disappointment.




Man, I hated to read that. Thankfully I haven't done anything more than just an hour or so of playing with it, but nothing I've read indicated there were issues with connectivity :(

Could you point me to or elaborate on some specifics on the partitioned states and manual interventions required? I'm only really calling the shots on the server side and endpoints will be written in various languages, part of my reason for looking at RabbitMQ since it has so many libs.


You may want to read Aphyr's analysis [1], which is rigorous and well written.

If you read the RabbitMQ documentation, they readily admit to the fact that the clustering support is not designed for unreliable network connections. It's mentioned only once, and it's easy to overlook.

Here's a very quick, rough overview (it's complicated):

RabbitMQ works by assigning queues to specific nodes. A queue is a physical thing belonging to only one node at a time; consumers, publishers and so on are distributed, but queues aren't. If you lose that node, you lose the queue. If you get a network partition, the nodes in the other partition(s) will not see the queue anymore. Connected publishers and consumers will start barfing when their bindings no longer seem to exist.

RabbitMQ can mitigate these loss scenarios by mirroring queues. They call this "high availability", or HA [2]. Each queue will have a master node and a number of slave nodes that maintain replicas. In the event of a cluster issue, a slave will be elected master.

Ironically, HA, despite its name and apparent purpose, still requires a stable network. It does not deal well with network partitions. It's not really high availability. It is, I suspect, designed for situations when you wanted to take down a node manually, in a controlled manned, without disturbing a live cluster.

The reason, and this is the important part, is that when a network issue is resolved and the partitions can talk to each other again, RabbitMQ has no way of reconciling queues. If the partition divided nodes into sets A and B, then A will likely have promoted a mirror slave to a master, while B will have kept its master, or vice versa. When the partition is resolved, you now have two masters, whose queues have very likely gotten out of sync in the mean time.

When this happens, RabbitMQ normally needs to be manually restored; it will simply cease to work properly. Note that I'm talking about the case when the network issue has been resolved. The network is back, RabbitMQ isn't. You need to decide which nodes should survive and which should be wiped, and this recovery process is manual. Of course, if this happens in the middle of the night, you have a potentional emergency.

Fortunately, RabbitMQ has a band aid on top of HA called "automatic partition handling" [3]. It offers three different modes of behaviour, of which autoheal is probably the most important one. In autoheal mode, RabbitMQ will automatically pick a winner to solve conflicts, and promote it to a master. This will result in data loss; the losing nodes will be wiped. Autoheal is actually pretty crazy, because it is lossy by design; the only way to find out if you lost any data today is by grepping the logs for autoheal messages.

Worse, due to the "poor man's database" nature of RabbitMQ, doing any kind of manual recovery -- like dumping the conflicts, reconciling them, and dumping them back in -- is probably not realistic when it happens.

Note that even on a flawless LAN, RabbitMQ can get network partitions if your server has high enough CPU or I/O load. Another thing that can induce network partitions is soft kernel lockups, which is common enough in virtualized environments; for example, VMware's vMotion can move a node from one physical machine to another, a process which can take several seconds, during which time RabbitMQ gets a fit. In such cases you'll want to increase the "net_tick_timeout" [4] setting.

[1] http://aphyr.com/posts/315-call-me-maybe-rabbitmq

[2] https://www.rabbitmq.com/ha.html

[3] https://www.rabbitmq.com/partitions.html#automatic-handling

[4] https://www.rabbitmq.com/nettick.html




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: