Hacker News new | past | comments | ask | show | jobs | submit login

You may want to read Aphyr's analysis [1], which is rigorous and well written.

If you read the RabbitMQ documentation, they readily admit to the fact that the clustering support is not designed for unreliable network connections. It's mentioned only once, and it's easy to overlook.

Here's a very quick, rough overview (it's complicated):

RabbitMQ works by assigning queues to specific nodes. A queue is a physical thing belonging to only one node at a time; consumers, publishers and so on are distributed, but queues aren't. If you lose that node, you lose the queue. If you get a network partition, the nodes in the other partition(s) will not see the queue anymore. Connected publishers and consumers will start barfing when their bindings no longer seem to exist.

RabbitMQ can mitigate these loss scenarios by mirroring queues. They call this "high availability", or HA [2]. Each queue will have a master node and a number of slave nodes that maintain replicas. In the event of a cluster issue, a slave will be elected master.

Ironically, HA, despite its name and apparent purpose, still requires a stable network. It does not deal well with network partitions. It's not really high availability. It is, I suspect, designed for situations when you wanted to take down a node manually, in a controlled manned, without disturbing a live cluster.

The reason, and this is the important part, is that when a network issue is resolved and the partitions can talk to each other again, RabbitMQ has no way of reconciling queues. If the partition divided nodes into sets A and B, then A will likely have promoted a mirror slave to a master, while B will have kept its master, or vice versa. When the partition is resolved, you now have two masters, whose queues have very likely gotten out of sync in the mean time.

When this happens, RabbitMQ normally needs to be manually restored; it will simply cease to work properly. Note that I'm talking about the case when the network issue has been resolved. The network is back, RabbitMQ isn't. You need to decide which nodes should survive and which should be wiped, and this recovery process is manual. Of course, if this happens in the middle of the night, you have a potentional emergency.

Fortunately, RabbitMQ has a band aid on top of HA called "automatic partition handling" [3]. It offers three different modes of behaviour, of which autoheal is probably the most important one. In autoheal mode, RabbitMQ will automatically pick a winner to solve conflicts, and promote it to a master. This will result in data loss; the losing nodes will be wiped. Autoheal is actually pretty crazy, because it is lossy by design; the only way to find out if you lost any data today is by grepping the logs for autoheal messages.

Worse, due to the "poor man's database" nature of RabbitMQ, doing any kind of manual recovery -- like dumping the conflicts, reconciling them, and dumping them back in -- is probably not realistic when it happens.

Note that even on a flawless LAN, RabbitMQ can get network partitions if your server has high enough CPU or I/O load. Another thing that can induce network partitions is soft kernel lockups, which is common enough in virtualized environments; for example, VMware's vMotion can move a node from one physical machine to another, a process which can take several seconds, during which time RabbitMQ gets a fit. In such cases you'll want to increase the "net_tick_timeout" [4] setting.

[1] http://aphyr.com/posts/315-call-me-maybe-rabbitmq

[2] https://www.rabbitmq.com/ha.html

[3] https://www.rabbitmq.com/partitions.html#automatic-handling

[4] https://www.rabbitmq.com/nettick.html

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact