It's extremely dangerous to run in production because it will lose data [1] by design.
RabbitMQ does not have a good strategy for recovering from partitions, which happens when a node is unable to talk to its peers. Partitions can occur not just from actual network hiccups but also simply due to high CPU or I/O load or benign VM migrations.
The underlying cause is that RabbitMQ is not multi-master by default. A queue is owned by a specific node, and if you have a partitioned cluster, that queue (and related objects such as exchanges and bindings) will simply disappear from other nodes.
You can patch this deficiency by enabling "high availability" (HA), which is their term for mirrored queues. Each queue will get a designated master, and be replicated to other nodes automatically. If a partition happens, the nodes elect a node to become a new master for a mirrored queue.
Unfortunately, this opens the cluster up to conflicts. Let's say you have two nodes, A and B, with one queue, Q. You experience brief hiccup causing a partition. A and B will both assume the role of master for Q, because RabbitMQ has no quorum support. Therefore they will continue to accept messages from apps. The hiccup passes, and now both nodes see each other again. Meanwhile, apps had sent messages to both A and B, causing Q to diverge into Q^1 and Q^2.
However, RabbitMQ has no way to consolidate the two versions into a single queue. To fix this situation, either you need to reconstruct the queue manually (usually impossible from an application's point of view), or wipe it (hardly a solution in a production envirionment), or simply have RabbitMQ automatically pick a winning master and discard the other master(s). The latter strategy is called "autoheal", and will automatically pick the master which has the most messages. The previous master(s) are wiped and become slaves. This is coincidentally the only mode in which RabbitMQ can continue to run after a partition without manual intervention. Without autoheal, RabbitMQ will become unusable.
In practice, recovery has proved flaky for us. Nodes often stay partitioned even after they should be able to see each other. We have also encountered a lot of bugs — for example, bindings or exchanges disappear on some nodes but not on others, or queues are inexplicably lost, or nodes otherwise just misbehave. We're on a cloud provider which is otherwise rock solid; of all the software (databases etc.) we employ in our clusters, RabbitMQ is the only one that misbehaves. I should add that the last few minor versions have increased stability considerably, though the fundamental clustering design issue remains.
I wouldn't call this a design flaw, it just shifts the responsibility of handling conflicts to the consumer app rather than forcing you into a server-side solution. It's too application-specific what kind of partition recovery do you need. Most people don't mind to lose a message. If you don't want to lose messages (CP-mode?) just stick all clients for certain app object (AMQP tree) to consume from the same node and block until everything is healthy or fail them over in some app-specific way to ensure consistency. While other AMQP servers may handle it better for your case these things tend to be app-specifc since your app knows better than the server which trees are critical and which are not.
I would definitely call it a design flaw. After all, RabbitMQ pretends to be multi-master; but if you set up clients to treat a single node (of several) as a master, the other nodes will be reduced to being dumb backup nodes.
Also, don't forget that clients don't decide which node owns a queue — it's owned by the node where it's first created. Being consistent about always talking to the correct node, and keeping track of which node owns the master replica of any given queue, puts a complicating burden on both clients and the system administrator.
You can run RabbitMQ without any partition handling, but that means the ops staff needs to wake up in the middle of the night to handle downtime. Not to mention that there's no queue merging support — at the very least, it could be possible to merge conflicting queues where you don't particularly care about ordering.
RabbitMQ does not have a good strategy for recovering from partitions, which happens when a node is unable to talk to its peers. Partitions can occur not just from actual network hiccups but also simply due to high CPU or I/O load or benign VM migrations.
The underlying cause is that RabbitMQ is not multi-master by default. A queue is owned by a specific node, and if you have a partitioned cluster, that queue (and related objects such as exchanges and bindings) will simply disappear from other nodes.
You can patch this deficiency by enabling "high availability" (HA), which is their term for mirrored queues. Each queue will get a designated master, and be replicated to other nodes automatically. If a partition happens, the nodes elect a node to become a new master for a mirrored queue.
Unfortunately, this opens the cluster up to conflicts. Let's say you have two nodes, A and B, with one queue, Q. You experience brief hiccup causing a partition. A and B will both assume the role of master for Q, because RabbitMQ has no quorum support. Therefore they will continue to accept messages from apps. The hiccup passes, and now both nodes see each other again. Meanwhile, apps had sent messages to both A and B, causing Q to diverge into Q^1 and Q^2.
However, RabbitMQ has no way to consolidate the two versions into a single queue. To fix this situation, either you need to reconstruct the queue manually (usually impossible from an application's point of view), or wipe it (hardly a solution in a production envirionment), or simply have RabbitMQ automatically pick a winning master and discard the other master(s). The latter strategy is called "autoheal", and will automatically pick the master which has the most messages. The previous master(s) are wiped and become slaves. This is coincidentally the only mode in which RabbitMQ can continue to run after a partition without manual intervention. Without autoheal, RabbitMQ will become unusable.
In practice, recovery has proved flaky for us. Nodes often stay partitioned even after they should be able to see each other. We have also encountered a lot of bugs — for example, bindings or exchanges disappear on some nodes but not on others, or queues are inexplicably lost, or nodes otherwise just misbehave. We're on a cloud provider which is otherwise rock solid; of all the software (databases etc.) we employ in our clusters, RabbitMQ is the only one that misbehaves. I should add that the last few minor versions have increased stability considerably, though the fundamental clustering design issue remains.
[1] https://aphyr.com/posts/315-jepsen-rabbitmq