RabbitMQ's high availability support is, frankly, terrible . It's a single point of failure no matter how you turn it, because it cannot merge conflicting queues that result from a split-brain situation. Partitions can happen not just on network outage, but also in high-load situations.
NATS is also a lot faster , and its client network protocol is so simple that you can implement a client in a couple hundred lines in any language. Compare to AMQP, which is complex, often implemented wrong, and requires a lot of setup (at the very least: declare exchanges, declare queues, then bind them) on the client side. NATS does topic-based pub/sub out of the box, no schema required.
(Re performance, relying on ACK/NACK with RPC is a bad idea. The better solution is to move retrying into the client side and rely on timeouts, and of course error replies.)
RabbitMQ is one of the better message queue implementations for scenarios where you need the bigger features it provides: durability (on-disk persistence), transactions, cross-data center replication (shovel/federation plugins), hierarchical topologies and so on.
We currently use it to send hundreds of thousands of messages per second, large and small, around the world to different data centers and it always works smoothly.
For anyone interested, Microsoft Channel 9 has a great 6 part series on the AMQP 1.0 protocol: https://channel9.msdn.com/Blogs/Subscribe/The-AMQP-10-Protoc...
AMQP is nice, but I don't think anyone would categorize it as simple. It's binary, for one. And to read and write it, you have to deal with framing, which is not always easy to get right. For example, for a long time RabbitMQ had an issue with dead-letter exchanges (aka DLX), where each bounce would add a header to the message envelope. DLX is great for retries, but after a bunch of retries, a message could get quite large. Some clients (the Node.js client in particular) has a small limit on frame sizes and will throw on such messages rather than grow the buffer. (Fortunately, this header was fixed in a recent RabbitMQ version.)
Didn't say AMQP is simple though. It's definitely not, but that's where the features and functionality come from. It seems the issues you described are with the broker and clients, not the protocol itself, which I find to be pretty solid.
NATS doesn't have replication, sharding or total ordering. Consistency is a challenge for clustered messaging brokers that need this.
With NATS, queues are effectively sharded by node. If a node dies, its messages are lost. Incoming messages to the live nodes will still go to connected subscribers, and subscribers are expected to reconnect to the pool of available nodes. Once a previously dead node rejoins, it will start receiving messages.
NATS in this case replaces something like HAProxy; a simple in-memory router of requests to backends.
I don't like pushing the retry into the client side. Since microservice by necessity have lots of communication between them that can be quite a bit of code across all the services. I would rather the architecture deal with it and just ensure that endpoints are idempotent so calls can be retried without adding client complexity. This is a personal preference, and in some cases clients do need to deal with retires, but I just like it not to be the default.
In the services folder there is this https://github.com/LoyaltyNZ/alchemy-framework/blob/master/s... pretty good RabbitMQ HA setup, with:
1. cluster_partition_handling set to autoheal so that it will do its best to recover (a few lost messages is infinitely better than a broken system)
2. queue_master_locator is min-master, so queues are mastered on nodes where the least amount of other queues are mastered. This will balance the queues across the clusters meaning if a node goes down then there will be minimal amount of queues to recreate
3. A mirror policy to mirror every queue (this will only mirror service queues because response queues are exclusive) , this will make the system a bit slower, but makes it much more robust.
This is enough to handle split brain (although this is difficult to test) and nodes going down and coming back (much easier to test).
Cheers for the comment :)
RabbitMQ's autohealing just solves your problem in the wrong way. Yes, it will usually fix itself (if it doesn't die with a Mnesia inconsistent_database error), but it will discard messages, and you won't know which ones.
Meanwhile, NATS will forward messages to subscribers as long as there's a clear path. There are no network partition issues because the queues don't have RabbitMQ's strict, total ordering.
Note that RabbitMQ is notoriously sensitive to partitions; one small blip and it gets its knickers in a twist. This is why I recommend increasing the net_ticktime option to something like 180 so you're less exposed.
Having done this for a long time, my advice is that making the client more intelligent is always the better option. If you're relunctant, consider a sidecar proxy like Linkerd  which can handle the gritty details for you.
I moved one large project from RabbitMQ to NSQ over a year ago and haven't looked back. It has just been wonderful to work with and build on top of.
Anything you didn't like about NSQ or pitfalls to watch out for from your experience?
Just clarifying the comment about the benchmark, even though there are both Go and Node.js components the actual bit that was being benchmarked was HTTP and NATS for the inter service communication. All the code is available on github if anybody wants to rerun the benchmarks.
One thing I do find lacking in both of these queues is support for per-message TTL though, for pruning time sensitive messages. I'm not sure what the performance overhead would be for supporting something like that.
First, its data model shards queues into partitions, each of which can be consumed by just a single consumer. Assume we have partititions 1 and 2. P1 is empty, P2 has a ton of messages. You will now have one consumer C1 which is idle, while C2 is doing work. C1 can't take any of C2's work because it can only process its own partition. In other words: A single slow consumer can block a significant portion of the queue. Kafka is designed for fast (or at least evenly performant) consumers.
Kafka's queues are also persisted on disk, which is terrible for RPC.
Think of Kafka as a linear database that you can append to and read from sequentially. Its main use case is for data that can fan out into multiple parallel processing steps. For example, a log processing system that extracts metrics: You feed the Kafka queue into something like Apache Storm, which churns the data and emits counts into an RDBMS (for example).