Hacker News new | comments | show | ask | jobs | submit login

As someone who has used RabbitMQ in production for many years, you should rather consider using NATS [1] for RPC.

RabbitMQ's high availability support is, frankly, terrible [2]. It's a single point of failure no matter how you turn it, because it cannot merge conflicting queues that result from a split-brain situation. Partitions can happen not just on network outage, but also in high-load situations.

NATS is also a lot faster [3], and its client network protocol is so simple that you can implement a client in a couple hundred lines in any language. Compare to AMQP, which is complex, often implemented wrong, and requires a lot of setup (at the very least: declare exchanges, declare queues, then bind them) on the client side. NATS does topic-based pub/sub out of the box, no schema required.

(Re performance, relying on ACK/NACK with RPC is a bad idea. The better solution is to move retrying into the client side and rely on timeouts, and of course error replies.)

RabbitMQ is one of the better message queue implementations for scenarios where you need the bigger features it provides: durability (on-disk persistence), transactions, cross-data center replication (shovel/federation plugins), hierarchical topologies and so on.

[1] http://nats.io

[2] https://aphyr.com/posts/315-jepsen-rabbitmq

[3] http://bravenewgeek.com/dissecting-message-queues/

While a simple HTTP interface is easy to code around, I quite like the AMQP protocol. It's fast, efficient, reliable and powerful.

We currently use it to send hundreds of thousands of messages per second, large and small, around the world to different data centers and it always works smoothly.

For anyone interested, Microsoft Channel 9 has a great 6 part series on the AMQP 1.0 protocol: https://channel9.msdn.com/Blogs/Subscribe/The-AMQP-10-Protoc...

Note that NATS is not HTTP, it's its own very simple text-based protocol.

AMQP is nice, but I don't think anyone would categorize it as simple. It's binary, for one. And to read and write it, you have to deal with framing, which is not always easy to get right. For example, for a long time RabbitMQ had an issue with dead-letter exchanges (aka DLX), where each bounce would add a header to the message envelope. DLX is great for retries, but after a bunch of retries, a message could get quite large. Some clients (the Node.js client in particular) has a small limit on frame sizes and will throw on such messages rather than grow the buffer. (Fortunately, this header was fixed in a recent RabbitMQ version.)

Yes, thanks for reminding me, NATS isnt HTTP.

Didn't say AMQP is simple though. It's definitely not, but that's where the features and functionality come from. It seems the issues you described are with the broker and clients, not the protocol itself, which I find to be pretty solid.

AMQP also specifies the data model (exchanges, queues, bindings and so forth), which dictates the implementation of it. I find that data model a bit heavy-handed, but it's not terrible. However, it's not very suitable for RPC.

It's not too difficult to do RPC over AMQP by having an ephemeral reply queue per request, if your load isn't too high.

I spoke to Kyle Kingsbury (Aphyr) about 9 months ago and at that time, the only queue or pub/sub system he thought was relatively safe from partition errors was Kafka. Not sure if his position has changed recently.

Indeed, though this is irrelevant to this particular use case.

NATS doesn't have replication, sharding or total ordering. Consistency is a challenge for clustered messaging brokers that need this.

With NATS, queues are effectively sharded by node. If a node dies, its messages are lost. Incoming messages to the live nodes will still go to connected subscribers, and subscribers are expected to reconnect to the pool of available nodes. Once a previously dead node rejoins, it will start receiving messages.

NATS in this case replaces something like HAProxy; a simple in-memory router of requests to backends.

Ahh .. OK, that makes it clearer. Thanks.

This is also my personal experience with message queues, even though I haven't had a chance to work with NATS yet. Kafka is just a really solid piece of engineering when you need 5-50 servers. With that many servers you can handle millions of messages per second that usually enough for a mid size company. I am not sure about higher scale but I believe LinkedIN has much larger clusters.

+1 to this. Used rabbit extensively in production but now #10 on the list. Nats is #1 as a transport message bus in my opinion.

Initially when developing Alchemy we had a look at NSQ http://nsq.io and found it a little difficult (that was a year or so ago, so might be better now). Then we started looking at RabbitMQ and thought it fit our requirements better. I have not heard of NATS but will definitely have a look see. At the moment Alchemy is tied to RabbitMQ pretty tightly, but abstracting and supporting many queue solutions would be good.

I don't like pushing the retry into the client side. Since microservice by necessity have lots of communication between them that can be quite a bit of code across all the services. I would rather the architecture deal with it and just ensure that endpoints are idempotent so calls can be retried without adding client complexity. This is a personal preference, and in some cases clients do need to deal with retires, but I just like it not to be the default.

In the services folder there is this https://github.com/LoyaltyNZ/alchemy-framework/blob/master/s... pretty good RabbitMQ HA setup, with:

1. cluster_partition_handling set to autoheal so that it will do its best to recover (a few lost messages is infinitely better than a broken system) 2. queue_master_locator is min-master, so queues are mastered on nodes where the least amount of other queues are mastered. This will balance the queues across the clusters meaning if a node goes down then there will be minimal amount of queues to recreate 3. A mirror policy to mirror every queue (this will only mirror service queues because response queues are exclusive) , this will make the system a bit slower, but makes it much more robust.

This is enough to handle split brain (although this is difficult to test) and nodes going down and coming back (much easier to test).

Cheers for the comment :)

Destructive cluster recovery, patched with mirroring raises the question of how suitable RabbitMQ is for the job in the first place. Mirroring for RPC requests, think about it for a second! Mirroring. For RPC!

RabbitMQ's autohealing just solves your problem in the wrong way. Yes, it will usually fix itself (if it doesn't die with a Mnesia inconsistent_database error), but it will discard messages, and you won't know which ones.

Meanwhile, NATS will forward messages to subscribers as long as there's a clear path. There are no network partition issues because the queues don't have RabbitMQ's strict, total ordering.

Note that RabbitMQ is notoriously sensitive to partitions; one small blip and it gets its knickers in a twist. This is why I recommend increasing the net_ticktime option to something like 180 so you're less exposed.

Having done this for a long time, my advice is that making the client more intelligent is always the better option. If you're relunctant, consider a sidecar proxy like Linkerd [1] which can handle the gritty details for you.

[1] https://linkerd.io/

Just curious, what did you find difficult about NSQ?

I moved one large project from RabbitMQ to NSQ over a year ago and haven't looked back. It has just been wonderful to work with and build on top of.

Anything you didn't like about NSQ or pitfalls to watch out for from your experience?

Curious about this too. I've found NSQ to be rock solid and easy to setup and work with.

Also felt this post by Diogo in the NATS community was interesting here:


Impressive numbers, though it seems he's testing Go vs. Node.js at the same time. I'd like to see performance numbers for HTTP/2, though.

HTTP/2 definitely makes it faster for browsers to load assets in a webpage, however, not sure how much it would speed up individual REST requests since most of the time is bound to the request/response round trip.

Just clarifying the comment about the benchmark, even though there are both Go and Node.js components the actual bit that was being benchmarked was HTTP and NATS for the inter service communication. All the code is available on github if anybody wants to rerun the benchmarks.

HTTP/2's main feature is that it's multiplexed, which a client written in a suitably async-friendly language can exploit to pipeline parallel requests, or at least better reuse connections. There's presumably not too much performance gain (though header compression helps) if your client can't exploit the multiplexing.

Setting Rabbitmq's "cluster_partition_handling" as "pause_minority" in theory ameliorates split-brain issues in clusters of 3 or more (odd numbers), where the majority would ignore the minority nodes.

NSQ and NATS are my goto tools for messaging, though NSQ seems more flexible to me because it supports message persistence and also provides NATS-like ephemeral channels for when persistence is not a hard requirement. And it comes with a shiny admin-dashboard, which NATS lacks. NATS is useful when raw performance is a priority.

One thing I do find lacking in both of these queues is support for per-message TTL though, for pruning time sensitive messages. I'm not sure what the performance overhead would be for supporting something like that.

Would this library be better off utilizing NATS instead of RabbitMQ baring any requirements a person might have for the persistence and other features you mentioned?

Yes, I think so.

Would Kafka be a better option in this case? There might be some properties I am not aware of that makes it impossible.

No, Kafka is completely unsuitable for RPC, for several reasons.

First, its data model shards queues into partitions, each of which can be consumed by just a single consumer. Assume we have partititions 1 and 2. P1 is empty, P2 has a ton of messages. You will now have one consumer C1 which is idle, while C2 is doing work. C1 can't take any of C2's work because it can only process its own partition. In other words: A single slow consumer can block a significant portion of the queue. Kafka is designed for fast (or at least evenly performant) consumers.

Kafka's queues are also persisted on disk, which is terrible for RPC.

Think of Kafka as a linear database that you can append to and read from sequentially. Its main use case is for data that can fan out into multiple parallel processing steps. For example, a log processing system that extracts metrics: You feed the Kafka queue into something like Apache Storm, which churns the data and emits counts into an RDBMS (for example).

Kafka stores everything to disk, this may not be what you are looking for for your RPC calls (that you would make usually as a direct service to service HTTP call). Moreover kafka "topics" are statically declared (i.e. by admin scripts instead of a public API), and it's a heavy-weight operation. So it's not the best fit to have micro-services registering themselves and dynamically creating "topics" for each method.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact