Hacker News new | past | comments | ask | show | jobs | submit login
An introduction to RabbitMQ (erlang-solutions.com)
586 points by olikas on May 21, 2020 | hide | past | favorite | 255 comments

I'm really surprised to see so much positivity about rabbitmq here when it's probably the most sweared-at software in the space.

Let me share my anecdote. In my last work place I got onboarded on rabbitmq and it was such a painful software to work with and almost impossible to set up locally that I silently sneaked in simple redis list as queue alternative for my dev environment. The whole rabbitmq and it's pika library was replaced by 3 lines of python and redis server.

One day rabbitmq died and it tooks sys admins few weeks to get it back running. In that time I deployed my simple redis list and never looked back. To this day redis solution works without any friction whatsoever with fraction of resources.

The rabbits AMQP exchange model is severely flawed and convoluted. It's the worst example of corporate software where everything works and doesn't work at the same time.

I wouldn't recommend rabbitmq to my worst enemy yet there's still something attractive about it. Maybe there's a sane alternative? Maybe zeromq?

If you were capable of replacing your use of rabbitmq with 3 lines of python and a redis instance, you shouldn't have been using rabbitmq to start with.

That doesn't mean redis is a drop in replacement for any of the valid uses for rabbitmq though...

can you share some of those feature where rabbitmq that is useful? Tks.

Sounds to me like the problem is, that the company taking out the cannon to shoot at sparrows and that nobody seem to have bothered to provide working dev environments?

It feels common that people think they their every problem requires a planetary scale solution or one that handles every conceivable case that could occur before the heat death of the universe. Be that because your a startup and think you are going to need to support a hundred million concurrent users on launch day, or because your an enterprise and think because your are oh so important you need to use the same tools other important companies use.

My own anecdote: When we refactored our early stage app we had a huge mess with the Redis based queue system. It was one of the biggest sources of errors and a massive pain to troubleshoot or even to monitor what is going on. So we investigated in a bunch of different solutions including all the usual contenders, and we ended up with: Let's just ditch the messages / queues altogether for now and just do boring old cron like jobs invoking internal api endpoints in regular intervals. This made 9 out of 10 cases a lot easier to maintain, and for the remainder, while temporary more difficult, we introduced a queue again at a much later stage.

I'm not saying you shouldn't use and you can't benefit of RabbitMQ at small scales or any technology for that matter. But I think too often in tech decision making ones own or the companies perceived importance, what is cool or what would look good on a CV takes precedence over what really fits the problem in context.

Odd, I've been running it for years and it's been flawless. Integrating new clients is a breeze using their, or any amqp, library.

brew install rabbitmq?

I agree rabbitmq is often adopted in places it shouldn't, or ill-configured.. But to completely get rid of it because you didn't take the time to RTFM when setting it up seems a little extreme.

And redis/rabbitmq have completely different use-cases 80% of the time. Sounds like you were trying to get drunk on kombucha.

You are right in our case redis failed spectacularly with celery queue when tasks in queue exceeded the memory of the machine. It took weeks to diagnose as error message were inconsistent and setting up a Message Queue cluster with redis is a nightmare.

Went back again to tried and tested rabbitMQ and it works so well. Also adding new nodes and removing nodes so easy, just used ansible to setup Erlang cookie and connect the node (thanks to OTP and BEAM). The best part is for important task queue for which we cannot accept failure we built a mechanism for fault tolerance. When you work with high availability and fault tolerant queue rabbitMQ is so good, it can recover from hardware or even VM failures, can’t say the same for redis which was a nightmare even with redis cluster.

Does your solution handle the situation when consumer crashes and queue has to be accumulated (while RAM allows) until consumer is up again (maybe 1 day later)? AFAIK ZMQ can't guarantee this.

It must be emphasised that, despite the name, ZeroMQ is not a message queue. It is a networking library. The current blurb says:

> ZeroMQ (also known as ØMQ, 0MQ, or zmq) looks like an embeddable networking library but acts like a concurrency framework.

And the old meme was "ZeroMQ is a replacement for Berkeley sockets".

It's a pretty cool networking library. But it makes no sense to think of it in the same slot as RabbitMQ, or even Redis.

That the GP mentioned it does make me wonder if they don't really understand what they're doing.

EDIT I love that the official guide actually has this diagram in! Pieter Hintjens's death was a sad loss. http://zguide.zeromq.org/page:all#How-It-Began

As an aside, the ZeroMQ Guide is the finest piece of technical writing I have ever seen - closely followed by Pieter’s (sadly unfinished) “Scalable C” book. His non-technical writing is also excellent.

I had a fun time implementing the Paranoid Pirate pattern with my coworker a few years back. He wrote the server part in Python, I wrote the client in PHP. We essentially built it as a wrapper to run some C code our boss wrote that we didn't want to write a PHP extension for - we used Python as a broker to allow for some concurrency. Worked super well.

a Redis list (what he's using) would handle this easily (I think Redis' PubSub also handles this)

Amusingly, RabbitMQ is the one message broker that HASN'T given me grief in my career.

With docker it takes 5 seconds to set a rabbitmq instance locally.

Your fault is using pika. I thought the same untill I tried some other python libraries . Try Oslo messaging . It's used in openstack (enough said).

We used both zeromq and its sort-of-descendant nanomsg but eventually moved to NATS because of it's approach to high availability (every node knows the topology of the whole cluster) and zero management overhead.

Some criticize NATS for the absence of message durability in the core technology, but we figured out we can drop this requirement in 95% of cases. Your microservices should be highly available anyway, so there's always a live consumer - and it's better to handle the message to it directly rather than introducing the overhead of storing a message somewhere and dealing with more-than-once delivery.

There's a bit more to that like you should let your microservices exit gracefully and finish processing consumed messages. And of course in 5% of cases, where you can't allow to lose a single message, you have to use NATS Streaming or the likes, but so far we've been greatly impressed by NATS for high load.

ZeroMQ compiles to well over 600 kilobytes of machine code (just on 32 bit x86).

  $ size /usr/lib/i386-linux-gnu/libzmq.so
     text    data     bss     dec     hex filename
   662134   12952      24  675110   a4d26 /usr/lib/i386-linux-gnu/libzmq.so
Yikes, is there an operating system kernel with virtual memory, USB support, a few filesystems and a TCP/IP stack hiding in there?

It's like 35% of glibc:

  $ size /lib/i386-linux-gnu/libc.so.6 
     text    data     bss     dec     hex filename
  1917549   11624   11112 1940285  1d9b3d /lib/i386-linux-gnu/libc.so.6
Which is itself heavily bloated, trying to provide as much POSIX as it can.

I'm equally surprised by your anecdote.

I found it quite simple to install RabbitMQ server and its admin panel in my WSL local dev environment.

And the cloud/prod instance took a few clicks (just spun up a DO Marketplace server image) followed by < five minutes of RabbitMQ user and firewall configuration.

It was also dead simple to start using RabbitMQ within my application. I found a well maintained package, installed it, edited a couple lines of my application's config, and everything just worked.

I specifically avoided Redis based on my understanding that it can't guarantee message persistence, so if it crashes, your unprocessed messages are lost.

I'd be interested to know what the 3 lines of python were and also more details about the redis server you deployed to replace RabbitMQ during the outage.

I'm using a bit of a hyperbole but using redis list as a task queue is a simple as:

    while True:
        msg = r.lpop(key)
            result = do_something(msg)
        except Exception as e:
            log.error(f'failure for msg "{msg}" got {e} back to queue {key}')
            r.rpush(key, msg)
Pop a member from a list, if failed plop it back to the end. It's simple, explicit and just works™

It just works, until PUSHing the member back fails. At least use RPOPLPUSH and a separate worker queue to make sure you don't accidentally drop packets.

Don't get me wrong, I'm using Redis queues myself and trying to get rid of the last remnants of RabbitMQ in our code, but there are use cases where RabbitMQ means less thinking about stuff, because it just works(tm)...

The other classic variant here is when msg is poisonous. Then pushing it back onto the queue means it'll go again at some point, poisoning your system again.

In practice, you need some dead-lettering with a message timeout. Something RabbitMQ does provide out of the box, for all its complexity.

couple of problems with this: (1) RPUSH can fail (2) LPOP's result might never reach your client after redis has executed it.

rabbitmq used to be one of the hardest things to self-host (even at relatively small scale). But it has been pretty stable since moving to Cloudamqp as a managed solution.

I had the same experience. I had particular trouble configuring encryption for connections. It didn't help that they'd switched configuration file-formats, but the documentation and tutorials still used the old format.

Never had any stability issues though. Once it was up and running, it was solid.

RabbitMQ has huge learning curve if you're trying to build a worker queue.

First, you'll learn about ack/noack and get the worker ack on success.

Then, you'll learn about dead letter queue ... etc for delayed retries.

Now, you'll have a topic exchange and a bit hairy routing in place using wildcards.

And you mistakenly set dead letter routing key so that expired messages end up in multiple queues (retry queues and actual worker queue ... ).

Then you rewrite your service in python and use Celery or something.

It's nearly impossible to get RabbitMQ working correctly within few months.

And I forgot about HA. Paying for hosted RabbitMQ might be better. But CloudAMQP in particular could be tricky as well. It can run out of AWS IOPS and your production gets hosed.

Also setting up monitoring on queue health, shoveling error queues ... etc take time to learn and apply. Be careful about routing keys when you shovel error queue to a topic.

Celery can be backed by RabbitMQ, not sure if that's what you meant, but all of what you described can be abstracted away. I didn't have the same experiences with months taken to get up to speed. Moreover, at work RabbitMQ is probably our most stable underlying tool, perhaps toe to toe with Redis. And that's saying a lot, since I consider Redis to almost be a piece of art in how great of a tool it is.

Back to RabbitMQ though, we run a HA 2 node deployment (just one active writer) and have been for over 3 years, requiring minimal changes or any kind of maintenance whatsoever, has scaled to hundred plus queues, going from some with super high numbers of messages per second, some with only tens of messages per day. Some queues stay low and process fast, others are heavy jobs that get enqueued all at once and generate hundreds of thousands of jobs.

Sure, if you have a service that interacts with disks you should have automated a monitor that cover your IOPS consumption, but I don't see how that's specific to RabbitMQ, you should be doing this for all your instances.

All in all, these are two identic instances, one active, one failover, and in a world of Kafkas and Pulsars and understanding the ins and outs of SQS pricing and capacity allocation, RabbitMQ is a tool that I consider simple to administer and allows me to sleep at night.

Interesting how the same tool can evoke such different reactions, but whatever works - works.

You would think, until you get to a split brain issue. The master and failover lose connectivity, and they each then think they're the master.

There's ways to repair it (and it has happened to me one total time in 4 years), but it does happen. I personally try to make my message processing idempotent for the worker to help alleviate these situations.

haven't encountered it personally, so honest question here: how does a split brain situation become an issue in a message queue?

there are some possible situation from my naive viewpoint:

1. the 'active' queue keeps jumping between, consumers & producers keep reconnecting

=> everything is still consumed, but takes longer as producers write into alternating queues, which are consumed ... albeit slowly whenever the switch happens

2. they're database backed, so they'll try to write into the same table

=> usually software that does this (but cant handle several writers) also creates a `lock` which has to be manually reset before the failover can come up. if its reset, the other node would fail. only one is up, so no issue?

3. producers/consumers dont notice that the 'active' mq changed, and keep running on initial

=> issue manifests as soon as any system is restarted. but only slowly so you got time to handle it with minor service degradation

none of them really sound that bad to me -- but as i said before, i haven't encountered it before, so i might just overlooking something really obvious?

There is a reason why you're supposed to run an odd number of nodes so that you will hopefully have a majority in case of a failure.

Once every four years sounds like a no-brainer, to be honest.

I have simple single node deployment and I was floored how easy it was to set up with Celery. Really surprised. I was kicking myself for not using it sooner.

Granted I don't know all the intricacies of RabbitMQ and this was just one step beyond os.popen, but it was painless, like half an hour painless to set up and it has worked really well.

*edit: reading some of the other posts now I'm waiting for the other shoe to drop. but so far it's worked wonderfully.

I also got my first queue set up and running within a reasonable period of time with celery. I have no idea of the internals of RabbitMQ and took longer with celery really (back on python 2.7) but that system has been in prod for 6 years now without really needing any maintenance

Same experience. Single node with a few clients and Celery. Works well.

My main issue in the beginning were network timeouts now and then. Those went away after tuning some TCP settings.

Thank you for this post.

When I first started using RabbitMQ I experienced just about everything you described.

I felt incredibly stupid when a customer would have issues with a queue being stuck or messages that were being dropped, and having no clue on why this was happening.

> It's nearly impossible to get RabbitMQ working correctly within few months.

This is so true. You can get it running in 10 minutes, but it takes weeks of banging your head against the wall and angry customers before you have it running right.

I understand where you're coming from, but what you're describing is learning how to use a queue to maintain consistency guarantees across a distributed system. You can get something simple like AWS SQS working with a few clicks, but then you don't have any of those consistency guarantees.

If you don't need crazy throughput, I find that Azure Storage Queues are crazy easy, built in retry and just simple as can be. Though when I've used it in the past, I've created a slightly simpler to use abstraction.


Thinking of doing something that works like an async generator so I can just use it like...

    const work = queue.subscribe('somequeue');
    for await (const {item, done} in work) {
      // do something with JSON.parsed item from message
      await done(); // wrapper for the delete/finish

Azure Storage Queues are about on par with SQS. That is, easy to use, but lacking strict concurrency control. If you need that level of concurrency control (and stricter serialization), then you’d be better off their (more complicated) service bus product [0].

RabbitMQ isn’t more complicated because it’s been improperly designed. It’s more complicated because it’s doing a much more complicated task.

A fair amount of that complexity lies in the hosting, so a managed service can take some of that off your hands (for an increased price obviously), but part of it is necessarily going to lie with the message consumer (your application logic). If your use case doesn’t need that level of control, then it doesn’t need that level of complexity either, so something like Rabbit would just be the wrong choice.

[0]: https://docs.microsoft.com/en-us/azure/service-bus-messaging...

Depending on your usage patterns, SQS can be significantly more expensive, too.

I have my own share of objections, mainly concerning the over-engineered nature of RabbitMQ, but most of the “huge learning curve” items that you’ve described can be learned in an afternoon by a motivated software engineer. Besides, she will have to learn those concepts anyway because they apply to most brokers.

You're right. It's difficult to get right. However, it is totally worth it. Once you get it working it just works.

I wish a standard set of higher abstractions existed on it though. Celery, from what I hear fills that gap very well in the python world but nothing like this exists in Nodejs land which leaves the room open for a bunch of redis-backed solutions which are pretty fragile in comparison.

I'm curious if you are comparing this to a non-queue solution or to a different queue system?

weird, I haven't done much digging in to the details of RabbitMQ, but I integrated it in a matter of hours, and have it deployed in production systems (for quite sometime now) and it works really solidly. I haven't tried to get too clever though.

You must of followed a good guide on getting it setup. Took me 2 days to get it solid. Then we decided to just use redis.

I just used the official docs and guides they had on the website, they seemed pretty good to me. I might have googled a few extra things, but can't really remember, I just remember it being pretty straightforward. I remember they pointed out a number of things you had to take care of.

Can anyone recommend an easier alternative?

I switched to redis since several years ago for simple task queue solution. For my usage (low to medium traffic at most in corporate environment) redis is easier to use and has very little cpu and ram footprint compared to rabbitmq (note that I only use redis for message queue, thus low memory consumption). Never got any message dropped so far. RabbitMQ uses too much memory right from starting up, not ideal for use in a resource constrained server.


Surprised to see not much mention of ActiveMQ in these comments, but it's an obvious alternative choice. The general (simplistic) comparison being:

- ActiveMQ more featureful, robust default settings, better integrated with Java/JMS but slower

- RabbitMQ faster, simpler, more "just works"

The defaults of ActiveMQ lean more towards robustness (hence often naive benchmarks will tell you it's slow). However in practice it is pretty damn easy to run, you literally can just download the default cross-platform distribution and type `./bin/activemq` and it will start running.

We use ActiveMQ + Apache Camel which makes a pretty nice combo to achieve lots of generalised messaging and routing functionality.

I heard a lot of praise about Nats, but isn't it more like a kafka alternative? Someone new need to spend sometime grasping the stream concept.

One practical reason we chose Nats over Kafka was that Nats doesn't need zookeeper for HA.

Nats doesn't provide message durability too, luckily it's not required for 95% of our use cases. Also, having NATS already implemented it's a natural move to use NATS Streaming for durability rather than introducing a completely new technology to your stack.

Less pieces - fewer chances something breaks.

NATS by itself is designed to be more of an always-on style queuing system (the term they use is "dial tone") but doesn't handle node failures by itself. If you're looking for a Kafka-flavored NATS, there's a new release I saw recently called LiftBridge that adds some durability to the NATS protocol.

Someone mentioned this already below, but nats streaming also adds durability.

Disclaimer: I work for CloudAMQP

Yeah we hear you regarding AWS IOPS: for some type of loads and smaller plans we need to offer an alarm + an easy way to scale IOPS. It is something we're working on.

The biggest annoyance I found with RabbitMQ was that it could take up to 10-15 mins to restart if it had a lot of jobs.

This was back in 2015 - might be better now.

Sounds like resource (design) issues to me.

isn't Rabbitmq prone to "split brain" problem on HA setups?

RabbitMQ is great. One of the few pieces of software I've used that "just works".

The only downside is once you get message-queue-pilled, you start seeing opportunities to refactor/redesign with message queues everywhere and it can be hard to resist the urge. It really is remarkable how, when used appropriately, message queues can dramatically simplify a system.

> The only downside is once you get message-queue-pilled, ...

I think this is why email will never die. It's basically turned into a huge message queue. Even voice mails come into my inbox.

====== EDIT - I meant to say "huge universal message queue" and left out the word "universal" accidentally

It always was a message queue in a very literal sense.

There's a lot of work in mailer-daemons to ensure that email has as reliable as possible delivery in a store-and-forward system..

You're correct - I left out the word "universal" accidentally, which would have made my intent much more clear.

Thanks for catching that.

I think this is what a lot of people who complain about Slack don't get. It's just a better message queue for your business. The fact that you can funnel all your business events, regardless of whether they originate from humans or bots, into one place and then each worker (again, either human or bot) can subscribe/filter/react to relevant events is super powerful. However, if you try to use it as a corporate SMS platform or email replacement, you will very quickly feel overwhelmed because both of those message queues are designed for much lower throughput.

And you can literally use the maildir format for a queue!


Perl had the original implementation, and there are implementations in other languages.

I forgot about that - used to be a cool hack!

How is it for production deployment? I was considering it for something recently, but got overwhelmed by the documentation on setting up a fault-tolerant production deployment, so have been avoiding it. Was this an overreaction? What is your experience with that?

Also, do you happen to know how well it works in a fault-tolerant way for communicating between services that are in different data centers?

My main use-case is to receive status/change notifications from a service running elsewhere from the API server servicing the UI, in order to avoid polling for new data.

RabbitMQ, even a single-node RabbitMQ, has a hard time going down. You are more likely to have your server/container go down long before RabbitMQ node goes down. That being said, if you want to have a clustered solution with nodes being in different DCs, configure shoveling (https://www.rabbitmq.com/shovel.html) or for a simpler solution, use a private VPN to interconnect the RabbitMQ nodes. I would go for the latter.

We use it in a fairly big scale for our slack bot system. It was just set up once, as akyu said, and since then it just works. Whenever we had troubles, it was always anything other than RabbitMQ.

I've also looked into other solutions (ActiveMQ, Google PubSub, ...) and RabbitMQ is by far the most straight-forward and quick to set up. There are some edge cases that it doesn't cover as well, for example automatic retries, but there are some "RabbitMQ patterns" to make it work. For a simple message broker/queue system, it's great and the docs are also great.

We use Google Pub-Sub and got the whole thing up and running very quickly with Spring integration. Message durability, automatic consumer load balancing, automatic retries, some easy broadcast patterns - all out of the box and literally a click of a button on the infra side.

Has worked out quite well so far.

Was it easy to setup in terms of reliability and failover?

Given what you and larrik are saying, I think I need to give it a trial run, but its a project with a tiny team, so I want to be sure it won't be the cause of sleepless nights when things go wrong. It sounds like RabbitMQ is quite solid and shouldn't be the cause for concern, which is promising!

Is there anything I should keep in mind for running it in production? Any best practices or gotchas, based on your experience (eg don't run in docker, or make sure there's lots of RAM or things like that)? I guess its all in the production checklist. I need to read through it all again!

This came up in several other threads here: Don't use RabbitMQ's clustering. It's surprisingly brittle and hard to recover from.

The accepted wisdom that I've seen is to run a single broker with a completely independent hot spare. But of course switching over to your hot spare will violate most of the guarentees that Rabbit gives you around durability, ordering etc, so you have to be very careful how you use it.

I desperately want to like Rabbit (and have used it heavily in the past) but right now I wouldn't use it if I can get away with anything else, it just has no real HA story.

Rabbit dev here. We released quorum queues a few months ago. It's a Raft based replicated queue that addresses all the old problems. https://www.rabbitmq.com/blog/2020/04/20/rabbitmq-gets-an-ha...

Thanks for all your work on RabbitMQ, and for your great blog posts about it and other messaging systems.

For anyone who wants to understand the potential complexities of HA RabbitMQ, spend some time reading https://jack-vanlightly.com/blog/2018/8/31/rabbitmq-vs-kafka...

In my experience RMQ is solid enough that for many use-cases it's reasonable to run it without a standby (especially if you're on Kubernetes where you'll get a replacement instance created automatically if your active instance fails).

A common use-case is for async tasks (Celery) that can tolerate a few minutes of downtime. If you're running a fully evented architecture then this might not apply - though if you're not targeting 4-5 nines of reliability or an RTO of < 5mins, then you might not need a standby even if RMQ is a core part of your architecture. "Avoid single points of failure" is a good heuristic, but "consider the SLAs of your dependencies" is the more granular way of thinking about this, and a single RMQ instance has a very high uptime.

For context I had an RMQ docker container running for almost two years without any interruptions. If you're in a small team then HA might well be overkill.

Fun gotcha - if you're running RMQ in Kubernetes/Docker, make sure you give it an explicit memory limit, else it will try to allocate disk space equal to 40% of your host's memory. (See "memory limits" in https://hub.docker.com/_/rabbitmq). That's a good best-practice for any containerized environment regardless what workload you're running, but this one will cause errors if you're trying to use a small disk volume on a host with lots of memory.

Which Kubernetes operator for RabbitMQ are you using?

I’m just using the Helm manifests, but when I set this up operators were not a thing. I’d probably look into the operator approach if I was starting from scratch now.

I've been running RabbitMQ for >8 years in production, once even in a fleet of 180 buses where every bus had an instance of rabbitmq running locally.

Never had a single issue in all those years.

But, I must admit that running a HA cluster is something that I've never tried, it sounds complicated and scary once you start digging through the docs.

All my deployments have been to bare metal Ubuntu and Debian machines with durable Qs and messages.

If you need to use transactions, they are really slow, couple of orders of magnitude slower compared to regular AMQP usage.

> Was it easy to setup in terms of reliability and failover?

To be honest, the current setup was set up by colleagues who have even less experience than me and it's still running flawlessly. Iirc, it's just two instances that are behind a load balancer and the consumer just consumes from both, but I'm not super certain on that.

I've tested the cluster functionality to see how to set it up and it worked fine for me, but I have no experience with that in production, but other people in this thread don't seem to be too happy with it, so ymmv.

> Is there anything I should keep in mind for running it in production?

Nothing special that you wouldn't do otherwise, when getting to know a new component/microservice. Just check out the get started section[1] on their page in the appropriate language and play around with a small setup. Get familiar with the libraries to connect and send/queue/fetch stuff and the topics. Make sure to use your brainpower before you set it up to handle all eventualities and is set up exactly how you want it to act (ACK/NACK, what happens if a sender/consumer dies, etc.), because then you set it up once and probably never touch it again.

One thing I'm not really sure on and what I haven't really answered myself yet: there might be some "logic" to your RabbitMQ instance, depending on which metadata you add to each message (e.g. retries). If you have such logic, it might be better to have a service around the RabbitMQ instance, otherwise this logic ends up in your code base of you actual solution and that might not be wanted and maybe harder to maintain. But I'm not so sure myself on that one.

Oh yeah, and check out some patterns for your needs. There are for example multiple ways to implement retries, for example with a queue for queues, etc. But if you API is REST based, everything should be straightforward.

[1]: https://www.rabbitmq.com/getstarted.html

Thanks for the detailed response (and to everyone else who responded too!), I appreciate it! I will prototype something and play around and see how it handles different situations when I get time. I've also just bought the book mentioned elsewhere, so hopefully I can get up to speed quickly. It does sound that my original impression about it being complex to run/maintain was perhaps overblown. That's good, because from a features point of view, RabbitMQ seemed like a good fit for the things I want an MQ solution for.

My go-to solution for fault-tolerant message queues is nsq (https://nsq.io/). nsq works differently from most other message queues in that it's supposed to be run in a distributed fashion, i.e. one nsqd running wherever messages are produced. That way you have a lightweight and fast local message queue that you can push messages to and not worry about network connectivity. You can use nsqlookupd to find the distributed nsqd that hold the topic you want to subscribe to, or you can run an additional nsq-to-nsq process to push messages from one broker to the next. It's a really great and very mature and stable piece of software. I'd say the only downside to using nsq is that you have to invest a little more in monitoring and you have to make sure that network connectivity between your consumer and each nsqd that carries a certain topic is possible.

Thanks for the recommendation! That looks pretty nice and “ops friendly” is definitely a plus. I will investigate this further.

Been using RabbitMQ for a lot of projects in production. It can handle quite a lot data and this thing never fails. Sometimes it can be running for an entire year and we force restart just because.

Yup been my production experience as well ! Super solid system !

We use it for more or less everything at reddit. Almost every user action corresponds to a rabbit queue

sounds cool! how big queues are on your setup? how big mq instances (servers) are? do you use HA, replications/failovers?

We switched from Amazon's SQS to RabbitMQ, because SQS was killing our performance, and wasn't nearly as powerful overall.

RabbitMQ gave us such a performance increase that we killed our database. We ended up having to rate limit RabbitMQ!

> I was considering it for something recently, but got overwhelmed by the documentation on setting up a fault-tolerant production deployment, so have been avoiding it. Was this an overreaction?

In general the defaults are pretty good I think. There is a one page production deployment guide: https://www.rabbitmq.com/production-checklist.html that I followed to replace our handbuilt cluster w/ a new automated deployment, plus a few other niceties like docker logs & rmq metrics to cloudwatch and then auto clustering via autoscaling groups lookup.

I thoroughness of the docs can perhaps seem daunting, but I see it as a badge of quality and especially if you are growing it's usage organically it should "just work".

If it's super simple like that and the throughput isn't massive, use something else you don't need to support, like AWS's SQS.

If you're bad at hosting and need the throughput, there's cloudamqp.

So many options for pub/sub systems so use what works for you.

+1. Discovered RabbitMQ/AMQP around 2010, since then tech went through a 2015-era wave of HTTP microservices that has come, and, largely gone or moved to MQ.

When you say "gone or moved to MQ" - if not moved to messaging services like RabbitMQ/NATS/etc, where else could things have gone? At least from my experience, HTTP microservices are still very common, especially when using things like AWS Lambdas.

I feel like most continually-running backends will make use of RabbitMQ/NATS/ZeroMQ/etc, or more and more I see lightweight systems going completely serverless and just using lambdas - which are HTTP microservices.

> When you say "gone or moved to MQ" - if not moved to messaging services like RabbitMQ/NATS/etc, where else could things have gone?

They could have stayed trying to do continually running microservices on HTTP.

> I feel like most continually-running backends will make use of RabbitMQ/NATS/ZeroMQ/etc

I do too.

> more and more I see lightweight systems going completely serverless and just using lambdas - which are HTTP microservices.


But long running HTTP microservices are lame, and everybody realises that now, despite it being a cool idea back in 2015.

To be fair, I started working post-2015, so I've actually never come face-to-face with a long running HTTP microservice backend... what would something like that even look like? I'm thinking of systems I've worked on that use a messaging queue, but that only rely on HTTP requests - is that what it would be? So like, I'd make a request to a microservice behind an endpoint, which in turn would make requests to 3 more microservices behind other endpoints? If so, I'm certainly glad that idea isn't cool anymore because that seems greatly inefficient :)

  moved to MQ
Are you referring to IBM MQ?

probably just meant message queues in general.

I've had truly terrible experiences with RabbitMQ. I believe that it should not be used in any application where message loss is not acceptable. Its two big problems are that it cannot tolerate network partitions (reason enough to never use it in production systems, see https://twitter.com/antifuchs/status/735628465924243456), and it provides no backpressure to producers when it starts running out of memory.

In my last job, we used Rabbit to move about 15k messages per sec across about 2000 queues with 200 producers (which produced to all queues) and 2000 consumers (which each read from their own queues). Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.

Additionally, Rabbit would invent network partitions out of thin air, which would cause it to lose messages, as when partitions are healed, all messages on an arbitrarily chosen side of the partition are discarded. (See https://aphyr.com/posts/315-jepsen-rabbitmq for more details about Rabbit's issues and some recommendations for running Rabbit, which sound worse than just using something else to me.)

We experimented with "high availability" mode, which caused the cluster to crash more frequently and lose more messages, "durability", which caused the cluster to crash more frequently and lose more messages, and trying to colocate all of our Rabbit nodes on the same rack (which did not fix the constant partitions, and caused us to totally fail when this rack lost power, as you'd expect.)

These are not theoretical problems. At one point, I spent an entire night fighting with this stupid thing alongside 4 other competent infrastructure engineers. The only long term solution that we found was to completely deprecate our use of Rabbit and use Kafka instead.

To anyone considering Rabbit, please reconsider! If you're OK with losing messages, then simply making an asynchronous fire-and-forget RPC directly to the relevant consumers may be a better solution for you, since at least there isn't more infrastructure to maintain.

Rabbitmq blocks producers when it hits memory high watermark (default 40% of available RAM) - https://www.rabbitmq.com/memory.html

We used to have a pub rate of about 200k msgs/s, from about 400 producers all to a single exchange and had similar issues. However, we were able to mitigate this by using lazy queues.

This worked fine until things got behind and then we couldn't keep up. We were able to work around that by using a hashed exchange that spread messages across 4 queues. It hashed based on timestamp inserted by a timestamp plugin. Since all operations for a queue happen in the same event loop, any sort of backup led to pub and sub operations fighting for CPU time. By spreading this across 4 queues we wound up with 4x the CPU capacity for this particular exchange. With 2000 queues you probably didn't run into that issue very often.

We had a similar experience where I work. We just ended up rolling our own queue system because we really just needed point to maybe a few other points.

I'm glad Kafka is working for you you! Rabbit's HA story has definitely been rough until recently. But I think a few of the issues you describe can be mitigated with a bit better understanding of what's going on.

> Any time any of the consumers would slow down of fail, rabbit would run out of memory and crash, causing sitewide failure.

Not to be glib, but in any brokered system, you to have enough (memory and disk) buffer space to soak up capacity when consumers slow down, within reason. Older (2.x) RabbitMQs did a very poor job rapidly paging queue contents to disk when under memory pressure. Newer versions do better, but you can still run the broker out of memory with a high enough ingress/low enough egress, which brings me to...

It sounds like you did not set your high watermarks correctly (another commenter already pointed this out); RabbitMQ can be configured to reject incoming traffic when over a memory watermark, rather than crash.

However, a couple of things can complicate this: rejection of incoming publishes on already-established connections may not make it back to your clients, if they are poorly behaved (and a lot of AMQP client libraries are poorly behaved) or are not using publisher confirms. Additionally, if your clients do notice that this is happening and continually reattempt to reconnect to RabbitMQ to handle the (actually backpressure due to memory) rejection notification, this connection churn can put massive amounts of strain on the broker, causing it to slow down or hang. In RabbitMQ's defense, connect/disconnect storms will damage many/most other databases as well.

> We experimented with ... "durability", which caused the cluster to crash more frequently and lose more messages

A few things to be aware of regarding durability:

Before RabbitMQ 3-point-something (I want to say 3.2), some poorly chosen Erlang IO-threadpool tunings caused durability to have higher latency than expected with large workloads. Anecdotally, the upgrade from 3.6 to 3.7 also improved performance of disk-persisted workloads.

If you have durability enabled, you should really be using publisher confirms (https://www.rabbitmq.com/confirms.html) as well. This isn't just for assurance that your messages made it; without confirms on, I've seen situations where publishers seem to get "ahead" of Rabbit's ability to persist and enqueue messages internally, causing server hiccups, hangs, and message loss. That's all anecdotal, of course, but I've seen this occur on a lot of different clusters. Pub confirms are a species of backpressure, basically--not from consumers to producers, but from RabbitMQ itself to producers.

When moving a high volume of non-tiny messages (where tiny is <500b), you really need a fast disk. That means the equivalent of NVMe/a write-cache-backed RAID (if on real hardware; ask me about battery relearns killing RabbitMQ sometime ... that was a bad night like the one you described), or paying attention to max-throughput/IOPS if deploying in the cloud (for example, a small EBS gp2 volume may not bring enough throughput, and sometimes you may need to RAID-0 up a pair of sufficiently-sized gp2's to get what you need). And no burst IOPS, ever.

> We experimented with "high availability" mode

You're 100% right about this. RabbitMQ's story in this area was pretty bad until recently. Quorum queues and lots of internal improvements have made the last ~4 years worth of the Rabbit versions behave better in HA configurations. But things can still get really dicey. Always "pause minority" (trade away your uptime for message loss), as the Jepsen article you linked mentioned.

For failure recovery (though it's not that "HA") if you can get single-node durability working well and are using networked disks (e.g. NFS, EBS) or a snapshotted-and-backed-up filesystem, one of the nice things about RabbitMQ's persistence format is that at the instant of crash, all but the very most recent messages are recoverable in the storage layer. That doesn't solve availability, but it does mean you don't have catastrophic data loss when you lose a node (restore a snapshot or reattach the data volume to a replacement server).

Wow, that error message! Unless you are Google, network partitions are a thing. With CAP, you don’t get to choose CA.

Kyle's analysis of RabbitMQ is almost 6 years old. Rest assured, things have changed since then.

How did the switch to Kafka solve your issue with providing backpressure to producers?

My general problem is that it's really hard to figure out which architecture is right for which system.

There's a different architecture for:

* one queue with billions of messages

* a millions of queues with small numbers of messages per queue

* many queues with many messages per queue

There are also different topologies:

* Anyone can send a message to anyone (O(n^2) queues)

* One publisher with millions of subscribers

* One subscribed with millions of publishers

* Complex processing networks, where messages get routed in complex ways between processing nodes.

There are differences in timing:

* More-or-less instant push notifications

* Jobs which run within e.g. 5 minutes with polling

* Jobs which run in hours/days, with a cron-style architecture

And in reliability:

* Messages get delivered 100% of the time, and archived once delivered

* Messages get delivered 99.999% of the time, but might be dropped on a system outage

* ... all the way down to ephemeral pub-subs

... and so on.

I'd give my VP's right eye to get a nice chart of what supports what. For the most part, I've found build to be cheaper than buy due to lack of benchmarks and documentation for my use cases. Otherwise, you build. You benchmark. You optimize. And things melt down.

My use case right now requires a large number of queues (eventually millions). I'd like to have an archival record of messages. Peak volume is moderate (several messages per second per queue), but usage patterns are sporadic (most queues are idle most of the time). Routing is slightly complex but not supper-complex (typically, about 30 sources per sink, at most 200; most sources only go to one sink, but might go to 2-3). Messages are relatively small (typically, around 1k), but isolated messages might be much bigger (still <1MB, but not small).

My experience has been that when I throw something like that into pick-your-queue/pub-sub, things melt down at some point, and building representative benchmarks is a ton of work.

All software breaks at some point. If you're dealing with this scale of load, it's mandatory to perform synthetic load testing to validate, otherwise you're just guessing what the breakage threshold will be.

Checkout https://github.com/yevhen/Streamstone Had similar needs, it was a good place to start. It’s just the persistence part of it though, did the messaging part using actors.

Fantastic points. We needed millions of queues with millions of items with fair queueing and scheduled release of some items and immediate release of others. 10s of thousands of messages per second. We had to build our own.

RabbitMQ is highly configurable in this regard but you will hit snags in how you distribute queues across exchanges.

Likewise this configurability makes case specific benchmarks very awkward.

I feel like RabbitMQ is sort of the "swiss army knife" of message queues, and I mean that in the nicest way possible.

People will compare it to Kafka, claiming that its pubsub is faster than Rabbit's, but that's sort of missing the point: Rabbit thrives because it's easy to set up, will work well for 99% of cases, and handles nearly every kind of distributed problem you're likely to come across.

I recently did a project with Rabbit on my home server, and while the project had some issues, the issues were never Rabbit.

Rabbit doesn't have Kafka's ability to massively distribute and scale (it does have a distributed story but from what I hear few explore it). But Rabbit also supports more complex use cases than Kafka because its messaging protocol (AMQP) is more intelligent. Unless you're a "web-scale"/s company, Rabbit's scale even on one node is likely enough.

I've been using Rabbit in production for RPC and pub/sub for the past 5 years (single instance running on a non-dedicated VM, medium traffic) and its been pretty easy to setup and has been pretty reliable in practice.

I've always been concerned about losing messages, and I did have to learn to turn on persistence and durability for messages to survive server interruptions, but it was easy enough. Message acknowledgements are also a nice feature, and Rabbit is able to achieve at-least-once messaging semantics.

Yeah, I don't dispute that for certain usecases, Kafka is definitely the better choice, use the right tool for the right job.

That said, for most small to medium-large tasks, Rabbit will handle things without much trouble, making it a good fit for most common usecases.

it's also super duper stable even on default configs. it's one of my favorite softwares ever.

I'm very well versed on RabbitMQ. We use it internally in a .NET codebase.

Anyone considering RabbitMQ needs to read up on "network partitions", how to build your cluster to avoid them (odd number of nodes and pause_minority), your recovery strategy for when a network partition occurs (it will occur), your personal/organizational tolerance for message loss and a plan for how you will upgrade your cluster at some later date (ensure you architect your application to handle whatever type of upgrade strategy you will pursue).

There are definitely ways to operate to minimize these failures but you SHOULD KNOW ABOUT THEM before your add this service to your environments.

If you're using RabbitMQ on .Net, I highly recommend using NserviceBus. It's made working with queues so easy. It handles maintaining a connection and retrying/acknowledging messages for you.

Hindsight is the best site. That's definitely what I would do if I was starting a new project using RabbitMQ. Although I'll defend myself on this front; I inherited our RabbitMQ project from the developer who left the company 7/8 of the way through the implementation. I had the "make it work" directive and not the decision making luxury he had from the beginning.

I worked on NServiceBus years ago (not just used, but actually was an active developer on the project). It's an excellent piece of software and Udi Dahan really knows what he's talking about.

This. We used Rabbit in our platform for a few years and it was an absolute disaster. Network partitions mainly. In retrospect I'm sure we were doing it wrong but that really wasn't obvious at the time.

Can you talk a little bit about how you've managed your RabbitMQ infrastructure? Also if you've done any comparison to Azure Message Queues and what were the pros and cons against Rabbit?

I'm looking to pitch adding a message queue to our infrastructure (at a .Net shop on Azure), and I'm sure there will be some questions about the comparisons between the two. Unfortunately, that's been tough to really track down.

>Can you talk a little bit about how you've managed your RabbitMQ infrastructure?

From a ten thousand foot view, two or three node clusters running in non-prod environments on virtual machines running Windows. In Prod, three node clusters on Windows virtual machines.

All work to install and configure RabbitMQ is done manually. Sadly enough.

I'm on the application/architecture side of this equation but I know enough about our infrastructure to perhaps answer follow-ups or more specific questions.

Our application is single tenant (so each customer is deployed in their own isolated area) so we use virtual hosts to isolate each customer within the cluster.

>Also if you've done any comparison to Azure Message Queues and what were the pros and cons against Rabbit?

Definitely looked in to the Azure native queueing options but it's been awhile. Azure Message Queues is an AMQP compliant messaging system that seems fairly robust. To be transparent, I have no production experience with this product. If your company/department is in to managing virtual machines then they might want/prefer to go with RabbitMQ. However, if they're in to PaaS systems then I'd probably roll with Azure Message Queues and never look back.

Thanks for the response. From the other responses in this thread, it seems like the admin of the nodes/cluster is not overly onerous. Would you agree with that statement? Also, being a .Net shop, the Windows VMs make sense, but is there any tradeoffs to running Rabbit on Windows, as opposed to Linux?

I think part of the sell is how we would manage the admin component of a Message Queue, which tilts things towards Azure Message Queues as it's PaaS. We're mostly IaaS at the moment, and starting to see some of the admin overhead that comes with managing that infrastructure ourselves. We're not ready to jump onto a PaaS solution for the things we've grown accustomed to managing, but for something brand new, I think my company would be open to it.

Architecturally, we'd lean on it initially for background job processing, which is currently at a scale where our homegrown, db-backed solution is starting to show it's weaknesses. Once it's in place though, I think it could leveraged as a key component to decouple subsections our application and give us more flexibility with scaling and deployment.

>From the other responses in this thread, it seems like the admin of the nodes/cluster is not overly onerous. Would you agree with that statement?


>is there any tradeoffs to running Rabbit on Windows, as opposed to Linux?

Should be fine to run on Linux assuming you (or you have) people are who are comfortable admin'ing Linux servers. I think that a Windows admin would get frustrated to setup/configure RabbitMQ on a Linux server. There's also a container advantage as RabbitMQ is published to Docker only with officially maintained Linux images.

>We're not ready to jump onto a PaaS solution for the things we've grown accustomed to managing, but for something brand new, I think my company would be open to it.

I'd push you to figure out why Azure Messages Queues would not work for you. If there's no compelling "no" argument then you'll thank yourself later.

>Architecturally, we'd lean on it initially for background job processing, which is currently at a scale where our homegrown, db-backed solution is starting to show it's weaknesses.

We pursued RabbitMQ for very similiar reasons (queueing mechanisms via SQL Server tables and stored procedures). Keep in mind that you still need something to submit the job (initiate the background task). RabbitMQ is not going to automagically schedule anything for you. We have a couple applications that use the tool Hangfire for job scheduling and in one case, the Hangfire job simply sends a message to RabbitMQ.

RabbitMQ runs perfectly fine on Windows too. As others mentioned in the comments, RabbitMQ supports a great variety of use cases. If you want to reach out for help, you can find my contact in the article.

Download and set aside a copy of the erlang and rabbitmq installers if you're running on windows... I've had issues on many occassions with the erlang installer being unavailable or very slow to download.

Also, if Redis 5 is already part of your stack then you should look at their Streams feature before adding anything like RabbitMQ or Azure MQ.

Unfortunately streams were released after we introduced RabbitMQ to our application and I really wish we could just focus on Redis.

I'll add one comment, if you aren't doing a really large number of queued items (under 50k messages every few minutes), Azure Storage Queues are pretty nice and easiest to use imo.

Using the opportunity to pimp my book, RabbitMQ in Depth: https://www.manning.com/books/rabbitmq-in-depth


So weird seeing you post that, as I literally have this book on my desk right now.

Thanks Gavin, I learned a lot from reading it!

That's awesome! I'm glad it was useful!

It is a great book, indeed. I always recommend it.

Thank you!

It is a nice book, highly recommend it!

Thank you!

This is HN - we need a sales chart! :)

RabbitMQ has been awesome in my experience. One of the few tools that just works and has a super useful management web interface and Prometheus support among other plugins.

For those noting HA and scalability, it not meant for those use cases where (virtually infinite) horizontal scalability are the biggest concern. If you need horizontal scalability at a massive scale, use Kafka. But for the majority of cases, you can get away with limited scalability and the prod setup, development experience, and reliability of RabbitMQ are unmatched from my experience.

I've been trying to rationalize using either RabbitMQ or Kafka for something I'm building. High messages per second but with more complex routing topologies.

Rabbit seems to be the right path but I'm worried about scaling out as many sources seem to point as Kafka being more scalable (at least horizontally). I've been looking into Rabbit's Federation but it's still not clear if that will solve the problem down the road.

Can anyone shine some light?

I've been running RabbitMQ on pretty small VMs for a long time. RabbitMQ doesn't need a lot of resources per message, even with very small VMs (512MB RAM, single CPU) I've seen it handle peaks of many thousands of messages a second without running into problems. Give it a bit of beefy hardware and it'll probably handle whatever load you were thinking, unless you're saturating 10gig links with messages or something.

RabbitMQ and Kafka are very different struggles when thinking of scaling and performance. Kafka is almost a database itself of messages which have routed through the system. In many configurations clients can come back and demand to replay the message stream from almost any point in time. This means you need to handle _a lot_ of disk and memory access. With RabbitMQ, messages are traditionally very ephemeral. Once a message has been ack'd, its gone. Poof. Not in memory. Not on disk. Nobody is going to come back asking for that message. This leads to a lot more efficiency in handling things per message, but at the cost of not being able to remember the messages that went through the system a few milliseconds ago.

CPU usage highly depends on the number of connected clients, not that much on message throughput. You can experiment with the excellent rabbitmq-perf-test tool to get some ballpark numbers.

I have a system that only pushes 5k messages per second but it needs 32 cores.

What’s the amount of connected clients around for that 32 core setup?

around 3000 connections, 2000 queues, 1K message size

Yeah that sounds about right. Of course if you had 200 connections and 50 queues you'd more likely be seeing 100000 msg/s. The number of connections and queues has a big effect on total throughput.

As someone who has ran a number of messaging systems in production, this is what my current take is in general:

If you are moving to a more "event-sourced" architecture, usually two main concerns (beyond basic operational stuff of uptime, scale, etc) are routing and long-term retention.

RabbitMQ has the routing but not the retention. Kafka can have the retention and the routing, but it can be complex/expensive. Apache Pulsar really shines here as the API is pub/sub but it is underpinned by a log structure that gives you long-term retention (that doesn't need to be manually re-balanced) but it's flexibility does come with some operation complexity when compared to RabbitMQ.

If your needs is pretty much just moving large amounts of data, Kafka is definitely the most mature and has a big ecosystem, but long term-retention is difficult and there are some sharp edges around consumer groups.

If you really really don't need long-term retention and need complex topologies, RabbitMQ is your best bet and is fairly reasonable to operate even up to fairly high message rates (~10k msgs/sec shouldn't be too hard to achieve)

There are a TON more options these days though, older more java solutions like activeMQ and rocketMQ or more "minimal" implementations like NATs, not to mention the hosted services on cloud providers.

Personally, I am a big fan of Apache Pulsar for it's flexibility and some nice design choices, but I don't think there is any silver bullet in this space.

Would you mind expanding on some of the operational complexity you ran into with pulsar?

I think pulsar is wonderful, but I haven't had the chance to use it for anything serious / in production yet, so I'm curious what pain points you had.

I'm guessing that the pain points surrounded having to set up a Zookeeper cluster in conjunction with Pulsar. I think Pulsar has the best model of the various queuing systems at the moment for the routing flexibility of RabbitMQ, the high-throughput of Kafka (topic/partitions), as well as the ability to seamlessly integrate with cold storage (S3/GCS) and to recall messages from cold storage without extra code (unlike Kafka), I just wish that ZK wasn't an additional dependency.

Anyone know of any Pulsar hosting providers?

Adding to what the sibling comment say, be careful about buying into RabbitMQ's clustering; having run it for years, I found it to be extremely brittle.

We often lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was. It has a bunch of hacks to mitigate this, but they don't solve the core problem; the only way to run mirrored queues ("classic mirrored queues", as they're not called) reliably is to disable automatic recovery, and then you have to manually repair RabbitMQ every time this happens. If you care about integrity, you can use the new quorum queues instead, which use a Raft-based consensus system, but they lack a lot of the features of the "classic" queues. No message priorities, for example.

I've never used federation or Shovel, which are different features with other pros/cons.

If you're willing to lose the occasional message under very high load, NATS [3] is absolutely fantastic, and extremely fast and easy to cluster. Alternatively, NATS Streaming [4] and Liftbridge [5] are two message brokers built on top of NATS that implement reliable delivery. I've not used them, but heard good things.

[1] https://www.rabbitmq.com/partitions.html

[2] https://www.rabbitmq.com/quorum-queues.html

[3] https://nats.io/

[4] https://docs.nats.io/nats-streaming-concepts/intro

[5] https://github.com/liftbridge-io/liftbridge

> lost entire queues because a small network blip caused RabbitMQ to think there was a network partition, and when the other nodes became visible, RabbitMQ has no reliable way to restore its state to what it was

I can offer a similar anecdote: we started seeing rabbitmq reporting alleged cluster partitions in production after enabling TLS between rabbitmq nodes, where manual recovery was needed each time.

After a bit of investigation we noticed that cluster partition seemed to correlate with sending an unusually large message (think something dumb like 30 megs) through rabbitmq when TLS between rabbitmq nodes was enabled. What I believe was happening was Rabbitmq was so busy encrypting/decrypting large message that it delayed sending or receiving heartbeat & then the cluster falsely assumed there has been a network partition.

Mitigated that issue by rewriting system to not send 30 meg messages- there was only one message producer that sent messages anywhere near that large, and after a bit of thought realised it was not necessary to send any message at all in that case (sending large message was to hack around some other old system performance problem that had gotten fixed properly a year back, but the hack that generated a huge message was still in place)

Erlang/OTP-22 (released last year) introduced TLS distribution optimizations and message fragmentation which sound very related to the problem you saw:


The fragmentation in particular addresses the problem where a large message would block all other messages, including heartbeats, and cause nodes to look “down” when they’re not.

fantastic. thank you for sharing that -- my anecdote about this problem is slightly dated -- it would have been late 2017 early 2018 we were seeing the issue, which indeed predates OTP 22 release.

The old network partition problems people remember about RabbitMQ are solved by quorum queues.

Yes, but quorum queues don't have many of the features of classic mirrored queues.

it used to be really bad, that's super true.

nowadays? it's actually quite simple to setup and works pretty well (source: i know two different companies that setup clustering recently and both had good experiences with no downtime).

I've used both. I was introduced to Rabbit at one job and at another, was "fed" Kafka during a selection process. At that time, I was definitely not opposed to Kafka because, hey new resume item. I ended up yearning for Rabbit for three reasons.

1) Much easier to implement and maintain for small to medium architectures. However, war stories I've heard is that it starts to become a hassle for large clustering architectures.

2) Because it's a traditional message broker, the input and output ends, which I was responsible for, were much simpler to write because I didn't have to worry about replays when it came back online. Rabbit knows which client it has already routed to and where messages went. Kafka is not that sophisticated in that regard. Kafka has been described as "dumb broker/smart clients" while Rabbit is "smart broker, dumb clients."

3) The scaling. Rabbit is very scalable. Once you get to the Uber/Paypal level (like, a couple of million writes per second), then Kafka becomes the obvious choice. Rabbit handles thousands or writes per second just fine. However, at that second company and like many others, they thought they'd have to suck up all the data, so of course, Kafka was the more scalable tool long term. Spoiler: We were never, ever close to PayPal-level transactions. If the size of the sun represents paypal/Uber transactions, we were basically Manhattan.

Kafka is one of those things where if you're new to it, especially if you're coming from Rabbit or similar, you might tend to assume the happy path - exactly once delivery. This is a bad mistake (whether that's possible and to what definition is not a debate I'd like to dive into now). What you should expect from Kafka is at least once delivery.

There will be times when you lose offsets or when you actually want to replay every message, so take an hour and figure out what that means to your app. It's usually only a few lines of code in your consumer that compares source timestamps, but it's by far the most beneficial thing you can do when working with Kafka in my experience.

It's also relatively easy to hit "tens of thousands" messages/second, especially in replay or bootstrapping scenarios, and that's when Kafka becomes useful to the non-FAANG companies.

Author here.

I've seen quite a lot messages going through RabbitMQ. I wouldn't worry too much about scaling, because the possibilities depend very much on the architecture. With some tuning RabbitMQ can take you a long way. I would give clustering a go and see where the limits are before exploring more complicated architectures like federation.

Could you explain how RabbitMQ clustering is going to improve performance? For how it works I would expect it to lower performance.

With clustering, you can have more nodes and you can shard (distribute) your queues over the cluster. You don't need to mirror every queue on every node. But you are right, mirroring alone will add more load.

Rabbit's federation is a good way to bridge point-to-point connections between geographically distributed systems. I'm not sure that's a great scaling pattern for throughput though.

The clustering might look tempting but it hasn't been resilient for me in the face of janky networks. Split brains and data loss can result.

In the past I've scaled my rabbits for throughput by implementing my own routing/sharding layer.

If you're tempted to use the message persistence and you care about retaining messages, kafka is a bigger but much more capable hammer.

If you’re trying to “rationalize” a decision, that’s already a red flag. Also, Kafka and RabbitMQ are intended for different use cases. One is (the log component of) a streaming data processing system, the other is a message queue. Figure out which kind of system you need before deciding on a particular system. BTW, if you need to really scale, Apache Pulsar is designed to handle both scenarios.

Look into Pulsar, it can function as a message queue or pub/sub like Kafka.

By default it only retains non-acked messages, multiple subscription modes, can use non-persistent messaging, dead letter queue, scheduled delivery, can use Pulsar Functions to implement custom routing etc.

Scales like Kafka (probably better) and has cluster replication built in.

Rabbit MQ is a traditional message broker; you use it when you have lots of messages you don't particularly want/need to be stored persistently, and where you want/need to take advantage of the routing feature--that you put keyed messages into some topic/exchange and then subscribe to only part of the messages any given application is interested in.

Kafka creates the abstraction of a persistently stored, offset-indexed log of events. You read all events in a topic. Kafka can be used to distribute messages in the way AMQP is used, but is more likely to be the centerpiece of an architecture for your entire system where system state is pushed forward/transformed by deterministically processing the event logs.

If your main concern is scalability: Each queue in rabbit gets its own thread. So if you can spread your workload across multiple different queues you can scale without too many problems.

Both RabbitMQ and Kafka are extremely simple to stress test with simulated data which will let you make a decision that you will be comfortable with.

Are you replacing an existing system that's already at scale?

No, greenfield

Then the odds of you hitting the scale where RabbitMQ v. Kafka is relevant are a million to one. There is a lot of overhead with Kafka compared to RabbitMQ.

Unless you already have Kafka infrastructure, setting up Kafka for a brand new project is crazy unless your only goal is learning how to set up Kafka.

you either look at pulsar with rabbit

We've used RabbitMQ since 2010 in KAZOO. I would argue, save one or two instances in the intervening 10 years, that RabbitMQ is the most stable piece of the infrastructure. I think it might be the only open-source project we build on that we haven't committed upstream to because we haven't encountered any issues in our usage.

RabbitMQ is one of those pieces of software I usually forget are there. I can't remember having to deal with any rabbit issue in last few years.

This might be the Achilles heel of RabbitMQ. It works so well that people forget it for years, and then they have forgotten how to upgrade it, etc. :)

Lol this ! WRITE down that rabbitmq-web-admin passwd. After the setup and first few weeks of checking the speed of your queues you will forget about it and try to login in 1 year later :)

We started using RabbitMQ for several projects last year, and it's been a joy.

Some of that joy is surely just moving from older, creakier solutions. But it hasn't let us down, and everyone is eager to use it for new features or refactoring legacy code.

Using this opportunity to shout out to Rascal (https://github.com/guidesmiths/rascal) which makes using RabbitMQ on Node an absolute joy.

Same with MassTransit[0] and .NET. We have several distributed .NET Core services running in our data center, services running on employee PCs, etc all communicating via RMQ with MassTransit and it's great. The primary maintainer is very active (streams every Thursday evening) and the documentation has gone from "pretty bad" to really good in the last few months.

[0] https://masstransit-project.com/

MassTransit is awesome! I love what Chris Patterson (the author) did. It essentially allows you to swap out RabbitMQ for SQS or Azure Service Bus or a few others. Pretty cool stuff if you're in .NET land.

I have never had a good experience with RabbitMQ wherever I have worked. Often it was buggy and unreliable. It’s almost always been some thing shoehorned into a service, but failed to gain widespread adoption with future services. Furthermore, it’s usually some hot potato no one even wants to deal with. We have written some code around it to make it more reliable. You quickly figure out why there seems to be so many half baked implementations of it wherever you go work.

It’s basically caught between being too bloated and complex for use with smaller systems (as some commenters have poked at people for not being the ‘right’ kind of person to be running it)

While at the same time, it’s not robust and reliable enough to use in prime time.

What’s left is this enticing and sexy sounding message broker called RabbitMQ that actually just sort of sucks.

In my experience someone gets stoked on trying this out but once everything is all implemented it disappoints and the system or service it is apart of is a one off after future services use something more mature the next time around.

For scale I have used NSQ to handle millions of message a second and then for smaller scale AWS services like SQS can handle things much more reliably.

I love RabbitMQ but deploying/managing a cluster can be tricky. We had problems with network partitioning and since we didn't really need a cluster for performance reasons - only availability - we switched to a single node.

Try the new quorum queues, they don't have those issues.

It's working well for us but we occasionally get blips where for very short periods of time messages get "stuck" in between application code events on different servers and we cannot figure out why. It's very rare. Maybe a burst of 5 messages every 10 million messages.

Any ideas on how to even debug this type of thing? Help! We think it might be a tcp connection failure but we have no idea.

Tcpdump and wireshark?

One big thing I’ve appreciated about RabbitMQ is how well it separates publishing, message routing, and subscription concerns. Plus it’s never been the issue in any infrastructure I’ve encountered it.

We use AWS SQS and Rabbit. At our scale, SQS is easy peasy and we can wrapper it to make http calls instead of using SQS, as we're using AWS Beanstalk workers. SQS is generally quicker to get up and running with and we can have metrics out the box. With rabbit we use it for some other stuff and it works just fine, it's when things go into a black hole we struggle, but that's our lack of knowledge.

Depending on your scale, we find SQS is cheaper than a managed rabbit service. Although I'd be interested in using kafka!

Use RabbitMQ for a call center handling thousands of calls per second. It worked fine integrated with Flower, Celery and Python...but once we went production, became a black box which every setting was hard to find documentation or support, we ended up having to build huge Machines with tons of memory and CPU and still saw messages lost no explanation. Ended up moving to PubSub and rebuild the whole app

Rabbit is not a "turn it on and hope it works" kind of solution and if it's a blackbox to you then you shouldn't use it. AMQP is a relatively fancy protocol and Rabbit is endlessly configurable which is both a pro and a con. You will need to develop expertise in Rabbit to use it well at scale.

I've been using rabbitmq heavily for a fairly large hobby project (20-100 messages/sec) for a few years now. I'm generally happy with it, but there are a number of caveats I've learnt.

1. If you have large messages and use keepalives (and you'll need keepalives), you need to write your own message fragmentation.

2. There are no python libs that just work. I'm currently using a vendored version of amqpstorm with a bunch of hacks to handle wedged connections. I have some AMQP connections that are intercontinental, and I've been able to wedge literally every other AMQP library.

3. If you have a single open connection, it will get stuck from time-to-time. With a bunch of both in-band and out-of-band keepalives, I've got it to the point where I don't have things permanently block, but you should expect things getting stuff for ~2x your heartbeat time periodically. This doesn't seem to result in message loss. I've dealt with this by just running LOTS of concurrent connections, and aggregating them client side. This has worked fine.

4. In general, exactly-once delivery isn't a thing. You should design either for at-most-once, or at-least-once delivery modes exclusively. Idempotency is your friend.

5. The tooling /around/ the rabbitmq server is a dumpsterfire.

Basically, I feel like the core server is super durable (note: I'm not running a cluster, so this doesn't generalize to multi-instance cases), but the management stuff is god-awful. The main management CLI tool actually calls the HTTP interface, which is kind of ridiculous. I've occationally run into a situation where I wound up with leaking temporary exchanges, and just flushing bogus exchanges is super annoying.

I don't think there's any other options that can do what rabbitmq does for my use-case, but it's had quite the learning curve.

> If you have large messages and use keepalives (and you'll need keepalives), you need to write your own message fragmentation.

I'm confused by what you mean by that. Do you mean "large" as in "take a long time to process in the consumer"? If so, and if your consumer is not issuing heartbeats concurrently with message processing, then that is true.

> There are no python libs that just work.

Completely agree. Having hacked on and patched the code inside Celery, it's really quite a bummer. I think this is because the Python libs try to abstract over things that ... just straight up can't be abstracted away given the semantics of AMQP: specifically connection-drop-detection, "resumption" of a consume (not really possible; this isn't Kafka), and the specific error code classes (connection-closed vs channel-closed vs information).

> If you have a single open connection, it will get stuck from time-to-time.

Are you talking about publishing connections? Consuming connections? One used for both? What does "stuck" mean? I'd be interested in hearing more about this.

> exactly-once delivery isn't a thing

Kinda pedantic, but exactly once delivery is possible in some very restricted situations (see Kafka's implementation of this guarantee: https://www.confluent.io/blog/exactly-once-semantics-are-pos...). Exactly once processing is what's tough-née-impossible. So yeah, idempotence is great.

> I'm confused by what you mean by that.

By large, I mean 10+ MByte.

> Completely agree. Having hacked on and patched the code inside Celery, it's really quite a bummer.

I don't understand what the point of celery is. Literally everything I do requires /some/ persistent state in the workers, and there's no way to do that with celery.

> Are you talking about publishing connections? Consuming connections? One used for both? What does "stuck" mean? I'd be interested in hearing more about this.

TCP connections. As in, a connection to the server from a consumer. High latency connections seem to exacerbate the issue.

I think the issue is the state machines server-side and client-side get out of sync, and things just stop until the keep-alives/heartbeat cause the connection to reset, but that's a bunch of time to wait with no messages.

I also ran into the issue that basically every python library had at least one or two locations where `read()` was called without a timeout, but that was at least easier to fix.

> Kinda pedantic, but exactly once delivery is possible in some very restricted situations (see Kafka's implementation of this guarantee: https://www.confluent.io/blog/exactly-once-semantics-are-pos...). Exactly once processing is what's tough-née-impossible. So yeah, idempotence is great.

Well, it isn't really a thing, so you at least shouldn't depend on it being a thing for your architecture if possible.

> By large, I mean 10+ MByte.

OK. Did Rabbit or your client libraries bug out when sending single giant messages? What does message fragmentation (by which I assume you mean splitting one logical message up over multiple AMQP messages? Or something else?) have to do with keepalives (and what do you mean by keepalives? Connection heartbeats? TCP keepalives?)?

> Literally everything I do requires /some/ persistent state in the workers, and there's no way to do that with celery.

Sure there is. In-memory caches persist between requests. And there's always sqlite and friends. Celery's more intended for the "RPC/fire-and-forget" case than stateful workloads, but it's not too painful to use those with it. And you get the benefits of its (reasonably) hardened connection/heartbeat management, which may help with some of your other issues.

Basically every time I've seen code that rolled its own bespoke consumer loop for RabbitMQ, it was wrong in some fundamental ways; the state machine on the consumer side did indeed get out of whack, and badly. Best to outsource the "keep the connection alive, establish subscription, detect failures" work to a higher-level library (like Celery) that provides a long-lived consumer so your code can just be occupied with data processing.

Would anyone be able to explain the benefits of RabbitMQ over NATS? As far as I've seen, it's really just that RabbitMQ is more feature-rich, which I personally feel like isn't that crucial, as frankly many systems are not going to take advantage of those more complex functionalities anyway.

Durability. If you need to push messages that don't get lost, RabbitMQ is a pretty solid choice. In years past the clustering situation wasn't great and there was some potential for message lost and that seems to be resolved now with quorum queues, but the biggest different between NATS and RMQ is the durability guarantees and the at-least-once delivery guarantees that RMQ has. NATS is more like ZeroMQ in that it expects the subscribers to be online. There has been some work by others using that NATS protocol to create a Kafka-like system (written in Go, I believe) called LiftBridge. So if you like NATS and it's working for you and you want durability, take a look at LiftBridge.

This isn't true anymore. Nats streaming has persistence, so the OP's question still remains

My understanding is that NATS (a protocol) and NATS streaming were related but separate:


(The issue is from 2017 but illustrates a distinction)

That's right, but I think at least since both are listed on their website as different ways to run it that it should at least be considered a native feature at this point.

All very interesting - this is great!

Rabbit saved my life. I had a project that involved getting the AMQP Proton library working on the Xbox. Rabbit was so easy to setup and use, it gave me a reliable way to test my work. Getting into AMQP at the time was confusing and poorly documented. Rabbit did imdeed "just work".

How does RabbitMQ compare with Kafka?

I'm more familiar with SQS than RabbitMQ, but have used both, and have chosen between queue and stream based solutions.

Kafka is a stream, and can be replayed (if you have it set up to store stuff). Rabbit is simply a queue, and when the messages are gone, they're gone.

This means that queues are a lot smaller, but can only serve one set of consumers at at time. If you want to have multiple things listening to messages, you have to use fan-out patterns that place messages on multiple queues. Queues can also suffer from less than atomic delivery, especially if the system is distributed. This means you have to jump through some hoops and add an atomic layer somewhere if you want to ensure you're not double processing anything.

Kafka can have infinite retention (if you got the storage/$), and you don't need to have multiple streams to service multiple consumers. Each consumer stores where they are in the stream, and can traverse as needed. You'll need to be careful to make sure that a single consumer is handling a single partition to promise that you'll only process a message once.

Managing streams can be a headache, but less so now if you have money to have Amazon or Confluent manage it for you. They offer pretty much unlimited scalability, and are the production grade solution for a ton of problems.

Queues are really simple to understand and build and still scale pretty dang well. Just make sure your message processing is idempotent and make sure you can handle if something is processed multiple times.

I've been interested in this question as well. There's a lot of sources online comparing the two but none really definitive.

RabbitMQ is not suitable for event sourcing. Kafka is. In general, RabbitMQ is a “river” and Kafka is a “lake”.

RabbitMQ has excellent support for complex message flow topologies. Kafka out of the box does not provide these features.

I highly recommend all of Jack's blog posts about RabbitMQ - https://jack-vanlightly.com/blog/tag/RabbitMQ

Jack works with me on the RabbitMQ core engineering team. We've been hard at work to address a lot of the issues brought up in comments here. It's worth it to try out our latest releases. The engineering team is very active with the community and takes all constructive, helpful (i.e. reproducible) feedback seriously. Feedback is encouraged via the rabbitmq-users mailing list. Thanks.

I really like RAbbitMQ. But I really dislike that database that it rellies into, Mnesia. I had a client that because of licence issues could only do one operation per time in the ERP software. So I used RabbitMQ to line the requests, and do one at a time. Worked great ,was fast and low in resources. But the place power supply was a problem, and more than once the place had a blackout and when returning menesia messed up and lost the queues. So I ended up just making my own simple queue using sqlite in the server.

Debated using RabbitMQ but decided the infrastructure overhead was too high.

Ended up looking into `rq` and `arq` which were both excellent!



Would recommend if you're looking for a (faster) worker queue without all the overhead (in my case, didn't need all the other features that came w/ RabbitMQ so this got the job done).

We use ZeroMQ a bit. It's been pretty much flawless as far as I can see but I get the impression that it's becoming obsolete. Is RabbitMQ a viable replacement?

The "MQ" in "ZeroMQ" is misleading, so this is an apples-to-oranges comparison. ZeroMQ is a socket abstraction that allows you to build apps that send messages to each other. RabbitMQ is a reliable message queue broker; a central server that stores messages and that clients connect to in order to push/pop them.

You might find it interesting to note, that Peter Hintjens, was one of the core authors of the AMQP 0-9-1 Specification [1], that RabbitMQ is implementing.

ZeroMQ was born out of a frustration with complex routing patterns and the need for a broker-less architecture for maximal performance message delivery.

[1] https://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf

He and Martin Sustrik both created ZeroMQ. Then after that, they saw some of the limits of ZMQ and created nanomsg. It's excited to see what cool stuff they were working on. It's a little hard to see ZeroMQ become abandonware from them. That said, the community is solid and supportive around ZeroMQ which actually I would say is the best part. In other words, you can tell if a project has staying power when the original creator no longer has to be there to maintain it.

I'll add that the AMQP 1.0 spec (supported in Rabbit using a plugin) is a peer-to-peer protocol that supports both the traditional broker use case, 'direct' p2p messaging and opens some interesting uses of message routers like Apache Qpid Dispatch Router.

I am no export but I have heard PH say, that it's much worse than the AMQP-0.9 Spec. It's a design-by-comitte thing, where he was sidelined.

No. The only thing ZeroMQ and RabbitMQ have in common are the letters M and Q.

RabbitMQ is a messaging system. ZeroMQ is sockets on steroids.

If the ZeroMQ community seems quieter lately it's because things work well and there's not much left to do within the project's intentionally limited scope. libzmq is certainly maintained.

Our company has been using ZeroMQ for over 8 years. We'll be putting out another ZeroMQ-based open source project soon too.

I think they’re slightly different solutions — ZeroMQ works without a broker, RabbitMQ requires a server process.

If you use the brokerless model, there was a bit of drama over ZeroMQ — the original technical developer (Martin Sustrik) left and created a successor, nanomsg, with what he learned. At some point, Martin lost interest, and Garrett D’Amore took over maintenance and did a rewrite called nng. Both the old nanomsg and nng are maintained, with nng being somewhat actively developed, but also fairly “complete”, so there’s not a lot of excitement like you see with some projects. ;) nanomsg and nng are essentially wire-compatible, so you can mix and match depending on bindings availability for your language.

Yes, my handwavy reading of the situation was that he left due to issues with zeromq that he couldn't/wasn't allowed 'fix'. Then Peter Hintjens unfortunately died a few years back. I haven't heard about nng, so thanks for that, I'll check it out.

ZeroMQ certainly isn't perfect, for example there's no way to tell if a message was successfully written to a PUB socket, or if it was dropped (just one minor issue)


Anyway, This is digressing from the main topic

You should take a look at nng (nanomsg-next-gen) [1], which is a successor to nanomsg, which was a successor to ZeroMQ.

[1]: https://github.com/nanomsg/nng

For now, I'll just address the one point -- obsolete? NOT!

We've been working with ZeroMQ a lot over the past couple of years, and have gotten to know some of the maintainers -- we've been very favorably impressed by their ability and dedication.

Pieter Hintjens was the "voice" of ZeroMQ, and with his passing things have gotten a bit quieter, but no less active. (Just take a look at the commit log: https://github.com/zeromq/libzmq/commits/master).

JeroMQ has been nothing but a pleasure. I don't see ZeroMQ being obsolete in its forked forms anytime soon.

We were looking into RabbitMQ but quickly retracted once we realized that it does not support external OAuth2.0 providers in a straightforward way.

I used RabbitMQ to distribute messages between components of a distributed grading service that I wrote in Kotlin and deployed on Kubernetes.

My experiences were pretty mixed. Overall I found it to be more difficult than I would have wanted to get simple things to work. Part of this seems to be a problem with the Java library, which is not great. For example, IIRC you have to be really careful not to create the same queue twice, even with identical configurations, since the second time something blows up. At the end of the day just a simple fan-out configuration ends up involving a lot of somewhat-intricate code. It definitely does not Just Work (TM).

And then there was the bizarre hangs that I would experience during testing. I set up a Docker Compose configuration so that I could test the various parts of the system independently. It included one container running RabbitMQ to simulate the cluster we have running on our cloud.

Usually tests ran fine. But then, from time to time, the client would just hang trying to send a message through RabbitMQ. Unfortunately, again, the code you need to just run a basic configuration using RabbitMQ is complex enough that at first I was pretty sure that I had done something wrong. But after a few hours of increasing frustration I finally broke down and discovered that a simple test case that just sent a single message using code torn right out of the docs would hang. Forever. (Or, long enough that I gave up waiting.)

After a lot of digging I found the culprit. RabbitMQ will just take its ball and go home if the broker doesn't have enough disk space. Given that I use Docker heavily for a lot of projects, the amount available to new containers would vary a lot depending on what other data sets I had loaded or how recently I had run docker system prune.

I filed an issue about this, asking to have a better error message displayed when an attempt to send a message was made. The response was: there's already an error message, printed during startup. You didn't see it? No. I must have missed it among the hundreds of other lines of output that RabbitMQ spews when it starts.

Overall my favorite part of this story is that RabbitMQ chooses to start but refuse to send messages when low on disk space, when just crashing would be much more useful and make it much easier to pinpoint what was going on.

Anyway, I'm in the market for a simpler alternative that's Kotlin friendly.

Man I love this piece of software.. We used it as a bare-bones msg queue FIFO and some FAN-OUT patterns. Basically only scratch the service of what is possible. But this beast ran our ETL distributed update system at PriceCheck (S.A largest price comparison service). Haven't worked there now for a few years but back then the RabbitMQ was rock-solid for us !

A few years back I wrote a blogpost showcasing a nice use-case for RabbitMQ and Elixir


Using it in multiple projects: The software itself is great and provides great value.

Only pitfall are the available libs. Especially with the .NET implementation we had quite a lot of trouble. Its not following current .NET patterns and has strange quirks. Does anyone know a good alternative to the "official" one?

> Especially with the .NET implementation we had quite a lot of trouble. Its not following current .NET patterns and has strange quirks.

It would be great to get specific, actionable feedback with your experience, either via a message to the rabbitmq-users mailing list or via a GitHub. The .NET client is an old library but considerable effort into improvement went into version 6.0. The plan for 7.0 is to address old patterns that remain in the library. Feedback would help guide that effort.

I just released version 6.1.0-rc.1 and would appreciate testing if you have time. Thanks!

The biggest issues are the public API surface.

If the library were being designed from scratch today, pretty much every method on the model would be Async. After all, if it leads to any network I/O of any kind, that can block.

Working with the current public API, Trying to implement a publish wrapper that never blocks, and returns a task that either completes when the publisher confirm is received, or faults after some provided timeout, is a lot trickier than it might sound.

Recovery from network interruptions is complicated, and auto-recovery features are limited, and in some use cases actually dangerous. For example, if you are manually acknowledging messages to ensure end-to-end at-least-once delivery, then you cannot safely use the auto-recovery, since the delivery numbers would reset when the connection does, and you can accidentally aknowlodge the wrong message with delivery tag 5. (Acknowledge the new one, when you were trying to ack the old one).

In my implementation of that included my own recovery, I ended up needing to pass around the IModel itself with the delivery tags, so I can check if the channel I am about to acknowledge on is really the same one I received the message on. (There is no unique identifier of a channel instance, since even the channel number is likely to get re-used).

Thanks for taking the time to respond. I created this issue so that this feedback is not lost - https://github.com/rabbitmq/rabbitmq-dotnet-client/issues/84...

If you have code you can share that you used to address shortcomings in the client, we could get ideas from it for the next major release. Cheers!

MassTransit is great, the maintainer is very active on Discord, and since quarantine has been streaming every Thursday night (for my UTC-5 anyway). Documentation quality has increased greatly the last few months as well.

You might want to look into EasyNetQ[0]. I've not played with it much but it appears to be a cleaner, more modern abstraction over the existing .NET Client. I'm not sure whether it fixes all the 'quirks' in the client however (I've run into them too :))

[0] - https://github.com/EasyNetQ/EasyNetQ

It's been a few years since I used it, but I used EasyNetQ for a while when I was working with RabbitMQ and it was great. A quick peek at GitHub shows that it still seems to be actively maintained. Maybe it's what you're looking for: https://github.com/EasyNetQ/EasyNetQ

I've got a connection/channel question for those who have built solutions with rabbitmq-- how did you decide as to how many connections and channels-per-connection to use? Does connection pooling even make sense for RabbitMQ? My impression is that channel pooling may make more sense. Thoughts?

An application usually has one connection, and many channels. Our pattern is to dedicate one channel for all publishing and then N channels mapped to consumer threads.

You don't have to pool connections as channels are multiplexed by them.

Things to watch out for:

- opening too many channels - these map to Erlang processes and can overwhelm your server if you go over ulimits - sharing consumer channels between threads - you might see weird behavior (e.g. acking wrong messages etc)

We've built own library/framework for creating resilient consumers, and it enforces mapping 1:1 channels and consumer threads, as well as automatic reconnections and channel clean ups.

+1 for everything that's been said. Another thing to consider is message throughput, if that's a concern. In the case of multiple channels per single connection, note that a connection is a single TCP connection such that multiple channels contend for the TCP stream. At the same time, connections aren't completely free either.

The general takeaway from this should be: if you've got a particular stream of messages (either a producer or a consumer) that pushes many thousands or even tens of thousands of messages per second, use a separate TCP connection. For anything else that is slower (dozens of messages per second), multiple channels on the same connection work great.

One last consideration is that when a given channel misbehaves or you perform an operation that the broker doesn't like, the only recovery that I've seen is to shut down the entire connection which can affect others channels on the same connection.

I used RabbitMQ together with python and celery quite extensively and it scales really well. One thing we had trouble with though was to find a nice mechanism to scheduled tasks. Eg. “Run this task 12 hours before departure”. Maybe AMQP is the wrong place to solve that problem.

I've been using something like this for exponential backoffs, but I think it'd work for this case as well.

Let's say you've got one exchange and one main queue for processing: jobs.exchange and jobs.queue respectively.

If you need to schedule something for later, you'd assert a new queue with a TTL for the target amount of time (scheduled-jobs-<time>.queue). Also set an expiry of some amount of time, so it'd get cleaned up if nothing had been scheduled for that particular time in a while. Finally, have its dead-letter-exchange set to jobs.exchange.

This could lead to a bunch of temporary queues, but the expiration should clean them up when they haven't been used for a bit.

Haven’t used this myself, but there seems to be a plugin for delaying message delivery: https://www.rabbitmq.com/blog/2015/04/16/scheduling-messages...

You schedule tasks very easily with the Celery Beat Scheduler. I've used it in production system to kick off big jobs and notifications.


You're usually stuck polling for stuff like that. I'm a big fan of using things like Advanced Python Scheduler for those sorts of tasks: https://apscheduler.readthedocs.io/en/stable/

Yeh I would argue don't use a message queue for this, they're really best processing many messages quickly, there are plenty of scheduling libraries that have various persistence layers to handle this depending on your ecosystem.

Celery has eta/countdown params that allow for running tasks at a specific time

you can use a redis sortedset https://redis.io/commands/zadd

My biggest issue with RabbitMQ is the only official erlang downloads for windows binaries are from the official website and slow as all getout in most of the world.

I really don't get why they don't publish at least the windows binaries with their github releases.

I tend to see people preferring RabbitMQ over Kafka and viceversa as if they were products solving the same problems but they are not. They do have in common the fact that they help decoupling applications but in different ways. Both are great.

Everywhere I've worked in the last 10 years has been a cornucopia of databases, programming languages, cloud platforms, linux flavors and everything was different except for one thing: they all used RabbitMQ.

My only experience with RabbitMQ is managing an Openstack environment. In that environment, it's a huge resource hog and we had to put it on 3 separate bare metal instances to keep it stable.

I read but I still can't understand. I would like to know a very simple example of something that can't be solved with a CRUD, and can be solved with RabbitQM

Really really bursty loads. You have a customer upload a data file and you have to process it. If you crud it, you have a worker chopping it apart and making sync API calls. If something fails in the middle, it has to retry, but what happens if your container/database goes down when you're halfway through? Now you have to reprocess that file again, etc.

You move this to a queue, and have a worker chop that data file up into individual records, those records go onto a queue, and you can process them however you want, no worries about something crashing and not being able to be retried. If the database goes down, everything just pauses until it can go again. You can limit the queue throughput to whatever you want to avoid having to scale your API/Database.

Can you handle stuff via all CRUD sync APIs? Sure, just like you could handle running a restaurant where you have one person who takes the order and cooks it and delivers it to a table. However, it's more efficient to have a waiter (API) take requests and give them to a cook (queue based async worker) to handle stuff that's not as time sensitive. This saves you a lot of money in certain situations.

One common way queues can be used to give an async like feel to your applications and flatten out spikes of activity without having to add hardware.

So, for example, you would have a CRUD that takes requests, and when there is background work to be done, places a message on the queue, and immediately returns to the user. This frees up the server for more requests. Meanwhile in the background, a worker process chugs through the queue and does its work. During long spikes it will take longer to get through the queue, but your end users will not have disruption of service.

I've been meaning to give RabbitMQ a try in the last few years, but our good old beanstalkd is serving us well. It has all the features we need, and it just works.

Given the recent "boom" of MQTT, anyone use RabbitMQ for MQTT clients? Any benefits of using it that way over using MQTT-only brokers?

I do and it's great. It doesn't have some MQTT features sush as persisted messages or QOS 2 but if you don't need that, it's a fine MQTT broker.

Wait, what are you talking about? RabbitMQ does have persistent messages, you just need to set the queue as "durable", and the messages persist even during failures.

Indeed I'm a bit confused. I remembered about having to find a workaround because I couldn't use retained messages. It's actually not working only for subscribers with wildcards : https://github.com/rabbitmq/rabbitmq-mqtt/issues/154

Would anyone using RabbitMQ as a replacement for GCM/FCM on Android mind sharing their experiences?

Doesn't keep the messages for historical analysis. No deal. Kafka please.

At work we built a microservice-like (more like meso services) architecture which uses RabbitMQ for messaging.

RabbitMQ itself is great, but there are some downsides to this architecture:

* Lots of tooling (for blue/green deployments, load balancing, autoscaling, service meshes etc.) assumes HTTP(s)+JSON or GRPC these days

* Getting people who aren't deep into software engineering to write a service that connects to RabbitMQ has a much higher perceived hurdle than making them write a HTTP service

* Operations is different than with HTTP-based services, and many operators aren't used to it

TL;DR: it's more of a niche product for inter-service communication, which comes with all of the problems that niche products typically face.

Does anyone know of any big name brands using RabbitMQ? And if so, what specifically for?

While an official list of customers can't be published, you can get some ideas from the speakers at the last two RabbitMQ summits - https://rabbitmqsummit.com/

Also, see the following articles:

Laika - https://www.rabbitmq.com/blog/2019/12/16/laika-gets-creative...

Bloomberg - https://tanzu.vmware.com/content/rabbitmq/keynote-growing-a-...

Goldman Sachs - https://tanzu.vmware.com/content/rabbitmq/keynote-scaling-ra...

Softonic - https://www.cloudamqp.com/blog/2019-01-18-softonic-userstory...

I recommend going through presentations of the RabbitMQ Summit. https://www.youtube.com/channel/UCp20sSF_JZv5aqpxICo-ZpQ

There are some big companies talking about their experience.

We use it heavily at Reddit as well.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact