Hacker News new | past | comments | ask | show | jobs | submit login
The Big Little Guide to Message Queues (sudhir.io)
295 points by sudhirj on Dec 31, 2020 | hide | past | favorite | 108 comments

The decoupling narrative is oversold for queues.

There's essential decoupling and accidental decoupling; decoupling you want, and decoupling which mostly just obscures your business logic.

Resilience in the face of failure, where multiple systems are communicating, or there's a lot of long-running work which you want to continue as seamlessly as possible, is the biggest essential decoupling. You externalize transitions in the state machine (edges in the state graph) of the logic as serialized messages, so you can blow away services and bring them back and the global state machine can continue.

Scaling from a single consumer to multiple consumers, from multiple CPUs to multiple machines, is mostly essential decoupling. Making decisions about how to parallelize subgraphs of your state machine, removing scaling bottlenecks, is an orthogonal problem to the correctness of the state machine on its own, and representing it as a state machine with messages in queues for edges helps with that orthogonality. You can scale things up without external persistent queues but you'll end up with queues somewhere, even if it's just worker queues for threads.

Accidental decoupling is where you have a complex state machine encapsulating a business procedure with multiple steps, and it's coordinated as messages between and actions in multiple services. The business logic might say something like: take order from user; send email notification; complete billing steps; remove stock from inventory system; schedule delivery; dispatch stock; etc.

All this logic needs to complete, in sequence, but without higher order workflow systems which encode the state machine, a series of messages and producers and consumers is like so much assembly code hiding the logic. It's easy to end up with the equivalent of COMEFROM code in a message system.


This is touched on in the article, but I should probably expand on it a bit more. A queue is a transport tool that allows moving information between the stages of the workflow, but using a queue doesn't mean you have an excellent workflow. The decoupling business advantages come mainly from having the workflow, not necessarily using a queue as a building block for it.

When you do use a queue as a building block in a workflow you get some technical and operational advantages, maybe those are oversold.

Will think about how to advocate business decoupling using a workflow as a the zeroth step before getting into queues.

Message queues are one of these things that sound great in theory but IME has simply been a way of adding more failure modes and points of failure to chase the (largely unjustified) goal of systems decoupling.

There are two types of people in the world: those who divide the world into two kinds of people and those that don't. As one of the former, I like to view MQs as being essentially one of two types:

1. Stateless

2. Stateful

Stateless means if there's no one there to handle the message, it is simply lost. The benefit of course is you don't need to consider all the complexity of persisting messages. But a stateless MQ is really just a routing / service discovery architecture. Now that's completely fine to abstract that away but it's not really that exciting. It's still synchronous delivery, essentially.

Stateful OTOH means you really can queue things up and deal with them later. Unfortunately you're now dealing with:

- Reliability: what if the persist call fails? Do you have enough storage? Can your persistence medium keep up with the volume of reads and writes?

- Is delivery guaranteed or is it simply best-effort? This is a big one. Best effort is MUCH less problematic but doesn't tend to be what people want. You see guaranteed delivery in a lot of enterprise MQ software (eg TIBCO certified messaging) where you have things like MQ integration with two-phase commits and the like.

3. What ordering guarantees are there? Guaranteed ordering or best-effort or no promise?

Basically you've probably just created another database. Worse, that database may be different to your other databases.

Also, is your MQ a FIFO or some form of priority queue? If it's a FIFO, how do you handle throttling and buffering?

The post describes the apocryphal MQ as having infinite capacity. No MQ is.

IME the supposed benefits have been outweighed by all the negatives.

> Basically you've probably just created another database

I think that's a fair assessment of persistent queues, using "database" in the most general sense.

> Worse, that database may be different to your other databases.

Different isn't inherently worse. It's just a different tool for a different job than a traditional RDBMS. The problem only occurs when we pretend they're the same.

while this is a good overview, i think this article paints too rosy a picture of message queues, specifically

> We can put as many messages as we want into the tube (let's assume we have a infinitely long tube) at whatever speed is comfortable to us.


> The receiver will never be impacted by our actions—they will pull out as many messages as they want at whatever rate is comfortable to them.


> Neither the sender nor the receiver are concerned with how the other works.

First, I would argue that the sender and receiver still absolutely have to know how the other works, because your API contract is now the structure of all messages that enter the queue. You also need to be aware of the semantics of both client and receiver: if I receive a message I can't process, can I get rid of it? Do I need to retry it forever? And as a sender, will my receiver retry this message or do I need to track success? In synchronous systems you have the opportunity provide clearer feedback in these scenarios, which help mitigate or at least share responsibility in mitigating poison pill scenarios.

It's also the case that when you introduce a queue you're adding an arbitrary buffer. During short transient outages this is usually ok, as your system has some slack somewhere at some time to drive down the pileup, but during persistent outages you can end up driving up a difficult to overcome buffer. And worse, in the event of time sensitive message a substantial amount of that buffer may be otherwise useless. Unless your receiver knows whether a message can be safely ignored, you can encounter outage scenarios that take hours or even days to recover from and exacerbate impact after recovery compared to if you just had no queue and shed load in the first place.

This was one of the things that made me look into lean manufacturing (book recommendation: The Machine That Changed the World) and lean product development (book recommendation: The Principles of Product Development Flow). The latter is more relevant for software systems, but the former was very good to have read first to see where software is different.

These areas can teach us a lot about the dangers of queues and how to handle them carefully.

Thanks for the points, yeah, definitely need to cover poison pills and dead letter queues.

Re. size the main examples I'm looking at are SQS, Google's Pub/Sub etc which are pretty hard to fill up or overload.

Will think about how to talk about recovery measures. The handling is very application specific, it seems that guidelines will boil down to

1) if your messages are important, suck it up and add more subscriber processing capacity or

2) if they're not important, purge the queue and restart your subscribers.

Not sure I want to get into this kind level of advice without looking at the nature of the messages in the system.

> Re. size the main examples I'm looking at are SQS, Google's Pub/Sub etc which are pretty hard to fill up or overload.

It's not the size of the queue supported by the queue service that's the problem, it is the size of the queue in relation to the receivers capacity to process it. The infinite buffer can become a liability when your receivers are unable to handle it.

> 1) if your messages are important, suck it up and add more subscriber processing capacity or > > 2) if they're not important, purge the queue and restart your subscribers.

But what if things aren't so binary? Say I have a queue of emails that have backed up because my email service (or one of its dependencies) had a multi hour outage on black friday. Some of these messages are time sensitive ("We're packing it!" for a delivery that already happened) and some must be sent ("Your account is being terminated, this is the last notification"). You would like to discard messages like the former (which are certainly the overwhelming bulk of your messages) while making sure that you absolutely send the latter. Oh, and it turns out your email service has a fixed rate limit on your account and they won't raise it. This stuff happens.

Another example I've seen is a queue used as a workflow system: I enqueue commands (add X to Y, ensure X is in state Z, remove W from Y) in a backing system because, right, sometimes the dependencies for doing these operations are down for a little bit. But then one day they are down for a long-time, and the automation that has been set up on top of the external API has enqueued A LOT of commands. And a lot of those commands are now useless, things like an unprocessed "add X to Y" when there's a "remove X from Y" just a couple thousand messages later. My dependency is working to come back up to full capacity and isn't going to give me a higher rate limit, so I need to figure out how to drain this queue faster to get back to my normal line rate.

Another way to put this is that queues introduce bimodal state. When the queue is empty or only experiences temporary periods of overfill, my service runs in one way and has predictable behavior. But when my queue becomes chronically overfilled, my service is in an entirely different state. This is certainly bad news for my system, but can also be bad news for adjacent parts of the greater system that expect certain behaviors out of my service.

Yeah... let me think about this. Not sure how to boil this down to generalized notes or advice.

Could you purge conditionally? Eg Purge where message importance < critical.

Depends on the particular system you’re using. Some might allow it, or you could do it with a script if you really wanted to.

you’d have to build that into your system, or build a synchronous system in front of you queue to provide these semantics. and does critical imply time sensitivity, necessity of delivery, or both?

My guide to message queues: Never use them unless you have no other option.

Message queues are an organizational band-aid for lack of architecture and agreeable contracts between systems. RPC is always going to be a more reliable approach.

If something goes wrong with a complex system, I would much rather an exception be raised immediately and crater the caller than have a message wind up not being answered silently and having to remember to check other areas of the system after the fact.

I have worked in environments that used messaging on a massive scale, and can vividly recall having to write adhoc software to reprocess 3+ gigabytes worth of messages that failed to be picked up by another system.

A simple example would be a "Constraint Warning" message that might be produced by some factory tool. Perhaps you want to make sure all lots that tripped that warning are handled in an extra QC phase to double check things. Perhaps this is another system listening for these messages. What happens if that system fails to see these for whatever reason and you dont have a full-time employee monitoring dead letter queues? You might end up with weeks worth of broken products in your logistics chain and millions of dollars in damages before you figure out what went wrong.

The example of factory QA is really an OLTP / near real time system, for which queues are definitely a bad idea.

Any system where you "would much rather an exception be raised immediately and crater the caller" is by definition an OLTP type system, which is a really bad fit for queues.

Lots of bad experiences with queues don't necessarily mean that queues aren't useful tools. They're just tools—they have a time and a place. If you've tried integrating with an email or SMS gateway, or a legacy rate limited bank you might have a different experience. Or dealing with slow job execution combined with spiky high speed job creation.

Depends entirely on the problem, and probably the solution team and constraints as well. Don't think it makes sense to dismiss them out of hand for all problems, though—just as it doesn't make sense to claim that they should be used for each and every problem.

There is some nuance here. My contention is with queues as a means to communicate between systems. Queues within systems are totally reasonable.

For instance - You have 2 services: Bank Core Integration and Teller. The Bank Core Integration service contains the queue of items that need to ultimately be pushed to the underlying legacy, rate-limited system. The teller service operates in RPC terms with the bank core integration service, so callers into the problem area are isolated from the semantics.

This might sound like a subtle difference, but it is huge in practice. If there is some problem with that queue, it is entirely contained to that one service and can be dealt with in isolation. Logs from that one service should comprehensively document the concern. If there was a message bus between these systems, you would have to go back and review message logs to see if things got missed between systems.

This eventually becomes a microservices vs monolith discussion though.

Implementing both the message queue and the bank core integration service within 1 service makes for less communication/network overhead and is easier to debug, which is definitely valuable.

Seperating them out means you can use off the shelf products for the message queue and is easier to scale, which can also be valuable.

Makes sense, yeah. If a system architecture demands that failures be cascaded downstream to all dependencies, then queues should not be used. There are definitely architectures for which this is a reasonable demand.

This is severely overgeneralized advice. Some traffic needs to be processed right now, and if it can't be processed within a narrow timeframe a failure needs to be acknowledged and handled. Other traffic can wait an hour, but needs to eventually succeed. Why conflate the two?

> What happens if that system fails to see these for whatever reason and you dont have a full-time employee monitoring dead letter queues?

"What if you fail to monitor correctly" is a bad scenario no matter the architecture, but doesn't critique message queues. Any sensible message queue will have lag and backlog metrics on subscriptions, which you can monitor just as easily as an RPC end point.

> RPC is always going to be a more reliable approach.

Seems like an odd statement. RPC seems considerably less reliable to me since it ties your business logic services states' together.

> I would much rather an exception be raised immediately and crater the caller

Why? The caller might not care. And what if you have a large graph of callers? You're going to end up with a cascade of failures, when really none of them are going to do anything except retry or drop - something that the failing service is capable of making the call to do in most cases.

> I have worked in environments that used messaging on a massive scale,

Same, and in my experience it was the RPCs that caused the most trouble. I don't think anecdotes are going to be helpful here.

> Perhaps you want to make sure all lots that tripped that warning are handled

If you want to make sure of something, and then you don't make sure of it, I don't see how that's the fault of a queue. Just... check?

counterpoint: you can implement RPC using queues. queues are just one more tool in the toolbox. it's not the screwdriver's fault if it's used to hammer nails.

Author here. This was self-posted, so hope you find it useful. Please ask me anything.

Great writeup. One thing that could be expanded upon could be the "append-only log" approach to message queues, which is the abstraction Kafka is using. The paradigm itself is extremely intuitive and performant. Producers write to the end of the log, and consumers maintain an offset of how far they've gotten.

That pattern is really powerful and simple, even if the main tool using it (Kafka) ends up being difficult to operate.

Thanks for the guide, as others have said this is good introductory reading for new engineers. I know I really only learned about dedicated message queues 1-2 years into my career and felt really dumb for not having known/thought about message queuing before.

One thing you might consider adding are more examples for different popular queuing systems and how they differ from one another. The software I always reach for is nsq (https://nsq.io/) because it's meant to run co-located with the message producers and readers are supposed to connect to multiple instances where the messages are produced (using a lookup daemon). This is quite different from the queues on your list, so much so that I'd consider adding it just because it works so differently.

Will add NSQ notes. Researched it once for distributed web socket pub sub, did find it pretty interesting.

Added NSQ notes.

Great write up. I think you have clearly laid out the things that a newcomer to the space should know about queues. Bookmarked!

Talking to a bunch of engineering teams I found that some use case for queues are very generic (almost identical use case and implementation across teams). Specifically webhook handling is something that keeps coming up. We've been working for a few months of a queue that's specifically for ingesting and delivery of webhooks. Do you see a future for use case specific queueing systems instead of defaulting to a general purpose queue?

In our case we abstract the actual implementation and behave more like you would expect a standard webhook.

For reference, it's https://hookdeck.io

The way I’ve handled incoming webhooks is to run a Lambda+API Gateway to put them into an SQS queue, and the inverse, SQS to Lambda, for sending webhooks out.

This was possible only because AWS provides these services, of course. If you’re offering an infinitely scalable HTTP endpoint to soak up webhooks and allow me to query them at my leisure, or put them into a queue for me, that would be useful.

I haven’t looked into hookdeck in detail yet, will post again once I do.

That's essentially it.

We've heard from teams having issues dealing with large uncontrollable spikes from their webhook providers and we can smooth out that out. There's additional benefits that can be introduce before it gets to your own infra such as verifying signatures, filtering events, etc.

API Gateway + SQS + Lamda is definitely a common and good approach. My understanding is that you often start running into into other problems. Hitting DB connection limits from serverless invocation is a recurring one! I'm hoping we can make the troubleshooting / replayability easier as well.

Thanks for sharing your approach and opinion! Hoping to hear more!

Are you aware of any nice articles or open source projects for setting up the ability to dispatch webhooks events? I am currently thinking of using GCP PubSub for it and have it consumed by consumer (GCP Cloud functioon?) which does the network request, and requeues it back to the topic when its non 200 response. If it keeps failing 10 times it will get send to a dead letter queue.

Not off the top of my head, no. Your plan sounds good except for the "and requeues it back" part. Ideally you should just ignore failure (don't acknowledge/delete) and have the queue control plane decide when and how to retry—unless you have special logic (exponential backoff?) around that. If you do need to re-queue, just make sure you re-queue before you delete/ack the current message, otherwise you might lose jobs.

Is that something you would be interested in a hosted solution for? We've built the infrastructure to deal with incoming webhooks but we've been exploring the idea to also leverage it for dispatching. Turns out the same infrastructure works both ways!

I am from an embedded background/Operating system background. I would like to know if the ring buffers/circular queues in drivers or stacks suffer same problems as people are discussing here?

Thank you for putting this together. I am reading it with interest.

I spotted one small typo and thought you would like to know:

“like transferring information form your software into an email or an SMS on the cellphone network.”

Thanks, fixing.

What web framework is used for this blog? Really liked it

It’s Ghost (Ghost.org). The theme is Dawn (it’s on the official theme marketplace). My pandemic project is to offer Ghost hosting (https://ghosting.dev) - let me know if you want a free personal setup.

No mention of MQTT?

As a protocol? Not sure how it would deserve more mention than JSON or some other wire protocol. Almost every example I’ve given will use a client SDK, which may or many not use MQTT under the hood.

Anything specific you think is worth pointing out?

Not persé as a protocol but one of the implementations like Mosquitto, like RabbitMQ, Kafka, etc... was covered.

I'm putting this one on my new hire required reading list. So often people end up re-inventing message queues and not even noticing they are doing it.

I've bumped into this recently. Someone waved their hand and said that x queue system didn't work for their specific use case. so they spent an innovation token on making a slow unmonitored queuing system. unsurprisingly the project is late, and it only now has started to work on the bits its supposed to handle. (three months after the public announcement)

When I asked for the rationale, they said that x queue system didn't do ordered delivery. The queue system literally has the word "ordered" in it.

Its not really their fault, they project manager should have stopped it. However its still not an excuse to re-invent the wheel.

Yeah, a similar nightmare prompted a lot of this reasearch. A colleague would argue for what felt like hours against SQS because it was “random” ordering, and kept demanding we use something that had strict FIFO. Despite the clients who placed the jobs not being affected by or caring about execution order. The worst part was that we had paid a shit ton of money for many globally distributed executors to increase throughout - I simply didn’t have the vocabulary to explain how asking for strict FIFO would mean only one executor would be running at a time, wasting hundreds of thousands of dollars.

This is along the same lines, but something people don’t often think about when using and naming queues.


> people end up re-inventing message queues and not even noticing they are doing it

Amazing how people sucked down this rabbit hole. My work is doing this right now, and its driving me mad as there are well supported alternatives out there. But I suppose there is something intrinsically fun about building a complex communications system.

I've experienced excellent results building distributed systems where the nodes talk to each over via raw udp. often, in the context of message queues, there is much talk about delivery guarantees, but in many contexts the same goals can be accomplished by designing the system so it can handle lost messages. for example, rather than a node sending a single message notifying that it has started work on something, it sends a message once a second, saying, 'I am currently working on x'. is that reinventing the wheel? to me it just seems much simpler.

Interesting. This is actually sort of a distillation of all the new hire training / discussions I’ve done. It’s as much a “things I wish someone had told me 15 years ago” thing too.

Would you be willing to share the list with us?

Hmm. I really should find time to get my personal site back online. I took it down a month ago to re-face it and life has intervened.

You could just dump it in a gist


If the poster is a point-haired-middle-manager archetype, maybe. More often though, it’s very useful for a manager or team leader to ensure that everyone on the team has a common base of knowledge, concepts and terminology. That way the team members can unambiguously understand jargon (at-least-once), communicate with high bandwidth (“let’s switch RMQ queue X to at-most-once mode, the persistent audit FIFO will handle missed messages”). Even if the team members disagree, at least everyone knows what the disagreement is about. The shared vocabulary is very useful.

thats one way to look at it.

However if you want your employees to grow, you need to teach, encourage, guide and in some cases tell them to stop.

doing this in a transparent and accessible way seems like a good idea to me

API vs MQ : if there is an API you don't need MQ, or will you make a front MQ for every API you're calling ?

This article is inconsiderate of the added complexity of MQs.

How is the payment processor given as example decoupled from the main process from let's say an e-commerce website using a MQ ?

If the payment processor isn't responding, the payment process on the main site is broken and a client can't purchase.

If many clients try again to pay, the MQ will be flooded and attain its max limit. How do you deal with added complexity and extra error handling of the MQ ?

Same question for integrating a new payment processor: with a different API or SDK to integrate, how would that not be the case with a MQ as stated in the article?

A MQ is simply a buffer when you need to scale up service consumption. But services with APIs don't need it, they do rate limiting and if they're down you should better set the status on your program and stop communicating until it's up again.

MQs are not automatic and this article is selling false promises and examples.

I learnt a ton form this talk by Tim Bray about how Amazon does event driven architecture. The talk isn't directly about queues, but they play a massive part in the narrative, including notes about idempotency, duplication, FIFO (or lack of it), and message design. The article adapts a lot of these ideas from a pure queue point of view.


Using an MQ with a payment processor is very common - payment processors do have transient outages and we still want to allow customers to transact. We may also use multiple payment processors in this case and fallback to a more expensive one if we see our preferred one go offline for a while.

MQ's are distinct from APIs. APIs require both sides to be free to talk, while an MQ allows one side to send when it's able and the other to read when it's able which can happen out of sync.

MQs do have limits but this is typically several orders of magnitude higher than an API - a few million queued messages is pretty trivial on any decent MQ cluster.

An MQ isn't a buffer alone, it's also routing, retries, error handling, deadlettering, etc. This is the difference between your crappy in-memory blind buffer and something like RabbitMQ. Yes MQs still require maintenance/attention/etc, they're not magic.

> if there is an API you don't need MQ

Im not sure what you are talking about here. The two concepts are quite different and are meant to be used for different processes.

> selling false promises

AFAIK, the article wasnt selling anything. I think it was a nice intro and overview to message queues.

> if there is an API you don't need MQ, or will you make a front MQ for every API you're calling ?

I don't think this is quite right. An API might use a message queue as a transport mechanism.

The MQ only gives you a way to communicate, it doesn't define how you structure your requests.

in your payment example, if you replace MQ with REST, it gives you the same answer.

Some message queues give you rich diagnostics for when messages fail. Others like NATS only give you the guarantee that the server will be up.

MQs come in many flavours. some act like python's Queue, but allow you to use many machines.

Some are as you say, a buffer.

However I like to use them for connecting n transient instances of a service to an API front end. The routing is handled for you, and you can isolate services from each other.

Before I got my current job I reached the final round (of 4 or 5!!) at a startup where I was failed because I failed to say 'message queue' or mention RabbitMQ (the interview used to work for them I think) for a question of 'how would you implement whatsapp?' I was told I was a good enough lowly developer but not the right stuff for senior.

So perhaps I'm bitter but, while they seem useful for specific purposes, it does feel a little religious and one of these presumed 'must have if we're serious' class of program when other solutions which architecturally resemble them might suit the problem far better.

In my current job we interact with kafka and have definitely encountered many less than ideal aspects of its behaviour. But I guess that must be the junior developer talking again ;)

Great article, dead letter queues might be another thing add to later updates. Also Apache Pulsar might be worth mentioning.

Yeah, forgot to talk about dead letters, will add a section. And notes for Pulsar, once I read up on it.

I look forward to you considering Pulsar. Even though it is the least mature, it is the superset of most of the others listed.

Nice write up. There should be a section on MQ gotchas. Here's my war story:

Back in the days when we didn't know what we were doing, we had a system where MQ was used in Request/Response style for querying databases. On the beginning of this was a web endpoint. Frequently we end up in a cascading failure whenever we hit an overload situation, because the web client would time out, and retry another call. Meanwhile, there's a bunch of query messages enqueued to be executed against the DB despite the fact the client had already disconnected. If you are trying to web scale, you'd end up paying for IO twice, first to query the DB and second to persist the results in a queue. The big take away for me was to be wary of coupling an unreliable synchronous RPC with a reliable asynchronous MQ.

When reading the article I was wondering: what if clients just spam the queue (by retrying) because they don’t see any responses coming their way (because subscribers have a long backlog to get through first).

This article is great! It was long I was looking for a simple article explaining what options are there. I've been developing custom semifunctional queue systems for years, and I always wanted to know why we could not use an existing one. It looks like DIY was the rule back then.

Thanks. Yeah, DIY was pretty much the norm. Jobs as table rows, in memory arrays, the whole deal. Spent quite a bit of my life writing glue and management code that should have just been queues.

Great write-up.

Back-pressure would be a useful concept to add.

Thought it was already implicit in the capacity decoupling, could you elaborate a bit more / add an example. I’ve always thought back pressure isn’t applicable to queues because there’s no pressure to begin with, front or otherwise.

In most cases, queues are going to be a lot faster than whatever they are feeding into. If you are doing something like sending every SQS message to a Lambda with infinite concurrency then, sure, but many queues are feeding into a compute or persistence layer of some kind.

Back-pressure here would mean that emitters of messages into your queue behave differently when there is an excessive number of messages waiting to be processed.

For example:

- If it's a job queue, load-shed less important jobs until the queue is healthy.

- If you are ingesting a high volume of messages, be more aggressive with batching them together before inserting as a single queue message.

Ah, got it. I’m not entirely sure I want to put that in, though. The message I’m “preaching” is that you have full decoupling with queues - not sure that should be diluted. Wouldn’t this kind of checking defeat the purpose of using the queue in the first place? It seems unlikely that Amazon would stop selling stuff just because the order-confirmation-email-sending-service was backed up, for instance. The point of using SQS would be that the sales system wouldn’t have to care.

In the universe of message queues, an email-per-sale queue is not going to be your problem.

Think of the queue-as-message-bus pattern -- very common. You don't want to have a service with millions of pending messages to process, it forces you into painful, slow recovery, or making the tough choice of dropping everything.

Anyway, great write up, and happy new year!

For example, in a near-real-time application it may be better to fail than to have overly delayed message delivery. If a queue becomes too large then we get the idea that messages are not being consumed fast enough (in near-real-time). In this situation continuing to send more messages is just going to make the problem worse (although this is more like a circuit-breaker/fail fast than back-pressure). Also it might be better from a business logic perspective to inform the user of the problem than to apply their input at a later time (eg. a game where user input is unlikely to be intended for anything but the very recent game state).

I think that you description of exactly once messaging is inaccurate

> This is the holy grail of messaging, and also the fountain of a lot of snake-oil.

Exactly once message delivery is quite possible with messaging systems that support transactions. When combined with other transactional resources (e.g. database) and a distributed transaction monitor, exactly once messaging works well and is rock solid reliable. The grand-daddy of message brokers IBM MQ is absolutely capable of exactly once messaging.

AWS makes the same claim with FIFO SQS, and maybe I’m getting it wrong, but these claims 1) have a lot of caveats and 2) only work inside the boundaries of the messaging system.

There’s a note in the next paragraph about how systems manage to say that if you pass in the same message ID / token for X minutes they won’t be duplicated, and my ensuring FIFO there’s a side effect of not giving out the next message until the current one is acknowledged.

This leads to a situation where there’s a guarantee of exactly once acknowledgement, but not necessarily exactly-once processing or delivery. Given that the semantics of at-most-once and at-least-once apply to processing and delivery, I personally don’t think the goalposts should move on exactly once.

Systems claiming exactly-once lull developers into not planning for multiple deliveries on the subscriber, or the need to do multiple publishes, both of which can still happen.

It's better to use a SQS standard queue and have the consuming system provide the exactly-once processing guarantee for various reasons. You will need to introduce something like Redis, if you are not already using it, but I still think it's net superior to using an SQS FIFO queue if you want exactly-once processing.

Might not even need to use Redis. If the message has a proper idempotent ID a transactional database is more than enough. If the consumer is running MySQL/Postgres/DynamoDB etc nothing else is needed.

Not quite, there are always a bunch of edge cases that inevitably make "exactly once" actually "almost always exactly once, but always at least once."

Indeed. "exactly once" violates the CAP theorem, so if you actually make a system that can guarantee "exactly once" then you should apply for a Turing medal immediately.

I think that you are misunderstanding the CAP theorem. The CAP theorem states that in the event of a network partition, a system can either be consistent or available, but not both. So a messaging system that provided exactly once message delivery would not provide availability during a network partition. However, there are many applications for which consistency is more important than availability, especially if the period of unavailability can be limited.

Ah, but "especially if the period of unavailability can be limited" is exactly the type of edge case kasey_junk was talking about. Network partitions may persist for unbounded amounts of time as far as the CAP theorem is concerned, and an unspecified amount of packets may be dropped and/or delayed. It could be the case that every message you send gets dropped due to a persisting partition and in such a case none would arrive, thereby violating the "guarantee" of exactly-once delivery.

In practice I agree that these problems are quite rare since most network are reasonably stable. However, especially at scale it's not rare to see messages dropped or delivered more than once. I have no doubt IBM MQ can achieve exactly-once most of the time, but no distributed system can achieve exactly-once delivery all of the time.

> It could be the case that every message you send gets dropped due to a persisting partition and in such a case none would arrive, thereby violating the "guarantee" of exactly-once delivery.

That is not correct. All interactions between the client and the broker are performed in transactional units. If the transaction in which messages are sent fails to commit, then the messages are not sent, and all work is rolled back. Once a message is successfully send (that is, sent and transaction committed), it will be delivered once and only once to the receiver.

Likewise on the receiving side, a message is delivered and the encompassing transaction is committed once and only once. A message may be delivered more than once if the encompassing transaction is later rolled back due to say network failure. But a message delivery in a transaction that does not commit is not a delivery.

The benefit here is that application programmers don't need to concern themselves with message duplicate checking and the risk that duplicate checking is done incorrectly leading to bugs that are very difficult to identify.

A transaction which is partition-tolerant in the way you're describing requires stronger semantics than mere client acknowledgement, it requires all participants to engage in the consensus protocol. Unless your application joins the message broker's topology as an active member -- some systems do work this way, like Zookeeper -- it can still suffer message loss.

But even if it does join, that's still not sufficient, because these systems can become unavailable during partitions, and that is definitionally incompatible with "exactly once".

It’s been an age since I’ve worked with IBM MQ and there are dials upon dials when setting up MQ based systems but it doesn’t off exactly once in the face of broker failure in most of its HA configurations and it uses deduplication at the protocol level to prevent duplicates.

When people say “exactly once” is impossible they really mean in the face of failure at the queue level.

> When people say “exactly once” is impossible they really mean in the face of failure at the queue level.

And what exactly is impossible with that? Just wait it out, i.e. like all the CP systems do (as per CAP).

The premise is that unavailability is the same as zero delivered messages, not one.

Note none of this is rigorously defined either in the article or with most message queues and the configuration of queues/brokers/clients means that there are all manner of edge cases around delivery guarantees in practice.

"Wait it out" is only a valid strategy when message rates are low enough that you can buffer them all until the network partition goes away again.

As an example, imagine a system sending a million 1 KB messages per second. To survive a 1 minute network outage it would need 60 GB extra storage to park the messages. If the outage lasts longer than it has space available, dropping messages becomes inevitable.

Even in the face of network partitions?

See my answer to WJW above. Yes, even in the face of network partition, but with system unavailability in the event of a network partition.

System unavailability means the messages get delivered zero times, which is not exactly once.

So we're just playing semantic games at this point, using different definitions of terms. The definition of "exactly once" you're using isn't the formal definition.

I'm speaking specifically of linux/unix message queues:

Richard Stevens book APUE, seems to not like Message Queues, as much as FIFOs, especially IIRC, FIFOs work on file descriptors and therefore works nicely with poll/select and also since destruction of FIFOs is handled more gracefully as the connected programs terminates. Is that view still holds good?

I wonder if that view still has currency?

We normally use message queues when we have a distributed problem involving multiple nodes where you can't typically push a file descriptor. A FIFO is still basically a message queue though it has less semantics than we'd typically see.

All the System V IPC things are ugly ducklings imho.

So today, if I want reliable messaging, I use asynchronous, non-blocking functions. Like async/await in .NET. If I want guaranteed delivery I write logic to provide that. Again, using .NET. If in-order delivery is required, again, .NET.

Of course .NET can be replaced by anything that supports async, non-blocking functions.

If instead I used message queuing instead of writing the code myself I'm adding yet another configuration item to my system, replete with security holes, patches, operational management, and sourcing the skills to procure, install and run it. Of course each has idiosyncracies you need to learn and understand - like Rabbit's memory-based queues. When the instance goes, so potentially does reliable messaging.

I'm very much not a fan of message queues. Haven't been for a good many years.

Worth noting - Apache Kafka is not a queue. It's a distributed commit log.

Might add a note, but I don't want to muddy the waters. Technically, neither is the Redis Streams implementation. But they both quack like a queue, and are meant to be used like a queue, so will probably talk about implementation details in a deep dive of some sort.

EventBridge isn't a queue either, actually.

> But they both quack like a queue, and are meant to be used like a queue

Not necessarily. Nothing stops one from using Kafka topic in a fashion similar to a queue but even then, the way this would have been done differs significantly.

Very simplified: in a queue, an available message is picked up by a consumer and must be acked / nacked within a certain period of time. If message is acked, it is considered processed and may be chosen to be removed, depending on the queue semantics. If nacked, it becomes available for reprocessing. Multiple consumers may be processing messages at completely different offsets and a failure of processing at an early offset does not prevent other consumers advancing the queue.

In a commit log, if the consumer is not able to process the message, it has to keep retrying that one message until it can be processed OR ignore failure and publish the message back to the log at a new offset for future reprocessing. Regardless of that, any other consumer (or group) has its own view of the world so two consumers (or consumer groups) will process all messages from the topic.

The semantics are completely different.

True enough. Will think about whether it makes sense to add a section on stream processing, or maybe do a different post. It's an interesting complement to message queues. There's also Kinesis in this space, will look out for others.

Added a note about the consumer groups, and how each consumer group behaves like a queue.

Is that worth noting??

Yes. Kafka does not have a concept of a queue. Messages are written to an append only log and the broker does not track the state of the message. The consumer is responsible for tracking the offset. Kafka is not a queue.

> Given all the things that can go wrong, it's impossible for any messaging system to guarantee exactly-once delivery.

It's possible to guarantee exactly-once delivery, just not in bounded time, but eventually. That's what FLP impossibility is about, impossibility to achieve consensus (e.g. exactly-once) in bounded time. But if you are talking in the context of semantics, it's a totally different thing, it's a computation model and it's absolutely possible to have exactly-once semantics, regardless whether you achieve it by waiting, like consensus-based systems, or by using special data types and operations that can be performed without waiting and eventually converge, like CRDT systems.

Don’t quite get the gist of this... a consensus based system like Raft or a CRDT operation will converge on to exactly one state, yes. I haven’t seen these concepts applied to a message queue. One could build a consensus or CRDT system that used a normal at-least-once message broker as a component, for sure. Is that what we’re talking about?

Hi, I'd just like to draw your attention to the community guidelines ( https://news.ycombinator.com/newsguidelines.html ).

I hope you'll find them as useful as I have in shaping my interaction with this community. Specifically I'd like to highlight the theme of substantiative comments and the discouragement of commenting on small issues like website formatting.

I assume you’re saying the theme doesn’t work for you? Thanks for the feedback, will customise it to something more readable, better colours, nicer fonts etc soon.

Everything worked great for me, its why I was surprised to find a somewhat rude link as the only comment once I'd read the article.

On a side-note, it's customary to disclose the author's connection with his own submission.

I understand that it's a bit on the nose to see user account sudhirj submitting a link to sudhirj.io. However, disclosures are also a clear indication on where to post questions and/or remarks directed at the author.

Anyway, my 0.02€.

Will post a note.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact