Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Why do message queue-based architectures seem less popular now?
389 points by alexhutcheson 4 months ago | hide | past | favorite | 364 comments
In the late 2000s and early 2010s, I remember seeing lots of hype around building distributed systems using message queues (e.g. Amazon SQS, RabbitMQ, ZeroMQ, etc.) A lot of companies had blog posts highlighting their use of message queues for asynchronous communication between nodes, and IIRC the official AWS design recommendations at the time pushed SQS pretty heavily.

Now, I almost never see engineering blog posts or HN posts highlighting use of message queues. I see occasional content related to Kafka, but nothing like the hype that message queues used to have.

What changed? Possible theories I'm aware of:

* Redis tackled most of the use-case, plus caching, so it no longer made sense to pay the operational cost of running a separate message broker. Kafka picked up the really high-scale applications.

* Databases (broadly defined) got a lot better at handling high scale, so system designers moved more of the "transient" application state into the main data stores.

* We collectively realize that message queues-based architectures don't work as well as we hoped, so we build most things in other ways now.

* The technology just got mature enough that it's not exciting to write about, but it's still really widely used.

If people have experience designing or implementing greenfield systems based on message queues, I'd be curious to hear about it. I'd also be interested in understanding any war stories or pain points people have had from using message queues in production systems.




I like a lot of the answers, but something else I'd add: lots of "popular" architectures from the late 00s and early 2010s have fallen by the wayside because people realized "You're not Google. Your company will never be Google."

That is, there was a big desire around that time period to "build it how the big successful companies built it." But since then, a lot of us have realized that complexity isn't necessary for 99% of companies. When you couple that with hardware and standard databases getting much better, there are just fewer and fewer companies who need all of these "scalability tricks".

My bar for "Is there a reason we can't just do this all in Postgres?" is much, much higher than it was a decade ago.


We also have much much bigger single machines available for reasonable money. So a lot of reasonable workloads can fit in one machine now that used to require a small cluster


It's kind of mind boggling just how powerful mundane desktop computers have gotten, let alone server hardware.

Think about it: That 20 core CPU (eg: i7 14700K) you can buy for just a couple hundred dollars today would have been supercomputer hardware costing tens or hundreds of thousands of dollars just a decade ago.


According to geekbench, an i9 4790 processor released a decade ago is ~5 times slower than i7 14700. 4790's go for $30 at ebay, vs $300 for 14700, so price/performance seems to be in favor of older hardware:)


What about power consumption? When running a server 24/7, power is likely to be a bigger cost concern than the one-off cost of purchasing the processor.


Under full load, roughly 100W for the 4790, and 350W for the 14700. Note that both links are for the K variant, and also, both were achieved running Prime95. More normal workloads are probably around 2/3 those peak values.

For a desktop, yeah, you’re generally better off buying newer from a performance/$ standpoint. For servers, the calculus can shift a bit depending on your company’s size and workloads. Most smaller companies (small is relative, but let’s go with “monthly cloud bill is < $1MM”) could run on surprisingly old hardware and not care.

I have three Dell R620s, which are over a decade old. They have distributed storage via Ceph on NVMe over Mellanox ConnectX3-PRO. I’ve run DB benchmarks (with realistic schema and queries, not synthetic), and they nearly always outclass similarly-sized RDS and Aurora instances, despite the latter having multiple generations of hardware advancements. Local NVMe over Infiniband means near-zero latency.

Similarly, between the three of them, I have 384 GiB of RAM, and 36C/72T. Both of those could go significantly higher.

Those three, plus various networking gear, plus two Supermicro servers stuffed with spinning disks pulls down around 700W on average under mild load. Even if I loaded the compute up, I sincerely doubt I’d hit 1 kW. Even then, it doesn’t really matter for a business, because you’re going to colo them, and you’re generally granted a flat power budget per U.

The downside of course is that you need someone[s] on staff that knows how to provision and maintain servers, but it’s honestly not that hard to learn.

[0]: https://www.guru3d.com/review/core-i7-4790k-processor-review...

[1]: https://www.tomshardware.com/news/intel-core-i9-14900k-cpu-r...


You might like https://labgopher.com/


I think for server type workloads to get performance improvement estimate it would be reasonable to compare single core performance and multiply by the ratio of number of cores.


On the other hand, the E7-8890 v3 (the closest equivalent to a 14700K in core count at the time from a quick glance) had an MSRP of $7174.00[1].

So maybe I was a bit too high on the pricing earlier, but my point still stands that the computing horsepower we have such easy access to today was literal big time magic just a decade ago.

[1]: https://ark.intel.com/content/www/us/en/ark/products/84685/i...


The RAM also get much larger and cheaper, and it is now possible to have several terabyte (TB) of RAM memory (not storage), in a single PC or workstation. This i7 14700K can support 192 GB RAM but other lower end Xeon CPU W for workstation for example w3-2423 costing around USD350 can support 2 TB RAM albeit only 6-core [1]. But then with not so much more extra budgets you can scale the machine to your heart's content [2].

[1] Intel Xeon w3-2423 Processor 15M Cache, 2.10 GHz:

https://www.intel.com/content/www/us/en/products/sku/233484/...

[2] Intel Launches Xeon W-3400 and W-2400 Processors For Workstations: Up to 56 Cores and 112 PCIe 5.0 Lanes:

https://www.anandtech.com/show/18741/intel-launches-xeon-w-3...


Good point. And going back to the start of this thread, you can put a whole lot of Postgre into a machine with even a few hundred gigs of RAM.


This was true in the 2000's and 2010's as well. A lot of the work could be handled by a single monolithic app running on one or a small handful of servers. However, because of the microservices fad, people often created complicated microservices distributed across auto-scaling kubernetes clusters, just for the art of it. It was unneeded complexity then, as it is now, in the majority of cases.


Hardware has certainly progressed significantly over the years, but the size of workloads has also grown.

The big question is: Does a 'reasonable' workloads today fit on a single machine better than a 'reasonable' workload did 20 years ago?


This! You NEEDED to scale horizontally because machines were just doing too much. I remember when our Apache boxes couldn’t even cope doing SSL so we had a hardware box doing it on ingress!


I used to administrate a small fleet of sun sparc hosts with SSL accelerators. They were so much money $$$$.

I proposed dumping them all for a smaller set of x86 hosts running linux, it took 2-3 years before the old admins believed in the performance and cost savings. They refused to believe it would even work.


I lived through that era too, it was wild to see how quickly x86 dethroned sparc (even Intel's big misses like Itanium were only minor bumps in the road).

Those days, you had to carefully architect your infrastructure and design your workload to deal with it, and every hardware improvement required you to reevaluate what you were doing. Hence novel architectural choices.

Everything is way easier for normal sized organizations now, and that level of optimization is just no longer required outside of companies doing huge scale.


I have the same memories, trying to convince people to dump slow as hell sparc processors for database workloads in favor of X86 machines costing a 10th of the price.

To this day I still argue with ex Solaris sysadmins.


15 years ago I ran a website (Django+Postgres+memcached) serving 50k unique daily visitors on a dirt cheap vps. Even back then the scalability issues were overstated.


As the stock market prices now expected future growth, architecture had to justify rising stock prices by promising future scalability.

It was never about the actual workloads, much more about growth projections. And a whole lot of cargo cult behavior.


What happens when the single machine fails?


Worst case scenario you service is not available for a couple of hours. In 99% of business, customers are totally okay with that (if it's just not every week). IRL shops are also occasionally closed due to incidents; heck even ATMs and banks don't work 100% of the time. And that's the worst case: because your setup is so simple, restoring a backup or even doing a full setup of a new machine is quite easy. Just make sure you test you backup restore system regularly. Simple system also tend to fail much less: I've run a service (with customers paying top euro) that was offline for ~two hours due to an error maybe once or twice in 5 years. Both occurrences were due to a non-technical cause (a bill that wasn't payed - yes this happened, the other one I don't recall). We were offline for a couple of minutes daily for updates or the occasional server crash (a go monolith, crash mostly due to an unrecovered panic), however the reverse proxy was configured to show a nice static image with the text along the lines "The system is being upgraded, great new features are on the way - this will take a couple of minute". I installed this the first week when we started the company with the idea that we would do a live-upgrade system when customers started complaining. Nobody ever complained - in fact customers loved to see we did an upgrade once in a while (although most customers never mentioned having seen the image).


Depending on your product, this could mean tens of thousands to millions of dollars worth of revenue loss. I don't really see how we've gone backwards here.

You could just distribute your workloads using...a queue, and not have this problem, or have to pay for and pay to maintain backup equipment etc.


If your product going down for an hour will lead to the loss of millions of dollars, then you should absolutely be investing a lot of money in expensive distributed and redundant solutions. That's appropriate in that case.

The point here is that 99% of companies are not in that scenario, so they should not emulate the very expensive distributed architectures used by Google and a few other companies that ARE in that scenario.

For almost all companies on the smaller side, the correct move is to take the occasional downtime, because the tiny revenue loss will be much smaller than the large and ongoing costs of building and maintaining a complex distributed system.


> The point here is that 99% of companies are not in that scenario

I‘d argue that is wrong for any decently sized ecommerce platform or production facility. Maybe not millions per hour, but enough to warrant redundancy. There’s many revnue and also redundancy levels between Google and your mom and pop restaurant menu.


From the original post: “Your business is not Google and will never be Google”

From the post directly above: “Most businesses…”

The thread above is specifically discussing business which won’t lose a significant amount of money if they go down for a few minutes. They also postulate that most businesses fall into this category, which I’m inclined to agree with.


I understand it in practice but I also think it's weird to be working on something that isn't aiming to grow, maybe not to good scale but building systems which are "distributable" from and early stage seems wise to me.


Hours, not minutes. That is relevant for most businesses.


It could. In those cases, you set up the guardrails to minimize the loss.

In your typical seed, series A, or series B SaaS startup, this is most often not the case. At the same time, these are the companies that fueled the proliferation of microservice-based architectures, often with a single-point of failure in the message queue or in the cluster orchestration. They shifted easy-to-fix problems into hard-to-fix problems.


Hellishly and endlessly optimising for profit is how we've gotten the world into its current state, lmao.


Machine failures are few and far between these days. Over the last four years I've had a cluster of perhaps 10 machines. Not a single hardware failure.

Loads of software issues, of course.

I know this is just an anecdote, but I'm pretty certain reliability has increased by one or two orders of magnitude since the 90s.


Also anecdotally, I’ve been running 12th gen Dells (over a decade old at this point) for several years. I’ve had some RAM sticks report ECC failures (reseat them), an HBA lose its mind and cause the ZFS pool to offline (reseat the HBA and its cables), and precisely one actual failure – a PSU. They’re redundant and hot-swappable, so I bought a new one and fixed it.


You didn't answer the question though. You're answer is "it won't" and that isn't a good strategy.


It is in that if something happens less often, you don't need to prepare for it as much if the severity stays the same (cue in Nassim Taleb entering the conversation).


I'm not sure what types of products you work on, but it's kind of rare at most companies I've worked at where having a backup like that is a workable solution.


Your monitoring system alerts you on your phone, and you fix the issue.

When I worked with small firms who used kubernetes, we had more kubernetes code issues that machines failing. The solution to the theoretical problem was the cause of real issues. It was expensive to keep fixing this.


Depending on your requirements for uptime, you could have a stand-by machine ready or you spin up a new one from backups.


> "You're not Google. Your company will never be Google."

I'm not sure people realize this now more than then. I was there back then and we surely knew we would never be Google hence we didn't need to "scale" the same way they did.

Nowadays every project I start begins with a meeting where is presented a document describing the architecture we are going to implement, using AWS of course, because "auto-scale" right?, and 9/10 it includes CloudFront, which is a CDN and I don't really understand why this app I am developing, which is basically an API gateway with some customization that made Ngnix slightly less than ideal (but still perfect for the job), and that averages to 5 rps needs a CDN in front of it... (or AWS or auto-scaling or AWS lambda, for that matter)


The autoscaling is nice because a lot of performance issues just get resolved without much meddling by the ops team, buying time for proper optimizations should it get out of hand.

The disadvantage is that people don't think hard about performance requirements anymore. Premature optimization is bad, but it's also a warning sign if a project has no clue whatsoever how intensely the system is going to be used.


In defense of CDNs they're also pretty neat for cutting down latency, which benefits even the first customer.

Of course that only helps if you don't end up shoving dozens of MBs in Javascript/JSON over the wire.


Putting your app behind a CDN also gives you some cheap defense against (most, casual) DDoS.


usually, but this app is not even exposed publicly to the internet


AWS overcharges 100x for bandwidth, and CloudFront's free tier has 10x more bandwidth in it.


5 rps aside, the more data you can push to the edge (your customer) the cheaper it will be and the better performance for you customer.


> because people realized "You're not Google. Your company will never be Google."

Is that also why almost no one is using microservices and Kubernetes?


I realize you're being sarcastic (I think), but I actually would put microservices in the same boat. There was a huge push to microservices in the mid teens, and a lot of companies came to hugely regret it. There is a reason this video, https://youtu.be/y8OnoxKotPQ, is so popular.

And it's not that "no one is using microservices", it's just that tons of companies realized they added almost as many complications as they alleviated, and that for many teams they were just way too premature. And a lot of the companies that I've seen have the most success with microservices are also the most pragmatic about them: they use them in some specific, targeted areas (e.g. authn and authz), but otherwise they're content using a well-componentized monolith where they can break off independent services later if there is an explicit reason to do so.


> Is that also why almost no one is using microservices and Kubernetes?

I don’t know of a single 100+ sized organisation in my area which doesn’t use micro services in some form. A lot of places also use kubernetes indirectly through major cloud provider layers line Azure Container Apps.

Our frontend (and indeed quite a bit of our backend) lives in a NX mono-repo. As for how it actually works, however, it’s basically a lot of micro-services which are very independently maintainable. Meaning you can easily have different teams work on different parts of your ecosystem and not break things. It doesn’t necessarily deploy as what some people might consider micro services of course. But then micro services were always this abstract thing that is honestly more of a framework for management and change management than anything tech.


Maybe add /s ;). It may decline number of hot headed responses


Yeah but you can’t deny the rofls without the /s. At least I certainly can’t deny the humor from reading DevJab’s comment.


lol they gon get flamed


Kubernetes brings more than just being Google. In a way it’s also an ecosystem.


That‘s nice, but the question is if you (i.e. your company) needs this ecosystem.


Need no. But it’s nice especially since I have the know how.

Creating a new deployment is just super fast. Developers can deploy their own apps etc.

And then there is Helm.

If we only decide by “need” then most of the time we also wouldn’t need object oriented programming.


> And then there is Helm

Right, who doesn't want to template hundreds of lines of code in a language that uses whitespace for logic and was never made neither for templating nor complex long documents(YAML)? What could possibly go wrong ("error missing xxx at line 728, but it might be a problem elsewhere").


I wonder why people don't use fromYaml + toJson to avoid stupid errors with indent.

Yaml is for all intents and purposes a superset of JSON so if you render your subtree as JSON you can stick it in a Yaml file and you don't need to care with indentation.


"in a language that uses whitespace for logic"

This argument kind of died when python became one of the most popular programming languages used.


It’s not completely that white space is bad, but in particular that white space is very difficult to template relative to something character-delimited like JSON or s expressions.

In JSON or s expressions, the various levels of objects can be built independently because they don’t rely on each other. In YAML, each block depends on its parent to know how far it should be indented, which is a huge pain in the ass when templating.


Python uses whitespace for scoping - not for logic. That said, the same is true for YAML.


Yeah, but Python has actual for loops.


But my infrastructure is code! Can't you see how it's all in git?


What's wrong with having your infrastructure as code and storing it in Git?


Nothing. Even if it's objectively terrible (thousands of lines of templated YAML, or thousands of lines of spaghetti bash), being in Git as code is still better. At least you know what it is, how it evolved, and can start adding linting/tests.


I manage large IaC repos and it's mostly HCL, well-structured and easy to work with. Where we have Kubernetes manifests, they're usually split into smaller files and don't cause any trouble as normally we usually don't deploy manifests directly.


YAML isn't code. Same as your reply, YAML has very little awareness of context.


In computer science, the word "code" has a very specific meaning, and markup languages definitely fit this definition.


Honestly, Helm kinda sucks (for all the reasons you mentioned).

But Kustomize is very nice. Although, their docs could be a bit better.


It’s not that bad if you need to deploy at least 3 things and for most cases it beats the alternatives. You can get away with a bootstrapped deployment yaml and a couple of services for most scenarios. What should you use instead? Vendor locked app platforms? Roll out your own deploy bash scripts?

Sure the full extend of Kubernetes is complicated and managing it might be a pain, but if you don’t go bonkers is not that hard to use it as a developer.


I don’t like helm itself. But I was referring to the deployment part. I like Kustomize more.


Lot's of things are nice to have but are expensive.

I'd love to have a private pool in my backyard. I don't, even though it's nice to have, because it is too expensive.

We intuitively make cost-benefit choices in our private lives. When it comes to deciding the same things at work, our reasoning often goes haywire.

Sometimes we need an expensive thing to solve a real problem.

Your point about object-oriented programming makes sense. Sometimes, a bash script suffices, and the person who decides to implement that same functionality in Java is just wasting resources.

All of these solutions have a place where they make sense. When they are blindly applied because they're a fad they generate a lot of costs.


That’s true. Using k8s to host a static website would be silly.

Generally I only use it when I see a compelling case for it and the introduced complexity takes away complexity from somewhere else.


I’ve only ever seen a single dev team managing their own K8s cluster. If by deploy you mean “they merge a branch which triggers a lot of automation that causes their code to deploy,” you don’t need K8s for that.

Don’t get me wrong, I like K8s and run it at home, and I’d take it any day over ECS or the like at work, but it’s not like you can’t achieve a very similar outcome without it.


Of course. And I can achieve a Webserver in C. Doesn’t mean it’s the best way given the circumstances.

There are many ways in tech to achieve the same result. I don’t understand why people constantly need to point that out.

I also don’t understand why k8s ruffles so many feathers.

Reminds me a bit of Linux vs windows vs Mac debates.


K8s ruffles my feathers because it’s entirely too easy to build on it (without proper IaC, natch) without having any clue how it works, let alone the underlying OS that’s mostly abstracted away. /r/kubernetes is more or less “how do I do <basic thing I could have read docs for>.”

I’m a fan of graduated difficulty. Having complex, powerful systems should require that you understand them, else when they break – and they will, because they’re computers – you’ll be utterly lost.


Genuine question: say you have 3-4 services and a bunch of databases that make up your product, what's the alternative to plemping them all into K8s according to you?


3-4 services and a bunch of databases?

Assuming there aren’t any particular scaling or performance requirements, if I were managing something like that, I would almost certainly not use k8s. Maybe systemd on a big box for the services?


I agree with you and I'm always confused when people talk about process isolation as a primary requirement for a collection of small internal services with negligible load.

In addition the overhead and reporting drawbacks of running multiple isolated databases is vastly higher than any advantaged gained from small isolated deployments.


For personal stuff I simply run systemd services and that does scale quite a lot (as in, you can rely on it for more production services) I believe.


My hero.


If I had 3-4 services and a bunch of databases, I would look at them and ask "why do we need to introduce network calls into our architecture?" and "how come we're using MySQL and Postgres and MongoDB and a Cassandra cluster for a system that gets 200 requests a minute and is maintained by 9 engineers?"

Don't get me wrong, maybe they're good choices, but absent any other facts I'd start asking questions about what makes each service necessary.


AWS Fargate is popular among large companies in my experience.

Some of them try to migrate from it to a unified k8s "platform" (i.e. frequently not pure k8s/EKS/helm but some kind of in-house layer built on top of it). It takes so long that your tenure with the company could end before you see it through.


Using cloud platform as a service options. For example, on Azure you can deploy such system with Azure App Service (with container deployment), or Azure Container Apps (very suitable for microservices). For database, you can use Azure Database for PostgreSQL (flexible), or Azure Cosmos DB for PostgreSQL.

This way, Azure does most of the heavy lifting you would otherwise have to do yourself, even with managed kubernetes.


In my home environment I run a VM with docker for that.

In a commercial environment I’d still use kubernetes. But maybe something like k3s or if we are in a cloud environment something like EKS.

Usually with time other services get added to the stack (elastic, grafana, argocd, …)


If you use AWS, it's probably easier to use ECS that takes away some of the complexity from you.


Maybe at first, but once you start building all of the IaC and tooling to make it useful and safe at scale, you might as well have just run EKS. Plus, then you can get Argo, which is IMO the single best piece of software from an SRE perspective.


Docker compose is another option.


Define "no one". If you mean small shops, maybe. If you mean large organizations, I haven't seen even one in the last 5 years that wouldn't use them in one way or another.


It was meant to be ironic


Ah sorry, it is more and more difficult for me to detect irony these days.


Yeah sorry as well, I actually wanted to add a “… oh wait” to my original comment but forgot to do it… (too busy fixing a podman issue… )


To be fair, I worked on multiple projects removing queues at Google, so it's more than just that.


and mandates that virtually all new projects not directly use borg/kubernetes.


Can you extend it? How do they deploy, and where do they deploy their projects?


That's interesting. What's the rationale behind that?


I left Google in 2022, but I remember seeing the beginning of this. The system that got built up to do all the things Google wanted to do -- previously "simple" things, orchestrated on raw hosting (Borg) -- got too big and complex. (Partially because it was too flexible, so too many projects did too many unique things, and running them grew difficult to match.) So they started adding abstraction layers, to make it "simpler". (Which as always means the common/easy case got easier, and the rare/difficult case got harder. The team I was on at the time couldn't use the new thing for (most) of what we did, because our needs were incompatible. But we did enough other compatible things that I started learning about it/using it.)

(Personally I'm sure it's doomed to failure. But partially because most all technology operates in cycles/like a pendulum. The abstraction layers will grow too onerous and limiting, and the solution will be to dive closer to the metal again.)


I guess such a service gets coupled too strongly to that platform, and major engineering effort is required to deploy it the old-school way.


> You're not Google. Your company will never be Google

True, but the CTO comes from twitter/meta/google/some open-source big-data project, the director loves databases, etc.

So we have 40-100 people managing queues with events driven from database journals.

Everyone sees how and why it evolved that way. No one has the skill or political capital to change it. And we spend most of our time on maintenance tasked as "upgrades", in a culture with "a strong role for devops".


Meta, where they run a PHP monolith with mysql? ;)


So true, people optimize prematuriley for when they'll have 100m monthly users, when they have no product yet and won't likely reach 100k users in many years (which can run on a single $100/m dedicated machine...).


My, perhaps overly cynical view, is that Message Queue architecture and blogging was all about "Resume Driven Development" - where almost everybody doing it was unlikely to ever need to scale past what a simple monolith could support running on a single laptop. All the same people who were building nightmare micro service disasters requiring tens of thousand of dollars a month of AWS services.

These days all those people who prioritise career building technical feats over solving actual business problems in pragmatic ways - they're all hyping and blogging about AI, with similar results for the companies they (allegedly) are working for: https://www.theregister.com/2024/06/12/survey_ai_projects/


I'm sure this happens. But ... most websites I load up have like a dozen things trying to gather data, whether for tracking, visitor analytics, observability, etc. Every time I view a page, multiple new unimportant messages are being sent out, and presumably processed asynchronously. Every time I order something, after I get the order confirmation page, I get an email and possibly a text message, both of which should presumably happen asynchronously, and possibly passing through the hands of more than one SaaS product en route. So given what seems to be the large volume of async messages, in varying/spiking volumes, possibly interacting with 3rd party services which will sometimes experience outages ... I gotta expect that a bunch of these systems are solving "actual business problems" of separating out work that can be done later/elsewhere, can fail and be retried without causing disruptions, etc in order to ensure the work that must happen immediately is protected.


Bingo - I work on the backend of a medical system and basically anything that interacts with a 3rd party gets put into a queue so our application doesn't choke immediately when one of them has issues. We also have some uses for it within our system.

As far as the question, I was thinking that queues have probably just become a standard aspect of modern distributed systems; it's considered a pretty foundational cloud service for any provider (though we just run RabbitMQ ourselves and it has worked well for us).


At work we usually integrated with a queue, and then some partners/customers wanted a synchronous flow because the user has to see and pick from data we don't own, and now life is pain.


Tracking, visitor analytics, and observability type things are all (in general) going out to 3rd party specialist services for those things, and getting dropped into a time series database (or, for us old school gray beards, a log file) and processed later. It's rare for the website devs to be doing anything more complex that adding in even more javascript to their pages for those, no need to message queues for that.

Order confirmation emails and sms messages are triggered by the order submission, and again usually sent off to a 3rd party bulk email or SMS service. Twilio or Campaign Monitor or Mailchimp will have queues and retry mechanisms, but again the website devs are just firing off an API call to some 3rd party that's dealing with that.

So there are no doubt message queues being used in non-sexy 3rd party services, but those companies probably consider that kind of thing to be their "secret sauce" and don't blog about it.


> Order confirmation emails and sms messages are triggered by the order submission, and again usually sent off to a 3rd party bulk email or SMS service. Twilio or Campaign Monitor or Mailchimp will have queues and retry mechanisms, but again the website devs are just firing off an API call to some 3rd party that's dealing with that.

In my case, I need to compile templates for the e-mails to be sent, which is somewhat slow. Even if I have an in memory cache for the compiled templates that can then be quickly filled in with actual data, I don't want to make the first user to request them after a restart/flush wait like 4-7 extra seconds upon clicking on the confirm order button (in addition to any other 3rd party calls, checking payment status etc.).

Ergo, I need to carry the actual e-mail preparation logic (regardless of whether I send it myself, or use a 3rd party service) out into a separate thread. The problem then is that I most likely don't want to create a new thread for every e-mail to be processed, so I need a queue for the requests and one or multiple worker threads. There is functionality for this in the stack I'm using so no big deal, except I can't pass live DB entities across threads, so I also need to serialize the data before passing it off to the queue and deserialize it inside of the queue (or just pass some IDs and do DB calls within the queue).

Essentially I've created a simple queue system in the app code, since processing that data definitely shouldn't make the user wait on it. I can see why some might also go the extra step and opt for something like RabbitMQ, since at the end of the day my queue system is likely to be way more barebones than something that's been around for years. But until a certain point, YAGNI.


What is taking so long for compiling?

And what language are you using? E.g. in NodeJS you can fire of requests without having ti have a new thread or having to wait for it.


> What is taking so long for compiling?

Razor templates: https://learn.microsoft.com/en-us/aspnet/core/mvc/views/razo... with this outside of MVC https://github.com/adoconnection/razorenginecore

Though to be honest, if it wasn't template compilation, it might as well be slow DB queries, slow external service calls, or anything else that you wouldn't want to do in the lifecycle of the user's request, if it concerns something that should happen in the background, not even related to a particular stack.


There's always a queue somewhere at some level of resilience. Sometimes it's as mundane as the TCP buffer, some other times the other end points may become unresponsive for a while, including the time series DB you seem to want to take as the source of truth, or whatever that tries to resend and reconcile the requests.


One word: scale. The services you mention above do require scale if commercial. OP argues and I somewhat agree that lots of resume driven tech was oversold and overused making things more complicated and expensive than they should have. Once tech gets more mature it’s harder do misuse and it is used where real needs arise.


This is true, but in my opinion badly misunderstood.

There are a huge number of "commercial" things that are hitting several million dollars a month in revenue running Ruby On Rails or WordPress/PHP. You can scale a long long way with "boring technology".

Way too many people think that are "the unicorn" who's user base is going to grow so quickly that then need billion simultaneous user scale right now - instead of realising that before they get even close to that they'll be generating enough revenue to have an engineering team of hundreds who will have rewritten everything two or three times with more suitable architectures already.

If you need a billion users to turn a profit, then whether you admit it or not your business model is "burn VC money until they stop giving it to us or we pivot using Doctorow's Enshittification blog posts as a guide book". That though, is a vanishingly small percentage of all dev work. Most business models have a way too make real profits of thousands or perhaps tens or hundreds of thousands of transactions a month - and they should be rolling in profit to reinvest in future higher scale development well before they run out of fairly pedestrian "boring technology" platforms. Horizontally scalable Java/PHP/Ruby/Node with vertically scaling databases on you cloud provider of choice is a well known and battle tested way to generate real value and cashflow in probably 99% of all businesses.


I absolutely agree. But keep in mind that a lot of these services are made by startups that want to be acquired by someone with big pockets. Using hyped tech helps the sale. Don’t ask me why…


"MongoDB is web scale"


> My, perhaps overly cynical view, is that Message Queue architecture and blogging was all about "Resume Driven Development" - where almost everybody doing it was unlikely to ever need to scale past what a simple monolith could support running on a single laptop. All the same people who were building nightmare micro service disasters requiring tens of thousand of dollars a month of AWS services.

Yes, that is cynical. People have been building architectures off MQ for a much longer time than microservices have been around. Lots of corporates have used JMS for a long time now.


Anything can be 'resume driven design'. If someone is rewarded with a raise, promotion, or even just praise sometimes, for applying a technology to a problem without being required to prove if the technology is appropriate they'll find a way to jam that tech into their domain regardless.

Sometimes that promotion is rewarded by going a different company while being able to say "yes, I used X in my previous role."


I have also seen a lot of cases where engineers would use the more unnecessarily complex structure on purpose to make themselves less replaceable, as it would take longer time for newcomers to get familiar with the environment deployed.


The reality is that management never cares how non-replaceable an engineer is, fires them anyway, a bunch of stuff breaks and the newcomers are stuck holding the bag.


All fun and games until stock price tanks because of shitty uptime


It's simply that the complexity rises and more people are employed which then create fiefdoms to simplify and deliniate responsibilities.


I can offer one data point. This is from purely startup-based experience (seed to Series A).

A while ago I moved from microservices to monolith because they were too complicated and had a lot of duplicated code. Without microservices there's less need for a message queue.

For async stuff, I used RabbitMQ for one project, but it just felt...old and over-architected? And a lot of the tooling around it (celery) just wasn't as good as the modern stuff built around redis (bullmq).

For multi-step, DAG-style processes, I prefer to KISS and just do that all in a single, large job if I can, or break it into a small number of jobs.

If I REALLY needed a DAG thing, there are tools out there that are specifically built for that (Airflow). But I hear they're difficult to debug issues in, so would avoid at most costs.

I have run into scaling issues with redis, because their multi-node architectures are just ridiculously over-complicated, and so I stick with single-node. But sharding by hand is fine for me, and works well.


To your comment on Airflow, I’ve been around that block a few times. I’ve found Airflow (and really any orchestration) be the most manageable when it’s nearly devoid of all logic to the point of DAGs being little more than a series of function or API calls, with each of those responsible for managing state transfer to the next call (as opposed to relying on orchestration to do so).

For example, you need some ETL to happen every day. Instead of having your pipeline logic inside an airflow task, you put your logic in a library, where you can test and establish boundaries for this behavior in isolation, and compose this logic portably into any system that can accept your library code. When you need to orchestrate, you just call this function inside an airflow task.

This has a few benefits. You now decouple, to a significant extent, your logic and state transfer from your orchestration. That means if you want to debug your DAG, you don’t need to do it in Airflow. You can take the same series of function calls and run them, for example, sequentially in a notebook and you would achieve the same effect. This also can reveal just how little logic you really need in orchestration.

There are some other tricks to making this work really well, such as reducing dependency injection to primatives only where possible, and focusing on decoupling logic from configuration. Some of this is pretty standard, but I’ve seen teams not have a strong philosophy on this and then struggle with maintaining clean orchestration interfaces.


Helpful comment! If I could pick your brain...

I'm looking at a green field implementation of a task system, for human tasks - people need to do a thing, and then mark that they've done it, and that "unlocks" subsequent human tasks, and near as I can tell the overall task flow is a DAG.

I'm currently considering how (if?) to allow for complex logic about things like which tasks are present in the overall DAG - things like skipping a node based on some criteria (which, it occurs to me in typing this up, can benefit from your above advice, as that can just be a configured function call that returns skip/no-skip) - and, well... thoughts? (:


I think there are some questions to ask that can help drive your system design here. Does each node in the DAG represent an event at which some complex automated logic would happen? If so, then I think the above would be recommended, since most of your logic isn’t the DAG itself, and the DAG is just the means of contextually triggering it.

However, if each node is more of a data check/wait (e.g. we’re on this step until you tell me you completed some task in the real world), then it would seem rather than your DAG orchestrating nodes of logic, the DAG itself is the logic. In this case, i think you have a few options, though Airflow itself is probably not something I would recommend for such a system.

In the case of the latter, there are a lot of specifics to consider in how it’s used. Is this a universal task list, where there is exactly one run of this DAG (e.g. tracking tasks at a company level), or would you have many independent runs of this (e.g. many users use it), are runs of it regularly scheduled (e.g. users run it daily, or as needed).

Without knowing a ton about your specifics, a pattern I might consider could be isolating your logic from your state, such that you have your logical DAG code, baked into a library of reusable components (a la the above), and then allowing those to accept configuration/state inputs that allow them to route logic appropriately. As a task is completed, update your database with the state as it relates to the world, not its place in the DAG. This will keep your state isolated from the logic of the DAG itself, which may or may not be desirable, depending on your objectives and design parameters.


Do you avoid things like task sensors? Based off what you described it sounds like an anti pattern if you’re using them.

Great description of good orchestration design. Airflow is fairly open ended in how you can construct dags, leading to some interesting results.


Yes, I think you could make an argument for them, but in general it means putting your state sensing into orchestration (local truth) rather than something external (universal truth). As with anything, it does depend on your application though. If you were running something like an ETL, I think it’s generally more appropriate to sense the output of that ETL (data artifact, table, partition, etc) than it is to sense the task itself. It does present some challenges for e.g. cascading backfills, but I think it’s a fine tradeoff in most applications.


If you’re already in the Kubernetes system, Argo Workflows has either capabilities designed around what you are describing or can be built using the templates supported (container, script, resource). If you’re not on Kubernetes, then Argo Workflows is not worth it on its own because it does demand expertise there to wield it effectively.

Someone suggested Temporal below and that’s a good suggestion too if you’re fine with a managed service.


Not GP or specifically Airflow user; but my approach is to have a fixed job graph, and unnecessary jobs immediately succeed. And indeed, jobs are external executables, with all the skip/no skip logic executed therein.

If nothing else, it makes it easy to understand what actually happened and when - just look at job logs.


I’m working on similar system. My plan is to have multiple terminal states for the tasks:

Closed - Passed

Closed - Failed

Closed - Waived

When you hit that Waived state, it should include a note explaining why it was waived. This could be “parent transaction dropped below threshold amount, so we don’t need this control” or “Executive X signed off on it”.

I’m not sure about the auto-skip thing you propose, just from a UX perspective. I don’t want my task list cluttered up with unnecessary things. Still, I am struggling with precisely where to store the business logic about which tasks are needed when. I’m leaning towards implementing that in a reporting layer. Validation would happen in the background and raise warnings, rather than hard stopping people.

The theory there is that the people doing the work generally know what’s needed better than the system does. Thus the system just provides gentle reminders about the typical case, which users can make the choice to suppress.


I think of jobs rather as of prerequisites. If a prerequisite is somehow automatically satisfied (dunno, only back up on Mondays, and today is Tuesday) then it succeeds immediately. There is no "skipping". Wfm.

I find embedding logic into DSLs usually quite painful and less portable than having a static job graph and all the logic firmly in my own code.


Tbh that sounds almost like an already built workflow engine like n8n or even Jira would be preferable to reinventing the wheel.


Have you looked into temporal.io? It supports dynamic workflows.


Ok, so question (because I really like the DAG approach in principle but don't have enough experience to have had my fingers burned yet):

The way you use Airflow, what advantage does it have over crontab? Or to put it another way, once you remove the pipeline logic, what's left?


Airflow provides straightforward parallelism and error handling of dependent subtasks. Cron really doesn’t.

With cron you have to be more thoughtful about failover especially when convincing others to write failure safe cron in invoked code. With airflow you shouldn’t be running code locally so you can have a mini framework for failure handling.

Cron doesn’t natively provide singleton locking so if the system bogs down you can end up running N of the same jobs at the same time which slows things down further. Airflow isn’t immune to this by default but it’s easier to setup centralized libraries that everything uses so more junior people avoid this when writing quick one off jobs.


Observability is a huge upside.


Backfilling is also very useful


Thanks to both comments.


This is exactly what we do, but with Spark instead. We develop the functions locally in a package and call necessary functions for the job notebooks, and the job notebooks are very minimalistic because of this


Spark-via-Airflow is also the context we use this, glad of see the pattern also works for you.


Thanks, this was really helpful.


In my experience monoliths don't reduce complexity, they just shift it. The main issue with monoliths is that they don't have clear and explicit separation of concern between domain concerns, therefore it's very easy for your monolith codebase to devolve into a mess of highly interconnected spaghetti code with time. This is especially true if you're building something large with a lot of developers who don't necessarily understand all of the domain complexity of the code they're touching.

Monoliths imo are better for smaller projects with a few devs, but otherwise within a few years most of the time you'll regret building a monolith.

I also disagree with the duplicated code point. I don't understand why that would be a significant problem assuming you're using the same language and sharing packages between projects. This isn't a problem I've ever had while working on microservices anyway. I'd also debate whether they're anymore more complex than monoliths on average. My favourite thing about microservice architecture is how simple individual microservices are to understand and contribute to. The architecture and provisioning of microservices can be more complicated, but from the perspective of a developer working on a microservice it should be much simpler to work on compared to a monolith.


I think lots of microservices can be replaced with a monolith which in turn can be replaced with a set of composable libraries versioned separately.

If anyone doubts that, this very browser used to read and write is built all the way up with dozens of libraries from compression, network, image encoding, decoding, video encoding decoding, encryption, graphics, sound and what not where each library is totally separate and sometimes was never intended to be used to build web browsers by the original authors.

Rest assured, most of the business (or web 2.0 systems, search, curate, recommend, surface etc kind of) systems are a lot more simpler then an A class browser.


If you are using Chrome, it's also a combination of multiple well separated processes talking via RPC with each other, which is pretty similar to microservices, although the separation boundaries are more influenced by attack mitigation requirements than your typical microservice architecture would be.


But that’s due to security, not for any supposed benefit of microservices. Also, both processes are from the same repo sharing code, so I wouldn’t really qualify as microservice.


That‘s literally an example of the decision depending on multiple factors. Separation of concerns -> more isolation -> stronger security of the overall system is exactly one of the possible benefits of microservices.

Scale is just one. There is also fault tolerance, security, organizational separation (which can be, up to a point, also be realized with libraries as you suggest), bigger ecosystem to choose from, …


1. microservices also create security boundaries

2. microservices living in monorepos is common


And even that process separation is spinning up more processes from within the same binary or build artefact. The usual fork() and CreateProcessW() etc and then wait on them.

Unlike in Microservices where each process is possibly a totally different language, runtime and framework spun up individually that too possibly in totally different ways.


These blanked statements about monoliths are what made every junior dev think that microservices are the only solution.

If you cannot make a clean monolith, I have never seen any evidence that the same team can make good microservices. It is just the same crap, but distributed.

The last 2 years I see more and more seasoned devs who think the opposite: monoliths are better for most projects.


> It is just the same crap, but distributed.

Yes, but also - more difficult to refactor, more difficult to debug (good luck tracking a business transaction over multiple async services), slower with network overhead, lack of ACID transactions... Microservices solve problems which few projects have, but adds a huge amount of complexity.


monoliths are the postgres of architectures - keep it simple until you really can't, not until you think you can't.


In my experience, the biggest issue with microservices is that they convert nice errors with a stack trace to network errors. Unless you also invest heavily in observability (usually using expensive tools), running and debugging monoliths generally seems easier


Microservices necessarily add more complexity and overhead when compared to a monolith. Just the fact that you have to orchestrate N services instead of just pressing run on a single project demonstrates some of the additional complexity.


Counterpoint: a monolith usually contains a complex init system which allows multiple ways of running the codebase. Microservices can avoid at least that one complexity.


Another advantage of microservices is that you can avoid the overhead of multiple services by having one really big microservice.


You mean like profiles? Monolith can run like a front service or background worker depending on the config?

In a technical sense it is a complexity, but IME pales in comparison with the alternative of having to manage multiple services.

I actually really like this model, it's pretty flexible. You can still have specialized server instances dedicated for certain tasks, but you have one codebase. What's very sweet is that for local development, you just run a single application which (with all enabled profiles) can fulfill all the roles at the same time.


Thanks no. I rather wait 10ms for a rebuild and 1s on gdb init, than 25m for a monolith rebuild, and 2m on gdb init.

Seperate processes, yes.


Who says monoliths don't have clear and explicit separation of concern between domain concerns? I think that just comes down to how the codebase is organized and how disciplined the team is, or possibly breaking out core parts into separate libraries or other similar strategies - technically it's still a monolith.


Libraries are a great way to manage separation of concerns. Any dependency you add has to be explicit. There's nothing stopping you from adding that dependency but you can't just do it accidentally.

The graph of dependencies between components makes for explicit separation of concerns just like you would have a graph of dependencies between different network services.


Or you just use a language with support for clear module API boundaries (vs something like Ruby where without bolt on hacks, every piece of code can call any other in the same process).


The lower cost of a function call versus any microservice network call is a good performance advantage of a monolith. Monoliths also make refactoring code a lot easier. While in theory I agree about the spaghetti issue, in practice I haven't seen much of a difference. In part because microservices seem to encourage proactive overdesigning, and then the designs don't age well.

I also find monoliths a lot easier to debug. You have the whole call stack, and you get rid of a lot of potential sources of latency problems. You don't have RPC calls that might sometimes get forgotten.

Given the choice, I'd choose monolith every time. Unless, of course, microservices are needed for various other reasons. (Scale, the ability to distribute, etc.)


> In my experience monoliths don't reduce complexity, they just shift it

This is both true and false in a way. Sure, the same business logic is distributed across microservices, but a method call in a monolith can only fail in a couple of ways, while network calls are much more finicky - handling it in every case is pure added complexity in case of a microservice architecture.

Also, don’t forget the observability part - a mainstream language will likely have a sane debugger, profiler, a single log stream, etc. I can easily find bugs, race conditions, slow code paths in a monolith. It’s much more difficult if you have to do it in a whole environment communicating with potentially multiple instances of a single, or multiple microservices.

Lastly, we have programming languages to help us write correct, maintainable code! A monolith != spaghetti code. We have language tools to enforce boundaries, we have static analysis, etc. A refactor will work correctly across the whole codebase. We have nothing of this sort for microservices. You might understand a given microservice better, but does anyone understand the whole graph of them? Sure, monoliths might become spaghettis, but microservices can become spaghettis that are tangled with other plates of spaghettis.


Microservices also introduce the issue of maintaining schema compatibility for messages between services which usually leads to additional code in order to maintain backward compatibility

From a technical POV they are good for horizontally scaling different workloads which have a different resource footprint

From my experience, when a company decides to go the microservice route, it's more for the sake of solving an organizational problem (e.g. making team responsibilities and oncall escalation more clear cut) than it is to solve a technical problem. Sometimes they will retroactively cite the technical benefit as the reason for using them, but it feels like more of an afterthought

But in all honesty: microservices are very good at solving this organizational problem. If microservice x breaks, ping line manager y who manages it. Very straightforward


You could do that with code owners file in a monolith as well.


This presupposes that there is more than one line manager.

I see people trying to apply micro services architectures to a web app with a single developer.

As in literally taking a working monolith written by one person and having that one person split it up into tiny services.

It’s madness.


If your goal is to learn kubernetes instead of developing a product, then go for it IMHO, no better way. Just make sure everyone is on board with the idea.


I call this customer-funded self-training.


>My favourite thing about microservice architecture is how simple individual microservices are to understand and contribute to.

I generally agree with your stance and I would add that I find the whole simplistic „microservices suck“ talking point as nonsensical as viewing them as a panacea. They do solve a few specific (for most companies, mostly organizational/human factors because scale and redundancy don’t matter that much) problems that are harder to solve with monoliths.

I still think this point is a bit misleading, because yes, the components become simpler, but their interaction becomes more complex and that complexity is now less apparent. See:

>The architecture and provisioning of microservices can be more complicated, but from the perspective of a developer working on a microservice it should be much simpler to work on compared to a monolith.

I think that perspective often doesn’t match reality. Maybe most microservice domains I‘ve seen had poorly separated domains, but changes small enough to ignore the interactions with other services have been rare in my experience.


Amazing this was downvoted. The comment starts with "in my experience" and is hardly a controversial perspective. I beg the HN community, stop disincentivizing people from respectively providing a converse opinion, lest this become yet another echo chamber of groupthink.


It relates that there was experience, but not what that experience was - We can read and understand that they're reporting their own experience, but that's about it.

One could say "In my experience, the earth is flat", but there's not much a conversation to be had there.

One could say, "In my experience, the earth is flat - I got in my car, went for a drive in one direction, and eventually hit a cliff at the ocean instead of finding myself back where I started". Now there's something to talk about.

(To be clear: this is the internet, and a limited communication medium: I'd assume OP could relate details about their experience, and it's totally reasonable that instead of taking the time to do that, they went outside and touched grass)


That’s not a reason to downvote. Downvoting is a censorship tool.


It's because it didn't loop back to queues at any point. It's just a tangent on a tired topic.


We've swung back and it's trendy to hate on microservices now, so join in! /s


Patiently waiting for common sense to be fashionable. Alas, sensible people are too busy to advocate on the internet.


> My favourite thing about microservice architecture is how simple individual microservices are to understand and contribute to.

Whether this is good depends on the type of changes you need to make. Just as you mentioned maintaining modularity in a monolith can be difficult with entropy tending to push the code to spaghetti, there is an equivalent risk in microservices where developers duplicate code or hack around things locally versus making the effort to change the interfaces between microservices where it would make sense.

Ultimately microservices add structure that may be useful for large enough teams, but is still overhead that has to earn its keep.


> My favourite thing about microservice architecture is how simple individual microservices are to understand and contribute to.

You can achieve exactly the same with simple individual libraries.


"Cleverness, like complexity, is inevitable. The trick is making sure you're getting something worthwhile in exchange."


I think this: "* The technology just got mature enough that it's not exciting to write about, but it's still really widely used."

Messaging-based architecture is very popular


Agreed. It has become a tool just like any other. Just like nobody writes about how they use virtual machines in the cloud anymore.


This is the answer. I'd wager that almost every distributed system that runs at scale uses message queues in some capacity.


I think that's definitely part of it. Two roles ago my team was invested heavily in SQS and Kinesis. The role before that and my current role are pretty heavy with Kafka still.

I wouldn't call their use super interesting, though.

The last role was simply because the business required as close to real time message processing as possible for billing analytics. But if I tell someone that, it's not incredibly interesting unless I start diving into messages per second and such.


Yep. It’s a really nice architecture for lots of use cases where it fits.

Every new idea goes through the same cycle of overuse until it finds its niches.


Yeah, this is the most likely reason.

It used to be popular to post about rewriting angular to react. Now everyone just uses react (or they write posts about rewriting react to vue or whatever the flavor of the month)


I think "message queues" have become pretty commoditized. You can buy Confluent or RedPanda or MSK as a service and never have to administer Kafka yourself.

Change Data Capture (CDC) has also gotten really good and mainstream. It's relatively easy to write your data to a RDBMS and then capture the change data and propagate it to other systems. This pattern means people aren't writing about Kafka, for instance, because the message queue is just the backbone that the CDC system uses to relay messages.

These architectures definitely still exist and they mostly satisfy organizational constraints - if you have a write-once, read-many queue like Kafka you're exposing an API to other parts of the organization. A lot of companies use this pattern to shuffle data between different teams.

A small team owning a lot of microservices feels like resume-driven developnent. But in companies with 100+ engineers it makes sense.


This. You don't need Google scale for Kafka to make sense. Just few acquisitions and a need to fan out some data to multiple products. For example you have SCIM hooks that will write to Kafka so all parts of org can consume the updates. Or customer provisioning.


Going to give the unpopular answer. Queues, Streams and Pub/Sub are poorly understood concepts by most engineers. They don't know when they need them, don't know how to use them properly and choose to use them for the wrong things. I still work with all of the above (SQS/SNS/RabbitMQ/Kafka/Google Pub/Sub).

I work at a company that only hires the best and brightest engineers from the top 3-4 schools in North America and for almost every engineer here this is their first job.

My engineers have done crazy things like:

- Try to queue up tens of thousands of 100mb messages in RabbitMQ instantaneously and wonder why it blows up.

- Send significantly oversized messages in RabbitMQ in general despite all of the warnings saying not to do this

- Start new projects in 2024 on the latest RabbitMQ version and try to use classic queues

- Creating quorum queues without replication policies or doing literally anything to make them HA.

- Expose clusters on the internet with the admin user being guest/guest.

- The most senior architect in the org declared a new architecture pattern, held an organization-wide meeting and demo to extol the new virtues/pattern of ... sticking messages into a queue and then creating a backchannel so that a second consumer could process those queued messages on demand, out of order (and making it no longer a queue). And nobody except me said "why are you putting messages that you need to process out of order into a queue?"...and the 'pattern' caught on!

- Use Kafka as a basic message queue

- Send data from a central datacenter to globally distributed datacenters with a global lock on the object and all operations on it until each target DC confirms it has received the updated object. Insist that this process is asynchronous, because the data was sent with AJAX requests.

As it turns out, people don't really need to do all that great of a job and we still get by. So tools get misused, overused and underused.

In the places where it's being used well, you probably just don't hear about it.

Edit: I forgot to list something significant. There's over 30 microservices in our org to every 1 engineer. Please kill me. I would literally rather Kurt Cobain myself than work at another organization that has thousands of microservices in a gigantic monorepo.


To second this theory with some real-world data. A few startups I worked at a NY scala shop that used tons of Akka (event-driven queuing scala thing). Why? Because a manager at his prior job "saved the company" when "everything was slow" by doing this, so mandated it at the new job?

What were doing that required queueing? Not much, we showed people's 401ks on a website, let them adjust their asset mix, and sent out hundreds of emails per day. As you would expect, people almost never log into their 401k website.

A year or so after working there I realized our servers had been misconfigured all along and basically had 0 concurrency for web-requests (and we hadn't noticed because 2 production servers had always served all the traffic we needed). Eventually we just ripped out the Akka because it was unnecessary and added unnecessary complexity.

In the last month this company raised another funding round with a cash-out option, apparently their value has gone up and they are still doing well!


I think I'm actually somewhat familiar with your story.

There's something about "Java/Scala", "New York startup" and cargo-culted behaviors. I'm sure this happens elsewhere too in other circles, but I've both heard of and read about what you're referring to before.


Two observations:

1. There doesn't seem to be a design review process before people start implementing these things. Devs should make a design document, host a meeting where everyone reads it, and have a healthy debate before implementing stuff. If you have bright people but not-so-bright outcomes, it's because there is no avenue for those who know their shit to speak up and influence things.

2. I will always rather hire a 2-5 YOE dev from a no-name school over a new grad from a top 5 school. The amount that software engineers learn and grow in the first 5 years of their career is immense and possibly more than the rest of their career combined.


That doesn’t sound like hiring the only „brightest“.


Intelligence != experience. We are all aware of the very bright junior engineer that produces huge code volume and creates massive technical debt. It takes experience to go with the boring, simple approach.


But we only hire from MIT, Waterloo, CMU, etc!

No, truly, I feel you. The Waterloo kids are literally cut from a different cloth though I'll take one of them to ten of anyone else. I feel really guilty about grabbing young grads with all that potential and putting them through this...

Grads aren't expected to know anything. It's all the early ex-google hiring that fucked us.


Not sure how big your company is and how large your influence. But it sounds to me like your new hires would benefit from more guidance and oversight. Those mistakes shouldn’t make it production unless nobody gives a damn.


I keep the platform systems running, so they aren't my responsibility, but as their profession used to be mine I think the way they do things is more than a bit mad. Had COVID not upended absolutely everything, I would have moved on years ago. I believe I'll be here until the industry winds change significantly.

I do get to slam my foot down on some things from time to time -- usually for compliance reasons. Like now when people tell me they need credentials for things so they can run their code in dev on their laptops, I tell them no and that they can use mocks for their testing like a professional.

Funnily the junior people expect and have no problem with this -- it's the seasoned folks who chafe and complain.


> I work at a company that only hires the best and brightest engineers from the top 3-4 schools in North America and for almost every engineer here this is their first job.

Are you making this up to make your claim sound more extravagant? Even Citadel is not that picky so unless you work at OpenAI/Anthropic I'm calling nonsense.


There's nothing about the company or the product that requires it but we in fact are that picky when it comes to hiring new grads. Only new grads will work here at the pay we're offering and for all the other reasons mentioned though, so there you go.


To be fair, thousands of microservices in a monorepo sounds better than thousands of microservices, each with its own repository (but sub-repo’ing common libraries so everything jams up in everything else).


When your monorepo grows into the dozens upon dozens of gigabytes in size, the effort to manage it grows exponentially and the tooling available to you to interact with it shrinks exponentially. For example, your options for CI/CD systems and code review tools to use are extremely limited.


Because JS is single threaded, and an average programmer today is too dumb to learn any language beyond JS, you must build everything into "microservices".


You sound like a joy to work with


Im retired so can tell the truth now)


Raw intelligence is of limited help when working in an area that requires lots of highly depreciating domain specific knowledge.

Relatively few graduates know their way around the Snowflake API, or the art of making an electron app not perform terribly. Even sending an email on the modern internet can require a lot of intuition and hidden knowledge.

> There's over 30 microservices in our org to every 1 engineer

I wonder if this a factor in making onboarding of new hires difficult?


> I wonder if this a factor in making onboarding of new hires difficult?

Surely it's because they focused on making onboarding easy?

New hires just make a new microservice for every task they get, don't need to know the broader scope just the specs for the service.


It's hard to make code changes without understanding the broader scope.

For example, a developer asked to change the payments processor from the Stripe API to the Adyen API needs to know about:

1) credential scoping across microservices and test environments 2) impacts on billing, tax management and customer support 3) the organisation's approach to external dependencies (e.g do they use some kind of internal registry)

Microservices don't inherently make this kind of change more difficult. A well isolated payments microservice might even help. But having hundreds or thousands of microservices spread across several code repos makes it quite difficult to figure out how your change will affect other services.

I acknowledge I've picked an extreme example that wouldn't be asked of a new hire. However, smaller versions of the same challenges will happen in the smaller API upgrades & migrations that make up a big part of what the average CRUD developer does.


My point is with microservices you can, in theory, write all that stuff in a spec, and hand it to some new kid on the block. They then don't need to know about billing or what not, you've taken care of that in the spec.

Not saying it's a good way, but it's one way to end up with an inordinate amount of microservices per dev.


> > There's over 30 microservices in our org to every 1 engineer

> I wonder if this a factor in making onboarding of new hires difficult?

Not so much that but a culture of rapidly reconfiguring the whole organization around new product work every three sprints and no ownership/responsibility for maintaining prior work.

It almost seems obvious what would happen.


> - Start new projects in 2024 on the latest RabbitMQ version and try to use classic queues

I am out of the loop here. Did Rabbit break classic queues?


Deprecated versions/years ago and set to be removed in the next release.

Also they're being used in a case where reliability is a must, thus they shouldn't be using classic queues.


« Use kafka as a basic message queue ». Since i’m guilty of that (i use kafka as the backbone for pretty much any service 2 service communication, under a « job » api) i wonder why you think that’s wrong.


you have easiest ways to do a basic message queue. Kafka is overkill and it doesn't worth the overhead in simple escenarios.


This. And as-mentioned, RabbitMQ is already part of our platform. Using Kafka here is both extremely wasteful and unnecessarily complicates the product.


Queues are a tool in your distributed system toolbox. When it's suitable it works wonderfully (typical caveats apply).

If your perception is indeed correct it'd attribute it to your 3rd point. People usually write blogposts about new shiny stuff.

I personally use queues in my design all the time, particularly to transfer data between different systems with higher decoupling. The only pain I have ever experienced was when an upstream system backfilled 7 days of data, which clogged our queues with old requests. Running normally it would have taken over 100 hours to process all the data, while massively increasing the latency of fresh data. The solution was to manually purge the queue, and manually backfill the most recent missing data.

Even if you need to be careful around unbound queue sizes I still believe they are a great tool.


Message queues have moved on past the "Peak of inflated expectations" and past the "trough of disillusinment" into the "slope of enlightenment", perhaps even the "plateau of productivity".

https://en.wikipedia.org/wiki/Gartner_hype_cycle


I used zmq to build our application used for testing new hardware. Everything comes in via serial every second so I made a basic front end for that serial bus that sends telemetry out over zmq and waits for any commands coming in using pub/sub. The front end piece can sit forever sending out telemetry no one ever hears or I can hook up a logger, debug terminal, data plotter, or a factory test GUI that runs scripts or all at once. Dealing with com ports on windows is a huge hassle so zmq lets me abstract those annoyances away as a network socket. Other engineers can develop their own applications custom to their needs and they have. Our old application tried to shove all of this functionality into one massive python script along with trying to update a complicated Tk GUI every second with new telem. The thing was buckling under its own weight and would actually screw up the serial data coming in if you were running a heavy script in another thread. I know there are ways to turn a serial port into a network socket but I wanted something that didn't require a server or client to be online for the other to function.


They have become boring so there are less blogs about them.

Thats good. The documentation for eg RabbitMQ is much better and very helpful. People use it as a workhorse just like they use Postgres/MySQL. There’s not much surprising behavior needed to architect around etc.

I love boring software.


Agree


I find it super interesting that the comments calling out "obviously we all still use message queues and workers, we just don't write about them" are buried half way down the comments section by arguments about Microservices and practical scalability. A junior engineer reading the responses could definitely get the false impression that they shouldn't offload heavy computation from their web servers to workers at all anymore.


It’s blowing my mind. Queues are the most bog standards architectural component.


Speaking from my own experience message queues haven’t disappeared as much as have been abstracted away. For example enqueue to SQS + poll became invoke server less process. There is a message queue in there somewhere just that it’s not as exposed.

Or take AWS SNS which IMO is one level of abstraction higher than SQS. It became so feature rich that it can practically replace SQS.

What might have disappeared is those use cases which used Queues to handle short bursts of peak traffic?

Also streaming has become very reliable tech so a class of usecases that used Queues as streaming pipe have migrated to the streaming proper.


> There is a message queue in there somewhere just that it’s not as exposed.

It's still pretty exposed. You can set redelivery timeouts and connect a dead letter queue to lambda functions. Even just the lambda invoke API is obviously just a funny looking endpoint for adding messages to the queue.

> as much as have been abstracted away

In AWS in particular into EventBridge which further extends them with state machines. They've become the very mature corner stone of many technologies.


good point. tcp was justified at some point in it's birth a queueing component. today nobody dates to think of it as such.


I think it's simple: async runtimes/modules in JavaScript/Node, Python (asyncio), and Rust. Those basically handle message queues for you transparently inside of a single application. You end up writing "async" and "await" all over the place, but that's all you need to do to get your MVP out. And it will work fine until you really become popular. And then that can actually still work without external queues etc. if you can scale horizontally such as giving each tenant their own container and subdomain or something.

There are places where you need a queue just for basic synchronization, but you can use modules that are more convenient than external queues. And you can start testing your program without even doing that.

Actually async is being used a lot with Rust also, which can stretch that out to scale even farther with an individual server.

Without an async runtime or similar, you have to invent an internal async runtime, or use something like queues, because otherwise you are blocked waiting for IO.

You may still eventually end up with queues down the line if you have some large number of users, but that complexity is completely unnecessary for getting a system deployed towards the beginning.


To back up the story regarding async a bit, at least on the front end ... A long time ago in the 2000s, on front-end systems we'd have a server farm to handle client connections, since we did all rendering on the server at the time. On the heavyweight front end servers, we used threading with one TCP connection assigned to each thread. Threading was also less efficient (in Linux, at least) than it is now, so a large number of clients necessitated a large number of servers. When interfacing with external systems, standard protocols and/or file formats were preferred. Web services of some kind were starting to become popular, usually just when interfacing with external systems, since they used XML(SOAP) at the time and processing XML is computationally expensive. This was before Google V8 was released, so JavaScript was seen as sort of a (slow) toy language to do only minor DOM modifications, not do significant portions of the rendering. The general guidance was that anything like form validation done on the client side in JS was to be done only for slight efficiency gains and all application logic had to be done on the server. The release of NGINX to resolve the C10K problem, Google V8 to make JS run faster, and Node.js to scale front end systems for large numbers of idle TCP connections (C10k) all impacted this paradigm in the 2000s.

Internally, applications often used proprietary communication protocols, especially when interacting with internal queueing systems. For internal systems, businesses prefer data be retained and intact. At the time, clients still sometimes preferred systems be able to participate in distributed two-phase commit (XA), but I think that preference has faded a bit. When writing a program that services queues, you didn't need to worry about having a large number of threads or TCP connections -- you just pulled a request message from the request queue, processed the message, pushed a response onto the response queue, and moved on to the next request message. I'd argue that easing the strong preference for transactional integrity, the removal of the need for internal services to care about the C10k problem (async), and the need to retain developers that want to work with recent "cool" technologies reduced the driver for internal messaging solutions that guarantee durability and integrity of messages.

Also, AWS's certifications try to reflect how their services are used. The AWS Developer - Associate still covers SQS, so people are still using it, even if it isn't cool. At my last job I saw applications using RabbitMQ, too.


It may be that lambdas (cloud functions, etc) have become more popular and supported on other platforms.

When you enqueue something, you eventually need to dequeue and process it. A lambda just does that in a single call. It also removes the need to run or scale a worker.

I think Kafka continues to be popular because it is used as a temporary data store, and there is a large ecosystem around ingesting from streams.

I personally use queues a lot and am building an open source SQS alternative. I wonder if an open source lambda replacement would be useful too. https://github.com/poundifdef/SmoothMQ


This is a big part of it IMO. When your downstream consumers can scale up and down quickly, you don’t necessarily need anything in the middle to smooth out load unless your workloads are especially spiky.

I think this also speaks to a related phenomenon where there are simply more tools and technologies you can buy or run “off the shelf” now. Back in the 2010s everybody was trying to roll their own super complex distributed systems. Nowadays you have a ton of options to pay for more or less polished products to handle that mess for you. No need for engineering meetups and technical blogs about tools that kinda-sorta work if you really know what you’re doing - just pay snowflake or confluent and work on other problems.


Regarding this issue, I have some observations of my own. I've noticed that systems based on queues, such as Kafka, AMQP, etc., are still very widespread, for example in vehicle networking, transaction systems, and so on. I recently encountered a customer deploying Kafka on AWS, with monthly consumption of Kafka-related computing storage exceeding $1 million. The cluster scale is huge, containing various system events, logs, etc. I've also seen customers building IoT platforms based on Kafka. Kafka has become very central to the IoT platform, and any problems can cause the entire IoT platform to be unavailable. I personally have written over 80% of the code for Apache RocketMQ, and today I have created a new project, AutoMQ (https://github.com/AutoMQ/automq). At the same time, we also see that competition in this field is very fierce. Redpanda, Confluent, WarpStream, StreamNative, etc., are all projects built based on the Kafka ecosystem. Therefore, the architecture based on message queues has not become obsolete. A large part of the business has transformed into a Streaming form. I think Streaming and MQ are highly related. Streaming leans more towards data flow, while MQ leans more towards individual messages.


People got excited about it as a pattern, but usually apps don't have that many things that really have to go in the background. And once you do, it becomes really hard to ensure transactional safety across that boundary. Usually that's work you want to do in a request in order to return a timely error to the client. So most jobs these days tend to be background things, pre-caching and moving bits around on cdns. But every single one of those comes with a cost and most of us don't really want a mess of background jobs or distributed tasks.

I just added a RabbitMQ-based worker to replace some jobs that Temporal.io was bad at (previous devs threw everything at it, but it's not really suited to high throughput things like email). I'd bet that Temporal took a chunk of the new greenfield apps mindshare though.


"The technology just got mature enough that it's not exciting to write about, but it's still really widely used."

My money is on this. I think the simple usecase of async communication, with simple pub/sub messaging, is hugely useful and not too hard to use.

We (as a Dev community) have just gotten over event sourcing, complex networks and building for unnecessary scale. I.e. we're past the hype cycle.

My team uses NATS for Async pub/sub and synchronous request/response. It's a command driven model and we have a huge log table with all the messages we have sent. Schemas and usage of these messages are internal to our team, and are discarded from NATS after consumption. We do at-least-once delivery and message handlers are expected to be idempotent.

We have had one or two issues with misconfiguration in NATS resulting in message replay or missed messages, but largely it has been very successful. And we were a 3 person dev team.

It's the same thing as Kubernetes in my mind - it works well if you keep to the bare essentials and don't try to be clever.


I think it's all of the above.

In large enterprises, there is usually some sort of global message bus on top of Kafka, AWS Kinesis or similar.

In smaller shops, the need for dedicated message bus is over engineering and can be avoided by using the db or something like redis. It is still a message queue, just without a dedicated platform.


I think it's very much your last theory -- used everywhere but not as interesting to tell people about as it might have been a decade ago. Queues are now Boring Technology(tm), and that's a good thing.


They aren't a general solution and don't really add much to your average application. But there are still instances where they make a lot of sense.

What I would need to see required before bothering with a message queue architecture:

* High concurrency, atomic transactions

* Multiple stages of processing of a message required

* Traceability of process actions required

* Event triggers that will actually be used required

* Horizontal scaling actually the right choice

* Message queues can be the core architecture and not an add on to a Frankenstein API

Probably others, and yes you can achieve all of the above without message queues as the core architecture but the above is when I would think "I wonder if this system should be based on async message queues".


Could you elaborate a little on the traceability part, please? If the different steps of a processing chain are distributed via queues, you rather have an overhead in collecting the information in a central place, I would think.


That is certainly true, but message queues are pretty well suited to tagging and audit logs. So the systems involved tag the messages and an audit repository can accept logs from all services.

I guess this isn't in contrast to a monolith which is easier to log, but in contrast to non-message based microservices I think it's easier to implement the audit logging.


My company heavily relies on Amazon SQS for background jobs. We use Redis as well but it is hard to run at scale. Hence, anything critical goes to SQS by default. SQS usage is so ubiquitous I can’t imagine anyone be interested in writing a blog post or presenting on a conference. Once you get used to SQS specifics (more than once delivery, message size limit, client/server tooling built, expiration settings, DLQ) I doubt there’s anything that can beat it in terms of performance/reliability Unless you have resources to run Redis/Kafka/etc yourself. I would recommend searching for talks by Shopify eng folks on their experience, in particular from Kir (e.g. https://kirshatrov.com/posts/state-of-background-jobs)


I never fully understood the need for back end message queues TBH. You can just poll the database or data store every few seconds and process tasks in batches... IMO, the 'real time' aspect was only ever useful for front end use cases for performance reasons since short polling every second with HTTP (with all its headers/overheads) is prohibitively expensive. Also, HTTP long polling introduces some architectural complexity which is not worth it (e.g. sticky sessions are required when you have multiple app servers).

Unfortunately, moving real-time messaging complexity entirely to the back end has been the norm for a very long time. My experience is that, in general, it makes the architecture way more difficult to manage. I've been promoting end-to-end pub/sub as an alternative for over a decade (see https://socketcluster.io/) but, although I've been getting great feedback, this approach has never managed to spread beyond a certain niche. I think it's partly because most devs just don't realize how much complexity is added by micromanaging message queues on the back end and figuring out which message belongs to what client socket instead of letting the clients themselves decide what channels to subscribe to directly from the front end (and the back end only focuses on access control).

I think part of the problem was the separation between front end and back end developer responsibilities. Back end developers like to ignore the front end as much as possible; when it comes to architecture, their thinking rarely extends beyond the API endpoint definition; gains which can be made from better integrating the back end with the front end is 'not their job'. From the perspective of front-end developers, they see anything performance-related or related to 'architectural simplicity' as 'not their job' either... There weren't enough full stack developers with the required insights to push for integration efficiency/simplicity.


Our whole backend is queue-based. If it's asynchronous and you don't need a fast response time use a queue. It's easy, reliable, and the queue can drive lambdas. Queues also makes it easier to collect metrics and performance data.

During heavy load the queue bloats up to a few million messages, then drains off over time. Or it spawns a few hundred lambdas to chow all the messages down...depending on what we want.


I've gone though all of these at different scales. What I find these days are: 1. databases have gotten very good where a separate queue infrastructure isn't worth it, 2. databases provide much better observability (you can query the contents). If you really do want a queue, you're still better off using a stream (e.g. Kafka) for multiple producers and/or consumers. Using a database table as a 'transactional outbox' is a poor-man's Kafka which works well enough without the additional infrastructure for most scales.

Redis taking some of the duty as you mentioned and microservices/distributed systems being less fashionable likely also factor into it.


1) Distributed databases do the same job, putting data local. They are slower, however they have less overhead and work involved if you already have a db layer.

2) Serverless, e.g. AWS lambdas can be joined with step functions instead and scale without a queue.

3) People have been burned. Multiple configurations, multiple queues, multiple libraries and languages, multiple backing stores, multiple serialisation standards and bugs - it's just overly complex engineering a distributed system for small to medium business. YAGNI.

4) Simpler architectures, e.g. microservices for better or worse tend to be fat monoliths made more efficient, mostly in my experience because those in charge don't actually don't understand the pattern, but the side effect is fewer queues compared to real microservices arch .

O/T I cringe whenever I hear our tech architects discuss our arch as microservices and datamesh. The former because it's not (as above, it's multiple small services) the latter also a problem because datamesh is an antiquated pattern that's better filled with segregated schemas on a distributed database, and scoped access per system to the data each needs, instead of adapter patterns, multiple dbs with slightly different schemas and fascades/specialised endpoints all the fucking way down.


I think it depends but to add some noise to the discussion:

People really abused kafka: https://www.confluent.io/en-gb/blog/publishing-apache-kafka-... like really abused it.

Kafka is hard to use, has lots of rough edges, doesn't scale all that easily, and isn't nice to use as a programmer. but you can make it do lots of stupid shit, like turn it into a database.

People tried to use message queues for synchronous stuff, or things that should be synchronous, and realised that queuing those requests is a really bad idea. I assume they went back to REST calls or something.

Databases are much much faster now, with both SSD, design and fucktonnes of ram. postgres isn't really the bottleneck it once was.

SQS and NATS cover most of the design usecases for pure message queues (as in no half arse RPC or other feature tacked in) and just works.

Message queues are brilliant, I use them a lot, but only for data processing pipelines. But I only use them to pass messages, not actual data. so I might generate a million messages, but each message is <2k

could I use a database? probably, but then I'd have to make an interface for that, and do loads of testing and junk.


The only place I find the complexity of message queues worth the trouble is in the embedded world. The limited resources make messaging a sensible way to communicate across devices with a fairly agnostic perspective on what runs on the devices apart from the message queue.

Most of us in the desktop computing world don't actually need the distribution, reliability features, implementation-agnostic benefits of a queue. We can integrate our code very directly if we choose to. It seems to me that many of us didn't for a while because it was an exciting paradigm, but it rarely made sense in the places I encountered it.

There are certainly cases where they're extremely useful and I wouldn't want anything else, but again, this is typically in settings where I'm very constrained and need to talk to a lot of devices rather than when writing software for the web or desktop computers.

As for your last point, the Internet of Things is driven by message queues (like MQTT), so depending on the type of work you're doing, message queues are all over the place but certainly not exciting to write about. It's day-to-day stuff that isn't rapidly evolving or requiring new exciting insights. It just works.


A few things I’ve noticed, in a large “I pick things up and put them down” solution for high dollar-value remote transactions: - a lot of services we need to communicate with have much higher resiliency than in the past, and so we’ve seen a big decrease in operational tasks for our queues that are “guarding “ those transactions; newer workloads might have less of a need to guard; - many services we use support asynchronous execution themselves, using patterns like events/callbacks etc, and while they may use message queues internally we don’t necessarily have to do so; - in what would have been called an “enterprise integration” environment, we are using internal event buses more and more, given they enable looser coupling and everyone speaks http.

From a technology perspective, message queuing has been commodified, I can pull an SQS off the shelf and get right to work. And so maybe the ubiquity of cloud based solutions that can be wired together has just removed the need for choice. If I need mqtt, there’s an app for that. Fanout? App for that. Needs to be up 25/7? …


I am under impression people are still actively using MQs but it’s just become a commodity and not as exciting as it was. I think two major cases - you need to do something asynchronous and in specific order. Simple example from past project: in a workflow/process management app (task manager on steroids) there’s a group of tasks (branch) that can be completed by multiple people in any order. When all tasks are done we have to mark the whole branch as completed and move workflow further. Many instances of the workflow are running at same time. Logic is much simpler to implement when you process all task completions within same workflow instance in order, but from different instances in parallel. It’s also much easier to provide close to realtime experience to users - when user clicks on a checkbox task is shown completed instantly as well as other effects- next task becomes active, branch shown as completed, whole workflow is shown as completed etc.


I think it's a mix of:

1. Queues are actually used a lot, esp. at high scale, and you just don't hear about it.

2. Hardware/compute advances are outpacing user growth (e.g. 1 billion users 10 years ago was a unicorn; 1 billion users today is still a unicorn), but serving (for the sake of argument) 100 million users on a single large box is much more plausible today than 10 years ago. (These numbers are made up; keep the proportions and adjust as you see fit.)

3. Given (2), if you can get away with stuffing your queue into e.g. Redis or a RDBMS, you probably should. It simplifies deployment, architecture, centralizes queries across systems, etc. However, depending on your requirements for scale, reliability, failure (in)dependence, it may not be advisable. I think this is also correlated with a broader understanding that (1) if you can get away with out-of-order task processing, you should, (2) architectural simplicity was underrated in the 2010s industry-wide, (3) YAGNI.


Most of those architectures were run on company data centers. The swap to cloud and making small stateless services (the rise of SPA) meant that a complex staged event-driven system was less needed.

On AWS for example, you use SQS and a sprinkling of SNS, or perhaps Kinesis for a few things and you're good. There isn't a lot to talk about there, so the queues no longer become the center of the design.

Message-queue based architectures are great for data processing, but not great for interactive web sites, and if most people are building interactive web sites, then the choices seem a little obvious. I still design event systems for data processing (especially with immutable business data where you have new facts but still need to know that you were "wrong" or had a different picture at some earlier time). But for most apps... you just don't need it.


For example MQTT finds plenty of use with IoT device communication.

https://en.wikipedia.org/wiki/MQTT


The web got faster, and it became easier to build and consume APIs, so we eliminated the need for an intermediary. More "native" event-driven architectures emerged.


"The web got faster" [citation needed]

Sure the network has gotten faster, but with the few exceptions like Craigslist and this site, in general "the web" has gotten way way slower in real terms, at least as I see it. These days megabytes of javascript (or webassembly) seems to be required just to display pages with a kilobyte of two of actual text content...


Browsing one of those bloated monstrosities on a nice new MacBook with a 2gbps connection is still a lot faster and nicer than Craigslist on old Celeron at 56k.


I should have clarified that I mean the backend infrastructure is much faster; I'm not talking about the client's experience (since rarely do end clients interact with message queues)


I was going to say something like this but you beat me to it. Moving more application state and logic to frontend SPAs reduced the need for so much backend work.

Backends are faster and more robust when they don't need to manage the "session" so much and instead just validate the final API request data and if they queue up anything it's those actions that require database transactions or to publish a message.


It's weird that people say this and yet the web feels very slow.


I should have clarified that I mean the backend infrastructure is much faster; I'm not talking about the client's experience (since rarely do end clients interact with message queues)

(in other words, by "web" I mean the network, not the content)


I see a fair amount of Kafka, while most other platforms have diminished. I think that is because people treat Kafka like a database/system of record. A queue is not a system of record.

A lot of the difficulty in modeling a complex system has to do with deciding what is durable state vs what is transient state.

Almost all state should be durable and/but durability is more expensive upfront. So people make tradeoffs to model a transition as transient and put in a queue. One or two or three years in, that is almost always a regretted decision.

Message queues that are not databases/systems of record wind up glossing this durable/transient state problem, and then when you have also this unique piece of infrastructure to support, this is a now you have two problems moment.


Personal experience: I needed a message broker while working with multiple sensors constantly streaming data at high frequency.

I have seen a startup where RabbitMQ was being used to hand-off requests to APIs (services) that take long to respond. I argued for unifying queueing and data persistence technology using Postgres even though I know a simple webhook would suffice.

Given that AWS has to sell and complexity tends to make people look smart, another server was spurn up for RabbitMQ :)

Many companies that have run a highly distributed system have figured what works for them. If requests are within the read and write rates of what Redis or Postgres can handle why introduce RabbitMQ or Kafka :?

Always remember that the Engineer nudging you towards more complexity will not be there when the chips are down.


Another theory: HTTP + Service Discovery gained popularity, alleviating the need to integrate with message brokers. Most popular languages have lightweight HTTP servers that can run in-process and don't need heavy Application servers to run them now. And Service Discovery routes the requests to the right places without the need for going through a central broker.

Message brokers need client libraries for every language and serialization support. HTTP clients and JSON Serialization have first-class support already, so many software distributors ship those APIs and clients first. Everyone got used to working with it and started writing their own APIs with it too.


HTTP is also very well understood with at least basic monitoring / health management that is fairly straightforward. I know what happens when I make an HTTP request. I know how long it should take to get a response, and I know what a success or error response will look like. And since I'm the originator, my log will itself contain all the necessary information to troubleshoot whether the issue is on my side or the other's.

That's not to say this can't also be done with message brokers, but unless there's a good reason HTTP won't work well, a lot of this stuff already works intuitively out of the box with HTTP.


I maintain and develop a message queue-based architecture at work (started around 2014), so here's my take:

* message queues solve some problems that are nowadays easily solved by cloud or k8s or other "smart" infrastructure, like service discovery, load balancer, authentication

* the tooling for HTTPS has gotten much better, so using something else seems less appealing

* it's much easier to get others to write a HTTPS service than one that listens on a RabbitMQ queue, for example

* testing components is easier without the message queue

* I agree with your point about databases

* the need for actual asynchronous and 1:n communication is much lower than we thought.


Because you can set up a rudimentary queueing system in MySQL/PostgreSQL very quickly these days. And it scales really well for small to medium sized applications!

I maintain a web application with a few hundred daily users and with the following table I have never had any problems:

CREATE TABLE `jobs` ( `id` BIGINT NOT NULL AUTO_INCREMENT, `queue` VARCHAR NOT NULL, `payload` JSON NOT NULL, `created_at` DATETIME NOT NULL, PRIMARY KEY `id`, INDEX `queue` );

Using MySQL's LOCK and UNLOCK I can ensure the same job doesn't get picked up twice.

All in all, it's a very simple solution. And simple is better!


I think probably for garbage collection reasons.

At my [place of work] we have built a simple event system in top of lambda functions, sqs, s3, eventbridge etc to ingest and add metadata to events before sending them on to various consumers.

We replaced an older kafka system that did lots of transformations to the data making it impossible to source the origin of a field at the consumer level; the newer system uses an extremely KISS approach - collate related data without transformation, add metadata and tags for consumers to use as a heads up and then leave it at that.

I agree that most regular stuff should just be http (or whatever) microservices as the garbage collection is free; requests, sockets, etc time out and then there's no rubbish left over. In an event based system if you have issues then suddenly you might have dozens of queues filled with garbage that requires cleanup.

There are definitely pros to event based, the whole idea of "replaying" etc is cool but like...I've never felt the need to do that...ever.

The event volume that we do process is quite low though, maybe a couple hundred k messages a day.


I think the premise is wrong. That "we" are still using it en masse and in ever increasing amounts.

We just call it different. Or use different underlying products. Nearly all web frameworks have some worker system built in. Many languages have async abilities using threads and messaging built in.

The only popular language and ecosystem I can think of that doesn't offer "worker queues" and messaging OOTB is JavaScript.

We are using it more than ever before. We just don't talk about it anymore, because it has become "boring" tech.


I implemented RabbitMQ based messaging queues as a mechanism to coordinate execution among discrete components of a handful of ambitious laboratory automation systems ~4-8 years ago.

Given a recent opportunity to rethink messaging based architectures, I chose the simplicity and flexibility of Redis to implement stack and queue based data-structures accessible across distributed nodes.

With even a handful of nodes, it was challenging to coordinate a messaging based system and the overhead of configuring a messaging architecture, essentially developing an ad-hoc messaging schema with each project (typically simple JSON objects), and relatively opaque infrastructure that often required advanced technical support led messaging systems to fall out of favor for me.

Kafka seems to be the current flavor of the day as far as messaging based systems, but I don't think I'll ever support a system that approaches the throughput required to even think about implementing something like Kafka in the laboratory automation space - maybe there's a use case for high-content imaging pipelines?

Right now, I'd probably choose Redis for intra-system communication if absolutely necessary, then something like hitting a Zapier webhook with content in a JSON object to route information to a different platform or software context, but I'm not in a space where I'm handling Terabytes of data or millions of requests a second.


sounds like you're about to reinvent a queueing system on top of redis. in a very painful way.


Redis already has the data structures for a queue and stack essentially straight out of the box.


The technology just got mature enough that it's not exciting to write about, but it's still really widely used.

Message queue-based architectures are the backbone of distributed, event-driven systems. Think of systems where when this particular event happens, then several downstream systems need to be aware and take action. A message queue allows these systems to be loosely coupled and supports multiple end system integration patterns.

Notice, this is systems or enterprise development, not application development. If you're using a message queue as part of your application architecture, then you may be at risk of over-engineering your solution.


My sense is the hype just died down. There are genuine cases where message queues are the correct solution. But since most of the developers are clueless they just jump on the latest trend and write blog posts and make Youtube videos.

This effect has also happened with microservices/monaliths, lambda/serverless, agile/scrum(still no concrete definition on these). Even cloud as a whole, there are so many articles of how companies managed to cut cloud cost to a fraction just by going bare metal.


I kind of stumbled into it. I have a server that's processing a lot of incoming data from field devices that expect quick 200 responses. The processing I was required to do on the data was pretty expensive time-wise, mainly because I have to make multiple calls to a third-party API that isn't highly performant. In order to keep everything stable, I had to delegate the data processing to a separate process via a message broker (Redis, with the Bull npm package as an abstraction layer to handle the message-passing), and I have no regrets. This pattern was suggested in the NestJS documentation, the framework I am using. After I realized the power of this pattern (especially because of my heavy leaning on mentioned third-party API), I started using it in other areas of my application as well, and I find to be a helpful pattern. As far as maintenance goes, I just have Heroku take care of my Redis instance. I can easily upgrade my specs with a simple CLI command. There was a slight learning curve in the beginning, but I got the hang of it pretty quickly, and it's been easy to reason about since.


It's the same reason why any old fad isn't in vogue anymore - it's just how popular things go. I mean, message queues weren't exactly new things in the late 2000s... your standard GUI and mouse uses message queues, pretty much since the 1980s. More and more people just caught on over time, popularity hit a peak, and then people eventually moved on. They're still used in many places, just no longer what's being crazed about.


I really hope that people are slowly starting to understand that using kafka and turning it in a single point of failure (yes, it fails) of your architecture is not a good idea.

This pattern has it's uses, but if you are using it everywhere, every time you have some sort of notifications because "it's easy" or whatever, you are likely doing it wrong and you will understand this at some point and it will not be pleasant.


It is impressive if kafka is the weakest link in your system (99.99% is achievable and even 99.999% is not impossible with the architecture)


I've yet to see the numbers you list live, but in addition to pure uptime, I've seen a lot of instances where it has been used as the default communication channel instead of proper RPCs, with devastating effects on the long run from the point of view of issues, debugging, predictability, etc...

It's being overused where it has no reason to be, good and simple on paper, awful in the way many are using it. Most of the times, you don't need it.


We often use RabbitMQ like middle-ware, and it is boring because it has proven very reliable.

Most people that deal with <40k users a day will have low server concurrency loads, and can get away with database abstracted state-machine structures.

Distributed systems is hard to get right, and there are a few key areas one needs to design right...

If I was to give some guidance, than these tips should help:

1. user UUID to allow separable concurrent transactions with credential caching

2. global UTC time backed by GPS/RTC/NTP

3. client side application-layer load-balancing through Time-division multiplexing (can reduce peak loads by several orders of magnitude)

4. store, filter, and forward _meaningful_ data

5. peer-to-peer AMQP with role enforcement reduces the producer->consumer design to a single library. i.e. if done right the entire infrastructure becomes ridiculously simple, but if people YOLO it... failure is certain.

6. automatic route permission and credential management from other languages can require a bit of effort to sync up reliably. Essentially you end up writing a distributed user account management system in whatever ecosystem you are trying to bolt on. The client user login x509 certs can make this less laborious.

7. redacted

8. batched payload 128kB AMQP messages, as depending how the consumer is implemented this can help reduce hits to the APIs (some user UUID owns those 400 insertions for example.)

9. One might be able to just use Erlang/Elixir channels instead, and that simplifies the design further.

Have a great day, =3


Tip #7 is my favorite. ;-p


#7 is extremely important...

If it was free it would be worthless... lol =3


Message based tech is less popular to talk about, not that it's less used overall.

These days, with AI, vector dbs are all the rage, so everyone hops onto that train.


I use web hooks now and they work really well for asynchronous situations. Spinning up a base app in node to do this is super simple and much easier to maintain than doing it in a kafka message bus.

Nice to see so many developers owning up to the "resume building" and being pragmatic about solving human/business problems versus technology for the sake of it.


- message queues are still widely used and more cloud hosted than self hosted so you will see them less in arch diagrams

- event log (stateful multi-consumers) have taken a portion of the message queue workflow. this is more likely than moving them to redis/database

message queuing works incredibly well for many problems. it’s as critical to most companies architectures as an application database


Nice observation. I still use Google Pub/Sub in my application - I recently also gave a talk on how we use Pub/Sub for our use case in a GCDG event (GCDG stands for Google Cloud Developer Group).

But now that I think about it, we don't use it in the traditional sense. Most of our regular operations work well enough by just using the "async" pattern in our programming (in JS and Rust).

The only place we use Pub/Sub is for communication between our NodeJS backend server and the Rust servers that we deploy on our client's VMs. We didn't want to expose a public endpoint on the Rust server (for security). And there was no need for a response from the Rust servers when the NodeJS server told it to do anything.

We don't fully utilize the features of a messaging queue (like GCP's Pub/Sub), but there just wasn't a better way for our specific kind of communication.


* The technology just got mature enough that it's not exciting to write about, but it's still really widely used.

That’s it. Full stop.


We have a new project (~ 6 years now) where we implemented a queue with RabbitMQ to temporarily store business events before they are stored in a database for reporting later.

It's awesome!

It absorbs the peaks, smoothes them out, acts as a buffer for when the database is down for upgrades, and I think over all these years we only had one small issue with it.

10/10 would recommend.


What the data consistency story around crashes ? Backup/recovery ?


Sorry, I'm not sure I understand your question. Can you rephrase?


Presumably you have backups for both the primary database and the message queue (or maybe no backup for the later). If a disaster happens, requiring you to restore backups, how confident are you that your system as a whole is behaving as expected (no events acknowledged but not processed / no events processed twice).


looks like you could have had a blue/green DB setup which would give you one less system to maintain and other benefits that a simple queue don't provide.

also, what do you do when the queue is down?


I don't think the queue has ever been down in 6 years. It certainly never was a breaking point.

As for the database, yes, we could do blue/green I guess, but it's a big database and it's more cost effective to rely on the queue.

To be honest, I'm not even sure blue/green would be an option given our constraints.


In our case, we managed to solve most of the use cases with less specialized tools.

I still think queues are great, but most of the time I can just run my queues using language constructs (like Channels) communicating between threads. If I need communication between machines, I can usually do that with Postgres or even s3. If you're writing constantly but only reading occasionally, you don't need to consume every message – you can select the last five minutes of rows from a table.

I've also seen a general trend in my line of work (data engineering) to try and write workloads that only interact with external services at the beginning or end, and do the rest within a node. That makes a lot of sense when you're basically doing a data -> data transformation, and is easier to test.

There are still cases where we need the full power of Kinesis, but it's just a lot less common than we thought it would be 10 years ago.


Last year I left a job that was about to give me the maintenance of a poorly implemented kafka system. I'm so glad I left before I really had to work with that system.

Since then I've been reading about async and await in the newer versions of javascript and it really threw me for a loop. I needed this to slow down some executing code but as I worked through my problems I realized "my god! this is exactly what we could have used for pub/sub at my last job".

We could have replaced a kafka system as well an enterprise workflow system with javascript and the async/await paradigm. Some of these systems cost millions per year to license and administer.


IMHO, message queues were all the hype at the time because cloud was picking up steam and people discovered how to decouple and handle massive loads.

For the blog posts, most were garbage(and still are) if my memory serves right. I recall reading a lot of blog posts and all of those were basically a different derivative of same “quick-start tutorial” that you would find on any decently documented piece of software. Once you delve into the real trenches, the blog posts start showing their limits immediately and the shallowness of the depth.

That all being said, message queues are very crucial part of most complex systems these days, same as your typical tools(containers, git, your language of choice etc.), it has moved onto mature and boring.


Hype is a function of number of discussions not number of applications.

There is no hype because not much news there.

That doesn’t mean it is less used.


It’s because they were billed as a way to reduce complexity but in reality just added a ton.

The fundamental issue with event driven architecture is getting out of sync with the source of truth.

Every single design doc I’ve seen in an organization pitching event driven architecture has overlooked the sync issue and ultimately been bitten by it.


can you explain how event driven is getting out of sync?

my understanding is that you pipe data change events right into queue, after each and every change, so in theory your application should reflect all changes in near real-time.

it will only go out of sync if your kafka consumers lag behind producers really really bad, and cannot process what is being produced fast enough


Events get missed for all kinds of reasons, having perfect delivery is basically impossible


So others have mentioned this in previous posts that 99% of companies building an app will not actually need this level of infrastructure especially considering how much better computers have gotten since the 2000s.

But...

Isn't another reason why we don't see hype around message queues and such in Distributed Systems because they are standard practice now? This discourse feels around this feels like, "message queues will make your architecture so much better, you should try it!" and more like, "just use a message queue..." To feel like the hype isn't there anymore because the technology is just standard practice now.

I could be wrong but whenever I come across any articles on building Distributed Systems, message queues and their variants are one of the first things mentioned as a necessity.


Corollary Question: what do people with prior experience scaling Elixir/Phoenix think about scaling with it?

I've read very strong reports favorable of it. For instance, it also can be scaled using progressively better hardware (like better CPU or more RAM), or horizontally scaling too. that Also that with a database on the same network, there won't be much need for a cache. Presumably, ability to throw CPU and RAM would lessen some needs for queue too.

At the same time, I don't notice much Elixir usage in practice and it has remained a small community.


Its fantastic to scale with.

I had written a reply to this thread days ago and decided not to post it.

But its amazing that you can do away with all the message queue infra since its essentially baked in.


How good is the vertical scaling with modern hardware?

Can you really just keep upgrading the hardware and essentially just scale a rails app?


IMO, many "microservices" just needed a way to do async processing without needing to hold on to connections, and landed up leveraging msg queues for such use cases. Now this is mostly getting replaced by new orchestration workflows like was step functions/temporal/orkes etc


It’s that it’s well established and there’s little need for hype on the topic. I work in a large enterprise with lots of autonomous teams where decoupling is key tenet for loose coupling those team systems.

We are now based firmly in the Azure landscape and Event Grids provide us with an effective team service boundary for other teams to consume our events all with the appropriate RBAC. Internal team Azure Service Bus is the underlying internal team driver for building decoupled and resilient services where we have to guarantee eventual consistency between our internal system landscape and the external SaaS services we actively leverage. At this scale it works very effectively, especially when pods can be dropped at any point within our k8s clusters.


I think that’s just standard software engineering now. Like no one is struggling to build these architectures.


Message queues are great for flowing information to external systems, one-way. But it's always worth asking why you have separate systems in the first place. And if your app has to also consume from a queue, forming any sort of cycle, look out.

Especially if they are services that are conceptually related, you will immediately hit consistency problems - message queues offer few guarantees about data relationships, and those relationships are key to the correctness of every system I've seen. Treating message queues as an ACID data store is a categorical mistake if you need referential integrity - ACID semantics are ultimately what most business processes need at the end of the day and MQ architectures end up accumulating cruft to compensate for this.


Simple: messaging systems / event-driven systems aren't the "fad of the day" anymore, so you don't have a gazillion vendors pumping out blog posts and Youtube videos to hawk their wares, or a million companies writing about them to look "hip" and "cool" to help recruit developers, or endless legions of consulting companies writing about them in order to attract customers, etc.

Basically every "cool, shiny, new" tech goes through this process. Arguably it's all related to the Garnter Hype Cycle[1], although I do think the process I'm describing exists somewhat independently of the GHC.

[1]: https://en.wikipedia.org/wiki/Gartner_hype_cycle


Message queues are often chosen as a communication protocol in microservice-based architectures. Microservices were a fad and people have sobered up. People have learned when microservices deliver a benefit, and when they are unnecessary. In many cases, they are unnecessary.

Queues are still very useful for queueing up asynchronous work. Most SaaS apps I've worked with use one. However, there is a difference to what kind of queue you need to queue a few thousand tasks per day, vs using the queue as the backbone of all of your inter-service communications. For the first use case, using a DB table or Redis as a queue backend is often enough.


The other answers around here are mostly right, but I'd like to add another one, which is right in some situations:

Message queues are often the wrong tool. Often you rather want something like RPC, and message queues were wrongly used as poor man's async DIY RPC.


They aren't, you just don't hear about it.

Basically everything in tech has gone through a hype cycle when everyone was talking about it, when it was the shiny new hammer that needed to be applied to every problem and appear in every resume.

A bit over 20 years ago I interviewed with a company who was hiring someone to help them "move everything to XML". Over the course of the two hour interview I tired unsuccessfully to figure out what they actually wanted to do. I don't think they actually understood what XML was, but I still wonder from time to time what they were actually trying to achieve and if they ever accomplished it.


Maybe one of the reason why it become unpopular is the additional code you have to implement with asynchronous processing in a separate system and tracking in case of errors during processing in the target system.

Its easier and faster to make a Webservice request where you get an instant result you can handle directly in the source System.

Mostly the queue is implemented in the source System where you can monitor and see the processing status in Realtime without delays.


A lot of great comments about overcomplicating architectures and using unnecessary tech but also you need to consider that almost any service you can use on aws will cost you less per month than a few hrs of development time.


Queueing systems haven't disappeared as thery're an important part of distributed systems. If you need to distribute work/data asynchronously to multiple workers, you're gonna use a queuing system.

Although queuing systems can be implemented on top of a database, message queues like RabbitMQ / ZeroMQ are doing a fine job. I use RabbitMQ all the time, precisely because i need to transfer data between systems and i have multiple workers working asynchronously on that data.

I guess these architectures might be less popular, or less talked about, because monoliths and simple webapps are more talked about than complex systems ?


We all learned that a distributed monolith is worse than just having a monolith. Truly independent event-based systems still are very useful, but not when they have to communicate stuff back and forth to achieve a single problem.


We're using lots of RabbitMQ queues in production. It works well, is efficient, low-maintenance and scales well up to 10k msg/s. By all means, I'd say queues aren't unpopular. It's just that I'm not the kind of person to loudly shill whatever tech I'm using, so you probably won't hear from people like me without asking.

And for a consulting company, a solid message-based deployment is not a good business strategy. If things just work and temporary load spikes get buffered automatically, there's very little reason for clients to buy a maintenance retainer.


We highly use messaging queues to power millions of daily transactions and requests. I think it’s because the patterns have been written about and it is no longer a fun hot topic. AWS SQS, Azure Event Hub etc all very standard for what I’ve seen in recent architectural diagrams from many companies.


It’s important to keep in mind that 12 years ago, the patterns for high scale message queues (and often even low scale) were still in flux and getting established.

In the ruby world, delayed job was getting upended by sidekiq, but redis was still a relatively new tool in a lot of tool-belts, and organizations had to approach redis at that time with (appropriate) caution. Even Kafka by the mid 10s was still a bit scary to deploy and manage yourself, so it might have been the optimal solution but you potentially wanted to avoid it to save yourself headaches.

Today, there are so many robust solutions that you can choose from many options and not shoot yourself in the foot. You might end up with a slightly over complicated architecture or some quirky challenges, but it’s just far less complex to get it right.

That means fewer blog posts. Fewer people touting their strategy. Because, to be frank, it’s more of a “solved” problem with lots of pre existing art.

All that being said, I still personally find this stuff interesting. I love the stuff getting produced by Mike Perham. Kafka is a powerful platform that can sit next to redis. Tooling getting built on top of Postgres continues to impress and show how simple even high scale applications can be——

But, maybe not everyone cares quite the way we do.


Hypothesis: The performance of systems has increased so much that many things that required a queue before can now just be made into regular synchronous transactions. aka PostgreSQL is eating the world.


Nah, messaging based/async operations are rarely for performance benefits...


Message queues are still definitely in use, it's just behind the scenes in most frameworks you're using now. They're still great for the highest scale stuff when you can't pay the abstraction cost and don't need stuff like FIFO semantics.

Along with much more mentioned in this thread, I think a lot of companies realized that they indeed are not AWS/Google/Meta/Twitter scale, won't be in the next decade, and probably never will need to be to be successful or to support their product.


All the devs have already put "event driven" on their resumes. They need something new to be seen to be at the forefront of technology. I think we're in the AI hype phase where everyone is competing for those tasty $500k per year AI jobs at google, so the ACS systems don't know what to do with resumes that have "event driven."

The last thing you want on your LinkedIn profile is a link to a video you made in 2015 about your cool Kafka solution. The ACS would spit you out so fast...


Have they gone away? I still use message queues a lot, be it rabbitmq, through Google PubSub, Redis, etc. They are such a normal thing nowadays, just another tool in the toolbox really


We’re in the process of shifting an entire business from db stored state driven by large (Java and .net) apps to an AMQP based system.

I can’t really say I’m enjoying it. But it does help with scale


Most MQs had architectural side effects. Lost/re order, some required duplication for multiple listeners, lacked history (and complected debugging), introduced/required messy reconciliation and remediation processes.

Distributed logs such as Kafka, bookkeeper, and more stepped in to take some market share and most of the hype.

MQs retain their benefits and are still useful in some cases but the more modern alternatives can offer fewer asterisks.


People discovered that using messaging can add tons of overhead, and async messaging architectures are harder to manage and monitor.

Serialization and going over the network are an order of magnitude slower and error prone than good ol' function calls.

I've seen too many systems that spent more time on marshalling from and to json and passing messages around than actual processing.


I have a couple of datapoints.

One project I know of started with message queues for request response pattern. It performed poorly because Windows Service Bus writes messages to a database. That increased latency for a UI heavy application.

Second project used message queues but the front end was a HTTP API. When overloaded the API timed out at 30 seconds but the job was still in the queue and wasn’t cancelled. It led to a lot of wastage.


> Windows service Bus??

MSMQ?


It was renamed Azure Service Bus


It's still around, and it's still going strong in the embedded space. Examples:

* PX4

* Ardupilot

* Betaflight

* DroneCAN

* Cyphal

* ROS/ROS2

* Klipper

* GNU Radio

Also would like to mention that all of the most popular RTOSes implement queues as a very common way of communicating between processes.


As does Apple's Darwin kernel, when used as intended. :-)


At least what I have seen is that MQ was mostly used for batches in the areas I worked in.

The new thing is "event driven architecture" (or whatever they can pass off as that hype). In a lot of cases, it's a better architecture. For fhe remaining batches, we are running against S3 buckets, or looking at no SQL entries in a specific status in a DB. And we still use a little SQS, but not that often.


I’m concerned that a lot of commenters don’t appreciate the difference between a queue and a log. Kafka is not a queue.

I think like most have said is that it’s just not a popular topic anymore to blog about but it’s still used. OTOH logs like Kafka have become more ubiquitous. Even new and exciting systems like Apache Pulsar (a log system that can emulate a queue) have implemented the Kafka API.


There is a constant flow of HN posts about how people have built their own message queues.

I personally have built three. It's the latest thing to do.


> * Databases (broadly defined) got a lot better at handling high scale, so system designers moved more of the "transient" application state into the main data stores.

This but also: computers got incredibly more capable. You can now have several terabytes of ram and literally hundreds of cpu cores in a single box.

Chances are you can take queuing off your design.


We use queues all the time, they’re practical, effective and easy to use. Too mature to talk about is my explanation.


> RabbitMQ, ZeroMQ

There is literally nothing in common between RabbitMQ and ZeroMQ except for the two symbols 'MQ' in the name.


Well, the one thing they have in common is that both projects were originally designed by the same person, Pieter Hintjens.


He was involved in the AMQP mess, not in designing RabbitMQ.


At least if you took his blogs at face value, he wasn't involved in the "mess", but the mess was forced on everyone by corporate interests.

And I believe this to be true. Peter was a brilliant and wonderful human being whom I shared a tremendous amount in common and largely the same viewpoint on life with. There's not a week that goes by where I don't mourn his loss and where we could be today if he were still with us.


You are correct, I misremembered.


ZeroMQ has never been a managed queue. It was always more of a networking library on top of which you could implement some interesting paradigms, but it has never been on the same playing field as MQs (on purpose).

SQS is still very much alive. It is more than likely the first or second most deployed resource in AWS in my daily work.


I think it's both. They're boring if you need them, because you probably started implementing these things over a decade ago. They're boring if you don't need it because you don't need it - e.g. maybe you're a small team able to work on a single monolithic codebase effectively.


To process all your tasks, it cost the same to either run 100 EC2 instances for an hour, or 10 EC2 instances for 10 hours. It's much easier than before to design a stateless scalable system, so people would rather scale out and get things done quickly than waiting in a queue.


In some ways they’re still popular, but the abstraction has changed.

AWS Lambda, Google Cloud Functions, etc often resemble message queue architectures when used with non-HTTP triggers. An even happens and a worker spawns to handle the event.


I continue to use redis as a multipurpose tool including acting as a message queue.


I think it is primarily your last bullet point, it is less exciting to write about unless you really are looking for specifically that content. They are widely used but they are normal parts of a lot of architectures now.


I miss the days of Microsoft’s Robotics Studio, a message passing architecture for controlling robotics and other hardware. If only they had continued development instead of stopping halfway before it reached maturity.


Hype died down along with "microservices" where queues made more sense.


and none of the tech are part of the keyword bingo you slap on the resume for AI jobs.


I believe it’s the last one. I use queues more than ever.

Almost every AWS architecture diagram will have a queue.

SQS is extremely stable, mature and cheap. Integrates with so many services and comes out of the box with good defaults.


This is my experience too. I frequently require a single queue for some very specific task but it does not form part of a greater application.

My current work (data engineering) is such that I don't have a relational database instance that it would make sense to use instead. SQS is very straightforward and just works.

None of this is to say that I would necessarily advocate for a user-facing app to be _based_ on a queue-based architecture.


Kafka is still used heavily, but the barrier is very high.

if you are running truly global (as planetary scale) distributed service, and have multiple teams developing independent components then it makes sense.


Every one of my customers uses message queues either Redis, SQS and rarely Kafka using Celery, Sidekiq, etc. If anything it is boring and works and just part of the default stack everyone uses.


Consensus in distributed systems is hard and MQs don’t help.

If possible boiled down systems to a transitional db replica set. That is a local minimum in complexity that will serve you for a long time.


Another:

* The Log is the superior solution


Tech cycles, hype, maturity. I remember the prevalence of javascript frameworks, big data bashing, and crypto currency debates for years. Then... nothing.


I have personally found them useful as “glue” layer in heterogeneous and large systems with moving and decoupled parts.

There are any number of ways to do the same thing — context matters.


It's boring tech. They're quietly being used everywhere in big companies doing the heavy lifting.


it's disk speeds.

suddenly everyone could scale much much more, but by then they were moving to the cloud and execs don't understand two buzzwords at the same time.


The hype was cooled down once companies realized "they have more microservices than actual users"

Was never obsessed with "event-driven" distributed systems using message queues.

The major issue is to keep syncing state between services.

Quite for a long time used to get decent results with simple golang and postgres scripts to distribute work between workers on multiple bare metal machines

Took ideas from projects similar to MessageDB

https://redis.io/docs/latest/develop/data-types/streams/

CREATE TABLE IF NOT EXISTS message_store.messages ( global_position bigserial NOT NULL, position bigint NOT NULL, time TIMESTAMP WITHOUT TIME ZONE DEFAULT (now() AT TIME ZONE 'utc') NOT NULL, stream_name text NOT NULL, type text NOT NULL, data jsonb, metadata jsonb, id UUID NOT NULL DEFAULT gen_random_uuid() );

along with Notify https://www.postgresql.org/docs/9.0/sql-notify.html

or polling techinques

Redis Stream here and there worked well too https://redis.io/docs/latest/develop/data-types/streams/

Another possible alternative can be "durable execution" platforms like Temporal, Hatchet, Inngest mentioned on HN many times

- Temporal - https://temporal.io/ - https://github.com/temporalio - https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

- Hatchet https://news.ycombinator.com/item?id=39643136

- Inngest https://news.ycombinator.com/item?id=36403014

- Windmill https://news.ycombinator.com/item?id=35920082

Biggest issues I had with workflow platforms - it requires a huge congnitive investments in understanding SDKs, best practices, deciding how properly persist data. Especially, if you optout to host it yourself.


Using them all the time in an ETL/ML-Eng type context. So I would say it's the last point for me.


On mobile the phones are so powerful now that you can just make them do most of the heavy lifting :)


I think dbs like etcd mongo Postgres all have some notion of events and notifications. The only drawback is that you have to still poll from the ui if you use these. This is sufficient for most use cases. If you want real time updates, Kafka provides long running queries which keep Kafka still relevant.


Because it’s massive over engineering of dubious value, unless you are Uber etc


I find it useful in small doses even at small scale. When you have only one of a thing processing stuff it's useful to isolate that behind queues so you can bounce it and lose nothing: "pause" the incoming queue (new messages stack up,) let the pipeline drain, do whatever you must, and then un-pause. Nothing is lost, despite the backend being simple minded and not fault tolerant.

The key here is "small dose". Very limited and strategic use at key of points in the system. Not a wholesale, endemic "messaging architecture."

Queues also provide valuable observability independent of applications: metrics that can be monitored for performance and availability. This is necessary even at small scale. Put "alerts" on the queues: when they start backing up something has likely fallen over.

Queues also help paper over unanticipated conditions, such as Important Customer (tm) fixing/reworking something and suddenly flushing a huge number of messages into my systems: the queue just stacks that up and the rest of the system avoids being overloaded. At the other end, Important Customer neglects their asynchronous consumers, which fail to process 50% of the time. The queue just automatically retries till it goes through.

All of this is transparent to my applications. That has enormous value.

So there you are; real world "grug brained" work that benefits from judicious application of message queues and deftly avoiding the "architecture" tar pit.


Agreed. Small doses fine, but don’t go full blown event sourcing with distributed transactions etc from the get go.


These things are more-or-less proven, and so aren't sexy because they either work or there are better ways to do them.

Be careful not to conflate message transport with message storage, although message brokers usually do both.

SQS was slower than frozen molasses when I used it c. 2010. ZMQ, ZK, rabbit, and MQTT are oft mentioned. ESBs always come up and a large % of technical people hate them because they come with "software architect" historical baggage.

It's risky to have a SPoF one-grand-system-to-rule-them-all when you can have a standardized API for every department or m[ia]cro-service exposed as RESTful, gRPC, and/or GraphQL in a more isolated manner.

Redis isn't needed in some ecosystems like Elixir/Erlang/BEAM. Memcache is simpler if you just need a temporary cache without persistence.


We use message queues for everything that doesn't have to be real time. It's just another tool in the toolbox and pretty old and well known. I don't think it needs extensive blogging as pretty much anyone who need to use message queues knows how to do it.

It has some downsides as it if something happens it's harder to debug than just using good old REST API calls.


It is much easier now to scale services (microservices, FaaS, etc.) to meet high or fluctuating demand than it used to be. So there are fewer cases where there is much to gain by carefully controlling how messages are buffered up.


too boring(to write about) for engineers, too complex for AI.

Still, very much present/popular in the ecosystems I dabble in.


Like micro-services, it's not a bad idea, but it's not the hammer for every nail as people who write books and blog posts seem to push for (they've now moved on past blockchain to AI).

If you have a problem that logically has different async services, sure, use Redis or something. Databases also were able to handle this problem, but weren't as sexy, and they explicitly handle the problem better now. Just another tool in the toolbelt.

NoSQL was another solution in search of problems, but databases can handle this use-case better now too.


Kafka is everywhere. But I guess there is a point where hyped new tech becomes boring and mainstream, and people stop writing about it.


"If people have experience designing or implementing greenfield systems based on message queues, I'd be curious to hear about it."

Ok, you asked.

It was early Summer 1997 and I had just graduated college with a computer information systems degree. I was employed at the time as a truck driver doing deliveries to Red Lobster and my father was not happy and offered me double my current pay and guaranteed 40 hours per week to return to work for him as an electrician. I returned to work with my Dad for the Summer but after 14+ years of electrical work with him I decided I needed to get a job with computers. Labor day weekend of 1997 I blasted out over 40 resumes and on Tuesday 9/9/97 I had my first interview at a small payments startup dotcom in Wilmington Delaware, the credit card capital of the world at that time. I was hired on the spot and started that day as employee #5 and was the first software hire for the company yet I had NO IDEA what I was doing. I was tasked with creating a program that could take payments from a TCPIP WAN interface and proxy it back out over serial modems to Visa and batch ACH files to the U.S. FED. This started as a monolith design and we were processing automated clearing house as well as credit cards by late 1997. I would continue in this architectural design and sole software developer support of this critical 100% uptime system for many years. Somewhere around mid 1998 volume started to increase and the monolith design experienced network socket congestion, the volume from the firehose continued to increase and I was the sole guy tasked with solving it. That volume increase came from a little known company at the time, PayPal. Since the 'mafia' was very demanding they knew I was the guy however management isolated me since the 'mafia' demanded extensive daily reporting that only the guy who built the platform could provide. This however took a backseat to the network connection issues which were growing at an increasing rate. I was involved in a lot of technology firsts as a result and herein starts the queue story. After processing payments for Microsoft TechEd too in 1998 I was given an NT Option Pack CD. I was constantly seeking a solution to reduce the network congestion on the monolith and within this option pack was something called "Microsoft Message Queue". I spent several months nonstop of nights and weekends redesigning the entire system using an XML interface API from the ground up while writing individual services that read from an ingress queue and output to and egress queue, this structure is now know as microservices and this design solved all the load problems since it scaled extremely well. This new redesigned system had many personal experience enhancements added such as globally unique identifiers as well as the API being fully extensible but the greatest unseen win was the ability to 100% recreate software bugs since all message passing was recorded. Paypal ended up leaving us in ?2003? for JPMorgan after the majority holder refused to sell to the 'mafia'. Some years later I was informed by several management executives that Paypal also offered to exclusively hire only me but I was of course never informed of that.

I have many a business horror story however in over a decade of using MSMQ at my first payments company, 1998-2010, I only had one data corruption event in the queue broker which required me to reverse engineer the MSMQ binary and thus file format to recover live payments records. This corruption was linked to bad ECC memory on one of those beige 1U Compaq servers that many here likely recall.

This story may reveal my age but while my ride has been exciting it isn't over yet as so many opportunities still exist. A rolling rock gathers no moss!

Stay Healthy!


I honestly cannot comment on RabbitMQ or Kafka. I have not used them. I have also not used Redis other than learning.

However, a few years ago, I did use MSMQ (Microsoft Message Queue) and worked very well. However, at the time, I wanted something that didn't limit me to Windows.

In the end, I ended up learning ZeroMQ. Once I understood their "patterns" I created my own Broker software.

Originally, all requests (messages) sent to the Broker were being stored in files. Eventually, I moved over to Sqlite. As the broker is designed to process one thing at a time, was not worried multiple requests were going to sqlite. So now my Broker requires little dependencies.. except ZeroMQ and Sqlite.

(I did not need to worry about async processing as they get passed to the workers)

So, a broker is communicated with a client and a worker/consumer.

- client can ask for state (health of a queue, etc)

- client can send a message (to a queue, etc)

The worker communicates with the broker

- please connect me to this queue

- is there anything in this queue for me?

- here is the result of this message (success or failure)

etc.

I also made use of the pub-sub pattern. I made a GUI app that subscribes to these queues and feeds updates. If there are problems (failures) you can see them, here. I leave it to staff to re-send the message. If its already sent, maybe the subscriber missed that packet. pub-sub is not reliable, afterall.. but works 99% of time.

Overall it has been a great system for my needs -- again, it is lightweight, fast, and hardly costs anything. I do not need a beefy machine, either.

Honestly, I swear by this application (broker) i made. Now, am I comparing it to RabbitMQ or Kafka? No! I am sure these products are very good at what they do. However, especially for smaller companies I work for, this software has saved them a few pennies.

In all I found "Distributed computing" to be rewarding, and ZeroMQ+Sqlite have been a nice combination for my broker.

I have been experimenting with nanomsg-NG as a replacement for ZeroMQ but I just haven't spent proper time of it due to other commitments.


Because Kafka is using Java. And Java should die.


The pendulum just swung the other way. Message queues were complicated and don't solve every problem, so rather than think about what the individual needs, the "community" took a 180 and adopted solutions that were overly simplistic and didn't solve every problem. It happens every 5 years or so with a different tech thing. HN is nothing if not constantly chasing any contrarian opinion that seems novel.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: