
To Message Bus or Not: Distributed Systems Design (2017) - gk1
https://www.netlify.com/blog/2017/03/02/to-message-bus-or-not-distributed-systems-design/
======
cpitman
In enterprise environments, I more often see _overuse_ of message brokers. I
am all for targeted use of messaging systems, but what I usually see is an all
or nothing approach where as soon as a team starts using messaging they use it
for _all_ interprocess communication.

This has some very painful downsides.

First, traditional message brokers (ie queues, topics, persistence, etc)
introduce a potential bottleneck for all communication, and on top of that
they are _slow_ compared to just using the network (ie HTTP, with no
middleman). I've had customers who start breaking apart their monoliths and
replacing with microservices, and since the monolith used messaging they use
messaging between microservices as well. Well, the brokers were already scaled
to capacity for the old system, so going from "1" monolith service to "10"
microservices means deploying 10x more messaging infrastructure. Traditional
messaging brokers also don't scale all that well horizontally, so that 10x is
probably more like 20x.

Second, performance tuning becomes harder for people to understand. A core
tenant of "lean" is to reduce the number/length of queues in a system to get
more predictable latency and to make it easier to diagnose bottlenecks. But
everyone always does the opposite with messaging: massive queues everywhere.
By the time the alarms go off, multiple queues are backing up and no one is
clear on where the actual bottleneck is.

What I would like to see more of is _strategic_ use of messaging. For example,
a work queue to scale out processing, but the workers use HTTP or some other
synchronous method to call downstream services. Limiting messaging to very few
points in the process still give a lot of the async benefits to the original
client, while limiting how often requests hit the broker and making it easier
to diagnose performance problems.

~~~
lulf
I recommend looking at a standard messaging protocol like AMQP 1.0, which will
allow you to implement the request-response pattern in an efficient manner
without message brokers. Either “direct” client server as with HTTP , or via
an intermediary such as the Apache Qpid Dispatch Router.

With the dispatch router, you can use pattern matching to specify if addresses
should be routed to a broker or be routed directly to a particular service.

This is way you can get the semantics that best fits your use case.

~~~
geezerjay
> I recommend looking at a standard messaging protocol like AMQP 1.0, which
> will allow you to implement the request-response pattern in an efficient
> manner without message brokers.

Honest question: if the goal is to implement the request/response pattern then
why should anyone adopt AMQP instead of plain REST/RPC-over-HTTP?

~~~
lulf
If you don’t use any other pattern than request-response, I agree there is no
point.

If you have a mix of pub-sub, work queues and request-response, it could
simplify your dependencies perhaps.

Also AMQP 1.0 has some nice async capabilities and acknowledge modes that I
believe goes beyond what http/2 and grpc supports today.

OTOH I don’t have any real world experience operating such a mix of different
communication patterns, so it could be the advantage is insignificant.

~~~
geezerjay
> If you have a mix of pub-sub, work queues and request-response, it could
> simplify your dependencies perhaps.

That's a good point. Indeed if the project is already rolling with a message
bus then it wouldn't make much sense to increase complexity just to use a
specific message exchange pattern.

------
redact207
This article talks about the benefit of a message bus being pub/sub. This
definitely helps to decouple the internals of apps, which in turn makes things
easier to maintain.

There are many other benefits to using a message bus, and it's a better fit in
general to distributed systems. The hard part is understanding how to
structure apps to make them compatible with messaging.

Imagine sending out a command to the bus and not knowing when it'll get
processed. Perhaps under a second, perhaps next week. You can't expect a reply
at that point because the service that sent it may no longer be running.

If you're on Node, try [https://node-ts.github.io/bus/](https://node-
ts.github.io/bus/)

It's a service bus that manages the complexity of rabbit/sqs/whatever, so you
can build simple message handlers and complex orchestration workflows

~~~
ledneb
> Imagine sending out a command to the bus and not knowing when it'll get
> processed

I would love to hear how others are correlating output with commands in such
architectures - especially if they can be displayed to users as a direct
result of a command. Always felt like I'm missing a thing or two.

It seems the choices are:

* Manage work across domains (sagas, two phase commit, rpc)

* Losen requirements (At some point in the future, stuff _might_ happen. It may not be related to your command. Deal with it.)

* Correlation and SLAs (correlate outcomes with commands, have clients wait a fixed period while collecting correlating outcomes)

Is that a fair summary of where we can go? Any recommended reading?

~~~
bunderbunder
I don't know about correlating output with commands, but if you're looking to
correlate output with input, one option is to stick an ID on every message,
and, for messages that are created in response to other messages, also list
which one(s) it's responding to.

I would say that loosening requirements is also a reasonable option. You _can
't_ assume that anything downstream will be up, or healthy, or whatever. On a
system that's large enough to benefit from a message bus, you have to assume
that failures are the exception and not the norm. And trying to get a system
that acts like that is the case is likely to be more expensive than it's
worth. For a decent blog post that touches on the subject, see "Starbucks Does
Not Use Two-Phase Commit"[1].

[1]:
[https://www.enterpriseintegrationpatterns.com/ramblings/18_s...](https://www.enterpriseintegrationpatterns.com/ramblings/18_starbucks.html)

~~~
hadsed
Nice blog post! Certainly puts things into perspective in terms of how one
should deal with errors, including sometimes just not caring about them much.

------
lugg
Whether to use pub/sub or rest is entirely domain dependent.

So, who cares? Its an implementation detail. Just do whatever makes sense. The
most sane systems I've seen make use of both.

> In an architecture driven by a message bus it allows more ubiquitous access
> to data.

Please stop mistaking design details for architecture. Lots of things allow
more ubiquitous access to data.

Talking about this stuff in this way is just going to wind up with you
replacing MySQL with Kafka and never actually solving any real problems with
your contexts/domains.

~~~
hinkley
I've been burned by both but I think I still lean away from event based
systems.

REST is more likely to result in a partially ordered sequence of actions from
one actor. User does action A which causes thing B and C to happen. In event
driven systems, C may happen before B, or when the user does multiple things,
B2 is now more likely to happen before C1.

IME, fanout problems are much easier to diagnose in sequential (eg, REST-
based) systems. If only because event systems historically didn't come with
cause-effect auditing facilities or back pressure baked in.

~~~
zmmmmm
Isn't sequentiality just an implementation detail too?

We use Apache Camel to orchestrate messages and you just declaratively state
where you want sequential vs multicast / parallel behavior.

~~~
hinkley
I don’t mean sequentiality of events I mean sequential steps in the process.
All of the side effects of my button press happening in order. If you’re
trying to say that’s an implementation detail instead of design elements then
I’ll counter that by that yard stick, the difference between email and Slack
are just implementation details.

~~~
zmmmmm
I mean that is orthogonal to whether you are using messaging or not.

With API calls you default to doing them in order but you can certainly
multithread the calls and lose the deterministic order if you want. With
messaging you default to sending them off in parallel and not knowing which
are received / processed / completed before the others, but the non-default is
available there too. You have a message router that knows what has to be
ordered and what doesn't and manages the flow. In both cases the non-default
requires a bit more work but in both cases it's available and it just depends
what your most critical needs are.

One thing though is that as systems scale up and get more complex, you want
less and less to have blocking API calls anywhere that you don't need them, as
those quickly become your bottle necks and hard failure points.

------
scraegg
The first sentence already is triggering me.

No, it's not hard. Like most topics in software engineering there are 50+
years of pretty successful development, backed by science, backed by software
from the 70s, backed by all seniority levels of engineers.

The problem nowadays is that people don't want to learn the proven stuff.
People want to learn the new hip framework. Learning Kubernetes and React.js
is much more fun than learning how actual routing in TCP/IP works, right?

The problem is that something can only be hip for 1-5 years. But really stable
complex systems like a distributed network can only be developed in a time
frame of 5-10 years. Therefore most of the hip stuff is unfinished or already
flawed in its basic architecture. And usually if you switch from one hip stuff
to another, you get the same McDonalds software menues, just with differently
flavored ketchup and marketing.

So if you feel something is hard it might be because you are not dealing with
the actual problem. For instance you might think about doing Kafka, and that's
fine. But be aware that email is shipping more messages than Kafka and its
doing it longer than Kafka.

For instance topologies: There is no point-to-point. There's star, meshed, bus
etc. See here:
[https://en.wikipedia.org/wiki/Network_topology](https://en.wikipedia.org/wiki/Network_topology)

If you don't know your topology it might be star or mesh. But it's still a
topology or a mix of multiple topologies.

And if you develop a distributed system you really need to think about how
your network should be structured. Then if you know which topology fits your
use case you can go and figure out the way this topology works and what the
drawbacks are. Star networks (like k8s) for instance are easy to setup but
have a clear bottle neck in the center. A Bus (like Kafka) is like a street.
It works fine until it is full, and there are sadly some activity patterns
were an overloaded Bus will cascade and the overload is visible even weeks
later (although you have reduced the traffic already), if you don't reset it
completely from time to time.

It's not magic. You can look all of it up on wikipedia if you know the
keywords. Also there is not a single "good" solution. It depends always on how
well the rest of your system integrates with the pros and cons of your
topology choice. And if you use multiple topologies in parallel you have a
complexity overhead at some point, which is why working in big corps is
usually so slow.

~~~
geezerjay
> The problem nowadays is that people don't want to learn the proven stuff.
> People want to learn the new hip framework. Learning Kubernetes and React.js
> is much more fun than learning how actual routing in TCP/IP works, right?

That's an awfully short-sighted comparison.

There are far more job offers for deploying and managing systems with
kubernetes and to develop front-ends with React than there are for developing
TCP/IP infrastructure. It's fun to earn a living and enjoy the priviledges of
having a high salary, and the odds you get that by studying solved problems
that nowadays just work are not that high.

~~~
scraegg
If my livelyhood depends on doing bullshit then of course I will also do
bullshit. But that doesn't stop me from applying for other jobs or from
creating random HN accounts and complain about it. ;-)

I also found it's not bad everywhere though. If you treat the smart people
around you well here and there you will get some opportunity to actually
change something.

So what I also try to do is getting people out of the mindset that they are
actually doing something reasonable when they do this bullshit circus to pay
the rent. It's possible when you are really frustrated to spend a few hours
here and there to learn the actual stuff instead of the new hip stuff. And
over time you will thereby be able to solve more and more problems with actual
solutions.

An example from my own life: At one point I really learned about Ansible,
Chef, Puppet etc. Then I learned about actual configurations, improved my
knowledge of ssh, bash-scripting etc, and in the end I wrote bash scripts that
replaced all the Ansible I had used. The results where more flexible, the
error messages more readable (thanks to set -x you were able to see what
actually went wrong), it was a lot less than 1000 lines of code, and it was a
lot of fun to do some actual problem solving for a change.

------
ww520
Besides the distributed case, message bus is invaluable in building crash-
proof applications. I've used lightweight message bus within the app itself,
for better crash recovery on long running tasks. E.g. Need to generate lots of
emails and send them out. I would create a small command object for each email
and queue it to the message bus, and let the message handler to handle email
generation one by one. The lightweight command objects can be queued up
quickly and the UI can return to the user right the way. The slow running
email generation can be run in the background. In the event of a system crash,
the message bus will be restored automatically and continue on the remaining
tasks in the queue.

~~~
sk5t
What if your process to generate the initial messages crashes halfway through?

~~~
ww520
You return the error to the user.

Crash handling is a matter of managing user expectation. If the user hits a
button and the UI shows the command as success, the system better ensures the
command will complete eventually. If the user hits a button and the UI shows
an error, well the user can try again when the system is up.

~~~
sk5t
Ah, let me restate the question: suppose the user has clicked a button to
approve sending 5,000 emails; the server process puts 2,000 messages onto a
queue for later processing--but then, something occurs to interrupt adding the
remaining 3,000 messages to the queue.

Presuming the queue doesn't support any transactionality beyond per-message,
and optionally supposing a queue consumer has already started sending some of
these emails, how do you recover from the fault and help the user send the
rest of the email (without duplicating outbound email)?

~~~
mattcaldwell
You update the status of each message when it's acked. You show a live count
of messages sent / total messages on the screen where the user sent it.
2000/5000 sent... if those 3,000 never get sent, it will be obvious to the
user.

If you want the user to be able to try re-sending, you can provide that
functionality... you'd need to "cancel" the outgoing messages using a separate
queue, re-sending when each cancellation is acked.

~~~
sk5t
It seems as though you might be describing a process with state stored in an
RDBMS or the like. Which, while a perfectly reasonable approach, is not much
like the initially-described case of firing a bunch of "send email to
foo@example.com"-type messages into a queue, subsequently to be drained and
acted upon by possibly-remote workers.

What I am trying to uncover here is how one might expect to use a queue, on
its own, to support a very-much-non-idempotent interruptible non-transactional
process.

Also, what does it mean to "cancel" an already-sent email?

~~~
floriol
I have yet to use it, but isn't Java EE's Batch API is used for things like
this? It tracks the progress of long running tasks (e.g. the index of the
current mail) and can continue from the last sent index in case of some error.

------
Pamar
I _strongly_ suggest anyone interested in this type of architecture to read
[https://www.goodreads.com/search?q=Enterprise+Integration+Pa...](https://www.goodreads.com/search?q=Enterprise+Integration+Patterns%3A+Designing%2C+Building%2C+and+Deploying+Messaging+Solutions)
\- it is a really well though out catalog of patterns and solution for
Message-based systems.

------
phoe-krk
It's weird for me to not see even a single mention of any BEAM languages, such
as Erlang or Elixir. They are naturally distributed and have discovery,
networking and messaging built into the virtual machine itself.

------
hestefisk
My main gripe with service buses is that they can be very hard to deploy and
test automatically. At least for traditional ‘middleware’ like WebSphere/MQ,
Weblogic etc. It potentially adds another monolith to your architecture,
which, whilst fancy, may not be required. Using ZeroMQ or similar
‘lightweight’ tech could be a better choice for small teams as it is easy to
integrate into containers and testable.

------
kbouck
I'm curious to hear others' opinions on using a database as a message queue.
One issue I have with most message brokers is that you can not perform adhoc
queries or manipulations of the queue.

When you've got a backlog situation, it's nice to be able to answer questions
like: \- how many of these queued msgs are for/from customer/partner X.

------
ngrilly
Must be noted that service meshes bring some of the message bus advantages to
the point-to-point architecture.

~~~
opportune
Somewhat, but it's a different use case. I'd say the main difference is that
message buses are better for non-urgent, hopefully-soon-but-definitely-
eventually workloads, while creating something like a message bus between
services within a mesh will impose more urgency.

Anecdotally I've heard that extremely chatty services (like something that
approximates a message bus) are considered poor mesh design but I don't really
understand why that is the case so long as the service architecture is kept
clean

~~~
ngrilly
I agree. The advantages I was referring to are things like adding monitoring,
which is mentioned in the article.

Interesting point about chatty services :-)

------
carc
A couple of big reasons don't like message buses (but open to hearing about
why I'm wrong):

-All "requests" will succeed even if malformed

-Couples producers/consumers to the bus (unless you put in the extra work to wrap it in a very simple service)

~~~
ww520
\- The consumer needs to validate the incoming requests. Just like any input
sources, the incoming data cannot be trusted until validated.

\- You need coupling somewhere anyway. Moving the coupling to the bus let the
consumer and producer evolve more freely.

~~~
jeremyjh
> \- The consumer needs to validate the incoming requests. Just like any input
> sources, the incoming data cannot be trusted until validated.

That's correct. And the producer will be long gone by the time the consumer
attempts to validate that message and rejects it.

------
bvrmn
Message bus proponents never build large systems. The only sane way is to be
pretty specific about data flow with clear mental model shared between
developers and ops. Message bus hides producer-consumer relations and with
multiple endpoints it's very hard to reason about the system as a whole.

~~~
redact207
Why do you say that? I'm someone who's worked on a few very large message
based systems in finance for years (100s of devs working together in multiple
countries). I found that messaging and workflow orchestration were the things
that helped keep things same.

~~~
PaulHoule
I've seen people succeed in a big way with message bus architectures and I
have seen them fail in a big way.

The master symptom I've seen is that people queue work and that work never
gets unqueued. It's amazing how many real-life systems fail with the same
symptom.

~~~
K2L8M11N2
Perhaps that could be solved with monitoring the number of pending requests
and alerting over a certain threshold?

~~~
PaulHoule
That diagnoses the problem, sets the stage for solving it, but doesn't
actually solve it.

~~~
dajohnson89
Isn't that an "easy" problem to solve, in the sense that you just increase the
processing capacity of the consumer?

