
Microservice Antipatterns: The Queue Explosion - cpitman
http://cpitman.github.io/microservices/2018/03/25/microservice-antipattern-queue-explosion.html
======
airfreak
I take the opposite view, the microservice anti-pattern is the API explosion.
When you start chaining together HTTP REST APIs into wide and/or deep call
graphs then you get all kinds of problems:

\- need complex circuit breaker patterns to deal with API call failures

\- overall uptime is affected by direct dependencies on a graph of synchronous
calls (you are at the mercy of the other APIs for your own uptime)

\- latency can increase as what could be done asynchronously is done
synchronously

\- developing and testing microservices becomes difficult because of so many
dependencies. You start having to develop your own service stubs or evening
running the other services.

Microservices are about independently deployable units. When you isolate
services via message queues you can create services that have many fewer
runtime dependencies which can lead to greater decoupling, greater reliability
and easier development/testing. You are less likely to end up with a
distributed monolith also.

~~~
cpitman
What I'd like to advocate for is a middle ground: using queues at key points
in the larger distributed system to reduce the scope of cascading failures and
HTTP/P2P messaging elsewhere.

If a request comes in from an external system, and that request is queued,
then the external system is already not directly impacted by downtime _after_
the queue. Additional latency in processing after the queue is also hidden by
the single queue. After a single queue makes the overall system async the goal
should be to increase overall throughput and efficiency, and I don't believe
only using queues accomplishes that.

I've seen this play out multiple times now, where the engineers decide they
are "doing messaging", and use it for everything, even what could be simple
request/response REST services. Tuning the overall system for performance
becomes complicated, and the messaging brokers have to be scaled out to handle
the multiplied load.

On a less serious note, the logistics game Factorio
([https://www.factorio.com/](https://www.factorio.com/)) really drives this
point home. In a distributed system/assembly line, really large buffers all
over the place makes it harder to detect issues and tune throughput.

------
tybit
I find it interesting that I’ve never been bothered by these issues but the
one I have isn’t mentioned. Difficulty in ensuring ordering, it seems like any
time I consider a queue the lack of ordering guarantees forces significant
complexity into the applications instead.

Not sure if I’m just buying into the Kafka hype but it seems like queues are
the awkward middle ground between synchronous communications and message
steams for many use cases.

~~~
bonesss
I'm in the same boat -- queues sound nice until the micro-picture at which
point "spew it out" and "deal with it later" via message streams seems like
the sane middle-ground across apps.

Hybrid cloud apps can do both, but in terms of complexity I agree: pushing
increasing complexity into apps on the periphery over time seems like a big
anti-pattern.

------
mindcrash
Kafka deserves some credits here, because it destroys these kind of
architectural problems however.

It completely changes how to think about service based architectures. In fact,
with a properly designed architecture sitting around Kafka APIs and a plethora
of queues magically disappear, because instead of function and request-
response driven the system becomes reactively data driven, sitting around a
single bus.

This makes designing actual services a lot easier because instead of thinking
about a few hundred function names all you need to think about are actual data
flows (data comes in at a central point -> gets transformed -> gets moved out
at a central point).

As a added bonus it scales like crazy, to the point that hammering through a
terabyte of data looks like nothing.

A great talk from Ben Stopford (Confluent employee, so mostly Kafka centric)
about this subject can be found here:

[https://www.slideshare.net/benstopford/devoxx-
london-2017-re...](https://www.slideshare.net/benstopford/devoxx-
london-2017-rethinking-services-withstatefulstreams)

~~~
cpitman
I agree, when Kafka is (or can be made) a good fit it simplifies a lot of
this. Thanks for the presentation link. I'm seeing companies trying to use
Kafka for "Request style" services, ie as just another enterprise messaging
system.

------
philwelch
Queues come from the following chain of reasoning:

BOB: I need to do X. I'm going to call the X-doing-service.

ALICE: What if the X-doing-service is down?

BOB: I'll use retries.

ALICE: What if the service is down for longer than your request timeout
threshold?

BOB: I'll use a queue.

I don't really see a way out of this, TBH. Maintaining queues is easier than
being a hardass about service uptimes. It also cuts down on request latency,
because "I promise I'm going to do X" is faster than "I just did X and it
totally worked".

Sure, sometimes you can just gracefully degrade and just not do X if you can't
do X right now. Or, at least, it's not the end of the world. And sometimes you
need X done right now and you can't wait. But message queues are a basic
building block for a reason, and it's not an antipattern to use a lot of them
any more than it's an antipattern to use a lot of pre-stressed concrete and
rebar to build an office building.

In terms of how you build your message queues themselves, this is probably one
of the best selling points for cloud services.

------
xyz-x
For simplicity, let's take RMQ

> Multiplying the load on message brokers, often 2-5x, requiring extensive and
> costly scale out of the messaging system

Scaling it out is no problem. kubectl apply -f statefulset.yaml replicas=5 —
done.

Also, message brokers almost never come under load because they are primarily
IO bound, not CPU bound. If your broker comes under heavy CPU usage, you're
almost certainly doing something wrong.

> No easy flow-control, fast services can produce and consume messages quickly
> enough to cause side effects for the slow services that are the bottlenecks
> and dictate overall system performance. This can cause large swings in
> performance as queues fill up.

In a RPC based architecture flow control would be even worse, cascade up to
the caller and cause timeouts and unbounded ephemeral queues that end up
looking like a hung system. Worse even, the thundering herd problem will make
you unable to get back up.

> Increased latency for end-to-end processing because of repeated round trips
> to the messaging broker, which must persist messages to disk each time

No, RMQ does not necessarily persist the message each time; you can get a
publisher confirm back if all receivers have claimed the message.

Furthermore, while send-then-ACK latency can be up to the disk latency, you're
by no means limited to sending one message at a time.

> Increased difficulty isolating performance issues to an individual service
> vs the messaging brokers vs side-effects from other services impacting the
> messaging brokers

You almost never have performance issues affecting the brokers.

Suppose you do; then it's most often because you've let your queues grow too
long. Where would you be in a HTTP world? You'd be having an outage.

\---

> HTTP-based services (REST, gRPC, etc) offer low latency and high throughput
> compared to queues. They also have strictly lower resource requirements
> (network, disk, compute, memory) than a message queue solution.

No, not strictly lower; I can imagine a scenario where you have >80%
utilisation on your HTTP server on averge; now you'll have growing ephemeral
TCP request queues once in a while; then you have to bump the node stats.
Consider the alternative; a node that has no latency requirement; it can be
utilised to a higher degree; 90-100%. Always at 100%? Then, scale out based on
queue size, then scale down again.

~~~
lobster_johnson
RabbitMQ isn't a great study in horizontal scalability and fault tolerance.
Its clustering is garbage.

The _only_ safe way to run a multinode RabbitMQ setup is to have it stop (the
pause_minority setting) when it detects a network partition. Any other mode is
lossy by definition. There's no safe HA mode that isn't lossy.

(RabbitMQ also wants a bunch of CPU and RAM. On Kubernetes you'll want to
dedicate entire nodes to it to avoid problems.)

As an aside, I think the message queue model used by RabbitMQ has far outlived
its usefulness. The big problem with this data model is that data disappears;
there's no replay, and zero visibility into processed data; what should be a
database that can always be queried at arbitrary points is instead of a
conveyor belt that is always moving forward and discarding its history. (And,
problematically, NACK is broken with respect to ordering; you can only discard
or put back at the end of the queue, not ask to retry.) Kafka and NATS
Streaming get this right.

~~~
xyz-x
I used RMQ because I just finished a smaller study of its properties when
running it on Kube.

In fact, its clustering on Kubernetes is really well done. It's very easy to
get up and running and to use from an operations standpoint.

`pause_minority` yes, so there's a setting for this. No safe HA mode? Could
you explain that further? It is not entailed by your previous statements.

I did a load test in the above mentioned study and each node in the 5-node
cluster hovered around 300 mCPU and 250 MiB with a throughput of 10 MiB/s of
messages in publisher-confirms + consumer auto-ack + 100 inflight —mode. That
was 7x our needs so I left it there for now.

A model in computing does not outlive its usefulness because you say so:

\- no replay: not needed; we're contrasting with RPC here which also does not
have replay \- zero visibility: this is false as there are lots of metrics and
libraries focused on AMQP \- should be a database: no, it should be a
networked queue with atomic broadcast in the happy case (unhappy case: see FLP
result) \- NACK does not do what you think it does; it puts the message as
close to the head of the queue as possible and even putting at the back of the
queue like RMQ did around v1 is a valid resolution, because you don't get
ordering guarantees nor exactly-once in a distributed queue, generally (Kafka
does not give you exactly once [see atomic producers RFC on their Wiki and the
consumers are not transactional so they don't consume exactly-once either])

So, I guess someone was wrong on the internet. ;)

Notes:

\- You have to use publisher confirms \- You have to use a HA-mode of at least
exactly=N/2+1 nodes on every queue \- You have to use acking consumers \- You
will get duplicate messages \- You will have to implement retries

------
mikhailfranco
The answer to proliferation of microservice queues

is to make lightweight microqueues proliferate.

Erlang is not an antipattern.

------
kujaomega
To solve this issue explained:

\- No easy flow-control, fast services can produce and consume messages
quickly enough to cause side effects for the slow services that are the
bottlenecks and dictate overall system performance. This can cause large
swings in performance as queues fill up.

What you think about using a Load balancer -> Queue -> Service? Have someone
tried?

