Hacker News new | comments | show | ask | jobs | submit login
Microservice Antipatterns: The Queue Explosion (cpitman.github.io)
58 points by cpitman 4 months ago | hide | past | web | favorite | 13 comments



I take the opposite view, the microservice anti-pattern is the API explosion. When you start chaining together HTTP REST APIs into wide and/or deep call graphs then you get all kinds of problems:

- need complex circuit breaker patterns to deal with API call failures

- overall uptime is affected by direct dependencies on a graph of synchronous calls (you are at the mercy of the other APIs for your own uptime)

- latency can increase as what could be done asynchronously is done synchronously

- developing and testing microservices becomes difficult because of so many dependencies. You start having to develop your own service stubs or evening running the other services.

Microservices are about independently deployable units. When you isolate services via message queues you can create services that have many fewer runtime dependencies which can lead to greater decoupling, greater reliability and easier development/testing. You are less likely to end up with a distributed monolith also.


What I'd like to advocate for is a middle ground: using queues at key points in the larger distributed system to reduce the scope of cascading failures and HTTP/P2P messaging elsewhere.

If a request comes in from an external system, and that request is queued, then the external system is already not directly impacted by downtime after the queue. Additional latency in processing after the queue is also hidden by the single queue. After a single queue makes the overall system async the goal should be to increase overall throughput and efficiency, and I don't believe only using queues accomplishes that.

I've seen this play out multiple times now, where the engineers decide they are "doing messaging", and use it for everything, even what could be simple request/response REST services. Tuning the overall system for performance becomes complicated, and the messaging brokers have to be scaled out to handle the multiplied load.

On a less serious note, the logistics game Factorio (https://www.factorio.com/) really drives this point home. In a distributed system/assembly line, really large buffers all over the place makes it harder to detect issues and tune throughput.


Of course queue push and pull operations are often RPC style APIs, and TCP connections are really just queues (ok, streams) of bytes. It's async and sync turtles all the way down.


I find it interesting that I’ve never been bothered by these issues but the one I have isn’t mentioned. Difficulty in ensuring ordering, it seems like any time I consider a queue the lack of ordering guarantees forces significant complexity into the applications instead.

Not sure if I’m just buying into the Kafka hype but it seems like queues are the awkward middle ground between synchronous communications and message steams for many use cases.


I'm in the same boat -- queues sound nice until the micro-picture at which point "spew it out" and "deal with it later" via message streams seems like the sane middle-ground across apps.

Hybrid cloud apps can do both, but in terms of complexity I agree: pushing increasing complexity into apps on the periphery over time seems like a big anti-pattern.


Some messaging systems have features to make ordering easier. For example, ActiveMQ has Message Groups (http://activemq.apache.org/message-groups.html) which guarantee a single consumer processes message from the same group and that it receives them in order.


Kafka deserves some credits here, because it destroys these kind of architectural problems however.

It completely changes how to think about service based architectures. In fact, with a properly designed architecture sitting around Kafka APIs and a plethora of queues magically disappear, because instead of function and request-response driven the system becomes reactively data driven, sitting around a single bus.

This makes designing actual services a lot easier because instead of thinking about a few hundred function names all you need to think about are actual data flows (data comes in at a central point -> gets transformed -> gets moved out at a central point).

As a added bonus it scales like crazy, to the point that hammering through a terabyte of data looks like nothing.

A great talk from Ben Stopford (Confluent employee, so mostly Kafka centric) about this subject can be found here:

https://www.slideshare.net/benstopford/devoxx-london-2017-re...


I agree, when Kafka is (or can be made) a good fit it simplifies a lot of this. Thanks for the presentation link. I'm seeing companies trying to use Kafka for "Request style" services, ie as just another enterprise messaging system.


Queues come from the following chain of reasoning:

BOB: I need to do X. I'm going to call the X-doing-service.

ALICE: What if the X-doing-service is down?

BOB: I'll use retries.

ALICE: What if the service is down for longer than your request timeout threshold?

BOB: I'll use a queue.

I don't really see a way out of this, TBH. Maintaining queues is easier than being a hardass about service uptimes. It also cuts down on request latency, because "I promise I'm going to do X" is faster than "I just did X and it totally worked".

Sure, sometimes you can just gracefully degrade and just not do X if you can't do X right now. Or, at least, it's not the end of the world. And sometimes you need X done right now and you can't wait. But message queues are a basic building block for a reason, and it's not an antipattern to use a lot of them any more than it's an antipattern to use a lot of pre-stressed concrete and rebar to build an office building.

In terms of how you build your message queues themselves, this is probably one of the best selling points for cloud services.


The answer to proliferation of microservice queues

is to make lightweight microqueues proliferate.

Erlang is not an antipattern.


For simplicity, let's take RMQ

> Multiplying the load on message brokers, often 2-5x, requiring extensive and costly scale out of the messaging system

Scaling it out is no problem. kubectl apply -f statefulset.yaml replicas=5 — done.

Also, message brokers almost never come under load because they are primarily IO bound, not CPU bound. If your broker comes under heavy CPU usage, you're almost certainly doing something wrong.

> No easy flow-control, fast services can produce and consume messages quickly enough to cause side effects for the slow services that are the bottlenecks and dictate overall system performance. This can cause large swings in performance as queues fill up.

In a RPC based architecture flow control would be even worse, cascade up to the caller and cause timeouts and unbounded ephemeral queues that end up looking like a hung system. Worse even, the thundering herd problem will make you unable to get back up.

> Increased latency for end-to-end processing because of repeated round trips to the messaging broker, which must persist messages to disk each time

No, RMQ does not necessarily persist the message each time; you can get a publisher confirm back if all receivers have claimed the message.

Furthermore, while send-then-ACK latency can be up to the disk latency, you're by no means limited to sending one message at a time.

> Increased difficulty isolating performance issues to an individual service vs the messaging brokers vs side-effects from other services impacting the messaging brokers

You almost never have performance issues affecting the brokers.

Suppose you do; then it's most often because you've let your queues grow too long. Where would you be in a HTTP world? You'd be having an outage.

---

> HTTP-based services (REST, gRPC, etc) offer low latency and high throughput compared to queues. They also have strictly lower resource requirements (network, disk, compute, memory) than a message queue solution.

No, not strictly lower; I can imagine a scenario where you have >80% utilisation on your HTTP server on averge; now you'll have growing ephemeral TCP request queues once in a while; then you have to bump the node stats. Consider the alternative; a node that has no latency requirement; it can be utilised to a higher degree; 90-100%. Always at 100%? Then, scale out based on queue size, then scale down again.


RabbitMQ isn't a great study in horizontal scalability and fault tolerance. Its clustering is garbage.

The only safe way to run a multinode RabbitMQ setup is to have it stop (the pause_minority setting) when it detects a network partition. Any other mode is lossy by definition. There's no safe HA mode that isn't lossy.

(RabbitMQ also wants a bunch of CPU and RAM. On Kubernetes you'll want to dedicate entire nodes to it to avoid problems.)

As an aside, I think the message queue model used by RabbitMQ has far outlived its usefulness. The big problem with this data model is that data disappears; there's no replay, and zero visibility into processed data; what should be a database that can always be queried at arbitrary points is instead of a conveyor belt that is always moving forward and discarding its history. (And, problematically, NACK is broken with respect to ordering; you can only discard or put back at the end of the queue, not ask to retry.) Kafka and NATS Streaming get this right.


I used RMQ because I just finished a smaller study of its properties when running it on Kube.

In fact, its clustering on Kubernetes is really well done. It's very easy to get up and running and to use from an operations standpoint.

`pause_minority` yes, so there's a setting for this. No safe HA mode? Could you explain that further? It is not entailed by your previous statements.

I did a load test in the above mentioned study and each node in the 5-node cluster hovered around 300 mCPU and 250 MiB with a throughput of 10 MiB/s of messages in publisher-confirms + consumer auto-ack + 100 inflight —mode. That was 7x our needs so I left it there for now.

A model in computing does not outlive its usefulness because you say so:

- no replay: not needed; we're contrasting with RPC here which also does not have replay - zero visibility: this is false as there are lots of metrics and libraries focused on AMQP - should be a database: no, it should be a networked queue with atomic broadcast in the happy case (unhappy case: see FLP result) - NACK does not do what you think it does; it puts the message as close to the head of the queue as possible and even putting at the back of the queue like RMQ did around v1 is a valid resolution, because you don't get ordering guarantees nor exactly-once in a distributed queue, generally (Kafka does not give you exactly once [see atomic producers RFC on their Wiki and the consumers are not transactional so they don't consume exactly-once either])

So, I guess someone was wrong on the internet. ;)

Notes:

- You have to use publisher confirms - You have to use a HA-mode of at least exactly=N/2+1 nodes on every queue - You have to use acking consumers - You will get duplicate messages - You will have to implement retries




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: