Hacker News new | past | comments | ask | show | jobs | submit login

IMO this is an emerging anti pattern to use Rabbit to connect "microservices". It often introduces a single point of failure to your "distributed" system and has problems with network partitions. If critical functionality stops working when Rabbit is down, you're probably doing it wrong.

Most real world microservice projects (I've worked on several) already have many single points of failure. Often there is one service that needs to be up for the system to be up (such as the one that processes your customers orders), you don't realise some VMs are sharing a physical disk or everything is dependent on a single router somewhere you've never heard of that will one day run out of memory and drop TCP connections. This is not to mention the risks posed to availability by third-party tracking software that push changes that break web forms (#1 cause of long outages in my experience).

Message brokers like RabbitMQ give you a lot of benefit and introduce only a small number of failure modes. You can obviate tricky service discovery boot orders, do RPC with without caring about whether about your server could be restarted and of course you get a good implementation of pub-sub too. If you stay away from poorly considered high availability schemes I am absolutely fine with recommending it for intra-service communication.

Great comment. One thing I don't like very much about microservices is simply that my service often will have some such hard dependency. e.g. if the logging service is down, I lose all logs-based statistics. If the authentication microservice is down, I'm screwed. The parent comment made the excellent point about SPOF, but it seems like for microservices to work correctly, there will always be some SPOFs.

Maybe I'm being too pessimistic. I use microservices, but without significant engineering rigor, I think its a recipe for disaster.

> If the authentication microservice is down, I'm screwed.

Well, make sure your sevice is not dependent on those services then. Use signed tokens to only rely on the authenticator for logins. everything afterwards can work it our by themself.

logging: keep your stuff in a queue or logfile, until the log service is back up.

Do you have a better solution? Genuinely interested since I'm about to implement webservice data change event notifications using probably RabbitMQ.

It does require some up-front work, but I've generally moved over to Kafka for most of my stuff. Rabbit has some nice aspects, but as noted its HA options start at "dire" and escalate smoothly to "oh dear god no". Throughput with Kafka is very, very good and,in my experience, it's remarkably difficult to kill. If you want fully ephemeral topics, you can write the broker data to a tmpfs. But I really like keeping data around, because I can replay my topics later both for DR and for debugging.

Kafka makes reliable durable messaging not just possible, but solves all the usual attendant problems you try to design around.

* Durable messages cause a slow-down when publishing, but not with Kafka because it uses the Linux kernel's page cache to write.

* Durable messages are slow to read if they have to come from disk. Kafka is optimized to load sequential blocks into memory an push them out through a socket with very few copies. This makes for very fast reads.

* Slow consumers can bog down the broker. Kafka stores all messages and keeps them on the topic for a time horizon. No back-pressure from slow consumers.

* Disconnected but subscribed consumers cause messages to back up on disk and eventually clog the broker. Kafka stores all messages. There's no clogging or backup, that's just how it works.

* Brokers must track whether a consumer actually received the message, failures can cause missed messages or clogs. Kafka clients may read from a given point on the topic forward. If they fail during a read, they just back up and read again. The messages will be there for hours/days/weeks as configured.

With a rock-steady durable messaging system based on commit logs, all of those problems that arose from attempting to avoid durable messaging go away.

Now you build microservices that emit and respond to events. Microservices that can "rehydrate" their state from private checkpoints and topic replays. And all of this with partition tolerance and simple mirroring.

Although it isn't usually necessary, if you want, you can make all of that elastic with Mesos, too.

Further reading:

[1] https://engineering.linkedin.com/distributed-systems/log-wha...

[2] http://mesos.apache.org/

Kafka saved my "life" few times. We had TTL set to 168h and somebody pushed a change to production that silently ignored a type of message. We realized it few days later. Luckily we could re-play all of the messages after fixing the code. I know there are so many things wrong with this, yet, Kafka is excellent at storing data for medium terms and that can be a real bliss.

Kafka is very good for synchronizing data streams, but I can't imagine that it's suitable for RPC?

You don't (or shouldn't) do ordinary RPC over messaging any way. Rather, use an Event Sourcing style: http://martinfowler.com/eaaDev/EventSourcing.html

I didn't read that web page carefully, but it seems to describe a transaction log, which is what Kafka excels it, but it has precious title to do with RPC.

RPC is point-to-point communication based on requests and replies. Kafka's strictly sequential requirement would be terrible for this because a single slow request would hold up its entire partition — no other consumer would be able to process the pending upstream events. Kafka is also persistent (does it have in-memory queues?), which is pointless for RPC.

Message queues, period, aren't particularly good for RPC. HTTP, as an online protocol, has such huge advantages that trying to replace it doesn't make sense to me. However, for comms between services where a consumer isn't waiting on the other end, a message queue is plenty appropriate--and Kafka is much, much better at that than RabbitMQ is in terms of throughput and data sanity.

I also quite like NATS, but Kafka provides similar performance characteristics in the general case (generally higher latency being the exception, though I have never encountered latency-sensitive processes where a message queue made sense in the first place) and means babysitting fewer systems.

To be clear, NATS is not a heavy messaging broker like RabbitMQ. For one, it's in-memory only, and queues only exist when there are consumers: If you publish and there are no subscribers, the message doesn't go anywhere. NATS is closer to ZeroMQ than RabbitMQ or Kafka.

A lot of people use HAProxy to route messages via HTTP to microservices — what's HAProxy if not a glorified message queue?

If you don't use an intermediate — meaning you to point-to-point HTTP between one microservice and another — you have to find a way to discover peers, perform health checks, load-balance between them, and so on. Which you can do — services like etcd and Consul exist for this — but using an intermediary such as NATS or Linkerd [1] is also a great, possibly simpler solution.

[1] https://linkerd.io/

The point I was making is that use of messaging in the first place implies an architectural style other than RPC.

If you're going to use RPC, host a REST endpoint and cache the living daylights out of it.

But if you read what I wrote elsewhere in this thread, NATS is not a traditional messaging broker, and is highly suitable for RPC.

See above comment, in the file https://github.com/LoyaltyNZ/alchemy-framework/blob/master/s... is a description of creating a cluster of RabbitMQ nodes that can auto heal if an individual node goes down. We are running CoreOS which will occasionally shutdown a node and update. When this happens we see zero downtime and no error messages.

Even if you have a cluster of Rabbit nodes, the logical RabbitMQ cluster is still a single point of failure.

This is a fundamental problem with message bus architectures that advocates seem to ignore. It's even more problematic in a microservices architecture, where you (presumably) do domain-driven design in order to allow overall forward progress in the face of partial service/component/datastore outages. To throw all that out the window by coupling everything to a message bus... I still don't really understand.

In my experience failure occurs more frequently when you use more and more systems in more complex ways. e.g. using HAProxy for load balancing, with Consul for service discovery and Consul template for configuration. Each of these is a single point of failure as they are all required for the system to work.

If you define single point of failure, as any computer goes down takes the system with it, then RabbitMQ is not a single point of failure.

I am not sure how domain driven design helps solve this.

> HAProxy for load balancing, with Consul for service discovery and Consul template for configuration. Each of these is a single point of failure as they are all required for the system to work.

Not necessarily. I don't know anything about consul, but if you use something like zookeeper to discover services and write those into an HAProxy config, include a failsafe in whatever writes the HAProxy config on ZK updates such that if the delta is "too large" it will refuse to rewrite the config.

Then if ZK becomes unavailable, what you lose is the ability to easily _make changes_ to what's in the service list. If your service instances come and go relatively infrequently, this might be fine while the ZK fire get put out.

Service instances in a continuous deployment environment are coming and going all day. IF your service discovery and config breaks then everything stops, nothing can be developed or deployed until the broken stuff is fixed.

If SD or config mgmt dies, you can't deploy new stuff, but the existing services continue to work. When your message bus dies, everything dies. It's a fundamentally different failure.

So what is the solution then?

Depends on the application, but as one possible general solution: have microservices talk to each other directly when it makes sense to rather than communicating over RabbitMQ / central message bus out of laziness/convenience.

Doesn't this make the architecture a lot more complex though? I mean if every service uses a common messaging broker, number of connections to every other service is O(n). While if every microservice needs to talk to every other microservice, its O(n2). And analysis is much harder, unless all the services send their logs/metrics to a common logging/metrics system.

You shouldn't need every single microservice to talk to every single other microservice. If so, you have a design problem.

If you don't have every service talk to every other service then it is either a deployment problem or a load balance problem.

If a service only say talks to instances of another service that is local, then every node in a cluster must contain ALL services. If the local service is overloaded but a remote service isn't then the service you are talking to will be slow, regardless of any front end load balancing.

Every service must be able to talk to every other service, because if it cannot then you cannot load balance or deploy without it being an n^2 problem. So the question is how to implement it WITHOUT complicating the services

profilesvc may only need to talk to (depend on) usersvc and accountsvc to do its job. usersvc and accountsvc may only need to talk to (depend on) their data stores.

Yes, every instance of every service needs to manage its communication paths to other services. That's N connections, rather than just 1 to the message bus. But this is a pretty well-understood problem. We have connection pools and circuit breakers and so on. And the risks are distributed, isolated, heterogeneous.

So you have a load balancer for every set of microservices? (Presumably, you have more than one instance running.)

Service Discovery often removes the need for load balancers. Let the clients discover where all the instances of Service X are and build the clients to handle failures to connect to individual instances.

Service discovery does not remove the need for load balancing.

For example, if you had three nodes, each with every service and round robin service discovery to overload the system is just a matter of receiving a difficult request every third query. No matter how good your front end load balancing is in a micro service system, if your intra-service requests are not load balanced you can have problems with overloading one node, while others are idle.

Service discovery can remove the need for load balancing, if you move load balancing logic into the client services. Have them architect their own load balancing over available instances of their dependent services.

Unless you're deploying a Smartstack-esque LB strategy, where each physical node hosts its own load balancer, then the details of what's deployed where are mostly irrelevant. You use your SD system to abstract away the physical dimension of the problem, and address logical clusters of service instances. And you rely on your scheduler to distribute service instances evenly among nodes.

If it goes down the Rabbit hole, you've engineered it wrong.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact