Message brokers like RabbitMQ give you a lot of benefit and introduce only a small number of failure modes. You can obviate tricky service discovery boot orders, do RPC with without caring about whether about your server could be restarted and of course you get a good implementation of pub-sub too. If you stay away from poorly considered high availability schemes I am absolutely fine with recommending it for intra-service communication.
Maybe I'm being too pessimistic. I use microservices, but without significant engineering rigor, I think its a recipe for disaster.
Well, make sure your sevice is not dependent on those services then. Use signed tokens to only rely on the authenticator for logins. everything afterwards can work it our by themself.
logging: keep your stuff in a queue or logfile, until the log service is back up.
* Durable messages cause a slow-down when publishing, but not with Kafka because it uses the Linux kernel's page cache to write.
* Durable messages are slow to read if they have to come from disk. Kafka is optimized to load sequential blocks into memory an push them out through a socket with very few copies. This makes for very fast reads.
* Slow consumers can bog down the broker. Kafka stores all messages and keeps them on the topic for a time horizon. No back-pressure from slow consumers.
* Disconnected but subscribed consumers cause messages to back up on disk and eventually clog the broker. Kafka stores all messages. There's no clogging or backup, that's just how it works.
* Brokers must track whether a consumer actually received the message, failures can cause missed messages or clogs. Kafka clients may read from a given point on the topic forward. If they fail during a read, they just back up and read again. The messages will be there for hours/days/weeks as configured.
With a rock-steady durable messaging system based on commit logs, all of those problems that arose from attempting to avoid durable messaging go away.
Now you build microservices that emit and respond to events. Microservices that can "rehydrate" their state from private checkpoints and topic replays. And all of this with partition tolerance and simple mirroring.
Although it isn't usually necessary, if you want, you can make all of that elastic with Mesos, too.
RPC is point-to-point communication based on requests and replies. Kafka's strictly sequential requirement would be terrible for this because a single slow request would hold up its entire partition — no other consumer would be able to process the pending upstream events. Kafka is also persistent (does it have in-memory queues?), which is pointless for RPC.
I also quite like NATS, but Kafka provides similar performance characteristics in the general case (generally higher latency being the exception, though I have never encountered latency-sensitive processes where a message queue made sense in the first place) and means babysitting fewer systems.
A lot of people use HAProxy to route messages via HTTP to microservices — what's HAProxy if not a glorified message queue?
If you don't use an intermediate — meaning you to point-to-point HTTP between one microservice and another — you have to find a way to discover peers, perform health checks, load-balance between them, and so on. Which you can do — services like etcd and Consul exist for this — but using an intermediary such as NATS or Linkerd  is also a great, possibly simpler solution.
If you're going to use RPC, host a REST endpoint and cache the living daylights out of it.
This is a fundamental problem with message bus architectures that advocates seem to ignore. It's even more problematic in a microservices architecture, where you (presumably) do domain-driven design in order to allow overall forward progress in the face of partial service/component/datastore outages. To throw all that out the window by coupling everything to a message bus... I still don't really understand.
If you define single point of failure, as any computer goes down takes the system with it, then RabbitMQ is not a single point of failure.
I am not sure how domain driven design helps solve this.
Not necessarily. I don't know anything about consul, but if you use something like zookeeper to discover services and write those into an HAProxy config, include a failsafe in whatever writes the HAProxy config on ZK updates such that if the delta is "too large" it will refuse to rewrite the config.
Then if ZK becomes unavailable, what you lose is the ability to easily _make changes_ to what's in the service list. If your service instances come and go relatively infrequently, this might be fine while the ZK fire get put out.
If a service only say talks to instances of another service that is local, then every node in a cluster must contain ALL services. If the local service is overloaded but a remote service isn't then the service you are talking to will be slow, regardless of any front end load balancing.
Every service must be able to talk to every other service, because if it cannot then you cannot load balance or deploy without it being an n^2 problem. So the question is how to implement it WITHOUT complicating the services
Yes, every instance of every service needs to manage its communication paths to other services. That's N connections, rather than just 1 to the message bus. But this is a pretty well-understood problem. We have connection pools and circuit breakers and so on. And the risks are distributed, isolated, heterogeneous.
For example, if you had three nodes, each with every service and round robin service discovery to overload the system is just a matter of receiving a difficult request every third query. No matter how good your front end load balancing is in a micro service system, if your intra-service requests are not load balanced you can have problems with overloading one node, while others are idle.
Unless you're deploying a Smartstack-esque LB strategy, where each physical node hosts its own load balancer, then the details of what's deployed where are mostly irrelevant. You use your SD system to abstract away the physical dimension of the problem, and address logical clusters of service instances. And you rely on your scheduler to distribute service instances evenly among nodes.