
Ask HN: What should we consider when moving to a service mesh architecture? - ciguy
I work for a company that&#x27;s been looking at moving to a service mesh style architecture. They run a few dozen micro-services which are currently Dockerized and deployed on an ECS cluster.<p>We are considering moving to Kubernetes, as ECS has a lot of limitations in the deployment&#x2F;resiliency area that come more or less standard with Kube. As part of this move we are looking at using a service mesh to enable easier cross region routing on AWS and more dynamic load on various clusters.<p>We were looking into Linkerd as a possible solution, but just noticed that Consul latest release has a service mesh feature now. So for those who&#x27;ve run a service mesh before, what sort of things should I be considering as we evaluate our options? Pros&#x2F;Cons of specific tools, or service mesh in general are welcome.
======
rdli
IMO, you don't need a service mesh until you have a complex topology (i.e.,
services that call other services that call other services). A service mesh
fundamentally is designed so that when you have a deep topology, you can
manage/mitigate failures better by limiting blast radius/cascade
failure/improving mean time to respond.

However, there are a _ton_ of techniques you should consider before you go to
a service mesh:

* canary deployments / rolling updates / etc. * monitoring via APM, distributed tracing, etc. * health checking / auto scaling via Kubernetes or equivalent

(I actually wrote an article about this recently that will be published in
InfoQ, but it's under embargo. If you ping me, I can send you a not-for-
public-yet version.)

~~~
ciguy
We do have a complex topology based on your definition, and already do many of
the things you mentioned (Though we do have some limitations due to ECS being
less flexible with deployment options). So glad to hear that a service mesh
may fit well with our current architecture.

I'd love to see the article, will email you.

------
bloomthrowaway
I wouldn't do it. You should split your services out of necessity, not as a
design pattern. Service meshes are also quite immature at the moment and
basically guaranteed to give you lots of pain. Inter-service communication
adds a ton of complexity and isn't a totally solved problem. The biggest issue
is distributed transactions.

Say you have a service that updates two others within the same call. A write
"fan out". What happens if one of the writes succeeds and the other fails?
Since they don't share the same transaction boundary you need to implement
rollbacks yourself. One for every single call you fan-out to. For writes to
two services you need to write two rollback mechanisms to handle cases where
either call fails.

But it gets even worse. What if the service doing the fan-out dies when its
only made one of the two calls, and that first call succeeds? Well now you
have no way to rollback, so the next step is to persist "fanout status" to
durable storage as each call succeeds.

But it gets worse still... What if you get a service loop? So you write to one
service which writes to another and that third service asks your service
something all within the same call. WTF is the global view of data consistency
when this happens? There isn't one, its undefined behavior.

If you go down this rabbit hole far enough you end up writing your own global
database to synchronize everything. The proper way to do "microservices" is to
share a datastore so it can handle the transactions and rollback for you. But
once you share a datastore, why not just build a monolith that can
horizontally scale easily?

I get that some companies are writing services with their own datastores, I've
worked on several of these projects myself. They have all been a data
consistency nightmare. Write your services like Google does, "monoliths"
running hundreds or thousands of instances all connected to the same database
that scales in the same way.

------
pramodbiligiri
This short book authored by George Miranda who works at Buoyant might be
helpful - [https://pages.buoyant.io/Oreilly-Service-Mesh-
Book.html](https://pages.buoyant.io/Oreilly-Service-Mesh-Book.html)

~~~
ciguy
Thanks I will check it out!

------
hackersac
We're definitely going to hear a lot more on service mesh in the coming
period. In one of my startups, we are in the process of implementing a service
mesh based on Istio (using Kubernetes and Docker). For complex transaction
handling, our goal is to implement them using an event-driven design
(RabbitMQ). So our architecture will end up being a hybrid of service mesh
(service orchestration) and event-driven. We feel that is the way to go to
build complex microservices based architecture.

------
cbluth
As you do more and more research of the options, you will find that each
"service mesh solution" has its own distinct set of pros and cons.

What was the driving factor in choosing a service mesh?

~~~
ciguy
We aren't 100% sold on service mesh yet, but it seems appropriate for our
needs as we are going multi-region soon and we already have a micro-service
style architecture. At some point an abstraction layer is needed to handle
routing amongst services and provide failover to a different region if one
goes out for example.

I know there are other ways to do this, but they all seemed clunky in
comparison. Care to share any pros/cons of mesh solutions you've used?

~~~
rdli
I'd also add that the current service mesh solutions are all pretty immature.
Linkerd is the most mature, but Buoyant is splitting resources and doing
Conduit. There's Cilium. There's Tigera/Calico (sort of). There's now Consul
Connect. And of course Istio.

------
williamallthing
I'm biased, but Linkerd has been in prod at companies around the world for
almost 2 years so it's about as mature as it gets. If you're moving to
Kubernetes, you're also a candidate for Conduit (next generation ultralight
mesh), which will become Linkerd 2.0 next month.

~~~
dankohn1
Does that explain why I'm not getting many hits on my job req demanding 5
years of production Kubernetes experience and 3 years of production Linkerd
experience?

[https://landscape.cncf.io/selected=kubernetes](https://landscape.cncf.io/selected=kubernetes)
[https://landscape.cncf.io/selected=linkerd](https://landscape.cncf.io/selected=linkerd)

