
The Service Mesh: What Engineers Need to Know - scott_s
https://servicemesh.io/
======
ajessup
One reason for the explosive interest in service mesh over the last 24 months
that this article glosses over is that it's deeply threatening to a range of
existing industries, that are now responding.

Most immediately to API gateways (eg. Apigee, Kong, Mulesoft), which provide
similar value to SM (in providing centralized control and auditing of an
organization's East-West service traffic) but implemented differently. This is
why Kong, Apigee, nginx etc. are all shipping service mesh implementations now
before their market gets snatched away from them.

Secondly to cloud providers, who hate seeing their customers deploy vendor-
agnostic middleware rather than use their proprietary APIs. None of them want
to get "Kubernetted" again. Hence Amazon's investment in the very Istio-like
"AppMesh" and Microsoft (who already had "Service Fabric") attempt to do an
end run around Istio with the "Service Mesh Interface" spec. Both are part of
a strategy to ensure if you are running a service mesh the cloud provider
doesn't cede control.

Then there's a slew of monitoring vendors who aren't sure if SM is a threat
(by providing a bunch of metrics "for free" out of the box) or an opportunity
to expand the footprint of their own tools by hooking into SM rather than
require folks to deploy their agents everywhere.

Finally there's the multi-billion dollar Software Defined Networking market -
who are seeing a lot of their long term growth and value being threatened by
these open source projects that are solving at Layer 7 (and with much more
application context) what they had been solving for at Layer 3-4. VMWare NSX
already have a SM implementation (NSX-SM) that is built on Istio and while I
have no idea what Nutanix et al are doing I wouldn't be surprised if they
launched something soon.

It will be interesting to see where it all nets out. If Google pulls off the
same trick that they did with Kubernetes and creates a genuinely independent
project with clean integration points for a wide range of vendors then it
could become the open-source Switzerland we need. On the other hand it could
just as easily become a vendor-driven tire fire. In a year or so we'll know.

~~~
streetcat1
This is a good overview. However, I think that the reason that we see a lot of
service variations is because the core tech - namely - Envoy, contains all the
"hard" tech (the data plane) while creating a "service mesh", basically comes
down to creating a management layer on top of it.

Another intresting note is that Google did NOT recede control over Istio to
CNCF.

~~~
jacques_chester
> _Envoy, contains all the "hard" tech (the data plane) while creating a
> "service mesh", basically comes down to creating a management layer on top
> of it._

I'd argue this is backwards. Envoy has a fairly tightly defined boundary with
relatively strong guarantees of consistency given by hardware -- each instance
is running on a single machine, or in a single pod, with a focus on that
machine or pod.

The control plane is dealing with the nightmare of good ol' fashioned
distributed consistency, with a dollop of "update the kernel's routing tables
quickly but not _too_ quickly" to go with it. It's "simple" insofar as you
don't need to be good at lower-level memory efficiency and knowing shortcuts
that particular CPUs give you. But that's detail complexity. The control plane
faces dynamic complexity.

------
cbsmith
I'm going to sound like an old man but...

What amuses me about this is back in the day everyone thought the Mach guys
were crazy for thinking things like network routing and IPC services be
implemented in user space... and others mocked the OSI model's 7 layers as
overly complex (e.g. RFC3439's "layering considered harmful").

Now we've moved all our network services onto a layer 7 protocol (HTTP), and
we've discovered we need to reinvent layers we skipped over on top of it.
We're doing it all in user space with comparatively new and untested
application logic, somehow forgetting that this can be done far more
efficiently and scalably with established and far more sophisticated
networking tools... if only we'd give up on this silly notion that everything
must go over HTTP.

~~~
taneq
Network-over-network is just another Inner Platform Effect.

~~~
jonahx
Wonderful term. I've been aware of the phenomenon for a while but not its
name.

Link for others:

[https://en.wikipedia.org/wiki/Inner-
platform_effect](https://en.wikipedia.org/wiki/Inner-platform_effect)

~~~
Bombthecat
Oh wow! I had a customer who was really bad at it. All! there Software was
effected by it in one way or another.

Now I have a name for that at least :)

------
lycidas
At my company, we were migrating all our apps to a kubernetes + istio platform
over the past couple of months and my advice is this - don't use a service
mesh unless you really, really need to.

We initially choose istio because it seemed to satisfy all our requirements
and more - mTLS, sidecar authz, etc - but configuring it turned out to be a
huge pain. Things like crafting a non-superadmin pod security policy for it,
trying to upgrade versions via helm, and trying to debug authz policies took
up a non-trivial amount of time. In the end, we got everything working but I
probably wouldn't recommend it again.

It's funny that I was at kubecon last week and there was a start up whose
value prop was hassle-free istio and the linkerd people stressed that they
were less complex than istio.

~~~
snupples
I would go as far as to say I think the vast majority of people don't need a
specialized service mesh. We unfortunately started with Linkerd and it
actually is the cause of most reliability/troubleshooting issues. I don't
think lack of complexity is actually a good selling point for it, because it's
inherently more complex that not using a service mesh.

Istio may appear more complex but that's because it has a superior abstraction
model and supports greater flexibility. We're beginning to migrate from
Linkerd to Istio at this point. I had the same initial frustrations with
podsecuritypolicy (and linkerd suffers from the same), but istio-cni solves
the superuser problem, and I believe even the istio control plane is now much
more locked down in the latest release.

However if I had my way I would be telling every team they don't need service
mesh. We don't have any particular service large and complex enough to really
take advantage of its sold features.

~~~
bogomipz
>"We unfortunately started with Linkerd and it actually is the cause of most
reliability/troubleshooting issues"

Would you mind elaborating on what those Linkerd issue are/were that were
effecting reliability and troubleshooting?

~~~
williamallthing
I'm also curious about this (author here btw). The majority of people we see
coming to Linkerd today are coming _from_ Istio. They get the service mesh
value props, but want Linkerd's simplicity and lower operational overhead.
Would love some more details, especially GitHub issues.

------
tick_tock_tick
My favorite use of this kind of system is to manage tls and acls. The service
itself can be extremely dumb and just expose a unix socket.

The 10:1 ratio of microservices to developer sounds like hell though that's
just too much to reason about.

------
tracer4201
Good article - I must admit I’m vaguely familiar with the concept and this
read certainly gave me some new insights.

One meta call out on the writing - I read and scrolled at least 30% through
the page on my iPhone until the author explained why I should care about a
service mesh I.e. what problems it tries to simplify or solve.

It seems to me there are some strong use cases here, but it’s only worth your
while if you’re operating at sufficient scale.

For instance, if my team at some FAANG scale company is responsible for
vending the library that provides TLS or log rotation or <insert cross
cutting/common use case here>, and it requires some non trivial on boarding
and operational cost, migrating to this kind of architecture longer term where
these concerns are handled out of the box may be beneficial.

Still - it doesn’t mean the service owners are off the hook. They still need
to tune their retry logic, or confirm the proxy is configured to call the
correct endpoints (let say my service is a client of another service B and for
us, B has a dedicated fleet because of our traffic patterns). This is an
abstraction. Abstractions have cost.

Trust but verify.

The trap people fall into is, “Here’s a new technology or concept. Let’s all
flock to it without considering the costs.”

------
theamk
It’s a pity “fat clients” are dismissed so quickly. I think that when your
tech stack is uniform enough to use them, they can provide much more that
service meshes, and do it faster as well. After all, why does “service is
down” and “service is sending nonsense” have to be handled via completely
different paths?

~~~
omeze
The main problem with fat clients is that for polyglot architectures (which
most large companies that end up building a service mesh evolve into over
time) you have to maintain a fat client library for every language. You can
get very far leveraging existing tools like gRPC that codegens fatty clients
for you but the quality of tooling is very uneven depending on the language of
choice. By pushing all of this into the network layer you skip all of that.

~~~
theamk
Right, polyglot architectures have no choice, but this text talks about “5
person startups” as well. Surely they can keep the set of languages limited?

Plus, it’s not either/or situation. A fat clients for Go + Node.js; and a
proxy for all others. This way your core logic can enjoy increased
introspection / more speed / higher reliability; while special purpose
services get a proxy which allow interoperability.

------
jayd16
As someone who's familiar with the API gateway pattern, is it fair to say this
is just another API gateway for internal services? Seems like it is but its
also described in an extremely convoluted way with 'control planes' and such.

~~~
hardwaresofton
The service mesh is a bit different from an API gateway -- in it's current
most popular implementations (linkerd[0] & Istio[1]), there are basically
small programs that run next to _each individual instance_ of the programs you
want to run. Linkerd has been around for a while and IMO there weren't _that_
many companies that were at a scale where they needed it (I didn't see it
deployed that often), but it's basically that same concept, but on a more
granular level -- if you delegate all your requests to some intermediary, then
the intermediaries can deal with the messy logic and tracing so your program
doesn't have to.

A better way to describe is "smart pipes, dumb programs". Imagine that all
your circuit-breaking/retry/etc robustness logic was moved into another
process that happened to be running right next to the program actually doing
the work.

You can have both an API gateway _and_ a service mesh deployment -- for
example Kong's Service Mesh[2] works this way. They're saying stuff like
"inject gateway functionality in your service", but that only make sense if
you sent literally every request (whether intra-service or to/from the outside
world) through the gateway. _Maybe_ that's how some people used Kong but I
don't think everyone thought of API gateways as a place to send every single
request through. You'll have a Kong API gateway at the edge _and_ the kong
proxies (little programs that you send all your requests through) next to
every compute workload.

[0]: [https://linkerd.io/](https://linkerd.io/)

[1]: [https://istio.io/](https://istio.io/)

[2]: [https://konghq.com/solutions/service-
mesh/](https://konghq.com/solutions/service-mesh/)

~~~
jayd16
Hmm, is the assumption that, because you're deploying an instance of the mesh
as close to the application as possible, you don't need robust logic between
the application and the service mesh? I can buy that I suppose.

~~~
hardwaresofton
Yes kind of -- except not in between the application and the service mesh,
it's between application and application.

Imagine that for every application there is _one_ small binary that runs and
serves _all_ it's traffic, like a chauffeur. Your application stops talking to
the outside world completely and sends all messages to the small chauffeur
binary -- which then talks to _other_ chauffeurs, over the network.

Keeping with the chauffeur analogy, there is a "head office" which calls the
chauffeurs on CB radio at regular intervals that lets them know which cars go
where and how to start them/etc.

"head office" => "control plane"

"chauffeur" => "side-car proxy"/"data plane"

In the end what this means for your application is that you just make calls to
external services (whether your own or others) and since _all_ your
communication goes through this other binary, you get monitoring, traffic
shaping, enhanced security, and robustness for free.

Another interesting feature is that if the side-car proxy can actually
_understand_ your traffic, it can do even more advanced things. For example
you can prevent `DELETE`s from being sent to Postgres instances at the
_network_ level.

------
peterwwillis
Every part of a service mesh could be baked into operating systems so that all
this extra technology was just there by default. This would put a fair amount
of start-ups out of business, but it would also mean a lot less people having
to be hired to set up and maintain all this stuff. Devs could just... develop
software, with a clear view into how their apps run at scale. And Ops wouldn't
have to custom-integrate 100 different services.

This is really the future of distributed parallel computing, but we're still
just bolting it on rather than baking it in.

------
reilly3000
I'm evaluating using AWS App Mesh at the moment. We're a really small team so
we're choosing Fargate vs Kubernetes- mainly because we don't have need of
nodes nor want to deal with them.

The appeal of App Mesh for us was initially around using it to facilitate
canary deployments. AWS Code Deploy does a nice job with Blue / Green
deployments and that may suffice for us, but it doesn't support canary for
Fargate. Is that enough reason to add the additional complexity in our stack?
Not sure, looking for input.

Also, much of the documentation is focused on K8s. I'm murky on how to
implement an internal namespace for routing. Most of what I've seen is like
myenv.myservice.svc.cluster.local but its not clear to me that using that
pattern is needed in the context of Fargate.

Consistent observability is valuable, but again Fargate can do that pretty
well- it just doesn't mandate access logging so that would be left to the app
itself.

We want to implement OIDC on the edge for some services, but App Mesh doesn't
support that yet as other meshes like Ambassador, Gloo, and Istio seem to.
Since App Mesh doesn't really act as a front-proxy on AWS, we'll still be
using ALB to handle auth which is fine, I think. I get mixed messages about
the need for JWT validation, but if so, that would need to be implemented in
the app level with ALB fronting it.

Can anybody help me find resources to sort this out? I've been through the
`color-teller` example time and time again, but it still leaves lots of open
questions about how to structure a larger project and handle deployments
effectively.

~~~
hardwaresofton
> The appeal of App Mesh for us was initially around using it to facilitate
> canary deployments. AWS Code Deploy does a nice job with Blue / Green
> deployments and that may suffice for us, but it doesn't support canary for
> Fargate. Is that enough reason to add the additional complexity in our
> stack? Not sure, looking for input.

Maybe you should write a script for this? It sounds like you're about to take
on a _lot_ of complexity for just the ability to do canary deployments when
you could probably hack up a script in a day or two.

> We want to implement OIDC on the edge for some services, but App Mesh
> doesn't support that yet as other meshes like Ambassador, Gloo, and Istio
> seem to. Since App Mesh doesn't really act as a front-proxy on AWS, we'll
> still be using ALB to handle auth which is fine, I think. I get mixed
> messages about the need for JWT validation, but if so, that would need to be
> implemented in the app level with ALB fronting it.

JWTs are only required for client-side identity tokens (you can use opaque ids
and other kinds of stuff for backends) -- it seems like you're also at the
same time looking for something to take authentication off your hands? App
Mesh doesn't do that AFAIK, it's _only_ the service<->service communication
that it's trying to solve.

I think it might be a good idea to make a concise need of what you're trying
to accomplish here, it seems kind of over the place. From what I can tell
it's:

\- Ability to do Canary deployments

\- The ability to shape traffic to services (?)

\- Observability, with access logging

\- AuthN via OIDC at the edge

A lot of meshes do the above list of things, but the question of whether it's
worth adopting one just to get the pieces you don't have already (which is
only #2 really, assuming you scripted up #1), is a harder question.

~~~
shubha-aws
> Namespaces: In order to identify the versions of services for routing, you
> need independent virtual nodes and routes in a virtual router. You can reuse
> the DNS names or use cloudmap names with metadata to identify the
> versions/virtual nodes. > OIDC at ingress - App Mesh does not do this yet,
> ALB / API Gateway is needed for this. App Mesh has this on the roadmap. >
> Resources - You can reach the app mesh team with specific questions at the
> App Mesh roadmap Github and we can help

------
solatic
Re. "fat client" libraries:

> Sure, it only worked for JVM languages, and it had a programming model that
> you had to build your whole app around, but the operational features it
> provided were almost exactly those of the service mesh.

The thing is, all of our microservices communicate with each other using
Kafka. Envoy has an issue open for Kafka protocol support [1], but it's a
fundamentally difficult issue because adopting Kafka forces you to build out
"fat client" code and building a network intercept that can work with pre-
existing Kafka client code is non-trivial. On observability, Kafka produces
its own metrics.

Granted, Kafka doesn't offer the same level of control. But Kafka does offer
incredible request durability guarantees. We don't have "outages" \- we have
increased processing latency, and Istio/Envoy and other service meshes can't
offer that because they do not replicate and persist network requests to disk.

[1]
[https://github.com/envoyproxy/envoy/issues/2852](https://github.com/envoyproxy/envoy/issues/2852)

------
reissbaker
Opinionated read, but interesting. That being said, Linkerd wasn't the first
service mesh — SmartStack predates it by three years. [1] Although they didn't
use the (then-nonexistent) "service mesh" term at the time, it pioneered the
concept of userspace TCP proxies configured by a control plane management
daemon. I doubt the Linkerd folks are unaware of it, so it was a surprising
omission.

[1]: [https://medium.com/airbnb-engineering/smartstack-service-
dis...](https://medium.com/airbnb-engineering/smartstack-service-discovery-in-
the-cloud-4b8a080de619)

------
golover721
While nobody ever seems to want to hear it, the vast majority of companies
utilizing service meshes and k8s are wasting huge amounts of time and money on
things they don’t need.

Unfortunately these technologies are at peak hype so everyone seems to be
implementing them for their small to medium crud apps. But get very sensitive
if you try and point it out.

------
Animats
How many transactions per second before you need all that stuff? If you're not
in the top 100 sites, it seems unnecessary.

~~~
tptacek
It's not as much about load as it is about complexity; it starts to make sense
when you hit some threshold number of internal services, regardless of the
amount of traffic you're doing. You use a service mesh to factor out network
policy and observability from your services into a common layer.

~~~
thom
What is the threshold above which a service needs to exist at all, over a
module in an existing codebase?

~~~
cpitman
The point at which you have multiple teams working on the same codebase, and
their velocity is suffering from communication overhead and missteps.

~~~
koffiezet
A few remarks:

* Codebase should be defined as 'the platform'. where one team will most likely never look at the code of other team's microservices. * this communication problems and overhead start the moment you go from 2 to 3 or more teams. * the term 'team' in this context should be interpreted very broadly. One dev working alone on a microservice should be considered "a team".

Also, things mentioned in the article: you don't want to implement TLS,
circuit breakers, retries, ... in every single microservice. Keep them as
simple as possible. Adding stuff like that creates bloat very quickly.

------
djohnston
This is quite interesting. I used to work in more devopsy kind of roles but at
the current gig it has been almost entirely removed from my purview. It's
impressive to step away for a few years and return to see so many changes, but
the article laid out the concepts in an easy to understand manner.

------
adolph
If one were to implement a service mesh of microservices wouldn’t the services
need to be versioned similar to how the packages used by a microservice are
version-pinned?

~~~
dodobirdlord
Sort of, but only for major versions, and it's preferable to bake that sort of
thing into the API itself. The API exposed by a microservice should only ever
be updated in backwards compatible ways unless you can verify that you have no
callers, which is hard. New functionality should be introduced using backwards
compatible constructs like adding fields to JSON or protobuf. Breaking changes
go in a new API. This is easily managed conceptually by having the
microservice expose version information as part of the API. A FooService might
define "v1/DestroyFoo" and "v2/DestroyFoo" with different calling contracts.
Perhaps v1 was eventually consistent and returns a completion token that can
be used with a separate "v1/CheckFooDeletionStatus", but now with v2 the
behavior has been made strongly consistent and there is no
"v2/CheckFooDeletionStatus". The v2 of the API can thus be thought of really
as a separate API that happens to be exposed by the same microservice, and
pre-existing callers can continue to call the (perhaps now inefficient) v1
API.

