
The service mesh era: Using Istio and Stackdriver to build an SRE service - crcsmnky
https://cloud.google.com/blog/products/devops-sre/the-service-mesh-era-using-istio-and-stackdriver-to-build-an-sre-service
======
talawahdotnet
I am still praying that some day soon AWS will announce that they are joining
Opencensus (along with Google, MS, Datadog, Prometheus)[1] in the hopes that
we can move towards standard tooling for observability.

They also seriously need to give CloudWatch a UI/UX overhaul.

1\. [https://opencensus.io/introduction/#partners-
contributors](https://opencensus.io/introduction/#partners-contributors)

~~~
sciurus
OpenCensus seems like it's really a Google-only project. OpenTracing and
OpenMetrics appear to have more community and vendor engagement.

E.G. Datadog is basing their newer tracing libraries on OpenTracing, and
Prometheus devs are behind OpenMetrics.

~~~
manigandham
That's exactly the fragmentation that we don't need. OpenCensus has backing
from Microsoft too and is a designed as a single API and library to support
both tracing and metrics using the same context.

OpenTracing and OpenMetrics are more like API specs with libraries left to
others to implement, and they're never really used standalone for them to be
separate projects. The best option for the industry would be to fold OT and OM
into OC and make a single stack, and hopefully include structured logging as
well.

~~~
lugg
Why is that best for the industry?

Or more precisely, which, industry?

~~~
manigandham
Because tracing, metrics (and logs) are the 3 fundamentals of observability
and having a single standard API and library would allow everyone to move on
from implementing the same thing all these years.

There are 0 advantages offered by opentracing and openmetrics over opencensus
to defend having separate projects.

~~~
sciurus
OpenTracing began in 2015. It was accepted as a CNCF project in 2016.

OpenCensus began in 2018.

Don't get me wrong, it's better for Google and Microsoft to collaborate on
OpenCensus instead of continuing to develop separate client libraries for
Stackdriver and Application Insights. It would be lovely if AWS joined too.

I just want to point out that there was a lot of community effort and vendor
adoption around OpenTracing by 2018 that Google chose to ignore. If you want
to reduce fragmentation and reimplementation, criticize that decision, not the
existence of OpenTracing.

~~~
manigandham
Users don't care what came first. OpenCensus created a better model by
combining tracing and metrics and building libraries.

------
andriosr
Zero instrumentation visibility with the service mesh, but the demo app is
instrumented. I’ve seen this point being sold everywhere for service mesh, but
the vanilla tracing data given by Istio or others is not that useful by
itself. There is no magic, you need to instrument your code

~~~
kcmastrpc
There is magic, (caveat: I work here), but Instana
([https://instana.com](https://instana.com)) will instrument most major
languages and frameworks auto-magically. As in, I don't have to declare a
dependency, change a configuration, or anything - our agent finds processes
running on the system and bootstraps the libraries while running and monkey
patches a huge number of standard libraries and frameworks with no restarts.
(don't believe me, give the trial a shot)

There is literally nothing else quite like it in the market, and it gives you
distributed tracing, automatic metric collection, and pre-defined alerts for a
reasonable price.

[https://docs.instana.io/core_concepts/tracing/#supported-
tec...](https://docs.instana.io/core_concepts/tracing/#supported-technologies)

~~~
Diggsey
That sounds horrific for whoever is going to be supporting that system...

The last thing I would want in a production environment is to have some 3rd
party software monkey-patching the code at runtime.

What happens when: \- a bug only occurs (due to timing or some other extremely
subtle issue) when this monkey-patching is applied. \- there's a bug in the
monkey-patching itself (sounds like a fun debugging session!) \- a library is
accidentally monkey-patched with a slightly different version, or falsely
detected as a known library (maybe it is a fork)

Give me statically compiled, reproducible, dependency free, bit-for-bit
identical with what has been thoroughly tested in CI, musl binaries any day.
That's how you avoid getting woken up at 4am.

This kind of magic should happen at compile time, if at all.

~~~
kcmastrpc
We have hundreds of customers (and thousands of engineers) who are willing to
make the trade-off.

It's OK that you're not, but I hope you can agree that engineering
observability isn't cheap nor easy - and if you're using standard libraries,
frameworks, and tooling (and not going way off the rails) we have observed
that, for the most part, our agent works as intended.

We always recommend our customers run the agent in their test and integration
environments, but you are correct, there are always risks involved. Other than
the automation how is this any different then putting a New Relic jar into
your Java app, or including a Datadog library? We simply figured out how to do
it automatically at runtime.

~~~
Diggsey
I definitely value ease of use and zero-setup solutions - the issue for me
here is that the situations where I would consider running a service mesh are
the same situations where predictability and reproducibility would trump ease
of use - namely any kind of production setting where down-time is sufficiently
undesirable.

Testing with the agent would certainly help, but then you lose some of the
"ease of use" benefits as I expect you would have to run a mini cluster in CI
in order to run your agent?

There are few important difference between this and a "normal" dependency:

\- Even if the application is fully tested with your agent, it could be
something as simple as turning your agent _off_ that could break things.

Hypothetical scenario: multiple instances of the application are running with
your agent enabled. Someone decides to turn off monitoring for some reason -
nothing bad happens and they go home at the end of the day. Later on, some
instances are restarted, or the cluster is re-scaled. Now you have half your
cluster on a different code-base and your serialisation breaks because you
were doing something silly like using pickle or a java object stream.

\- The examples I mentioned in my previous comment would not happen with a
normal dependency, because the version of that dependency would already be
managed through standard means. If I were to go an look at the code, I would
be able to see the actual code that is running, and the exact versions of all
dependencies used.

~~~
kcmastrpc
Well, I’m guessing a competitor flagged my OP; but I digress, I was just
trying to raise awareness to what is actually possible (even if it’s not free,
but I’d argue nothing of value is ever free - even open source, you still have
to implement and operate it).

Anyways, I feel like we’ve come to an impasse, there is no monitoring solution
out there which is bug-free (even opentracing and it’s various implementations
have caused performance/stability issues, re: [https://github.com/opentracing-
contrib/java-spring-web/pull/...](https://github.com/opentracing-contrib/java-
spring-web/pull/44))

Regarding CI, our agent has no requirements other than a supported OS - you
could be running your integration tests as a bare JVM and our agent would
detect, instrument, and monitor it the same way if it were running inside a
CRI-O container on K8S (though I’d question why you would run your integration
tests in that manner).

IRT your examples, I’ll be brief in my responses because you’re not wrong, but
the engineers on our team have taken great care to ensure we don’t break our
customers environments (we run on systems which process 10’s of millions of
requests an hour and where minutes of downtime cause losses in the 100s of
thousands).

We dynamically unload our sensors/instrumentation when the agent is unloaded -
so the likelihood of the issue which was mentioned earlier happening is slim
(though nothing is impossible)

We also don’t instrument serialization methods (unless you were to decide to
use our SDK to do so) so that’d literally never happen. We hook onto methods
which handle communication between systems — HTTP request handlers, DB
handlers, Messaging System handlers, Schedulers, etc.

Our sensors are open source, so you can check out the code if you’d like
([https://github.com/instana](https://github.com/instana)). As I said earlier,
we live in a world of trade-offs. I’d argue that systems which require the use
of a service mesh are significantly complex enough to warrant the use of this
level of automation to provide visibility that quite frankly 99.9% of
organizations don’t have the time to do themselves.

------
pmlnr
SRE service. Not any kind of service - an SRE service!

Can we please stop the buzzword train?

------
bogomipz
I would interested in anyones feedback of embracing and rolling out a service
mesh/Istio in a non-GCP environment.

I apologize if this is a naive question but how come this wasn't included as
part of the Kubernetes project given that it has the same Google origins?

~~~
twblalock
The Istio project is not supposed to be tied to Kubernetes. It is supposed to
be a general-purpose service mesh.

That being said, I have been looking for a while and I can't find anyone who
uses it in production on a platform other than Kubernetes.

~~~
barbecue_sauce
Also worth noting that Istio is not part of the CNCF while Linkerd is.

~~~
lcalcote
It is a good note to make. How does this bear weight on the question above?

~~~
barbecue_sauce
He was asking why Istio was not included as part of Kubernetes itself despite
both projects originating at Google. I was implying that there must be some
reason as Istio is not in the CNCF while Linkerd (arguably an Istio
competitor) and Kubernetes are. To further that idea, it seems that Google
wants to maintain direct control over Istio itself, rather than put it into an
organization with multiple sources of institutional governance (Amazon,
Huawei, Samsung, Microsoft, Oracle, etc.). If Istio or any other service mesh
had been incorporated directly into Kubernetes by Google, they would have lost
some control over it.

(There are also obvious technical reasons for decoupling something like this
from Kubernetes, mostly the opinionated nature of forcing a service mesh over
other potential solutions).

------
jcims
n00b question, I always see service meshes used in the context of containers
and mostly with kube. Would they work with more monolithic/traditional n-tier
architecture deployed directly on host OS as well? Or maybe put another way,
are there likely to be pain points that don't exist in containerized
architectures?

~~~
twblalock
The main pain point at the moment is that meshes were written for
containerized environments first, and attempts to extend full functionality to
other environments are pretty immature at the moment.

Meshes are a lot more than just sidecar proxying -- they are what make sidecar
proxying manageable, and they add a lot of other features like authentication,
network policies, various other traffic control policies, service discovery,
etc. They are an attempt to do for service-to-service communication what
Kubernetes has done for container deployment -- make it abstract and
declarative, with configurations that are independent from the underlying
implementation.

The underlying implementation that works right now is the Kubernetes API and
etcd, and alternate implementations need to be provided for those features to
work well outside of Kubernetes. I think it will happen sometime in the next
few years.

