
Riding the Tiger: Lessons Learned Implementing Istio - zwischenzug
https://zwischenzugs.com/2020/05/05/riding-the-tiger-lessons-learned-implementing-istio/
======
joekrill
This jives pretty well with my (admittedly little) experience with Istio. It's
certainly frustrating at times, but I still find the documentation is actually
pretty decent.

It's just that everything seems to lead down a rabbit hole. But this, I think,
is just a Kubernetes thing in general. I had the same experience they did with
a monitoring stack. But that's because you have to ramp up on some many
additional technologies (Prometheus, Grafana, Kiali, etc...). And not just
ramp up on their usage, but how they work, interact, and are configured. For
Prometheus, for example, they suggest using a federated setup, which adds
additional complexeties.

I messed around with strict mTLS for quite some time before simply giving up -
it just wasn't worth the time sink.

But in general I agree with the conclusions. Most things are pretty straight-
forward and the documentation has really good examples (using their "Bookinfo"
project). It's just the "going off the beaten path" thing they describe when
things become difficult.

~~~
zwischenzug
Glad to hear it’s not just us...

------
pcj-github
Can completely relate to this article... Have been up and down rabbit holes
trying to find an ingress controller that works well as an edge proxy for gRPC
streaming services (tried nginx-ingress, contour, esp, istio, ambassador).
Very challenging to get the configuration right (of which I have not found
yet).

------
arrayjumper
This is very close to our experience of working with Kubernetes and istio for
over an year. You get so many things seemingly for free that when things work
as expected the whole thing is actually quite nice.

It is when things don't work as expected that it's really hard to find help.
We had this issue happen a couple of weeks ago where we were trying to connect
from one of our clusters to an Elasticache instance. We could connect to it
from a namespace with istio sidecars disabled but not from a namespace with it
enabled. We could also connect to it from a namespace in a different cluster
which did have istio enabled. It took nearly a week to figure that out because
there is so little prior art (in terms of stackoverflow questions, github
issues etc). This comes with the territory of using something relatively new I
suppose.

------
pluies
This post certainly rings a bell! Even with deep Kubernetes experience, we
struggled at times with Istio at $previous_job, especially around:

\- Control-plane performance on a cluster with a large-ish number of pods
(thousand+) can be hit-and-miss, and "what to scale up" was hard to pinpoint
(though admittedly it seems to be getting better)

\- Istio upgrades often are a pain, but mostly around the actual way of
deploying the upgrade, rather than the upgrade itself. For a long time there
was no official Helm chart, then there was a Helm chart, then two Helm charts,
now it looks like Helm is deprecated and will be removed; instead installing
via `istioctl` is recommended... Some of it is due to the pain of upgrading
CRDs, which is a general Kubernetes issue, but there's still a _lot_ of churn
to keep up with.

\- Adding a new VirtualService registered to the same hostname as an existing
one will be accepted by Istio (at least as of 1.4), and will proceed to
_silently break all routing for new pods joining your istio cluster_! This was
a bear to debug too, given how noisy and confusing the Pilot logs are, and we
ended up stitching up a custom Prometheus alert around this given it bit us
roughly every other week

\- HA for control-plane components isn't explained in the docs. Is it safe to
run two Citadel pods? We did it and it seemed fine, but who knows?

\- We sometimes ran into pathological cases where traffic would for some
reason completely drop after a new deploy, and gradually pick up after the
config was stremed onto sidecars, over a span of ~10-15 minutes. We never
managed to debug this issue (which happened probably half a dozen times over a
year), and that mere fact turned me off the complexity of Istio in general.

(Some of these might have been fixed in Istio 1.5+, as 1.4 was my latest
experience)

Of course, once your setup is stable, everything is awesome: sidecar injection
works flawlessy, observability is awesome, distributed tracing is a breeze,
Kiali is a great crowd-pleaser when showing off features, mTLS + TLS
origination mean full on-the-wire encryption without losing any of the
previous benefits... A lot of features that meant we carried on with it, but
if I had to start again I'd probably have a good hard look at Linkerd before
recommending Istio for any prod setup.

~~~
williamallthing
Thanks for the Linkerd shoutout! Many our Linkerd adopters these days in fact
seem to be Istio refugees, looking for something simpler and lighter. Happy to
have them :)

------
jordanbeiber
Envoy is where the magic happens.

For those interested in a quick spin of envoy - take a look at consul connect
for something a bit more concise.

Although not for k8s, the end goal is pretty much the same.

~~~
omar16100
Any reference blog post/doc which you like?

~~~
jordanbeiber
Lyfts announcement of the envoy proxy:

[https://eng.lyft.com/announcing-envoy-c-l7-proxy-and-
communi...](https://eng.lyft.com/announcing-envoy-c-l7-proxy-and-
communication-bus-92520b6c8191)

Interesting read.

I actually wrote a consul SDS integration for envoy a couple of years back at
a previous employment.

Hasihicorp simple local setup:

[https://learn.hashicorp.com/consul/developer-mesh/connect-
en...](https://learn.hashicorp.com/consul/developer-mesh/connect-envoy)

------
DevopsQuestions
The titular allusion: [https://www.amazon.com/Ride-Tiger-Survival-Manual-
Aristocrat...](https://www.amazon.com/Ride-Tiger-Survival-Manual-
Aristocrats/dp/0892811250)

------
acd
I tend to prefer simpler ingress controllers like Traefik.

