
Failure Injection on Kubernetes with SMI and Linkerd - williamallthing
https://linkerd.io/2019/07/18/failure-injection-using-the-service-mesh-interface-and-linkerd/
======
lemoncucumber
It's worth noting that Istio has built-in support for failure injection (i.e.
without needing to run a separate service to return 500s):
[https://istio.io/docs/tasks/traffic-management/fault-
injecti...](https://istio.io/docs/tasks/traffic-management/fault-
injection/#injecting-an-http-abort-fault)

As far as I know Linkerd does not (yet) have such a feature though, so this
post seems like a reasonable alternative.

~~~
adlleong
That's right, and there are some interesting trade-offs. Having failure
injection built-in is convenient but running a separate service gives you full
control over the error responses. This can be useful if you want to simulate
responses with error bodies, for example.

------
adlleong
I'm the author of this blog post and I'm more than happy to answer any
questions people have!

------
jrockway
So... if I were going to inject failures into my service mesh, it would be my
service mesh that I'd be counting on to do the retry after the failure. Does
it even make sense to do it in that case?

~~~
ihcsim
It will be helpful for testing your retries, and getting insights into your
client's behaviour in the event of service failures.

------
samstave
You know what would be an interesting service:

Chaos monkey/failure injection-as-a-service: in that you define the parameters
by which you wanted to be assessed...

King of like pen test contractors...

So OK let me spin up an environ and attack the fuck out of it. Show me where
im weak. So that in prod... im good.

~~~
0vermorrow
You mean like [https://www.gremlin.com/](https://www.gremlin.com/) ?

One of the founders of Gremlin is an Engineer that worked in Netflix and
probably worked on Chaos Monkey as well :)

~~~
adlleong
If I understand correctly, one of the limitations of doing application level
failure injection with Gremlin is that you need to integrate it into your
code: [https://www.gremlin.com/docs/application-
layer/installation/](https://www.gremlin.com/docs/application-
layer/installation/)

It might be interesting to combine these approaches and use a traffic split to
send a percentage of traffic to Gremlin instead of integrating into the code
directly.

