
Envoy Proxy Performance on Kubernetes - kelseyevans
https://www.getambassador.io/resources/envoyproxy-performance-on-k8s
======
nickvanw
This is somewhat valuable, but misses any investigation as to why there were
outlying latency spikes when using non-Envoy software for load balancing.
Furthermore, using the average latency like this doesn't tell us much,
especially with outliers making the graphs worthless for steady-state
performance analysis.

My first thought is that the spikes are somewhat clearly the result of
requests getting sent to pods that no longer exist, or are starting and not
prepared to process requests. This might just speak to the method of
configuration for all three of these underlying softwares and say absolutely
nothing about how well they actually fare doing any load balancing.

If someone came to me with this at work, I would say it is the beginning of a
series of troubleshooting steps to answer the question of why there are such
outlying requests when using our load balancer of choice, and not an analysis
of which software to pick.

Edit: Even worse is that this appears to be from a company that sells.. an API
gateway built on top of Envoy.

~~~
rdli
(one of the authors here)

Thanks for the feedback.

So regarding your hypothesis on the spikes being sent to pods that no longer
exist/are starting: 1) it is the responsibility of the ingress controller on
K8S to properly handle that situation 2) it would be highly unlikely for
people to implement their own custom ingress controller around a given proxy
(it's actually somewhat complicated) and 3) the pod theory wouldn't address
the latency spikes seen on reconfiguration.

But you're right that there probably should be some explanation around why we
think this is happening (I just didn't want to speculate too much; I suspect
that the issue is with the hitless reloads implementation in the proxies which
is tricky to do well).

~~~
w8vY7ER
Could it be at all related to the circuit-breaking behavior that nginx
describes[1] in some of their reference architecture? Unclear to me which (if
any) of these properties might be in play for this test.

[1][https://www.nginx.com/blog/microservices-reference-
architect...](https://www.nginx.com/blog/microservices-reference-architecture-
nginx-circuit-breaker-pattern/)

~~~
rdli
I suspect it's because of reloads:

[https://kubernetes.github.io/ingress-nginx/how-it-
works/#whe...](https://kubernetes.github.io/ingress-nginx/how-it-works/#when-
a-reload-is-required)

The NGINX ingress controller goes to some lengths to avoid reloads because it
recognizes the hit from reloads. In Ambassador-land, we use Envoy's xDS APIs
to avoid this problem. Not clear what the HAProxy ingress controller does.

~~~
rogerdonut
The official HAProxy Ingress Controller uses the Runtime API [1] to avoid
restarts/reloads and also has hitless reloads configured by default. HAProxy
Technologies also contributed the capability to use the Runtime API within the
jcmorais-haproxy ingress controller as well but it requires you to activate
using the dynamic-scaling option [3]

One thing I wanted to point out is that the HAProxy Ingress Controller
actually has over 25 [2] configuration options at the time of publishing, not
8 as mentioned.

While we have identified a few on our own we'd love to work with you further
to identify any missing configuration directives that can help perform some
more accurate benchmarks using the official HAProxy Ingress Controller.

disclosure: I work at HAProxy Technologies

[1] [https://www.haproxy.com/blog/dynamic-configuration-
haproxy-r...](https://www.haproxy.com/blog/dynamic-configuration-haproxy-
runtime-api/)

[2] [https://github.com/haproxytech/kubernetes-
ingress/tree/maste...](https://github.com/haproxytech/kubernetes-
ingress/tree/master/documentation)

[3] [https://github.com/jcmoraisjr/haproxy-ingress#dynamic-
scalin...](https://github.com/jcmoraisjr/haproxy-ingress#dynamic-scaling)

~~~
rdli
Thanks! Drop me a line (email in my profile) and would love to chat.

I updated the article to clarify that there were 8 configuration options at
the time of testing (we started this effort awhile ago) and now there are 25.

We'd definitely like to rerun the tests with the official controller to use
the Runtime API.

------
rumanator
The article focus on latency spikes, which happen only sporadically. Can
anyone more knowledgeable on the subject chime in and comment if this
comparison is fair or relevant? Personally I was expecting to see histograms
or empirical distribution functions of each service's latency.

~~~
jchw
Personally, I prefer looking at something more like 95th percentile latency,
versus average, which is what I think this article is showing. I suppose a
histogram would give you the fullest picture, though.

~~~
rumanator
FTA:

> We measure latency for 10% of the requests, and plot each of these latencies
> individually on the graphs.

So for what it's worth these spikes may very well be single requests that are
not relevant and are only triggered by the way the Kubernetes cluster was
being manipulated for the test.

~~~
rdli
The spikes aren't single requests (at 1000 RPS, the spikes are well over a
second long and you see hundreds of requests that spike). As for the reason,
we suspect that the different config reload mechanisms in the respective
proxies is what triggers the spikes.

(disclaimer: one of the authors)

------
jwkane
Nginx-ingress (the free version) does a reload when pods are added/removed. I
would expect to see latency spikes. As I understand it the paid nginx-ingress
can do these reconfigs on-the-fly. If you have pods adding/removing
frequently, like maybe from "We then scale up the backend service to four pods
and then scale it back down to three pods every thirty seconds", this is what
you see. The more interesting question is if that is a realistic expectation
for your workload. Unless you are doing aggressive auto-scaling pods can be
very stable, only add/removing during upgrades.

~~~
rdli
This is a good point. I think an interesting test would be to add multiple
services, and see if scaling up/down a single service affects latency to other
services.

------
bradwood
traefik?

~~~
aairey
I am wondering as well how that compares to this.

