
Debugging Network Stalls on Kubernetes - chmaynard
https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/
======
KaiserPro
What I am struggling to understand is why is it a good idea to to ip inside
ip.

Giving containers their own nic, and IP isn't hard, and in most cases is >>
faster than spending time using some immature tunnelling protocol.

I _can_ see that doing multi-cloud deploys with transparent failover _might_
benefit from a VPN, but short of operating on a _hostile_ network I can't see
its worth the heartache to deploy an overlay network.

~~~
navaati
This a thousand !

You don't even need to give them actual nics, just veth pairs with routing
(some CNI plugons do just that btw).

I think a big part of why overlay nets are so common is that a lot of the
engineers deploying these systems have a shaky or non existant understanding
of networking and routing, they are stuck in the "local network and a default
route" model and if you only have that, overlay it is...

~~~
silasb
Do you know what CNI plugins support veth pairs?

~~~
navaati
It’s about routing more than veth pairs (I mean you can also do simple
bridging as other comment say, but it’s more limited), so look for BGP using
ones, mainly Calico, but kube-router too.

I know there is simpler ones that don’t use BGP, but I don’t remember their
names. For example there is one for AWS that uses ENI interfaces and pass them
directly to containers (you’re obviously limited by the number of ENI you can
attach to a particular host, then). There is also one that configures routes
in the VPC routing table, and a similar one for GCP.

------
maxpert
I keep seeing articles showing up on debugging network issues like DNS or
something else. Makes me wonder how much engineering effort we are putting in
to fight the tool. Have these people considered/investigated other tools like
Nomad (hashi corp)? How much value Kube actually adds vs these issue?

~~~
sytelus
Kubernetes is CORBA of this generation. It will float around because of few
heavyweights preaching it and suckers falling for it. Eventually it will die
like all overly-complex, hideously-designed and poorly-implemented things do.
You can save your time by just ignoring this nonsense.

~~~
rumanator
> Kubernetes is CORBA of this generation. It will float around because of few
> heavyweights preaching it and suckers falling for it.

This sort of assertion is oblivious to the fact that Kubernetes does solve a
few basic problems that no other orchestration service solves, at least as
easily.

I'm referring to problems like cluster autoscaling.

Until there's any alternative to Kubernetes that not only offers these
features but it is also supported by service providers like AWS or Azure or
GCP then Kubernetes is not a fad but the default tool of the trade.

~~~
pojzon
> I'm referring to problems like cluster autoscaling.

When it comes to autoscaling, kubernetes has two levels of scaling - pods and
nodes. This introduces pros and cons in itself.

Either your resources are under-utilized and you waste money or you scale as
fast as regular scaling of your cloud provider.

I'll just point out that for example AWS ASG TargetTracking scaling policy
blasts default Kubernetes HPA/CA scaling out of the water. It's more
conservative while HPA is highly susceptible to pod thrashing.

Two-level based scaling introduces a lot of complexity, it gives more
flexibility/power to the user, but only if that user has enough experience to
not fall into multiple pitfalls it also creates(scaling).

I'm confident enough to say that ANY corporation that is not familiar with
Kubernetes - decides to introduce it to it's technology stack will most likely
shoot itself in the foot - also increasing the cost of infrastructure for
their platform.

I've seen it too many times. Which then results in this corpo to hire some
consultant that looks at what abomination their devops created to shake head
and spend few months fixing their incompetence.

Doing Kubernetes the right way is hard.

~~~
rumanator
> I'll just point out that for example AWS ASG TargetTracking scaling policy
> blasts default Kubernetes HPA/CA scaling out of the water.

That doesn't really count as it's a proprietary service controlled by a single
service provider.

> I'm confident enough to say that ANY corporation that is not familiar with
> Kubernetes (...) also increasing the cost of infrastructure for their
> platform.

That assertion doesn't pass muster because a) you're assuming generalized and
widespread inexperience and/or incompetence and b) you're assuming that not
being able to learn how to use a service is a permanent state of affairs.

Meanwhile, back in the real world cluster autoscaling works well and does in
fact let users shut down nodes they are not using which otherwise would have
to be up and cost real money. Kubernetes is the reason why this feature is
available to the general public. Until a better alternative appears,
Kubernetes is by far the best and only option available to the whole industry.

~~~
pojzon
> That assertion doesn't pass muster because (..)

It does, because in the same time the corporation could have used the tech
stack they are familiar with, focusing on the product, which would directly
improve their revenue. In most cases I've been consulting, the move to
Kubernetes was strictly pushed by people who did not understand that for their
use case it made absolutely no difference and only brought a lot of complexity
to their table, which then turned into a year of two of consulting costs and
even more operational work they previously had to maintain the new tech stack.

Lose-Lose situation.

> Kubernetes is by far the best and only option available to the whole
> industry.

No it isin't. And the sooner people understand that the better.

------
sciurus
Fantastic write-up. I'm curious how many people were involved in the
investigation and how long it took.

~~~
theojulienne
This is a great question, thank you for asking!

Initially a few teams around the org had folks investigating poor performance
from different perspectives of the applications that were observing issues.
Once it was clear that it wasn’t the applications themselves or their
configuration at fault, the team that runs our Kubernetes infrastructure
started collating information together (in github issues) and getting to the
point of having a clear repro (the Vegeta test) and what to look out for. This
was the slowest part of the process because we needed to understand that
something non-application-level was going on (and because “random network
latency” is a very difficult thing to narrow down) - it probably took on the
order of months from the first sign of an issue to fixing all the other issues
that were contributing to small amounts of latency and being sure we still had
an underlying problem to find.

At that point it became clear that something more low level was going on, we
put together a focus team from a selection of teams to investigate the
underlying cause - that was a group of about 5 engineers actively working on
it, with another 5-10 interested engineers following along and helping out.
Folks were typically working in pairs or solo to dive in to different
potential leads, looping in everyone else in Slack as they go. Most of the
work here was finding signal in the noise, we found a lot of other smaller
system-level issues along the way that got ruled out and/or low priority to
fix. There were other DNS related issues at play, fixing those also improved
things, but not the specific underlying issue in the post here. Going down the
specific path in the post took just a few days once the first few steps showed
something was wrong at the packet level. The remediation from there was also
just a few days, because we already had infrastructure in place to detect a
known issue and mitigate in a safe/graceful way. The focus team was working on
this as a primary task for around a few weeks overall.

------
acd
There has been issues with Kubernetes and NAT reported from Xing engineering.

12 min into the video
[https://www.youtube.com/watch?v=MoIdU0J0f0E](https://www.youtube.com/watch?v=MoIdU0J0f0E)

~~~
theojulienne
This insert_failed issue described in the video was one of the ones we
discovered during investigations as well, but it was already well understood
because of this excellent Xing blog post which was extremely useful and
referenced internally a lot: [https://tech.xing.com/a-reason-for-unexplained-
connection-ti...](https://tech.xing.com/a-reason-for-unexplained-connection-
timeouts-on-kubernetes-docker-abd041cf7e02)

------
destitude
Would these zombie cgroups be able to be detected in any of the kubernetes
statsd metrics emitted?

~~~
theojulienne
We didn't find any metric that surfaced zombie cgroups, presumably because the
kernel mostly tries to hide them from user space since they have been deleted,
but haven't been cleaned up. The only way we found at the time to track them
was via a BCC script and observing the latency on reading the
/sys/fs/cgroup/memory/memory.stat file.

------
je42
Awesome deep dive ! I wonder how one can debug this without deep insights into
the networking stack. I hope i ll never have a problem like this myself ;)

------
hogetesco
I read the article with great fun. The question is, how do I run the bcc
script in the article? Is it on a sidecar container or on a worker node
machine?

~~~
theojulienne
The bcc script was run on the Kubernetes node itself directly over SSH, but it
should be possible to run it in a privileged container as well.

------
suresk
Super interesting, thanks for sharing!

These things are always my favorite to deal with - the feeling at the end when
you figure it out is amazing, and you usually learn a ton on your way there,
too.

------
joemag
Is the tunneling protocol really IPIP for these overlays? Oh boy - that’s
going to really suck on wide multi path networks like the ones used by cloud
providers.

------
bigbluedots
I learned a lot reading this writeup, thanks for posting!

------
seminatl
This does not really have anything to do with either kubernetes or networks.
If your computer is busy, it won't be able to process packets. Accessing
certain kernel stats via proc, sys, or other special files may be really
expensive. For example /proc/pid/smaps of a running mysqld takes 2 seconds on
a computer I happen to have on hand. Sometimes when you have many cores it is
expensive to produce some of the fields of /proc/pid/stat because the kernel
has to visit numerous per-cpu data structures. /proc/pid/statm is better for
this reason, if it contains what you are looking for.

TL;DR reading kernel stats can take a long time and cost a lot of CPU cycles.
It costs more for more containers, and more on bigger machines.

~~~
rumanator
> This does not really have anything to do with either kubernetes or networks.

The article is literally about an issue that was experienced while operating
kkbernetes clusters.

FTA:

> Essentially, applications running on our Kubernetes clusters would observe
> seemingly random latency of up to and over 100ms on connections, which would
> cause downstream timeouts or retries.

Sounds like a problem affecting Kubernetes to me, and an important one.

More importantly, it sounds like a non-trivial problem that others operating
Kubernetes clusters would be interested in learning how to identify and how to
search for the root cause.

~~~
seminatl
It has literally nothing to do with K8s. An equally suitable title would have
been "Debugging network stalls on the Intel Xeon processor" or "Debugging
network stalls on planet Earth".

------
Polyisoprene
Inefficient Linux cadvisor implementation causes unexpected latency. Also this
affects buzzword compatible technologies.

~~~
nielsole
What makes you say it was a cadvisor issue? Isn't it clearly a kernel issue?

~~~
alecco
It looks more like _Docker_ is pushing the Linux kernel in unprecedented ways.
You can't blame Linux for having a cache there for (so-far) normal workloads.
And cadvisor was pushing the problem further by reading constantly stats.

