
A visual guide on troubleshooting Kubernetes deployments - Cantaro86
https://learnk8s.io/troubleshooting-deployments
======
organsnyder
Note that this flowchart includes monitoring, logging, and networking, in
addition to typical app stuff. _Of course_ it's complicated.

The article does contain a lot of useful information. I don't think it was
intended to ride the "k8s is too complicated compared to vanilla X" bandwagon,
despite the huge flowchart at the top.

~~~
Kaiyou
Flowchar looks pretty straightforward to me. Nothing I'd categorize as
complicated.

~~~
emj
It's a good chart if you want to teach people the commands needed for trouble
shooting.

------
orthoxerox
I like how it has quite a few boxes that say "no idea what the reason might
be". Each one of them hides another chart as big as this one, if not bigger.

------
tinco
Fun fact, on GCP Kubernetes you can have green lightbulbs on every single
dashboard, and your entire site can be down anyway.

Our CI/CD was leaking "review" deployments, I forgot about them until one day
I upgraded a node and the entire site went down, even though everything was
green. Turned out there is some sort of naximum amount of nginx entries in
ingress and we were hitting it. That was some frantic debugging, solution was
just to delete the spurious review deployments.

~~~
nhumrich
This is true of anything, not just k8s. There is always a possibility of all
system monitors to be green but the apps dead/bad for some reason. This is why
you need more than system monitors. You need app monitoring as well.

~~~
StreamBright
That is called broken monitoring.

------
nathanwh
I enjoyed petulantly answering every question with “no” and the funding the
final state to be “Consult Stackoverflow”

~~~
dsnuh
Try [https://magick8sball.github.io/](https://magick8sball.github.io/) for
those scenarios. :)

------
nodesocket
Is there a way to prevent Kubernetes from killing and restarting a pod (from a
deployment) when you are debugging it with kubectl exec -it? I.E. inform
Kubernetes I am using this pod, don’t restart it automatically.

~~~
markbnj
Depends on why it is getting restarted. If it's exceeding mem limits and being
oomkilled that's the kernel, not k8s. If PID 1 inside the namespace is
terminating then k8s will restart the pod. No way to prevent that I am aware
of, but presumably you can't do much debugging once that happens anyway. If
the process is failing liveness probes and getting terminated for being
unhealthy probably the simplest approach is just to patch away those probes
until you have the workload stable.

~~~
nodesocket
Yes failing liveness probe.

~~~
anirudhrx
Liveness probes are used by the kubelet to restart the underlying container
and are independent of the deployment object. This has come up before in
[https://github.com/kubernetes/kubernetes/issues/57187](https://github.com/kubernetes/kubernetes/issues/57187)
but sadly, isn't possible yet. Your best bet is to create a new pod and hope
for repro, or one way might be to have a configmap that is mounted into the
pod that contains a debug flag that your liveness probe also looks at - i.e.
"debug == true || curl localhost:6789". Not a clean solution but may work for
the interim.

------
billfruit
Haven't had to work with k8s yet, but that flow chart looks really detailed.
It must have taken much effort to make it, that makes me wonder, for how long
shall it remain current? Will 6 months down the line it become invalid in
subtle manner that it can only add to the confusion.

~~~
etxm
All the CLI commands used have been in for quite a few “minor” versions. I’d
say it’s probably valid for the foreseeable future for debugging k8s
_primitive_ resources.

With the customization you can do to k8s the debugging can get a bit more
weird when dealing with operators and web hooks.

------
de_watcher
What about kubectl randomly hanging and some PLEG errors in the kubelet log?

~~~
erulabs
Kubectl randomly hanging sounds like a problem with the Metrics api. You can
use -v9 on kubectl and see what it’s pausing on. Note that metrics aren’t
refreshed every single run - instead on a timer - hence the “random” feel

------
gravypod
Looking at this I think it would be really cool for someone to build a Wheel
of Misfortune to help dry run some of these scenarios for debugging.

------
fheyfhth14353
Tickles my funny bone every time I see the industry jump on a new bandwagon
that starts simple and then inevitably adapts to the real world that the old
technology was addressing. When are we gonna learn folks?

Props to the author for the chart. He's actually providing real value to all
the suckers, err, SREs stuck dealing with this stuff.

------
jtdev
Yikes

------
StreamBright
I guess I can just print this out and show to people any time somebody wants
to move to k8s.

~~~
Kamshak
There are for sure a few concepts you need to learn with k8s, however I wonder
if it would be much shorter if you did a guide like this for manually done /
bash script powered / ansible deployments

~~~
linsomniac
I dunno, I did an install using the k8s install instructions, and got to a
point where I could access the containers locally but not from other nodes in
the cluster. I could see a bunch of routing and iptables rules, but I didn't
have any model for what it SHOULD look like, so I was at a loss for untangling
all that spaghetti.

~~~
etxm
Go through Kubernetes the Hard Way to get a low level grasp and once you
understand what’s going on use a managed service.

