Everything self-hosted has its maintenance tax but why Kubernetes (especially self hosted) is so hard? What aspect is that makes Kubernetes operationally so hard?
- Is it the networking model that is simple from the consumption standpoint but has too many moving parts for it to be implemented?
- Is it the storage model, CSI and friends?
- Is it the bunch of controller loops doing their own things with nothing that gives a "wholesome" picture to identify the root cause?
For me personally, first and foremost thing on my mind is the networking details. They are "automatically generated" by each CNI solution in slightly different ways and constructs (iptables, virtual bridges, routing daemons, eBPF etc etc) and because they are generated, it is not uncommon to find hundreds of iptable rules and chains on a single node and/or similar configuration.
Being automated, these solutions generate tons of components/configurations which in case of trouble, even if one has mastery on them, would take some time to hoop through all the components (virtual interfaces, virtual bridges, iptable chains and rules, ipvs entries etc) to identify what's causing the trouble. Essentially, one pretty much has to be a network engineer because besides the underlying/physical (or the virtual, I mean cloud VPCs) network, k8s pulls its very own network (pod network, cluster network) implemented on the software/configuration layer which has to be fully understood to be able to maintained.
God forbid, if the CNI solution has some edge case or for some other misconfiguration, it keeps generating inadequate or misconfigured rules/routes etc resulting in a broken "software defined network" that I cannot identify in time on a production system is my nightmare and I don't know how to reduce that risk.
What's your Kubernetes nightmare?