Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is your Kubernetes nightmare?
220 points by wg0 on June 27, 2022 | hide | past | favorite | 259 comments
Everything self-hosted has its maintenance tax but why Kubernetes (especially self hosted) is so hard? What aspect is that makes Kubernetes operationally so hard?

- Is it the networking model that is simple from the consumption standpoint but has too many moving parts for it to be implemented?

- Is it the storage model, CSI and friends?

- Is it the bunch of controller loops doing their own things with nothing that gives a "wholesome" picture to identify the root cause?

For me personally, first and foremost thing on my mind is the networking details. They are "automatically generated" by each CNI solution in slightly different ways and constructs (iptables, virtual bridges, routing daemons, eBPF etc etc) and because they are generated, it is not uncommon to find hundreds of iptable rules and chains on a single node and/or similar configuration.

Being automated, these solutions generate tons of components/configurations which in case of trouble, even if one has mastery on them, would take some time to hoop through all the components (virtual interfaces, virtual bridges, iptable chains and rules, ipvs entries etc) to identify what's causing the trouble. Essentially, one pretty much has to be a network engineer because besides the underlying/physical (or the virtual, I mean cloud VPCs) network, k8s pulls its very own network (pod network, cluster network) implemented on the software/configuration layer which has to be fully understood to be able to maintained.

God forbid, if the CNI solution has some edge case or for some other misconfiguration, it keeps generating inadequate or misconfigured rules/routes etc resulting in a broken "software defined network" that I cannot identify in time on a production system is my nightmare and I don't know how to reduce that risk.

What's your Kubernetes nightmare?

EDIT: formating

It's odd, but I actually really enjoy using Kubernetes in production.

We have a few rules:

1. Read a good intro book cover-to-cover before trying to understand it.

2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.

3. Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.

3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.

This is enough to handle 10-50 servers with occasional spikes above 300. It's not perfect, but then again, once you have that many machines, pretty much every solution requires some occasional care and feeding.

My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

> 3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.

This one.

Kubernetes on bare metal is actually pretty easy. Kubernetes on a hosted solution which doesn't have a managed version is prone to error. Usually on bare metal you can make some guarantees regarding bandwidth and storage speed. Trying to roll out a cluster on a service that can't give you these guarantees is truly a nightmare.

I would also say that if you are going to be administering clusters at your company that you should at least set up a cluster from scratch (doesn't have to be bare metal) and learn how the kubernetes control plane works by breaking it in various ways etc.

In my experience most people don't like black magic, they want something that they understand on some level. A fully managed k8s cluster is black magic, once you have set up a vanilla cluster you get a much better feeling about how the control plane works together to get things done.

I have tried several times over the past few years to install Kubernetes on bare metal, and it has never worked.

I don't mean installing it on VMs on a laptop, I mean on a real linux cluster of 8 to 32 nodes, with real networks and real switches.

Managing bare metal machines is a cakewalk compared to getting Kubernetes running in-house, at least in my experience.

Obviously the cloud providers do it, so it's possible. But IMO it is something you do only if you have a full-time admin team available to set it up and manage it. It's not by any stretch of the imagination something you install and forget about.

What were you using to install kubernetes?

Did you try using kubeadm to bootstrap installing kubernetes? It is pretty simple.

> Kubernetes on bare metal is actually pretty easy.

I would not call it easy at all. Last time I tried that a year ago you still needed a special load balancer to get it going (https://metallb.universe.tf). Has this changed?

MetalLB is pretty simply to configure.

That's just not true, especially if you compare it to the LoadBalancer you get on a cloud platform which usually involves zero clicks. I'm not saying it's impossible but it's definitely not "easy".

Configuration instructions: https://metallb.universe.tf/configuration/

Hint: You better know what all of these are in your environment:

    For a basic configuration featuring one BGP router and one IP address range, you need 4 pieces of information:

    The router IP address that MetalLB should connect to,
    The router’s AS number,
    The AS number MetalLB should use,
    An IP address range expressed as a CIDR prefix.

Did you miss the part about layer 2 configuration, where you don't need BGP at all? https://metallb.universe.tf/configuration/#layer-2-configura...

But then "When announcing in layer2 mode, one node in your cluster will attract traffic for the service IP."

This bottlenecking seems undesirable. At the very least, if you have one "main" traffic heavy service whichever node ends up servicing that IP address could have elevated cpu usage from processing all the network traffic via kube-proxy.

The obvious solution would be to allocate say 2 or more so ip addresses for the service with dns round robin set up. Then as long as all three are being handled by different nodes you are not bottlenecking nearly as badly. But perhaps I am missing it, but I'm not seeing a feature where you can force those two or more ip addresses to be claimed by different nodes. (If the feature is strict, then you would want more data plane nodes than IPs, so that having one node down will result in having part of the Round robin DNS unclaimed by any node).

True. If you want true load balancing, you need a layer 3 solution (BGP.)

MetalLB has been in beta for YEARS. It's OK for dev/qa/staging, but I wouldn't put in prod.

I wouldn't use it in prod when there are other alternatives from cloud providers. But to say it is difficult to configure for a bare metal dev cluster is not true. The instructions are pretty clear.

I don't disagree, I think it is easy to install on a bare metal cluster, although I think using HA Proxy is just as easy and probably a better solution. I was just pointing out that it has been in beta for a very long time.

HA proxy isn't complicated to setup.

Good rules.

>2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.

If one is at that level already, I don't think there's anything better than AWS ECS out there. It just works. Just works. Yes sure, it does not offer stateless workloads for example among other things but it works for 90% of the cases.

> 3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.

Pretty much... Each CNI generates the SDN its own way slightly differing then the others. It is like you can write the program to print a chessboard on terminal in ten different ways.

Unfortunately, these implementation details aren't written or documented anywhere and they of course would keep changing from release to release anyway. Your only way out if you have production workloads that you can't afford going down without missing revenue? Just pay for the support for respective CNI as only they would know what the voodoo magic is under the hood.

Sure you can see the source code and all of them are open source but that's not your main business or the main day job and of course, the solutions aren't 100 line trivial implementations either.

> If one is at that level already, I don't think there's anything better than AWS ECS out there.

100%. To answer the OP's question: my nightmare is having to use it at all. I work with small, very early-stage companies whose applications by and large are not complicated. Perhaps at some level of scale and/or complexity, k8s makes sense. For the vast majority of the cases I see, something like ECS does everything they need, while being significantly more simple to understand, develop for, and debug.

Do you still recommend they host their applications in containers (e.g. Docker)? I feel like it's fairly low effort to start out that way, but can be a pain to add later.

Being that they're all using ECS, yes containerizing using Docker is a prerequisite.

Doh. I am only familiar with Azure, and was confusing ECS with ordinary VMs. Sorry about the stupid question!

No worries, the cloud acronyms overlap so much these days, if it's helpful:

EC2: Elastic Compute service, bare VMs.

ECS: Elastic Container service, Docker containers

EKS: Elastic Kubernetes service

> I don't think there's anything better than AWS ECS out there

Do you have any experience with Kubernetes on GCP being less good than AWS ECS? I'd expect them to be the gold-standard when it's a project coming from Google originally and we haven't had any Kubernetes problems that were related to GCP.

I have experience with Google's managed Kubernetes service (GKE).

It's basically great. Solid, few surprises, no compatibility issues with third party software packaged for Kubernetes. Autopilot looks even better – billing you only for the resources you allocate rather than for the full nodes, basically removing the bin-packing problem. Very little about our config was Google specific or would cause issues porting to another provider. It was up-to-date enough for us to use relatively new features, while lagging enough that everything felt pretty stable.

The only issue we had was wanting to use a somewhat obscure configuration for the Google Cloud load balancer instance that was underlying the Kubernetes ingress. This was possible, we just had to configure it manually in Terraform and point it at the cluster rather than being able to treat it as a cluster resource if I remember correctly. This was only a temporary solution while they were in the process of adding more custom control via K8s.

As far as I can tell it is considered to be the gold-standard.

Disclaimer: I now work for Google on non-cloud stuff, but this was my experience doing a port from bare-metal to GKE.

> 3. Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.

This is interesting - last time I worked with Microsoft Engineers from Azure - they said exactly the opposite.

One workload = One cluster.

„There are too many shared resources in Kubernetes that can leak collateral damage from one workload to another”.

Azure might also require special precautions. Honestly, I've seen Azure have a lot of networking issues, for example. But this is based on scuttlebutt and limited personal experience.

I've found that on GCS, certain workloads benefit from a dedicated node pool. This gets them their own CPU and RAM and volume I/O. Yes, I could imagine that there are shared Kubernetes control plane resources that might be affected, but I haven't seen that with any of our workloads. It might get more complicated if you have lots of in-cluster networking.

But none of this is my area of expertise. I just think that Kubernetes can mostly be pretty pleasant in practice for companies that have outgrown PaaS offerings like Heroku and Render.com.

Not really. You can do fairly large clusters, you need differently sized node pools. For example - we run apache NiFi is AKS which is a complete memory and cpu hog. We have a node pool 16cpu/64g ram for that workload which we specify a node selector. Microservices we use a different node pool. System services run on the default node pool.

If you're running Azure functions with KEDA - setup a nodepool for that with a lower cpu/memory footprint.

It's really quite frustrating to find that out from experience. Things like operators, anything Cluster*, named resources, etc.

If you're doing relatively simple things with the cluster, then you can do namespaces. The more custom shit you do, the better off you are with true isolation.

What is their definition of a workload? Do they want a cluster per microservice? Per application? Per customer?

We had a different set of microservices doing specific part of the system.

One part was responsible for data transformation and the other was responsible for user modifications.

Its hard to tell whats the cutting point it depends on the system architecture.

Wouldn't a name space make more sense than a whole cluster?

Why is bare metal a nightmare? I have a project coming up which must be on bare metal so was thinking of doing this. Also, if it's so bad, what's better to use on bare metal? Thanks

My experience with bare metal is multi-fold:

* Documentation sees it as a second-class citizen, if that (loadbalancers, volumes are heavily biased towards cloud providers)

* Many cloud-provided instances of kubernetes will always use the exact same VMs backing the nodes. So they really don't have to care all that much about what config your bare metal cluster has or needs.

RancherOS/K3S can be really quite nice for getting bare-metal clusters up & going really fast. They don't always feel the most complete though, mostly lacking around failure documentation. Even RancherOS has a bias towards cloud clusters, but it's quite easy at least to get a simple k3s cluster going. I'd personally recommend going that way. RancherOS if you're managing multiple clusters, plain k3s if you're doing just one. It'll even come with a pretty decent LoadBalancer & volumes. If you need better management of volumes, Longhorn or minio isn't bad.

microk8s/KinD are for dev-env only, and I wouldn't recommend it for any bare metal cluster. 'Fun' to screw around with though.

Edit: I had a lot of really obnoxious DNS problems, mostly due to docker daemon & how the system config would interact with k8s/k3s. Super annoying when you can get everything working in docker containers manually, but not working in k8s. Once you get your bare metal system configured to work, it'll be fine. It's also very confusing how many different network options there are, and their claims are dubious at best.

To expand on the network subsystems: canal/calico/flannel/ipvs based vs iptables based, etc. We did a bunch of low-latency (sub ms) perf testing for ipvs vs iptables. Docs say ipvs should be both faster (throughput) and lower latency. Tested evidence did not show that to be the case for both small #s of pods & large numbers of pods. This was for a small cluster, so that could be impacting the results.

Never mind that it's a rather huge PITA to switch between them all. Rancher/K3S makes it a bit easier, but still annoying.

> loadbalancers, volumes are heavily biased towards cloud providers

Can you even run a "loadbalancer" if all you have is a single machine with a single IP behind a router you don't control? I got stuck on that the last time I tried running my own kubes.

not necessarily a router you don't control, but MetalLB does provide some nice LoadBalancer constructs for a bare-metal deployment. Putting Vyos infront of it is magical!


K3s uses their own loadbalancer, so yes. It will add extra hops to any of your services if you care about sub-ms latency.

I looked at metalLB, and didn't really fit with what we wanted to do, so YMMV. It's pretty limited unless you control a lot about your IP space.

Why would you need a load-balancer if you only have a single machine?

Because kubernetes says so? Can you run it without a load balancer?

You can use Nodeport instead of loadbalancer. Or use metallb if you insist to have LB so that it’s more closer to real production environments.

“Real production” smh. This is why docs make baremetal second class citizens, people assume something has to be a certain way for it to be “real.”

There's always one more thing that you need to install to have a working cluster that comes out of the box in cloud.

You want networking? OK, go read about Calico, Flannel, Cilium, etc and choose one. If you didn't fully read the instructions for the networking plugin you plan to use, plan to blow away your cluster and set it back up from scratch with the correct RFC1918 address range for your network plugin that doesn't conflict with your presumably existing network. Plan to dive in and re-jigger things when you need IPv6.

You want a working LoadBalancer? OK, now you need MetalLB or PureLB, among others. Make sure your IPAM people know that you've blocked off several addresses or a CIDR range for K8S dynamic address allocation. IP's allocated via K8S aren't going to respond to ICMP packets and people will assume they're unused :)

You want ingress controllers? OK, well you can pick from Nginx or Traefik. There's actually a ton of them but those seem to be the most popular.

You want certificate management? OK, go install CertManager. You'll need to have programmatic access to your DNS providers if you want to use Let's Encrypt with wildcard certificates.

Oh, you need some kind of volume provider? Well.. there's hostPath but people generally don't recommend that for security reasons. I guess you could use the NFS volume provider but that's a little creaky for all of the usual reasons that NFS has been creaky for the last 30 years. You could go install Rook - but that's another entire complex distributed system ontop of your distributed system. (I love Ceph, BTW- but this is really overwhelming for a new person)

At this point you have essentially a working cluster, probably with a single master unless you set up something like OKD, in which case you already had to setup an entire HAProxy setup before even approaching the K8S parts.

Prepare to have a non-insignificant number of full time employees keeping the plane flying while you swap out the wings in real time to keep up with the fast K8S release cycle.

IMO, the complexity of K8S really incentivizes trashing all of your on-prem hardware and just paying for cloud. That's the end game.

I actually found bare metal to be fairly pleasant, and because I built it I understood a ton about how it worked so was able to figure out issues a lot easier.

My advice would be to take careful notes about your setup steps though, even if you're following a guide. For some reason in the k8s world I have a hard time finding blog posts/guides/etc that I used months later, and Chrome seems to eat my bookmarks :-(. I suspect SEO is a ruthless beast when it comes to K8s.

I did a project 5 years back that had to be bare metal, and going to for Kubernetes was probably the worst project decision I've made so far. We didn't have the required competency and wasted so much time on it, we should have gone for something more bland and simple.

My only tip if you really decide to go for it is to make sure to use a well-supported linux distro. We had to be on REHL and that turned out to be ill fitted.

If you plan on running bare-metal I highly recommend RKE2. It just works, it sets up most things for you (CNI included).

Don't even think about using kubeadm, it's the worst. It's overcomplicated and the smallest issue will wreck your cluster.

Also as a quick tip, don't use firewalld or iptables, use CNI resources (eg calico GlobalNetworkPolicy c; )

Because a vendor will do a lot of ground work (choosing a CNI and CSI implementation for instance) for you, and everything usually covered by a cloud-controller will be entirely up to you (e.g. LBs)

Actual bare metal, where you own the physical hardware and pay for the physical network connections is actually pretty painless. I see many people trying to use hosted compute to try and set up a bare metal cluster. This is a recipe for heisenbugs.

Having physical access (or IPMI) certainly helps, but there's also a lot more knowledge about networks bundled in companies that already run data centers, so setting up something like MetalLB (BGP load balancers) and Rook (Ceph CSI) to cover the parts that your cloud vendor would usually provide automatically is not as big of a deal. But the overall complexity for someone completely new to the topic is still higher.

> 2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.

... which makes it trash IMHO. I don't see anything intrinsic about the problem domain that mandates a completely uninstallable unmaintainable Rube Goldberg machine. But having it be that way certainly benefits the cloud vendors who push it since it keeps people from escaping big cloud costs and using simple commodity VMs, bare metal, or colocated stuff.

Complex is the new closed. It can be fully "open" but it doesn't matter if mere mortals can't use it.

Regarding item 1, any recommendations for a good book? Manning has a couple titles that look good, but I’m curious to hear what others would suggest.

> My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

We have a bare-metal k8s cluster... In my opinion the thing we got right is to use external load-balancers (good old haproxy) to point at nginx-ingress-controllers (whose pods are pinned to two "service" nodes) and to load-balance the apiserver traffic.

Most other traffic is inter-cluster, and managed by calico anyway.

And don't expose workloads to the internet unless it is a prod app.

Can you recommend me a good intro book to read cover to cover (hopefully not too thick).

"Kubernetes in Action" by Marko (Manning publishing) is my recommendation. Took me from someone who knows docker/docker-compose to someone who can handle Azure/AWS Kubernetes, understand the terms and design apps. Very good book.

I'll second that rec, same story... I wouldn't consider myself at all an expert based solely on that book, but it did give me a lot more confidence in branching out from a straight-and-narrow configuration, that I'd at least be able to know what to look up when I run into problems.


Which book would you suggest please?

You hit every high point of my own experience but there are caveats. 1. If you have to do on prem then virtualize and package till you can wash, rinse repeat. 2. Secure systems with k8s are a thing: Stigged k8s, stigged host systems, mtls, psp, network policy, MAC integration - this makes k8s really unpleasant to deal with if you come from pub cloud, pub k8s provider. See #1. 3. Performance: dns sucks and it sucks for all kinds of reasons: usually avoidable with node local caching approaches, but sometimes not. 4. Yes: Big clusters...until you need federation.

>My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

One's heaven is another one's nightmare, i like building it from scratch because then i know every single knob, and doing a excellent job in documentation makes sure that others have that knowledge too.

But hey since "administrators" is a forgotten art, you are probably better of just buying some black-boxes with terrible performance.

Depending on your individual circumstances buying "some black-boxes with terrible performance" might be a worthwhile tradeoff

Well yes that's true. It's always about the circumstances.

What are some of the best Kubernetes books?

I personally liked O'Reilly's Kubernetes: Up and Running, which was fairly thorough, and Nigel Poulton's books, which were shorter and focused on the highlights (at least the editions I read).

The reason I always recommend that people read a book before getting into Kubernetes is that there are several things that make a lot more sense once someone takes the time to explain them.

It actually gave me some 90s nostalgia. In order to use a new server technology, I actually needed to sit down with an O'Reilly book.

Is website documentation not good enough? It looks very thorough. Actually I’d say that it’s rare to encounter so verbose and full documentation nowadays. May be it’s even too deep, but I enjoyed reading it.

The website has tons of reference documentation!

But what a lot of people need is someone to just explain:

1. The basic idea of setting a desired configuration, and having the cluster try to bring reality into sync with the config.

2. How pods, replica sets, deployments and services fit together, and why Google thought it was a good idea to split them up that way. Also, how ingress fits in with all this.

3. Basic volume management.

4. Other common optional topics, just to get an overview.

The big advantage of a book is that it will try to cover the essential ideas, any why they work the way they do, without getting lost in describing a hundred advanced features you can look up later.

If there's an introductory section on the website that covers just the essentials, that might be enough! But I didn't find one when I was learning.

For intro level: Kubernetes Up and Running. (Here's a free version provided by VMware: https://www.vmware.com/content/dam/digitalmarketing/vmware/e...). Will teach you the basic vocabulary and get you well enough oriented to use k8s.

For trying to get to pro level: Programming Kubernetes. This one is focused on writing code for the k8s ecosystem, but it will teach you a lot of the internals.

Kubernetes in Action by Manning (https://www.manning.com/books/kubernetes-in-action) is quite through, good and beginner friendly.

Second that, But I do recommend to come with docker/docker-compose understanding in advance.

There’s a second edition in MEAP.

>Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.

People do this? I thought the whole point was to abstract everything away. You should have containers running on pods. You shouldn't care about what's in the containers or what metal the pods are running on.

Some people don't trust the namespacing in Kubernetes, or have contractual obligations to keep environments separate. I've rarely seen clusters with more than 10 nodes, but I have seen single customers run 5 tiny different clusters, for different environment.

Thank you for your insight. Really solid advice.

I have a question thought.

> My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

Can you share a few details about which distribution you used and how did you handled ingress?

I've done exactly that. It wasn't fun. RHEL, k3s, ansible, Longhorn, metallb, to name a few.

Storage is the fun part.

I was playing around with a local Elasticsearch cluster and I couldn't figure out how to "do" k8s storage. Some kind of like... shared NFS volume or something maybe?

Could you spread some tips from your experiences?

Longhorn is honestly pretty easy to get up & going for backing a PersistentVolume, which you can then mount however you want.

K3S has some local storage options too, but that's of mixed usage. Or you just do a hostpath + NFS if you want something that has as little Kubernetes magic as possible.

Yeah, NFS is your best bet if you are on a lab enviroment. Mount the volume at each host, then use hostPath to mount it in the pods (https://kubernetes.io/docs/concepts/storage/volumes/#hostpat...).

We used a large amount of local physical storage spread pretty evenly over the machines in the cluster and then gave this to Longhorn to manage.

It pretty much takes care of everything else. But does require some preventative TLC

can you recommend any good intro books?

Responded to a sibling comment with the same question here: https://news.ycombinator.com/item?id=31894095

If you have a fully-functioning "best practices" Kubernetes environment, each of the following topics ends up with its own full-depth tech:

- Compute

- Deployment

- CI

- Networking

- Storage

- Policies

Imagine running a microservices solution without an orchestration solution - how many people would it take to administer the servers, the storage, the network, the policies, etc. And with Kubernetes, you get maybe a couple of teams if you're lucky. This is the power and the leverage of the platform.

But also, imagine in that environment, how many things can go wrong, and the amount of expertise that you need to properly debug them. You still need that amount of expertise, because all of that complexity is still in place (or at least most of it is) - if your physical disks are throwing errors, you need someone who knows how to debug and replace that. Not hard. But then you have Ceph above that, and Rook above that (or whatever storage solution you use). And then you've got the deployment that has to make the PVC successfully. And it's like that for everything. Every problem has the potential to be a full stack problem for any one of half a dozen stacks.

It's a lot.

My team looks after 150 servers from Singapore to LA, about half physical, half virtual, some which just sit on shelves between jobs. Being pesamistic takes about 100 hours a year of feeding, upgrading, etc, about $6k a year.

That's comparing steady state with bootstrapping a Kubernetes environment. I suspect steady state Kubernetes is comparable.

The two things thing that gets me are:

1. Latch up states. It's very very easy for something to go wrong and blow a whole deployment up and lose all the pods for example a health check failure. Most application frameworks have some sort of request queuing and the health checks sit in the same queue so any upstream issues and you get health check failures and flapping. Of course the autoscaler goes fucking bonkers in the middle of that. The only thing you can do is drop traffic at the network edge and wait for it to get itself together.

2. No one knows how to fix it if anything major goes wrong. Even cloud providers. It's so large and complicated that no one has enough knowledge independently to actually fix it. For example I suffered from months of weird network issues where pods would come up without network. No one to this day know why that happened and could explain it. No amount of debugging and reverse engineering even resulted in a single step forward, resulting in the only outcome being "replace the whole cluster".

Don't get me wrong, I still like it but I wouldn't want to run it with little expertise at hand. It's not something I would trust someone to run without production experience, which is difficult because there are very few people out there who are battle hardened past trivial home deployments and tiny little stacks.

> Don't get me wrong, I still like it but I wouldn't want to run it with little expertise at hand. It's not something I would trust someone to run without production experience, which is difficult because there are very few people out there who are battle hardened past trivial home deployments and tiny little stacks.

That‘s the problem. Everyone used something like minicube to bring a kubernetes cluster online and believes it is simple. But when anything does not work correctly the only approach is to kill the complete machine and start a new cluster on a server. Have fun with statefull data which needs to be copied…

It's not like you can let someone run bare VMs without any production experience either. You need a PaaS for something like that, even then...

> Everyone used something like minicube to bring a kubernetes cluster online and believes it is simple.

This hasn't been my experience, and certainly the Kubernetes project doesn't advertise the software as "so simple anyone can do it" or any such thing. Kubernetes definitely requires experience--it's not a PaaS, but a framework on which something like a PaaS could be built.

We sort of waltzed around that one with PV's on EBS on Amazon and also shifted the control plane to them. But there are still some serious problems in that space.

A limitation of EBS on amazon that I've run into a few times is that EBS volumes must connect to EC2 in their zone only. So if your k8s cluster has nodes across multiple AZs (which obv. important) your pod that mounts that pv will always be zone locked. This can also be problematic if you write a pod that mounts claims that are in different zones, that will never work.

There are also amazon limitations to how many volumes per node and I used to see problems with ebs volumes 'unmounting' from nodes getting stuck. The later was always problematic and required an admin with a hammer. However i've never seens a 'kube' problem per se, they've all been aws problems.

Resource zone affinity is annoying but usually works fine if you have 1 ASG per zone (Azure and afaik GCP are the same way). You mainly need to be careful during initial creation to make sure the volumes are spread out

WaitForFirstConsumer is very helpful, too (you basically guarantee the pod can be scheduled at least once before creating the resources which greatly improves the likelihood it can get rescheduled in the future without getting stuck)

Don't remember seeing any issues on AWS but I've seen Azure CSI take about 7.5 minutes to unmount a volume from 1 and mount to another (so each pod in a statefulset can take around 8-10 minutes)

1. You don't 'lose' the pods. If your pod fails on liveness or readiness checks it restarts, over and over until it passes.

2. Depends on your team for sure. If you are on a team that's like "We'll just spin this up and it will be fine." you ignore the fact that 'things happen'. I've seen similar situations with companies that deploy on bare linux servers as well, some update breaks something or something isn't optimally configured. Things go wrong with systems, people who know the systems are needed to fix them. It sounds like if you intend on using kubernetes you should learn to troubleshoot it.

It's not difficult to find people who know how to actually run kubernetes, it's just hard to convince them to switch jobs for you. I get a decent numbers of cold, non recruiter, linkedin contacts a month, and frankly many recruiters... but my current job pays well and matches my worklife balance. Zero people who are reaching out from startups offer the whole package, they can usually come in fine on the money, but I'm not working a hundred hours a week with little to no support. On the 'established corporate enterprise' side of the house, they can be inflexible when it comes to vacation time, salary ranges etc. but I've found a good place.

RE 1. Just to be precise (as I thought this until recently):

* liveness check will cause pod restarts.

* Readiness check causes the pod to be removed from the round-robin of new traffic requests so it has time to recover/finish processing what it's working on.

You are correct, readiness removes it from any service objects it would be an endpoint to.

My point was more 'the pod doesn't go away'. I've seen some people do stuff with the HPA that could cause it to scale down to minimum replicas if its in a broken state, depending on what stats you are using to scale, but that's more of a 'kubernetes doing what you told it to do' problem.

> Don't get me wrong, I still like it

Stockholm syndrome?

I have 2:

1. We build our own custom build system, because there is no CI that can do actual DAGs (maybe a few). A custom Kubernetes operator that parses Jsonnet files to create 100s of CRDs and pods to achieve extreme parallelization. EKS was 144$/mo (now 72$) but no info on master node types. Using watch endpoints with hundreds of pods did not scale well. They had to bump up the master node instances to c5.18xlarge, but same price for managed. But figuring out it was needed to do just scale-up took days. One c5.18xlarge is 2k$ month, and EKS runs at least 3 for HA. So it's a horror story for them. But we also had 100s of worker nodes so it might offset some of them.

2. Similar to CI, we allowed devs to deploy all microservices (~80) from any branch so that they can port-forward and use them. All of them had Ingress endpoints. Days after headaches and frustrations, it turns out nginx ingress generates megabytes of configuration whenever a new deployment occurs, forks a new subprocess with new cfg, kills the other connections. When it's done often, it takes 30GB of memory when 50 developers use it (~4000 pods) and it often dies and restarts. Similar story for Prometheus, kube-state-metrics; they do not like short-lived containers and hug on memory.

> We build our own custom build system, because there is no CI that can do actual DAGs (maybe a few).

Have you had a look at GitLab CI? They have a bit of documentation here: https://docs.gitlab.com/ee/ci/directed_acyclic_graph/

Now, I don't work on any projects that are too complicated, but I recall that piece of functionality working as one would expect: https://docs.gitlab.com/ee/ci/yaml/index.html#needs

Also there's Drone CI, which also supports setting up dependencies in your pipelines, if you'd prefer something that's not connected to GitLab CI so closely: https://docs.drone.io/pipeline/docker/syntax/parallelism/

D'ya like dags?

- 60% of the Kubernetes ecosystem is half-baked alpha software

- Maintaining 200+ clusters for 10 small applications

- Cloud bills

- Autoscaling never working well

- Trying to untangle Terraform state without taking down Prod

We use GKE.

1. I don't know about any of this; we don't seem to have problems.

2. This sounds like an architecture issue, not a k8s issue.

3. Our entire GKE infrastructure costs less than $50 a month.

4. You're right here; it doesn't work 'well', but it works 'well enough' for our use cases.

5. I'm sure you're talking about some event that was far more complex than the few times we've had to drain our pool, but we did what we needed to do without downtime in production. While annoyingly esoteric, I thought it worked pretty fucking well compared to our alternatives.

With regards to 3, isn’t the management plane alone $72 for GKE without considering the cost for the nodes? How are your costs so low?

As noted by another HNer, we are using one cluster and up to 10 nodes as the top end, with only one running 90% of the time and only up to 3 covering the other 9%. We set 10 to the upper limit to ensure we're not going to have borked runs in case some crazy random ML model isn't going to take everything out and force killed jobs. The vast majority of workloads running on the cluster are HIGHLY variable with most running on Monday morning/afternoon and Month or Quarter starts + 3 days.

We are running many hundreds of jobs on those peak days with only a handful running on any other day. While many bring up examples where 24/7 infrastructure from a single box is more than plenty, we find that we can run micro VMs in this configuration and not have to worry about resource contention as our jobs run.

Pre-GKE, we were managing the timing manually, which was fine until we started to scale, but we found this to be a far better situation. Particularly because we simply don't have to think about it.


> 3. Our entire GKE infrastructure costs less than $50 a month.

At this scale, you don't need Kubernetes, invest in a pocket calculator instead.

Agreed. We've had only one autoscaling issue and it was on Google's end (datacenter ran out of nodes of a particular type and thus failed to scale up). Our GKE infrastructure costs a lot more, but we do a lot of heavy compute.

Most of our 'heavy compute' happens in BigQuery so we pay for it there, instead, but we make sure the rest of our analytics infrastructure needs are engineered to be as light as possible to keep costs low.

We are an ESOP so, as employee owners, it behooves us to be as cost-conscious as possible.

> Our entire GKE infrastructure costs less than $50 a month.

Uhhh what? I mean even my personal DO based cluster runs about $40 a month. I'm skeptical a production cluster is at $50.

    Service                   Compute Engine Kubernetes Engine
    Cost.                     $191.78        $62.34 
    Discounts                ($13.96)       ($62.34)
    Promotions and others     $0.00          $0.00 
    Subtotal                  $177.83        $0.00
The GCE instance cost includes some of our 24/7 VMs which are the lion's share of that line item, not the micro VMs we use for the cluster.

We are using karpenter.sh for autoscaling. Works just fine.

I work with the karpenter team. Glad to hear you like it. Would love it if you added your info to our public reference adopters.md file https://github.com/aws/karpenter/blob/main/ADOPTERS.md

literally heaps of resources used in Kubernetes cluster are using API versions with alpha in the name!

Tell me you barely understand kubernetes without saying you barely understand kuberentes.

> 60% of the Kubernetes ecosystem is half-baked alpha software

This one is fair. Wasted a lot of time trying to find the "correct" dependencies—I remember the Nginx Ingress Controller specifically being a headache—only to find a maze of deprecations, poorly written documentation, or stuff that just flat out didn't work. That was ~18 months ago (I set up my cluster to run sites for my business and have basically left it alone) so things may have changed but at the time I remember being surprised after hearing so much hype.

Pretty sure nginx ingress controller is one of the more solid and widely-used pieces. I've had a lot more trouble with cert-manager, but it seems to be in a stable state on my cluster now and anyway similar solutions in the bare-VM world are just as painful (IIRC I gave up trying to get terraform to do the handshake for AWS ACM).

What I can say definitively is that having gone from not doing any infra work to using k8s and then over the past few months trying my hand at a bare-metal setup, just spinning up a Linux box and hand installing deps via apt or snap was far more enjoyable/easy to follow.

Primarily because there was very little obscurity (i.e., config files that automate away a lot of thinking or Dockerfiles/containers doing the same). It also left me feeling more confident about stability because if something isn't working, it's pretty clear what I broke/forgot. Worst "bug" I ran into was a snap server hanging when installing a dependency.

Well, that is a rude thing to say.

Call'um like i see'um

My #1 k8s nightmare is the widespread practice of just writing (or downloading and never even looking at!) YAML and applying it to the cluster, with no additional management layer (we use Terraform, but use whatever you want), meaning that eventually you have no idea what the intended state of the cluster is, only its actual state. Vendor READMEs encourage this (some even going so far as to suggest `kubectl apply -f https://...`!).

My #2 (probably partially caused by #1) is the lack of attention paid to RBAC in vendor-supplied manifests. Multiple times I've found that the vendor's YAML binds some controller's service account to a ClusterRole giving access to all secrets in the cluster, when it only really needs to read one. After filing a GitHub issue it seems that I'm the first to even notice, even on popular projects that have been around for years.

>Multiple times I've found that the vendor's <software does something stupid>

...sounds about right (at least for closed source products)

No experience with deploying anything from a closed source vendor on k8s — these experiences have been exclusively with OSS.

Cluster-wide secret access is one of the worst I've come across, but smaller problems are virtually universal. We've come to see the YAML shipped by projects as an example, even when they document it as the preferred installation method. We always write our own now.

Even shipped Helm charts are no better, they usually encapsulate the same problems but just make them harder to fix yourself (since you are incentivised not to fork the chart as you'll have to maintain it).

Honestly, creating my first K8s deployment of a service; typing out at least 150 lines of YAML to define my Deployment, figuring out how my ConfigMaps, Secrets, and Volumes, Services are defined and connected together. Vanilla K8s YAML is extremely low-level.

Having experienced many Word documents full of deployment instructions and screenshots on how to deploy software, a few lines of YAML has been amazing for me :)

But no doubt there are other tools that are even better.

Compare with Docker Compose files, which still use YAML, but a lot less of it, and less verbose too.

And does a lot less too, since it describes single node deployment, without any complexities behind networking and communication of services on different nodes.

While the file specification is for Docker Compose, Docker Swarm uses the same files, and supports multi-node deployments.

There are a few differences in supported features; for example, IIRC, only Docker Swarm supports `secrets`.

A docker compose file is also used for swarm mich is a multi-server deployment.

But is that a more complex file than the single node one?

I'm a platform engineer and I still think that Kubernetes and its tooling are unnecessarily complex.

As an aside, when I think "low-level" with regards to computer programming, I think machine byte code - closer to the hardware - so this statement read a little funny to me.

Yeah, it may be complex and highly configurable, but it's hard to imagine getting much higher level than defining networking, storage, secrets, etc. through YAML.

I feel it's a lot like the Java enterprise world of FactoryFactoryFactory-classes - a col league coined the term "horizontal abstraction" for this type of trend; you never build a pyramid/hierarchy that composes truly higher level/abstract reasoning - you just complect in various constructs - that all keep hold of most of the complexity from the level "below".

So you get to write 30-40 lines of yaml for each of your ten slightly different services...

The most freeing moment with k8s for me was integrating that no, you don't need to write all that YAML (or JSON), and going into generating manifests from other formats - in my case, usually Jsonnet with mix of libraries both public and private which quickly embedded knowledge specific to our setup.

I think one of the best things I ever heard from another colleague was "I took the example jsonnet and had working version for completely new deployment within hour" :)

Agree! That said, it's important to remember that someone needs to maintain those manifests, too. Copy-paste might work initially, but you will need to modify them at some point. Which means you need in-house knowledge of that K8s YAML, anyway. For a smaller shop this might be non-trivial. For a larger shop you likely have something close to a Platform team that can maintain the manifests to make them easy to use.

> Vanilla K8s YAML is extremely low-level.

I find trying to untangle something deployed with Kubespray/Helm/anything automatized far more headache inducing than flat YAML files.

Yeah, been there and it makes things one notch even worse. :)

I've found that creating the k8s resources manually in a local KinD/microk8s cluster and then spewing the resulting YAML to be much easier than typing the YAML directly.

I'm buried up to my balls in this right now. My favourite part is nested 'spec' objects where I've no clear view into what each spec is.

Try k9s[1], the xray view (:xray [ressource]) shows you nested resources as a tree. I find it very useful (and k9s in general is a fantastic administration tool).

[1]: https://k9scli.io

I haven't really used it in anger, but I sort of think kpt might be helpful in managing k8s: https://kpt.dev/

There's a vscode k8s language plugin that gives you autocomplete and tooltips when hovering over yaml keys, saves me a ton of time in situations like these.

kubectl explain

Yes it’s a bit much. When I was beginning with kubernetes I was writing Docker compose files first and then converting them to kubernetes using https://kompose.io/

My Kubernetes nightmare is that all kinds of organizations will end up cargo-culting it as required tooling when the reality is that it's massive overkill for most deployment scenarios. Oh wait...

Ironically, I think one of the biggest issues is around packaging, specifically Helm charts (but if there are others, it is probably the same). In many frameworks, packaging is to help people by hiding complexity. Need an ingress? Use a Helm chart!

But then upgrading can be very risky because if you have any problem at all, unless you understand the helm chart you can rarely simply downgrade/uninstall, you could have caused a fatal problem and for a cluster, the resilience is meaningless if you make a change that blocks access to all service.

Other issues relate to dependencies and breaking changes which might be subtle and which might not be easy to discover like the fact some old resource uses a v1beta type which becomes deprecated.

I think once it is working, Kubernetes is very reliable for me but it is when making infrastructure changes that things can go south very quickly. Updating deployments etc. is fine.

I am not going to contest this but one upside is that you can install several ingresses at once and you don't have to uninstall the old one until the new one works

Slow performance.

So we have a few Spring Boot based webapps which were running (along with PgSQL) on a shared AWS t2.medium instance, we migrated these to a GKE cluster with a node pool of e2-standard-2 instances. The nodes are on a private network and don't have public IPs. The services are exposed via Load Balancer based Ingress (with SSL). Even after allocating one core to PgSQL and 2GB RAM, the API calls from the GKE applications are perceptively slower than that of the shared AWS t2.medium instance based deployment. Tried giving generous CPU and RAM to the applications however, it still didn't improve the response time. Since these are the very fist applications being moved to this cluster, there isn't much else running on this cluster.

Now sure what's causing the slowness. Have any of you experienced something like this in GKE?

Slower in what sense? Latency, throughput, computation time?

Are you setting cpu requests in your pod spec? This influences the cgroup cpu.shares for the containers and (unless this has been fixed) leaving cpu requests unset results in cpu.shares=2, which jvm interprets badly.

Thanks for replying. Yes we are setting the CPU requests in our pod spec. I will experiment with this further and see if solves the issue.

> What aspect is that makes Kubernetes operationally so hard?

Inherited a web site and hosting from another studio. They setup a php site in a docker inside a vps. They don't use micro services its one monolith container. They didn't setup any way to get logs out of the thing. They don't use docker compose to build an image, they get a console for the container and use it like a vps.

They literally just use it to add another layer of containerisation on their vps.

You already need to understand linux to use docker or kubernetes, If you don't use micro services or need horizontal scaling its just more to learn, an extra layer of complexity thats super fragile and a nightmare to debug.

It has such a niche use case but every one use it where its not useful because its trendy. They want to put on their cv that they have used docker / kubernetes they don't have to write that it wasn't necessary and caused issues.

2025. 3am. Clear night. Full moon shining, many stars around.

I suddenly wake up, covered in cold sweat. My heart is pumping so hard.

I take out my phone. I search the internet. Kubernetes still reigns, no simpler approach made it.

The end.

I believe there is a rather relatable book called "The Dream-Quest of Unknown Kubernetes" or something like this.

Why do ppl obfuscate like this? k8s is a control plane + api and client and workers. Don't make it weird. If you do then control the weirdness: the hooks are all there. To me it still looks like client/server with some development overhead based on gRPC + REST, state + eventual consistency. When I was learning linux in 1997 I had butterflies in my stomach - it felt amazing working with tech that I could do _anything_ with. k8s is the only thing in 20+ years that has ever given me that feeling again.

Because even the built-in agents are huge and complex. Nevermind the homegrown agents some of which has config files that basically are YAML-serialized ASTs for some weird Turing-complete imperative language... and some just straight up embed a Lua script that does configuration and actual work. But thanks for your advice "don't make it weird": I won't! Too bad all the other people didn't heed to it.

You aren't supposed to use a tool made for Google level complexity unless you work with such complexity in the first place.

As a Xoogler I'd say that Kubernetes is harder to use than Google's internal equivalents. It may not be harder to run, but that doesn't matter inside of Google unless you're on the teams responsible for the base layers.

My point is, Kubernetes isn't really "made for Google level complexity" - Google only uses it for a handful of cloud products, internal research stuff and not much else.

Why did Google then release the complex Kubernetes to the world than their simpler internal tools?

To loosen up AWS-entrenched customers and their much more varied use-cases, probably.

I've long wondered if K8s is really a dastardly scheme to hold back the industry …

It's basically impossible to release this kind of internal tooling due to the way Google works. You'd have to rewrite it, or open-source essentially the entire foundation, and Google probably doesn't want to do either.

It's a ninja smoke bomb to distract the tech world into thinking we are cool like Google.

Where do I go for my 3-5 node cluster which I want to schedule containers to, run dynamic workloads etc?

How do I build a container based DAG in a small cluster today without k8s? Solutions that are not k8s tend to be single programming language/sdk based or not easier to set up.

I get the feeling the alternatives are dying off unnaturally fast.

Docker Swarm?

Ideally it would be a hub and spoke model. I don't know how much swarm hammers etcd but k8s in my cluster produces >1kIOps on my SSDs doing nothing at all. Just health checks and cascading effects

What exactly are you making that requires the use of multiple languages?

It would seem that the requirement for multiple languages often has nothing to do with making something or the customer requirements. It's often a developer requirement. Developers want to work in their favorite language. In the 90s you just wrote in whatever the company or ecosystem had mandated. Now that developers are in high demand they can specify what they want to use.

It started with books like "Beyond java" (and java was beyond C++) and disparaging articles by the likes of Paul Graham. They made some good points but if you have ever worked on a single system with multiple languages and are honest about it you would have to say that it wastes an enormous amount of time and generates unnecessary complexity. When a developer has to make a microservice just to write part of the system in another language you have wonder if they have the ability to evaluate technical trade-offs. Just pick one language and get on with it.

Kinda what I expect as well. I could understand it if a company was bringing in whole teams to scale up development of a large project quickly. Even then support would could be a nightmare. Doing it just because sounds painful.

My problem needing multiple machines is less weird than needing more than python for my dag?

No, I honestly just don't know what you're talking about lol. What exactly are you building?

I'm building several things, some of them need puppeteer with plugins so have to use a lot of node.js there. Which I have no interest in otherwise. Some of it is built in higher performance languages. Some of it is off the shelf solutions

And most of all by having every node in the dag a container I don't have versioning issues.

Yes everything can be wrapped in python, doesn't make it a good idea however.

I don't think that is true. Kubernetes brings a lot of advantages to people who have to manage infrastructure (like me). It gives me a single interface that I can apply across multiple teams. They all only need to provide me with a docker image and that's mostly it.

Also what is the alternative? Self-written unmaintainable bash-scripts? That's what they had before. Every team had their own way of deployment, creating packages,.... It was quite the nightmare.

To be fair, we use EKS, so a lot of the annoying work is done by AWS.

The first thing with bash scripts is you need to use a type system. Enforce json as your configuration language. Then have error handling as part of a default shell function library that emits json. All you are doing then is passing bad returns to a function that can emit json and reviewing json configurations for shell script variable definitions. Don't do in shell scripts what you can do in terraform. Don't use ansible/salt/puppet for what can be done in shell scripts. Shell scripts and makefiles go to CI/CD agencies (Jenkins). Ansible, terraform, etc...go to services that specialize in these.

>Also what is the alternative?

Docker compose.

Or if we're getting wild pick a technology, use it across the company, and rely on language tools to establish the APIs between modules. Then deploy the application however the language provides.

>Docker compose. Swarm is a closer alternative, although it's future is unclear.

Another alternative is Nomad.

I still wake up in cold sweat in the middle of the night feeling herds of yaml files are chasing me to pull me into the deep swamps of the clusters.

Yaml always mocking likortera. Sorry it’s early

There is also https://k8s.af , which covers some horror stories!

More distributed system than Kubernetes, but quite fun : We deployed a MongoDB cluster on our Kubernetes Clusters. Our application was a having a chat feature that stored the messages into the MongoDB cluster. After some months, we realized that we got some weird issues, some messages was arriving in the wrong orders, like : 1. A : Hi ! 2. B : Bye ! See you next time ! 3. A : Great and you ? 4. B : Hello ! How are you ?

We thought it was an application issue, but it was that actually on the database side : the timestamp of each message was using the local time of the mongodb instance. And between different instances, the time was different. We realized that the Kubernetes Nodes had issues to connect to the NTP server, due to a rule in an random firewall.

When we fixed it, every other messages where in the good order

> due to a rule in an random firewall

The eternal practice of middleboxing your network. This didn't work well at the time LANs were completely isolated, break much better nowadays when LANs are just a convention over WANs, and fails for virtual LANs on a single host too.

Yet people just do it, every single time. Because setting the security in a single place is expected to be easier than setting at the endpoints (I blame Windows for that culture). What is kinda understandable, but here we are, talking about Kubernets, and having that same culture.

1. A : Hi !

2. B : Bye ! See you next time !

3. A : Great and you ?

4. B : Hello ! How are you ?

My biggest surprise was how vanilla even hosted Kubernetes clusters are. For EKS I had to configure and install quite a lot to make it work as expected. At that point you are installing and self managing so much on your own that I wonder if you gain anything.

This is that I have to use yaml to configure k8s. Every k8sish tooling has it's own yaml API, including helm, gitops, argocd and friends, so you end up having a bunch of brittle and very hard to understand and maintain yaml files ... Sigh

I think that it's just the "kubectl" tool which accepts yaml input. This tool then talks to the actual API of your cluster and uses json (and maybe other formats).

It is not only k8s tool unfortunately. Helm and argocd also exposes YAML api. I am not against YAML in general btw, it’s just a variety of tools’ specific yaml formats one mixes and have to learn every time that makes one’s life difficult

I hate helm. Writing helm charts sucks. We use kustomize, but still YAML at the end.

I'm afraid k8s would become like git - it's a great tool out of which we mostly use like 5% of it's complete capabilities, yet we all use it because everyone is using it. Yet, k8s doesn't really make all the underlying stacks go away. When the shit hits the fan, you have to troubleshoot it with knowledge of much more than just YAML syntax.

k8s is just Linux, if you forget the Linux there is no hope.

I admit to being a little frustrated with systems engineers telling me I should never need shell access to a production system again, that web-based Metrics and Tracing should be enough to debug all problems. I have twenty years of muscle memory using strace, dtrace, lsof, blah blah blah to troubleshoot complex problems. Furthermore I'm only brought in when the problem is sufficiently complex. I understand that it should be a break-glass exception, but I don't want linux abstracted away completely.

Kubernetes gives you high availability, deployment automation, and powerful management tools, which are all needed to run software applications 'at scale'

Running software at scale is my nightmare.

Sounds like crypto promises?


Shared FS between nodes, autoscaling volume claim sizes, autoscaling volume claim iops, and measuring storage utilisation (iops e.g.) for pod/node/pv.

How have I solved it? I haven't and I know its a key part of cost-control for us in about 12 months.

Fast deploy:

I'm trying to get a test cluster up in less than half an hour. With the DAG for building it all I'm getting a failure rate of 30% if I don't leave arbitrary timings and extra steps. I've also only automated about 25% of our stuff, so I expect it will take longer.

Are you setting up managed k8s on a cloud provider and, if so, which one?

I've had some issues with getting EKS dependency ordering correct (using Terraform)

Indeed - GKE here.

I think some of the comments shed light on an interesting dichotomy I've noticed while talking to folk about K8s:

It seems that if you stick to simple configs, a setup hosted for you, etc, basically the happy path then people have had really good experiences with k8s. Those people can't understand how one could be inept enough _not_ to figure it out.

On the other hand, you'll also hear a lot of complaints about the difficulty of self-managed clusters, and attempting certain less popular or more complicated configs (or what have you). These people can't understand what benefit introducing such an insane amount of complexity could bring.

The second has mostly been my experience. I've tried now maybe a handful of times to create a cluster and get it running something on my home lab. At first I could rarely get it "up", but now I can usually get it to the point where I'd want to include storage or whatnot, and that's where I've been failing lately. Either way, I've never gotten it stable enough to warrant actual usage from me.

I like the idea of k8s; it seems like the natural next step of computing abstractions. I'm just not sure if "it's it", or if it's stable/reliable/evolved enough for people who don't need it now to invest in it yet.

For me it is actually getting services exposed. Say i buy a domain and then i (trivially) containerize my applications and set up services in kubernetes. Now comes the networking part which is just a pain. How do I make my service accessible? It's easy with docker and an nginx reverse proxy but with kubernetes it's always seemed to be a real pain.

Just making your pod accessible is not as complicated as it seems. All you need are: 1. A kubernetes service resource. This just contains a selector that points at your pod. 2. An ingress. You point this at the service you just made. You will get a static IP. Point your DNS at that.

And you're good to go. This assumes that you are using a provider that comes with an 'ingress controller' out of the box (which is what actually makes the ingress function. It's usually just nginx). If not, install the nginx ingress controller with helm. Then install cert-manager with helm for tls cert provisioning.

So if I'm not using a cloud provider but just have a k8s cluster on a small vps with a public ip i just need the nginx ingress controller? What if i want https? Is there a way to automatically enable let's encrypt for different services and domains/subdomains?

There's a wealth of material online to explain these things. let's encrypt can integrate with a number of ingress controllers trivially. Like much of anything else, you need to actually experiment with it to understand how it all fits together.

Nginx ingress controller + cert-manager is the most common, best documented way of doing this. If you don't have a domain already pointing to your public IP, you can use nip.io.

Cert manager

It's not quite that simple if you are not with provider that provides LoadBalancer service integration with Kubernetes. Normally the input to Kubernetes cluster is essentially NodePort. That's normally more or less random high port (like 31453) that is exposed on all nodes/nodes that run the service that matches the selector. Unless you want your visitors to add that to the URL (and keep DNS up to date with active nodes), using it to provide end-user accessible HTTP/HTTPS services is not very viable.

You either need to find/create integration to provider's load balancer (or possibly CDN that allows non-1:1 port mapping) or use HostPort service. Latter has it's own share problems as well.

I honestly kinda like kubernetes and I have no problem tying together a bunch of distributed resources in my head.

The biggest nightmare for me is networking, simply because I'm not trained in networking. I know the basics to become a senior sysadmin but it's not natural to me. So mix in kubernetes and it becomes even more abstract.

I hate that kubectl wants all the images to be already built. Instead I'm forced to keep docker-compose yaml around to actually build the damn things first. Which introduces more yaml that kubectl will insist on reading.

The documentation at the main kubernetes site is poor, and is being deprecated, but not in favour of anything new.

Give Skaffold a shot, takes a lot of the pain out of running locally. I think of it as a simple wrapper around Docker, Kubernetes, Helm and Port-Forwarding (plus other options I don't use) that makes it easy to use the same build/deploy definition to work locally or build/deploy in CI/CD.

Couple examples:

- Build the images that need it with "skaffold build"

- Can watch for changes are rebuild automatically when you "skaffold dev"

- Can automatically detect your services (or arbitrary) ports and forward them when working locally

- Advanced features like profiles and modules for supporting multiple environments


Interesting. You build images at deployment time? Why?

Quick iteration. More like I deploy at build time. CI/CD takes minutes to deploy to a shared environment. Rather than contending for a bottleneck, we do local development on a single node K8S environment.

Maintaining docker-compose as well is a pain, and it's repeating ourselves.

Kubernetes itself. Maybe I just need some more "hands-on therapy".

Overall it is a great tool and I completely do not get the hate for it, but I did have some issues with AWS EKS - we made two mistakes in our project by using k8s API instead of DNS for discovery and environmental variables instead of config maps and this ended up overloading master nodes, which started throttling sporadically, eapecially with load spikes. It seemed like AWS EKS support team were really puzzled by this and it took us weeks to get to the root cause, even with their support. This might be considered as more of an AWS issue than k8s one.

The only nightmare is the same one as with npm and pip etc.: dependencies hiding behind other dependencies and having badly documented charts being used that don't share basic lifecycle information.

Kubernetes itself has always been fine, just like a bare network or bare OS has been, but when you start stacking stuff built by other people (especially when the stuff isn't of the best quality) it just goes downhill from there.

Perhaps the actual nightmare is inadequate quality control... but that's not really specific to packaging or shared components in Kubernetes.

I don't like to upgrade things that are working. Unfortunately that bit me in the butt with Kubernetes.

Long story short, a node crashed, and when it came back up, the pods wouldn't start. We spent a couple days trying to figure it out, but nothing was working. This was in production, so we made the choice to rebuild the entire cluster again with a newer version. We still had other nodes running, and were scaled enough that there was no complete downtime, but we were maxxing the cpu and some connections were getting dropped.

Mine is fairly boring. Overall - the tool does the job, and I much prefer it to hand managing servers, or some of the previous VM based management solutions.

My two biggest gripes:

- Loss of visibility, especially related to inspecting network data as it moves from LB to pod.

- Half baked tooling around the eco-system, although this does seem to be slowly improving

My two biggest likes:

- I genuinely save a bunch of time with it at this point (it still occasionally sucker punches me)

- I can take the experience from my day job and self-host quite a large number of useful applications at home on old hardware.

I think my top ones are endpoint security software or running on redhat OSes. I used to work on a kubernetes distribution, so a large amount of the support escalations and workload was around shipping kubernetes to customers who weren't as apt as the HN crowd.

Endpoint Security Software, just because it's adding some policy, that usually isn't written by the team trying to run the application, and will apply the policy sometimes in non-obvious ways. Even when you think it's turned off, sometimes it isn't, and the vendor will leave kernel modules running and partial configurations.

RedHat was more a result of the stability policy for kernel, and often running much older kernels then other distributions. We had lots of problems with the more modern kernel features used by kubernetes, that we had to track down and often link to known fixes. We had one customer even replacing their kernels so they wouldn't have as many issues. This may be less and less of the case all the time with newer RedHAt Releases, and I also have no reason to believe OpenShift suffers in the same way... just that I've spent a large amount of time troubleshooting this.

Honestly I don't get the hate k8s get.

I run my own clusters and it just works.

Sure I have to ignore a lot of crap in the setup phase, there are so many products out there I don't want to pay for. The nightmare may come from some devop installing a bunch of helm charts without configuring things properly.

Scaling down to a minimal cluster is a real concern: I would like to run k8s for some micro project that literally run on 5$ vps but it's too heavy for that.

Never in my experience when someone says it just works does it just work in reality. Usually they are hiding some things that they are keeping up with but for some reason don't include those issues as the thing not working.

I look after a lot of k8s professionally (hundreds of clusters), and the only issues I've had in the last few years were a istio/pilotd bug and a bit of pain moving from e.g v1beta1->v1 during normal upgrades.

Everything else that's gone wrong with these environments was cloud/hw related, or app stuff.

Guess it depends on where your competencies are.

> Guess it depends on where your competencies are.

That's true of almost everything, no?

I use k8s myself (nowhere near that scale, basically the smallest scale you can think of... one cluster...) and don't think it is a bad product.

What about using k3s as a minimal cluster?

I find that even k3s' low overhead is often too heavy to run in tiny VMs and SBCs.

Overall kubernetes is far better than anything else i've used to manage deployments and production workloads. That said, what gives me the hee-bee-gee-bee's (and what has caused outages, for me at least) is:

1. Managing etcd nodes -- Reconciliation is a patient waiting game, try and rush it and you'll loose your cluster.

2. Kubernetes Networking -- This is nearly impossible to trace packets coming through an LB into a kubernetes pods without very deep understanding of different networking layers and CNIs. A lot can go wrong here.

3. Running persistent volumes in kubernetes. This can range from outright unstable and dangerous to annoying and at the very best intermittently loosing access to services due to volume claims being detached/reattached. Would highly avoid this.

4. Running "sticky" services. Statefulset's can allow you to run enumerated services with stick sessions but my experience with any sticky service is it tends to be somewhat volatile as kubernetes really loves to move workloads at its convenience. I've found statefulsets to be a redflag when considering putting it in kubernetes.

Someone hating on kubernetes and cobbling together their own stuff that "does the same, but simpler". Usually only simpler for that person.

Love kubernetes. And at the same time it's not the magical thing we all expect.

But my main pain points are around Kubernetes and all the hidden stuff.

Kubernetes alone is not enough, you need terraform or helm (or both) to have something manageable and deployable by a team. When things errors or do not behave the way you expected it all become so complicated or cryptic, that you sometimes better delete an entire resource than understand the underlying issue.

Some stuff like dependency between resources (e.g: Deployments depending on ConfigMap, updating the ConfigMap won't restart the deployment) makes things a lot more complicated than you expect.

There are too many vendor specific stuff that are necessary to make a Kube cluster works that you can not expect to have one terraform setup that is multi cloud. etc...

Don't use TF for k8s. That is the one lesson I have learned. Use jenkins + shell scripts for k8s. Drive vendor tools (eksctl) or use kube* tools.

Debugging CrashLoopBackOff

I'm probably the wrong person to ask. My Kubernetes nightmare is having to configure anything from scratch. Or having to fix something that doesn't work.

My comfort zone is where Kubernetes works fine and I don't have to touch it, or only update trivial stuff.

A failed upgrade of a CNI plugin on a production cluster. Since then I always have a blue / green cluster deployment at hand with a leach cluster containing the whole production environment and flipping via a Loadbalancer in front of the clusters

For stateless applications - doing a blue/green cluster upgrade is really the smartest choice.

I ended porting a few clients we had in a company that were already running based on Docker to a Kubernetes cluster. The major issues were trying to push everything there. I think it works very well to manage web clients.

Problems started by trying to push too many things into the clusters. Databases and specially ElasticCache with Kibana to collect metrics from the cluster ended killing the performance.

So it's like everything, some cases are great for K8s, some are terrible. This + a complex abstractions makes it not that developer friendly, but overall it does a good job to run and allow to scale services without having to worry too much with hardware.

Here's another: fitting cluster addressing plans into my organization's ridiculously constrained IPv4 addressing plan. I find it crazy that a modern technology like k8s was not IPv6-native from the start.

we have been using openshift at work and it has been relatively troublesome, to be honest when we have infrastrcuture (which is not often) we just have some redhat consultant fixing it.

Disclaimer: Former Red Hat Consultant who fixed people's openshift :-)

Can you share more about your openshift stack? For example do you have a storage solution (and is it openshift container storage)? Are you running on top of vmware, aws, etc?

You're far from the only one to have that problem. I love OpenShift and generally do recommend it, so I've been trying to think of ways to improve on that situation. I don't work for Red Hat anymore, but the product is wonderful and (despite not being perfect) is IMHO the best one out there, so I want to see it succeed. It always felt to me like it was just enough of a black box as to be hard for an outsider to get into and tinker/debug.

I'm not sure Kubernetes is worse than any other complex, distributed system.

OpenStack, Pivotal Cloud Foundry, internal compute platform

So far I think the nightmare problem is people trying to run it and CNCF software (Prometheus, various operators) with only a cursory understanding of how it works (me included)

It's easy to shoot yourself in the foot (oops, forgot requests on a resource intensive, high replica count deploy and hosed cluster autoscaling)

TLDR: Complexity.

The deprecation lifecycle, and running ingress controllers in an automatic scaling group.

The first isn't as much of an issue if you have a (partially) dedicated team for managing your clusters, but can be prohibitively expensive (effort / time-wise) for smaller organisations.

The second highlights a bigger problem in K8s in general. I'll have to give a little background first:

If you run an Nginx ingress controller on a node that's part of an ASG — i.e. a group where nodes can disappear, or increase in number — you will experience service disruption to a small percentage of your requests, every time a scaling event occurs. This is caused by a misalignment between timeout values for your load balancer and Nginx, which can not be fixed:

* https://github.com/kubernetes/ingress-nginx/issues/6281 * https://github.com/kubernetes/ingress-nginx/issues/6791 * https://github.com/kubernetes/ingress-nginx/issues/7175

The fix is to only run the controllers on nodes that reside in a separate statically sized group, and perform updates to them out of hours when necessary :|

I'll leave you to decide on whether that's a fix or not, but the larger point it highlights is how _theoretically_ everything's great in K8s, but the headaches introduced by the complexity often make it not worth it.

Another example is pod disruption budgets. These are needed because the behaviour of K8s when instructed to shutdown a node is, well, to shutdown the node. Seems reasonable, until you realise that it doesn't handle moving the workloads off that node _first_. No, at some point later, the scheduler realises the pods aren't running and schedules them somewhere else. So you use a combination of PDBs to tell K8s that it must keep n pods of this deployment running at all times, and distribution rules to tell it pods must run on different nodes. This solution falls apart when you have pods that should only have a single instance running.

This is always a problem with ingress/load balancer unless you buy some REALLY fancy (auto failover) enterprise hardware loadbalancers unfortunately, regardless of K8S. It does make it harder to see what is going on though or work around it.

While I'm venting K8s frustration; They told me K8s was awesome because it allows me to easy test things locally on my machine. They told me minikube would solve it. Minikube ran out of memory, was unstable, crashed. Production YAML required certmanager not on my machine. Production YAML required volume manager not on my machine.

Some very basic things look very hard to me. Right now I’m scratching my head how do I implement even non-HA WireGuard server in a pod, so wg clients can access a pod network and pods can access wg client network. Seems like a very basic requirement for any installation yet zero guides about it. And don’t even talk about HA server with load balancer.

If you find anything about it, please do share. I recall I saw something similar somewhere. Maybe it was OpenVPN.

Operators/OLM/channels/operand versioning/upgrade testing.

Even typing out the words makes me want to lie down in a dark room.

It's the lack of good first-class observability and good administration tooling into what's going on. There's a bunch of third party tools, and those seem to be lacking. The cli is abysmal tbh. Good luck figuring out why your pod crashed, or why it's frozen, or where the logs are from non-existing pods are, etc...

A client who was on version 1.13 (which was the current version when they started the project in 2018) being forced to upgrade by the aws managed kubernetes service. I'm not sure what they ended up doing but they were facing the requirement of having to upgrade the nodes and the control plane one version at a time.

For me the primary thing is OOM killed. But swap support is being worked on.

And then perhaps the proper handling of persistent disk.

The thing itself. I’ve been avoiding Java for most of my career and will do the same with kubernetes.

Having to use it. Same for Docker.

If anyone is looking for some tutorial/guides for a homelab. This is a great wiki. https://wiki.technotim.live/en/kubernetes


Kubernetes Failure Stories A compiled list of links to public failure stories related to Kubernetes. Most recent publications on top.


I haven't even touched intra-cluster networking, so it must be networking.

Out of the things I have touched, it's load-balanced ingress (when running on premises). So yeah, it's networking.

The iptables rules are mostly unrelated to your CNI plugin. They're added/managed by kube-proxy to provide your internal service routing and load balancing.

- PSP retirement and trying to define a replacement with 100% coverage. Gatekeeper seems to be the heir-apparent.

- Migration of all our customer workloads from PSP to gatekeeper.

The decoupling of ingress and deployments always bothered me, although it might not be a _nightmare_ exactly.

In short, the ingress may route traffic to a pod after it is killed. The solution is that when a pod gets a SIGTERM signal, it should mark itself not ready, wait for some amount of time and then shut down (see e.g. https://deepsource.io/blog/zero-downtime-deployment/). I've heard arguments for this behavior, but it's not the same trade-offs I would make.

In most cases you can fix this with the service upstream annotation, which has existed since 2017: https://github.com/kubernetes/ingress-nginx/issues/257

I did not know that. Thanks for explaining.

On AWS, the best way we found to use K8S is using Margate and using Copilot to create and manage infrastructure and deployments.

Late to the party, but figured I'd share my own story (some details obviously changed, but hopefully the spirit of the experience remains).

Suppose that you work in an org that successfully ships software in a variety of ways - as regular packaged software that runs on an OS directly (e.g. a .jar that expects a certain JDK version in the VM), or maybe even uses containers sometimes, be it with Nomad, Swarm or something else.

And then a project comes along that needs Kubernetes, because someone else made that choice for you (in some orgs, it might be a requirement from the side of clients, others might want to be able to claim that their software runs on Kubernets, in other cases some dev might be padding their CV and leave) and now you need to deal with its consequences.

But here's the thing - if the organization doesn't have enough buy-in into Kubernetes, it's as if you're starting everything from 0, especially if paying some cloud vendor to give you a managed cluster isn't in the cards, be it because of data storage requirements (even for dev environments), other compliance reasons or even just corporate policy.

So, I might be given a single VM on a server, with 8 GB of RAM for launching 4 or so Java/.NET services, as that is a decent amount of resources for doing things the old way. But now, I need to fit a whole Kubernetes cluster in there, which in most configurations eats resources like there's no tomorrow. Oh, and the colleagues also don't have too much experience working with Kubernetes, so some sort of a helpful UI might be nice to have, except that the org uses RPM distros and there are no resources for an install of OpenShift on that VM.

But how much can I even do with that amount of resources, then? Well, I did manage to get K3s (a certified K8s distro by Rancher) up and running, though my hopes of connecting it with the actual Rancher tool (https://rancher.com/) to act as a good web UI didn't succeed. Mostly because of some weirdness with the cgroups support and Rancher running as a Docker container in many cases, which just kind of broke. I did get Portainer (https://www.portainer.io/) up and running instead, but back then I think there were certain problems with the UI, as it's still very much in active development and gradually receives lots of updates. I might have just gone with Kubernetes dashboard, but admittedly the whole login thing isn't quite as intuitive as the alternatives.

That said, everything kind of broke down for a bit as I needed to setup the ingress. What if you have a wildcard certificate along the lines of *.something.else.org.com and want it to be used for all of your apps? Back in the day, you'd just setup Nginx or Apache as your reverse proxy and let it worry about SSL/TLS termination. A duty which is now taken over by Kubernetes, except that by default K3s comes with Traefik as their ingress controller of choice and the documentation isn't exactly stellar.

So for getting this sort of configuration up and running, I needed to think about a HelmChartConfig for Traefik, a ConfigMap which references the secrets, a TLSStore to contain them, as well as creating the actual tls-secrets themselves with the appropriate files off of the file system, which still feels a bit odd and would probably be an utter mess to get particular certificates up and running for some other paths, as well as Let's Encrypt for other ones yet. In short, what previously would have been those very same files living on the file system and a few (dozen?) lines inside of the reverse proxy configuration, is now a distributed mess of abstractions and actions which certainly need some getting used to.

Oh, and Portainer sometimes just gets confused and fails to figure out how to properly setup the routes, though I do have to say that at least MetalLB does its job nicely.

And then? Well, we can't just ship manifests directly, we also need Helm charts! But of course, in addition to writing those and setting up the CI for packaging them, you also need something running to store them, as well as any Docker images that you want. In lieu of going through all of the red tape to set that up on shared infrastructure (which would need cleanup policies, access controls and lots of planning so things don't break for other parties using it), instead I crammed in an instance of Nexus/Artifactory/Harbor/... on that very same server, with the very same resource limits, with deadlines still looming over my head.

But that's not it, for software isn't developed in a vacuum. Throw in all of the regular issues with developing software, like not being 100% clear on each of the configuration values that the apps need (because developers are fallible, of course), changes to what they want to use, problems with DB initialization (of course, still needing an instance of PostgreSQL/MariaDB running on the very same server, which for whatever reason might get used as a shared DB) and so on.

In short, you take a process that already has pain points in most orgs and make it needlessly more complex. There are tangible benefits for using Kubernetes. Once you find a setup that works (personally, Ubuntu LTS or a similar distro, full Rancher install, maybe K3s as the underlying cluster or RKE/K3s/k0s on separate nodes, with Nginx for ingress, or a 100% separately managed ingress) then it's great and the standardization is almost like a superpower (as long as you don't go crazy with CRDs). Yet, you need to pay a certain cost up front.

What could be done to alleviate some of the pain points?

In short, I think that:

  - expect to need a lot more resources than previously: always have a separate node for managing your cluster and put any sorts of tools on it as well (like Portainer/Rancher), but run your app workloads on other nodes (K3s or k0s can still be not too demanding with resources for the most part)
  - don't actually shy away from tools like Portainer/Rancher/Lens for making the learning curve more shallow, inspect the YAML that they generate, familiarize yourself with the low level stuff as necessary, while still having an easy to understand overview of everything
  - don't forget about needing somewhere to store Helm charts and container images, be it another node or a cloud offering of some sort
  - if you can, just go for the cloud, but even if managed K8s is not in the cards for you, still strive at least for some sort of self-service approach for the inevitable reinstalls
  - speaking of which, treat your clusters as *almost* disposable, have all of the instructions for preparing them somewhere, ideally as an executable script (maybe use Ansible)
  - don't stray too far away from what you get out of the box, also look in the direction of the most tried and tested solutions, like an Nginx ingress (Traefik with K3s should *technically* have the better integration, but the lack of proper docs works against it, you'll probably want something like a cookbook of sorts)
  - also manage your expectations, getting things up and running will probably take a long time and will be a serious aspect of development that cannot be overlooked; no, you won't have a cluster up and running on-prem with everything you need in 2 days
  - ideally, have a proper DevOps team or even just a group of people who'll spearhead information sharing and creating any sorts of knowledgebases or templates so it's easier in the future
So, in summary, it can be a nightmare if you have unrealistic expectations or an unrealistic view of how Kubernetes might solve all of your problems, without an understanding of the tradeoffs that it would require. I still think that Nomad/Swarm/Compose might work better for many smaller projects/teams out there, but the benefits of Kubernetes are also hard to argue against. If you manage to get that far, though, and only then.*

It's so saddening to see how the Kubernetes hype-cycle follows OpenStack and all the fundamental problems still seem unsolved. I sometimes feel like its just the same story playing out 5 years later, one layer up to the stack (IaaS -> CaaS) and with other fools to fall for it (with OpenStack it was sysadmins trying to run a control plane, with Kubernetes its devs trying to run infrastructure).

The abstractions we have available to build and run distributed systems may have improved, but they still suck in the grand scheme of things. My personal nightmare is that nothing better comes along soon.

> - Is it the networking model that is simple from the consumption standpoint but has too many moving parts for it to be implemented?

Many poor sysadmins before us have tried to implement Neutron (OpenStack Networking Service) with OvS or a bunch of half-assed vendor SDNs. Or LBaaS with HAProxy.

> - Is it the storage model, CSI and friends?

I mean, the most popular CSI for running on-premise is rook.io, which is just wrapping Ceph. Ceph is just as hard to run as ever, and a lot of that is justified by the inherent complexity of providing high performance multi-tenant storage.

> - Is it the bunch of controller loops doing their own things with nothing that gives a "wholesome" picture to identify the root cause?

Partially. One advantage the approach has is that it's conceptually simple, consistent and feels easy to compose complex behavior. The problem is that Kubernetes enforces very little structure, even basics like object ownership. The result is unbounded complexity. A lack of tooling (e.g. time travel debugging for control loops) makes debugging complex interactions next to impossible. This is also not surprising, control loops are a very hard problem and even simple systems can spiral (or oscillate) out of control very quickly. Control theory is hard. David Anderson has a pretty good treatise of the matter https://blog.dave.tf/post/new-kubernetes/

Compared to OpenStack, Kubernetes uses a conceptually much simpler model (control loops + CRDs) and does a much better job at enforcing API consistency. Kubernetes is locally simple and consistent, but globally brittle.

The downside is that it needs much more composition of control loops to do meaningful work, and that leads to exploding complexity because you have a bunch of uncoordinated actors (control loops) each acting on partial state (a subset of CRDs).

The implementation model of an OpenStack service otoh is much simpler because they use straight forward "workflows", working on a much bigger picture of global state, e.g. neutron owning the entire network layer. This makes composition less of a source for brittleness, not that OpenStack still has its fair share of that as well. Workflows are however much more brittle locally, because they cannot reconcile themselves in case things go wrong.

Extra layers, debugging things that are ephemeral, and overlay networks.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact