Ask HN: What is your Kubernetes nightmare?

ekidd · on June 27, 2022

It's odd, but I actually really enjoy using Kubernetes in production.

We have a few rules:

1. Read a good intro book cover-to-cover before trying to understand it.

2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.

3. Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.

3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.

This is enough to handle 10-50 servers with occasional spikes above 300. It's not perfect, but then again, once you have that many machines, pretty much every solution requires some occasional care and feeding.

My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

Mo3 · on June 27, 2022

> 3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.

This one.

Intermernet · on June 27, 2022

Kubernetes on bare metal is actually pretty easy. Kubernetes on a hosted solution which doesn't have a managed version is prone to error. Usually on bare metal you can make some guarantees regarding bandwidth and storage speed. Trying to roll out a cluster on a service that can't give you these guarantees is truly a nightmare.

sleepybrett · on June 27, 2022

I would also say that if you are going to be administering clusters at your company that you should at least set up a cluster from scratch (doesn't have to be bare metal) and learn how the kubernetes control plane works by breaking it in various ways etc.

In my experience most people don't like black magic, they want something that they understand on some level. A fully managed k8s cluster is black magic, once you have set up a vanilla cluster you get a much better feeling about how the control plane works together to get things done.

SoftTalker · on June 27, 2022

I have tried several times over the past few years to install Kubernetes on bare metal, and it has never worked.

I don't mean installing it on VMs on a laptop, I mean on a real linux cluster of 8 to 32 nodes, with real networks and real switches.

Managing bare metal machines is a cakewalk compared to getting Kubernetes running in-house, at least in my experience.

Obviously the cloud providers do it, so it's possible. But IMO it is something you do only if you have a full-time admin team available to set it up and manage it. It's not by any stretch of the imagination something you install and forget about.

cortesoft · on June 27, 2022

What were you using to install kubernetes?

hitpointdrew · on June 27, 2022

Did you try using kubeadm to bootstrap installing kubernetes? It is pretty simple.

dewey · on June 27, 2022

> Kubernetes on bare metal is actually pretty easy.

I would not call it easy at all. Last time I tried that a year ago you still needed a special load balancer to get it going (https://metallb.universe.tf). Has this changed?

icedchai · on June 27, 2022

MetalLB is pretty simply to configure.

dewey · on June 27, 2022

That's just not true, especially if you compare it to the LoadBalancer you get on a cloud platform which usually involves zero clicks. I'm not saying it's impossible but it's definitely not "easy".

Configuration instructions: https://metallb.universe.tf/configuration/

Hint: You better know what all of these are in your environment:

    For a basic configuration featuring one BGP router and one IP address range, you need 4 pieces of information:

    The router IP address that MetalLB should connect to,
    The router’s AS number,
    The AS number MetalLB should use,
    An IP address range expressed as a CIDR prefix.

icedchai · on June 27, 2022

Did you miss the part about layer 2 configuration, where you don't need BGP at all? https://metallb.universe.tf/configuration/#layer-2-configura...

jsmith45 · on June 27, 2022

But then "When announcing in layer2 mode, one node in your cluster will attract traffic for the service IP."

This bottlenecking seems undesirable. At the very least, if you have one "main" traffic heavy service whichever node ends up servicing that IP address could have elevated cpu usage from processing all the network traffic via kube-proxy.

The obvious solution would be to allocate say 2 or more so ip addresses for the service with dns round robin set up. Then as long as all three are being handled by different nodes you are not bottlenecking nearly as badly. But perhaps I am missing it, but I'm not seeing a feature where you can force those two or more ip addresses to be claimed by different nodes. (If the feature is strict, then you would want more data plane nodes than IPs, so that having one node down will result in having part of the Round robin DNS unclaimed by any node).

icedchai · on June 27, 2022

True. If you want true load balancing, you need a layer 3 solution (BGP.)

hitpointdrew · on June 27, 2022

MetalLB has been in beta for YEARS. It's OK for dev/qa/staging, but I wouldn't put in prod.

icedchai · on June 28, 2022

I wouldn't use it in prod when there are other alternatives from cloud providers. But to say it is difficult to configure for a bare metal dev cluster is not true. The instructions are pretty clear.

hitpointdrew · on June 28, 2022

I don't disagree, I think it is easy to install on a bare metal cluster, although I think using HA Proxy is just as easy and probably a better solution. I was just pointing out that it has been in beta for a very long time.

hitpointdrew · on June 27, 2022

HA proxy isn't complicated to setup.

wg0 · on June 27, 2022

Good rules.

>2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.

If one is at that level already, I don't think there's anything better than AWS ECS out there. It just works. Just works. Yes sure, it does not offer stateless workloads for example among other things but it works for 90% of the cases.

> 3. Don't get clever with Kubernetes networking. In fact, touch it as little as possible and hope really hard it continues to work.

Pretty much... Each CNI generates the SDN its own way slightly differing then the others. It is like you can write the program to print a chessboard on terminal in ten different ways.

Unfortunately, these implementation details aren't written or documented anywhere and they of course would keep changing from release to release anyway. Your only way out if you have production workloads that you can't afford going down without missing revenue? Just pay for the support for respective CNI as only they would know what the voodoo magic is under the hood.

Sure you can see the source code and all of them are open source but that's not your main business or the main day job and of course, the solutions aren't 100 line trivial implementations either.

anderiv · on June 27, 2022

> If one is at that level already, I don't think there's anything better than AWS ECS out there.

100%. To answer the OP's question: my nightmare is having to use it at all. I work with small, very early-stage companies whose applications by and large are not complicated. Perhaps at some level of scale and/or complexity, k8s makes sense. For the vast majority of the cases I see, something like ECS does everything they need, while being significantly more simple to understand, develop for, and debug.

Thiez · on June 27, 2022

Do you still recommend they host their applications in containers (e.g. Docker)? I feel like it's fairly low effort to start out that way, but can be a pain to add later.

anderiv · on June 27, 2022

Being that they're all using ECS, yes containerizing using Docker is a prerequisite.

Thiez · on June 27, 2022

Doh. I am only familiar with Azure, and was confusing ECS with ordinary VMs. Sorry about the stupid question!

samspenc · on June 27, 2022

No worries, the cloud acronyms overlap so much these days, if it's helpful:

EC2: Elastic Compute service, bare VMs.

ECS: Elastic Container service, Docker containers

EKS: Elastic Kubernetes service

dewey · on June 27, 2022

> I don't think there's anything better than AWS ECS out there

Do you have any experience with Kubernetes on GCP being less good than AWS ECS? I'd expect them to be the gold-standard when it's a project coming from Google originally and we haven't had any Kubernetes problems that were related to GCP.

danpalmer · on June 27, 2022

I have experience with Google's managed Kubernetes service (GKE).

It's basically great. Solid, few surprises, no compatibility issues with third party software packaged for Kubernetes. Autopilot looks even better – billing you only for the resources you allocate rather than for the full nodes, basically removing the bin-packing problem. Very little about our config was Google specific or would cause issues porting to another provider. It was up-to-date enough for us to use relatively new features, while lagging enough that everything felt pretty stable.

The only issue we had was wanting to use a somewhat obscure configuration for the Google Cloud load balancer instance that was underlying the Kubernetes ingress. This was possible, we just had to configure it manually in Terraform and point it at the cluster rather than being able to treat it as a cluster resource if I remember correctly. This was only a temporary solution while they were in the process of adding more custom control via K8s.

As far as I can tell it is considered to be the gold-standard.

Disclaimer: I now work for Google on non-cloud stuff, but this was my experience doing a port from bare-metal to GKE.

pojzon · on June 27, 2022

> 3. Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.

This is interesting - last time I worked with Microsoft Engineers from Azure - they said exactly the opposite.

One workload = One cluster.

„There are too many shared resources in Kubernetes that can leak collateral damage from one workload to another”.

ekidd · on June 27, 2022

Azure might also require special precautions. Honestly, I've seen Azure have a lot of networking issues, for example. But this is based on scuttlebutt and limited personal experience.

I've found that on GCS, certain workloads benefit from a dedicated node pool. This gets them their own CPU and RAM and volume I/O. Yes, I could imagine that there are shared Kubernetes control plane resources that might be affected, but I haven't seen that with any of our workloads. It might get more complicated if you have lots of in-cluster networking.

But none of this is my area of expertise. I just think that Kubernetes can mostly be pretty pleasant in practice for companies that have outgrown PaaS offerings like Heroku and Render.com.

gautamdivgi · on June 27, 2022

Not really. You can do fairly large clusters, you need differently sized node pools. For example - we run apache NiFi is AKS which is a complete memory and cpu hog. We have a node pool 16cpu/64g ram for that workload which we specify a node selector. Microservices we use a different node pool. System services run on the default node pool.

If you're running Azure functions with KEDA - setup a nodepool for that with a lower cpu/memory footprint.

fireflash38 · on June 27, 2022

It's really quite frustrating to find that out from experience. Things like operators, anything Cluster*, named resources, etc.

If you're doing relatively simple things with the cluster, then you can do namespaces. The more custom shit you do, the better off you are with true isolation.

Thiez · on June 27, 2022

What is their definition of a workload? Do they want a cluster per microservice? Per application? Per customer?

pojzon · on June 27, 2022

We had a different set of microservices doing specific part of the system.

One part was responsible for data transformation and the other was responsible for user modifications.

Its hard to tell whats the cutting point it depends on the system architecture.

dewey · on June 27, 2022

Wouldn't a name space make more sense than a whole cluster?

softwareoiu · on June 27, 2022

Why is bare metal a nightmare? I have a project coming up which must be on bare metal so was thinking of doing this. Also, if it's so bad, what's better to use on bare metal? Thanks

fireflash38 · on June 27, 2022

My experience with bare metal is multi-fold:

* Documentation sees it as a second-class citizen, if that (loadbalancers, volumes are heavily biased towards cloud providers)

* Many cloud-provided instances of kubernetes will always use the exact same VMs backing the nodes. So they really don't have to care all that much about what config your bare metal cluster has or needs.

RancherOS/K3S can be really quite nice for getting bare-metal clusters up & going really fast. They don't always feel the most complete though, mostly lacking around failure documentation. Even RancherOS has a bias towards cloud clusters, but it's quite easy at least to get a simple k3s cluster going. I'd personally recommend going that way. RancherOS if you're managing multiple clusters, plain k3s if you're doing just one. It'll even come with a pretty decent LoadBalancer & volumes. If you need better management of volumes, Longhorn or minio isn't bad.

microk8s/KinD are for dev-env only, and I wouldn't recommend it for any bare metal cluster. 'Fun' to screw around with though.

Edit: I had a lot of really obnoxious DNS problems, mostly due to docker daemon & how the system config would interact with k8s/k3s. Super annoying when you can get everything working in docker containers manually, but not working in k8s. Once you get your bare metal system configured to work, it'll be fine. It's also very confusing how many different network options there are, and their claims are dubious at best.

To expand on the network subsystems: canal/calico/flannel/ipvs based vs iptables based, etc. We did a bunch of low-latency (sub ms) perf testing for ipvs vs iptables. Docs say ipvs should be both faster (throughput) and lower latency. Tested evidence did not show that to be the case for both small #s of pods & large numbers of pods. This was for a small cluster, so that could be impacting the results.

Never mind that it's a rather huge PITA to switch between them all. Rancher/K3S makes it a bit easier, but still annoying.

tpxl · on June 27, 2022

> loadbalancers, volumes are heavily biased towards cloud providers

Can you even run a "loadbalancer" if all you have is a single machine with a single IP behind a router you don't control? I got stuck on that the last time I tried running my own kubes.

notwedtm · on June 27, 2022

not necessarily a router you don't control, but MetalLB does provide some nice LoadBalancer constructs for a bare-metal deployment. Putting Vyos infront of it is magical!

https://metallb.universe.tf/

fireflash38 · on June 27, 2022

K3s uses their own loadbalancer, so yes. It will add extra hops to any of your services if you care about sub-ms latency.

I looked at metalLB, and didn't really fit with what we wanted to do, so YMMV. It's pretty limited unless you control a lot about your IP space.

phaer · on June 27, 2022

Why would you need a load-balancer if you only have a single machine?

tpxl · on June 27, 2022

Because kubernetes says so? Can you run it without a load balancer?

xiwenc · on June 27, 2022

You can use Nodeport instead of loadbalancer. Or use metallb if you insist to have LB so that it’s more closer to real production environments.

withinboredom · on June 28, 2022

“Real production” smh. This is why docs make baremetal second class citizens, people assume something has to be a certain way for it to be “real.”

cmdrk · on June 28, 2022

There's always one more thing that you need to install to have a working cluster that comes out of the box in cloud.

You want networking? OK, go read about Calico, Flannel, Cilium, etc and choose one. If you didn't fully read the instructions for the networking plugin you plan to use, plan to blow away your cluster and set it back up from scratch with the correct RFC1918 address range for your network plugin that doesn't conflict with your presumably existing network. Plan to dive in and re-jigger things when you need IPv6.

You want a working LoadBalancer? OK, now you need MetalLB or PureLB, among others. Make sure your IPAM people know that you've blocked off several addresses or a CIDR range for K8S dynamic address allocation. IP's allocated via K8S aren't going to respond to ICMP packets and people will assume they're unused :)

You want ingress controllers? OK, well you can pick from Nginx or Traefik. There's actually a ton of them but those seem to be the most popular.

You want certificate management? OK, go install CertManager. You'll need to have programmatic access to your DNS providers if you want to use Let's Encrypt with wildcard certificates.

Oh, you need some kind of volume provider? Well.. there's hostPath but people generally don't recommend that for security reasons. I guess you could use the NFS volume provider but that's a little creaky for all of the usual reasons that NFS has been creaky for the last 30 years. You could go install Rook - but that's another entire complex distributed system ontop of your distributed system. (I love Ceph, BTW- but this is really overwhelming for a new person)

At this point you have essentially a working cluster, probably with a single master unless you set up something like OKD, in which case you already had to setup an entire HAProxy setup before even approaching the K8S parts.

Prepare to have a non-insignificant number of full time employees keeping the plane flying while you swap out the wings in real time to keep up with the fast K8S release cycle.

IMO, the complexity of K8S really incentivizes trashing all of your on-prem hardware and just paying for cloud. That's the end game.

freedomben · on June 27, 2022

I actually found bare metal to be fairly pleasant, and because I built it I understood a ton about how it worked so was able to figure out issues a lot easier.

My advice would be to take careful notes about your setup steps though, even if you're following a guide. For some reason in the k8s world I have a hard time finding blog posts/guides/etc that I used months later, and Chrome seems to eat my bookmarks :-(. I suspect SEO is a ruthless beast when it comes to K8s.

jbsone · on June 27, 2022

I did a project 5 years back that had to be bare metal, and going to for Kubernetes was probably the worst project decision I've made so far. We didn't have the required competency and wasted so much time on it, we should have gone for something more bland and simple.

My only tip if you really decide to go for it is to make sure to use a well-supported linux distro. We had to be on REHL and that turned out to be ill fitted.

awoimbee · on June 28, 2022

If you plan on running bare-metal I highly recommend RKE2. It just works, it sets up most things for you (CNI included).

Don't even think about using kubeadm, it's the worst. It's overcomplicated and the smallest issue will wreck your cluster.

Also as a quick tip, don't use firewalld or iptables, use CNI resources (eg calico GlobalNetworkPolicy c; )

sascha_sl · on June 27, 2022

Because a vendor will do a lot of ground work (choosing a CNI and CSI implementation for instance) for you, and everything usually covered by a cloud-controller will be entirely up to you (e.g. LBs)

Intermernet · on June 27, 2022

Actual bare metal, where you own the physical hardware and pay for the physical network connections is actually pretty painless. I see many people trying to use hosted compute to try and set up a bare metal cluster. This is a recipe for heisenbugs.

sascha_sl · on June 27, 2022

Having physical access (or IPMI) certainly helps, but there's also a lot more knowledge about networks bundled in companies that already run data centers, so setting up something like MetalLB (BGP load balancers) and Rook (Ceph CSI) to cover the parts that your cloud vendor would usually provide automatically is not as big of a deal. But the overall complexity for someone completely new to the topic is still higher.

api · on June 27, 2022

> 2. Pay a cloud vendor to supply a working, managed Kubernetes cluster.

... which makes it trash IMHO. I don't see anything intrinsic about the problem domain that mandates a completely uninstallable unmaintainable Rube Goldberg machine. But having it be that way certainly benefits the cloud vendors who push it since it keeps people from escaping big cloud costs and using simple commodity VMs, bare metal, or colocated stuff.

Complex is the new closed. It can be fully "open" but it doesn't matter if mere mortals can't use it.

agarren · on June 27, 2022

Regarding item 1, any recommendations for a good book? Manning has a couple titles that look good, but I’m curious to hear what others would suggest.

znpy · on June 27, 2022

> My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

We have a bare-metal k8s cluster... In my opinion the thing we got right is to use external load-balancers (good old haproxy) to point at nginx-ingress-controllers (whose pods are pinned to two "service" nodes) and to load-balance the apiserver traffic.

Most other traffic is inter-cluster, and managed by calico anyway.

badrabbit · on June 27, 2022

And don't expose workloads to the internet unless it is a prod app.

Can you recommend me a good intro book to read cover to cover (hopefully not too thick).

yonixw · on June 27, 2022

"Kubernetes in Action" by Marko (Manning publishing) is my recommendation. Took me from someone who knows docker/docker-compose to someone who can handle Azure/AWS Kubernetes, understand the terms and design apps. Very good book.

karmajunkie · on June 28, 2022

I'll second that rec, same story... I wouldn't consider myself at all an expert based solely on that book, but it did give me a lot more confidence in branching out from a straight-and-narrow configuration, that I'd at least be able to know what to look up when I run into problems.

badrabbit · on June 27, 2022

Thanks

cubancigar11 · on June 27, 2022

Which book would you suggest please?

nomorecodehere8 · on June 28, 2022

You hit every high point of my own experience but there are caveats. 1. If you have to do on prem then virtualize and package till you can wash, rinse repeat. 2. Secure systems with k8s are a thing: Stigged k8s, stigged host systems, mtls, psp, network policy, MAC integration - this makes k8s really unpleasant to deal with if you come from pub cloud, pub k8s provider. See #1. 3. Performance: dns sucks and it sucks for all kinds of reasons: usually avoidable with node local caching approaches, but sometimes not. 4. Yes: Big clusters...until you need federation.

nix23 · on June 27, 2022

>My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

One's heaven is another one's nightmare, i like building it from scratch because then i know every single knob, and doing a excellent job in documentation makes sure that others have that knowledge too.

But hey since "administrators" is a forgotten art, you are probably better of just buying some black-boxes with terrible performance.

moooo99 · on June 27, 2022

Depending on your individual circumstances buying "some black-boxes with terrible performance" might be a worthwhile tradeoff

nix23 · on June 28, 2022

Well yes that's true. It's always about the circumstances.

jimbokun · on June 27, 2022

What are some of the best Kubernetes books?

ekidd · on June 27, 2022

I personally liked O'Reilly's Kubernetes: Up and Running, which was fairly thorough, and Nigel Poulton's books, which were shorter and focused on the highlights (at least the editions I read).

The reason I always recommend that people read a book before getting into Kubernetes is that there are several things that make a lot more sense once someone takes the time to explain them.

It actually gave me some 90s nostalgia. In order to use a new server technology, I actually needed to sit down with an O'Reilly book.

vbezhenar · on June 27, 2022

Is website documentation not good enough? It looks very thorough. Actually I’d say that it’s rare to encounter so verbose and full documentation nowadays. May be it’s even too deep, but I enjoyed reading it.

ekidd · on June 27, 2022

The website has tons of reference documentation!

But what a lot of people need is someone to just explain:

1. The basic idea of setting a desired configuration, and having the cluster try to bring reality into sync with the config.

2. How pods, replica sets, deployments and services fit together, and why Google thought it was a good idea to split them up that way. Also, how ingress fits in with all this.

3. Basic volume management.

4. Other common optional topics, just to get an overview.

The big advantage of a book is that it will try to cover the essential ideas, any why they work the way they do, without getting lost in describing a hundred advanced features you can look up later.

If there's an introductory section on the website that covers just the essentials, that might be enough! But I didn't find one when I was learning.

AlexB138 · on June 27, 2022

For intro level: Kubernetes Up and Running. (Here's a free version provided by VMware: https://www.vmware.com/content/dam/digitalmarketing/vmware/e...). Will teach you the basic vocabulary and get you well enough oriented to use k8s.

For trying to get to pro level: Programming Kubernetes. This one is focused on writing code for the k8s ecosystem, but it will teach you a lot of the internals.

sateesh · on June 27, 2022

Kubernetes in Action by Manning (https://www.manning.com/books/kubernetes-in-action) is quite through, good and beginner friendly.

yonixw · on June 27, 2022

Second that, But I do recommend to come with docker/docker-compose understanding in advance.

rad_gruchalski · on June 27, 2022

There’s a second edition in MEAP.

treis · on June 27, 2022

>Prefer fewer larger clusters with namespaces (and node pools if needed) to lots of tiny clusters.

People do this? I thought the whole point was to abstract everything away. You should have containers running on pods. You shouldn't care about what's in the containers or what metal the pods are running on.

mrweasel · on June 27, 2022

Some people don't trust the namespacing in Kubernetes, or have contractual obligations to keep environments separate. I've rarely seen clusters with more than 10 nodes, but I have seen single customers run 5 tiny different clusters, for different environment.

arinlen · on June 27, 2022

Thank you for your insight. Really solid advice.

I have a question thought.

> My personal Kubernetes nightmare is having to build a cluster from scratch on bare metal.

Can you share a few details about which distribution you used and how did you handled ingress?

bradwood · on June 27, 2022

I've done exactly that. It wasn't fun. RHEL, k3s, ansible, Longhorn, metallb, to name a few.

Storage is the fun part.

MuffinFlavored · on June 27, 2022

I was playing around with a local Elasticsearch cluster and I couldn't figure out how to "do" k8s storage. Some kind of like... shared NFS volume or something maybe?

Could you spread some tips from your experiences?

fireflash38 · on June 27, 2022

Longhorn is honestly pretty easy to get up & going for backing a PersistentVolume, which you can then mount however you want.

K3S has some local storage options too, but that's of mixed usage. Or you just do a hostpath + NFS if you want something that has as little Kubernetes magic as possible.

mschuster91 · on June 27, 2022

Yeah, NFS is your best bet if you are on a lab enviroment. Mount the volume at each host, then use hostPath to mount it in the pods (https://kubernetes.io/docs/concepts/storage/volumes/#hostpat...).

bradwood · on June 30, 2022

We used a large amount of local physical storage spread pretty evenly over the machines in the cluster and then gave this to Longhorn to manage.

It pretty much takes care of everything else. But does require some preventative TLC

ianwehba · on June 27, 2022

can you recommend any good intro books?

AlexB138 · on June 27, 2022

Responded to a sibling comment with the same question here: https://news.ycombinator.com/item?id=31894095

MPSimmons · on June 27, 2022

If you have a fully-functioning "best practices" Kubernetes environment, each of the following topics ends up with its own full-depth tech:

- Compute

- Deployment

- CI

- Networking

- Storage

- Policies

Imagine running a microservices solution without an orchestration solution - how many people would it take to administer the servers, the storage, the network, the policies, etc. And with Kubernetes, you get maybe a couple of teams if you're lucky. This is the power and the leverage of the platform.

But also, imagine in that environment, how many things can go wrong, and the amount of expertise that you need to properly debug them. You still need that amount of expertise, because all of that complexity is still in place (or at least most of it is) - if your physical disks are throwing errors, you need someone who knows how to debug and replace that. Not hard. But then you have Ceph above that, and Rook above that (or whatever storage solution you use). And then you've got the deployment that has to make the PVC successfully. And it's like that for everything. Every problem has the potential to be a full stack problem for any one of half a dozen stacks.

It's a lot.

iso1631 · on June 27, 2022

My team looks after 150 servers from Singapore to LA, about half physical, half virtual, some which just sit on shelves between jobs. Being pesamistic takes about 100 hours a year of feeding, upgrading, etc, about $6k a year.

throwaway894345 · on June 27, 2022

That's comparing steady state with bootstrapping a Kubernetes environment. I suspect steady state Kubernetes is comparable.

iasay · on June 27, 2022

The two things thing that gets me are:

1. Latch up states. It's very very easy for something to go wrong and blow a whole deployment up and lose all the pods for example a health check failure. Most application frameworks have some sort of request queuing and the health checks sit in the same queue so any upstream issues and you get health check failures and flapping. Of course the autoscaler goes fucking bonkers in the middle of that. The only thing you can do is drop traffic at the network edge and wait for it to get itself together.

2. No one knows how to fix it if anything major goes wrong. Even cloud providers. It's so large and complicated that no one has enough knowledge independently to actually fix it. For example I suffered from months of weird network issues where pods would come up without network. No one to this day know why that happened and could explain it. No amount of debugging and reverse engineering even resulted in a single step forward, resulting in the only outcome being "replace the whole cluster".

Don't get me wrong, I still like it but I wouldn't want to run it with little expertise at hand. It's not something I would trust someone to run without production experience, which is difficult because there are very few people out there who are battle hardened past trivial home deployments and tiny little stacks.

tpetry · on June 27, 2022

> Don't get me wrong, I still like it but I wouldn't want to run it with little expertise at hand. It's not something I would trust someone to run without production experience, which is difficult because there are very few people out there who are battle hardened past trivial home deployments and tiny little stacks.

That‘s the problem. Everyone used something like minicube to bring a kubernetes cluster online and believes it is simple. But when anything does not work correctly the only approach is to kill the complete machine and start a new cluster on a server. Have fun with statefull data which needs to be copied…

throwaway894345 · on June 27, 2022

It's not like you can let someone run bare VMs without any production experience either. You need a PaaS for something like that, even then...

> Everyone used something like minicube to bring a kubernetes cluster online and believes it is simple.

This hasn't been my experience, and certainly the Kubernetes project doesn't advertise the software as "so simple anyone can do it" or any such thing. Kubernetes definitely requires experience--it's not a PaaS, but a framework on which something like a PaaS could be built.

iasay · on June 27, 2022

We sort of waltzed around that one with PV's on EBS on Amazon and also shifted the control plane to them. But there are still some serious problems in that space.

sleepybrett · on June 27, 2022

A limitation of EBS on amazon that I've run into a few times is that EBS volumes must connect to EC2 in their zone only. So if your k8s cluster has nodes across multiple AZs (which obv. important) your pod that mounts that pv will always be zone locked. This can also be problematic if you write a pod that mounts claims that are in different zones, that will never work.

There are also amazon limitations to how many volumes per node and I used to see problems with ebs volumes 'unmounting' from nodes getting stuck. The later was always problematic and required an admin with a hammer. However i've never seens a 'kube' problem per se, they've all been aws problems.

nijave · on June 27, 2022

Resource zone affinity is annoying but usually works fine if you have 1 ASG per zone (Azure and afaik GCP are the same way). You mainly need to be careful during initial creation to make sure the volumes are spread out

WaitForFirstConsumer is very helpful, too (you basically guarantee the pod can be scheduled at least once before creating the resources which greatly improves the likelihood it can get rescheduled in the future without getting stuck)

Don't remember seeing any issues on AWS but I've seen Azure CSI take about 7.5 minutes to unmount a volume from 1 and mount to another (so each pod in a statefulset can take around 8-10 minutes)

sleepybrett · on June 27, 2022

1. You don't 'lose' the pods. If your pod fails on liveness or readiness checks it restarts, over and over until it passes.

2. Depends on your team for sure. If you are on a team that's like "We'll just spin this up and it will be fine." you ignore the fact that 'things happen'. I've seen similar situations with companies that deploy on bare linux servers as well, some update breaks something or something isn't optimally configured. Things go wrong with systems, people who know the systems are needed to fix them. It sounds like if you intend on using kubernetes you should learn to troubleshoot it.

It's not difficult to find people who know how to actually run kubernetes, it's just hard to convince them to switch jobs for you. I get a decent numbers of cold, non recruiter, linkedin contacts a month, and frankly many recruiters... but my current job pays well and matches my worklife balance. Zero people who are reaching out from startups offer the whole package, they can usually come in fine on the money, but I'm not working a hundred hours a week with little to no support. On the 'established corporate enterprise' side of the house, they can be inflexible when it comes to vacation time, salary ranges etc. but I've found a good place.

kule · on June 27, 2022

RE 1. Just to be precise (as I thought this until recently):

* liveness check will cause pod restarts.

* Readiness check causes the pod to be removed from the round-robin of new traffic requests so it has time to recover/finish processing what it's working on.

sleepybrett · on June 27, 2022

You are correct, readiness removes it from any service objects it would be an endpoint to.

My point was more 'the pod doesn't go away'. I've seen some people do stuff with the HPA that could cause it to scale down to minimum replicas if its in a broken state, depending on what stats you are using to scale, but that's more of a 'kubernetes doing what you told it to do' problem.

otabdeveloper4 · on June 28, 2022

> Don't get me wrong, I still like it

Stockholm syndrome?

CSDude · on June 27, 2022

I have 2:

1. We build our own custom build system, because there is no CI that can do actual DAGs (maybe a few). A custom Kubernetes operator that parses Jsonnet files to create 100s of CRDs and pods to achieve extreme parallelization. EKS was 144$/mo (now 72$) but no info on master node types. Using watch endpoints with hundreds of pods did not scale well. They had to bump up the master node instances to c5.18xlarge, but same price for managed. But figuring out it was needed to do just scale-up took days. One c5.18xlarge is 2k$ month, and EKS runs at least 3 for HA. So it's a horror story for them. But we also had 100s of worker nodes so it might offset some of them.

2. Similar to CI, we allowed devs to deploy all microservices (~80) from any branch so that they can port-forward and use them. All of them had Ingress endpoints. Days after headaches and frustrations, it turns out nginx ingress generates megabytes of configuration whenever a new deployment occurs, forks a new subprocess with new cfg, kills the other connections. When it's done often, it takes 30GB of memory when 50 developers use it (~4000 pods) and it often dies and restarts. Similar story for Prometheus, kube-state-metrics; they do not like short-lived containers and hug on memory.

KronisLV · on June 28, 2022

> We build our own custom build system, because there is no CI that can do actual DAGs (maybe a few).

Have you had a look at GitLab CI? They have a bit of documentation here: https://docs.gitlab.com/ee/ci/directed_acyclic_graph/

Now, I don't work on any projects that are too complicated, but I recall that piece of functionality working as one would expect: https://docs.gitlab.com/ee/ci/yaml/index.html#needs

Also there's Drone CI, which also supports setting up dependencies in your pipelines, if you'd prefer something that's not connected to GitLab CI so closely: https://docs.drone.io/pipeline/docker/syntax/parallelism/

message · on June 27, 2022

D'ya like dags?

spicyusername · on June 27, 2022

- 60% of the Kubernetes ecosystem is half-baked alpha software

- Maintaining 200+ clusters for 10 small applications

- Cloud bills

- Autoscaling never working well

- Trying to untangle Terraform state without taking down Prod

garciasn · on June 27, 2022

We use GKE.

1. I don't know about any of this; we don't seem to have problems.

2. This sounds like an architecture issue, not a k8s issue.

3. Our entire GKE infrastructure costs less than $50 a month.

4. You're right here; it doesn't work 'well', but it works 'well enough' for our use cases.

5. I'm sure you're talking about some event that was far more complex than the few times we've had to drain our pool, but we did what we needed to do without downtime in production. While annoyingly esoteric, I thought it worked pretty fucking well compared to our alternatives.

ColdHeat · on June 27, 2022

With regards to 3, isn’t the management plane alone $72 for GKE without considering the cost for the nodes? How are your costs so low?

seized · on June 27, 2022

GKE gives you one free cluster.

https://cloud.google.com/kubernetes-engine/pricing

garciasn · on June 27, 2022

As noted by another HNer, we are using one cluster and up to 10 nodes as the top end, with only one running 90% of the time and only up to 3 covering the other 9%. We set 10 to the upper limit to ensure we're not going to have borked runs in case some crazy random ML model isn't going to take everything out and force killed jobs. The vast majority of workloads running on the cluster are HIGHLY variable with most running on Monday morning/afternoon and Month or Quarter starts + 3 days.

We are running many hundreds of jobs on those peak days with only a handful running on any other day. While many bring up examples where 24/7 infrastructure from a single box is more than plenty, we find that we can run micro VMs in this configuration and not have to worry about resource contention as our jobs run.

Pre-GKE, we were managing the timing manually, which was fine until we started to scale, but we found this to be a far better situation. Particularly because we simply don't have to think about it.

YMMV.

otabdeveloper4 · on June 28, 2022

> 3. Our entire GKE infrastructure costs less than $50 a month.

At this scale, you don't need Kubernetes, invest in a pocket calculator instead.

throwaway894345 · on June 27, 2022

Agreed. We've had only one autoscaling issue and it was on Google's end (datacenter ran out of nodes of a particular type and thus failed to scale up). Our GKE infrastructure costs a lot more, but we do a lot of heavy compute.

garciasn · on June 27, 2022

Most of our 'heavy compute' happens in BigQuery so we pay for it there, instead, but we make sure the rest of our analytics infrastructure needs are engineered to be as light as possible to keep costs low.

We are an ESOP so, as employee owners, it behooves us to be as cost-conscious as possible.

JamesSwift · on June 27, 2022

> Our entire GKE infrastructure costs less than $50 a month.

Uhhh what? I mean even my personal DO based cluster runs about $40 a month. I'm skeptical a production cluster is at $50.

garciasn · on June 27, 2022

    Service                   Compute Engine Kubernetes Engine
    Cost.                     $191.78        $62.34 
    Discounts                ($13.96)       ($62.34)
    Promotions and others     $0.00          $0.00 
    Subtotal                  $177.83        $0.00

The GCE instance cost includes some of our 24/7 VMs which are the lion's share of that line item, not the micro VMs we use for the cluster.

k8sToGo · on June 27, 2022

We are using karpenter.sh for autoscaling. Works just fine.

JustinGarrison · on June 30, 2022

I work with the karpenter team. Glad to hear you like it. Would love it if you added your info to our public reference adopters.md file https://github.com/aws/karpenter/blob/main/ADOPTERS.md

nunez · on June 28, 2022

literally heaps of resources used in Kubernetes cluster are using API versions with alpha in the name!

sleepybrett · on June 27, 2022

Tell me you barely understand kubernetes without saying you barely understand kuberentes.

rglover · on June 27, 2022

> 60% of the Kubernetes ecosystem is half-baked alpha software

This one is fair. Wasted a lot of time trying to find the "correct" dependencies—I remember the Nginx Ingress Controller specifically being a headache—only to find a maze of deprecations, poorly written documentation, or stuff that just flat out didn't work. That was ~18 months ago (I set up my cluster to run sites for my business and have basically left it alone) so things may have changed but at the time I remember being surprised after hearing so much hype.

throwaway894345 · on June 27, 2022

Pretty sure nginx ingress controller is one of the more solid and widely-used pieces. I've had a lot more trouble with cert-manager, but it seems to be in a stable state on my cluster now and anyway similar solutions in the bare-VM world are just as painful (IIRC I gave up trying to get terraform to do the handshake for AWS ACM).

rglover · on June 27, 2022

What I can say definitively is that having gone from not doing any infra work to using k8s and then over the past few months trying my hand at a bare-metal setup, just spinning up a Linux box and hand installing deps via apt or snap was far more enjoyable/easy to follow.

Primarily because there was very little obscurity (i.e., config files that automate away a lot of thinking or Dockerfiles/containers doing the same). It also left me feeling more confident about stability because if something isn't working, it's pretty clear what I broke/forgot. Worst "bug" I ran into was a snap server hanging when installing a dependency.

spicyusername · on June 27, 2022

Well, that is a rude thing to say.

sleepybrett · on June 27, 2022

Call'um like i see'um

jhugo · on June 27, 2022

My #1 k8s nightmare is the widespread practice of just writing (or downloading and never even looking at!) YAML and applying it to the cluster, with no additional management layer (we use Terraform, but use whatever you want), meaning that eventually you have no idea what the intended state of the cluster is, only its actual state. Vendor READMEs encourage this (some even going so far as to suggest `kubectl apply -f https://...`!).

My #2 (probably partially caused by #1) is the lack of attention paid to RBAC in vendor-supplied manifests. Multiple times I've found that the vendor's YAML binds some controller's service account to a ClusterRole giving access to all secrets in the cluster, when it only really needs to read one. After filing a GitHub issue it seems that I'm the first to even notice, even on popular projects that have been around for years.

nijave · on June 27, 2022

>Multiple times I've found that the vendor's <software does something stupid>

...sounds about right (at least for closed source products)

jhugo · on June 28, 2022

No experience with deploying anything from a closed source vendor on k8s — these experiences have been exclusively with OSS.

Cluster-wide secret access is one of the worst I've come across, but smaller problems are virtually universal. We've come to see the YAML shipped by projects as an example, even when they document it as the preferred installation method. We always write our own now.

Even shipped Helm charts are no better, they usually encapsulate the same problems but just make them harder to fix yourself (since you are incentivised not to fork the chart as you'll have to maintain it).

JensRantil · on June 27, 2022

Honestly, creating my first K8s deployment of a service; typing out at least 150 lines of YAML to define my Deployment, figuring out how my ConfigMaps, Secrets, and Volumes, Services are defined and connected together. Vanilla K8s YAML is extremely low-level.

robertlagrant · on June 27, 2022

Having experienced many Word documents full of deployment instructions and screenshots on how to deploy software, a few lines of YAML has been amazing for me :)

But no doubt there are other tools that are even better.

GordonS · on June 27, 2022

Compare with Docker Compose files, which still use YAML, but a lot less of it, and less verbose too.

KptMarchewa · on June 27, 2022

And does a lot less too, since it describes single node deployment, without any complexities behind networking and communication of services on different nodes.

GordonS · on June 27, 2022

While the file specification is for Docker Compose, Docker Swarm uses the same files, and supports multi-node deployments.

There are a few differences in supported features; for example, IIRC, only Docker Swarm supports `secrets`.

tpetry · on June 27, 2022

A docker compose file is also used for swarm mich is a multi-server deployment.

robertlagrant · on July 5, 2022

But is that a more complex file than the single node one?

bovermyer · on June 27, 2022

I'm a platform engineer and I still think that Kubernetes and its tooling are unnecessarily complex.

As an aside, when I think "low-level" with regards to computer programming, I think machine byte code - closer to the hardware - so this statement read a little funny to me.

OJFord · on June 27, 2022

Yeah, it may be complex and highly configurable, but it's hard to imagine getting much higher level than defining networking, storage, secrets, etc. through YAML.

e12e · on June 28, 2022

I feel it's a lot like the Java enterprise world of FactoryFactoryFactory-classes - a col league coined the term "horizontal abstraction" for this type of trend; you never build a pyramid/hierarchy that composes truly higher level/abstract reasoning - you just complect in various constructs - that all keep hold of most of the complexity from the level "below".

So you get to write 30-40 lines of yaml for each of your ten slightly different services...

p_l · on June 27, 2022

The most freeing moment with k8s for me was integrating that no, you don't need to write all that YAML (or JSON), and going into generating manifests from other formats - in my case, usually Jsonnet with mix of libraries both public and private which quickly embedded knowledge specific to our setup.

I think one of the best things I ever heard from another colleague was "I took the example jsonnet and had working version for completely new deployment within hour" :)

JensRantil · on June 29, 2022

Agree! That said, it's important to remember that someone needs to maintain those manifests, too. Copy-paste might work initially, but you will need to modify them at some point. Which means you need in-house knowledge of that K8s YAML, anyway. For a smaller shop this might be non-trivial. For a larger shop you likely have something close to a Platform team that can maintain the manifests to make them easy to use.

Bayart · on June 27, 2022

> Vanilla K8s YAML is extremely low-level.

I find trying to untangle something deployed with Kubespray/Helm/anything automatized far more headache inducing than flat YAML files.

JensRantil · on June 29, 2022

Yeah, been there and it makes things one notch even worse. :)

dhagz · on June 27, 2022

I've found that creating the k8s resources manually in a local KinD/microk8s cluster and then spewing the resulting YAML to be much easier than typing the YAML directly.

secondcoming · on June 27, 2022

I'm buried up to my balls in this right now. My favourite part is nested 'spec' objects where I've no clear view into what each spec is.

Bayart · on June 27, 2022

Try k9s[1], the xray view (:xray [ressource]) shows you nested resources as a tree. I find it very useful (and k9s in general is a fantastic administration tool).

[1]: https://k9scli.io

e12e · on June 28, 2022

I haven't really used it in anger, but I sort of think kpt might be helpful in managing k8s: https://kpt.dev/

bavell · on June 27, 2022

There's a vscode k8s language plugin that gives you autocomplete and tooltips when hovering over yaml keys, saves me a ton of time in situations like these.

moondev · on June 27, 2022

kubectl explain

speedgoose · on June 27, 2022

Yes it’s a bit much. When I was beginning with kubernetes I was writing Docker compose files first and then converting them to kubernetes using https://kompose.io/

ryandvm · on June 27, 2022

My Kubernetes nightmare is that all kinds of organizations will end up cargo-culting it as required tooling when the reality is that it's massive overkill for most deployment scenarios. Oh wait...

lbriner · on June 27, 2022

Ironically, I think one of the biggest issues is around packaging, specifically Helm charts (but if there are others, it is probably the same). In many frameworks, packaging is to help people by hiding complexity. Need an ingress? Use a Helm chart!

But then upgrading can be very risky because if you have any problem at all, unless you understand the helm chart you can rarely simply downgrade/uninstall, you could have caused a fatal problem and for a cluster, the resilience is meaningless if you make a change that blocks access to all service.

Other issues relate to dependencies and breaking changes which might be subtle and which might not be easy to discover like the fact some old resource uses a v1beta type which becomes deprecated.

I think once it is working, Kubernetes is very reliable for me but it is when making infrastructure changes that things can go south very quickly. Updating deployments etc. is fine.

aliswe · on June 28, 2022

I am not going to contest this but one upside is that you can install several ingresses at once and you don't have to uninstall the old one until the new one works

pritambarhate · on June 27, 2022

Slow performance.

So we have a few Spring Boot based webapps which were running (along with PgSQL) on a shared AWS t2.medium instance, we migrated these to a GKE cluster with a node pool of e2-standard-2 instances. The nodes are on a private network and don't have public IPs. The services are exposed via Load Balancer based Ingress (with SSL). Even after allocating one core to PgSQL and 2GB RAM, the API calls from the GKE applications are perceptively slower than that of the shared AWS t2.medium instance based deployment. Tried giving generous CPU and RAM to the applications however, it still didn't improve the response time. Since these are the very fist applications being moved to this cluster, there isn't much else running on this cluster.

Now sure what's causing the slowness. Have any of you experienced something like this in GKE?

k8sToGo · on June 27, 2022

Slower in what sense? Latency, throughput, computation time?

uberduper · on June 27, 2022

Are you setting cpu requests in your pod spec? This influences the cgroup cpu.shares for the containers and (unless this has been fixed) leaving cpu requests unset results in cpu.shares=2, which jvm interprets badly.

pritambarhate · on June 27, 2022

Thanks for replying. Yes we are setting the CPU requests in our pod spec. I will experiment with this further and see if solves the issue.

dangerface · on June 27, 2022

> What aspect is that makes Kubernetes operationally so hard?

Inherited a web site and hosting from another studio. They setup a php site in a docker inside a vps. They don't use micro services its one monolith container. They didn't setup any way to get logs out of the thing. They don't use docker compose to build an image, they get a console for the container and use it like a vps.

They literally just use it to add another layer of containerisation on their vps.

You already need to understand linux to use docker or kubernetes, If you don't use micro services or need horizontal scaling its just more to learn, an extra layer of complexity thats super fragile and a nightmare to debug.

It has such a niche use case but every one use it where its not useful because its trendy. They want to put on their cv that they have used docker / kubernetes they don't have to write that it wasn't necessary and caused issues.

mgarciaisaia · on June 27, 2022

2025. 3am. Clear night. Full moon shining, many stars around.

I suddenly wake up, covered in cold sweat. My heart is pumping so hard.

I take out my phone. I search the internet. Kubernetes still reigns, no simpler approach made it.

The end.

Joker_vD · on June 27, 2022

I believe there is a rather relatable book called "The Dream-Quest of Unknown Kubernetes" or something like this.

nomorecodehere8 · on June 28, 2022

Why do ppl obfuscate like this? k8s is a control plane + api and client and workers. Don't make it weird. If you do then control the weirdness: the hooks are all there. To me it still looks like client/server with some development overhead based on gRPC + REST, state + eventual consistency. When I was learning linux in 1997 I had butterflies in my stomach - it felt amazing working with tech that I could do _anything_ with. k8s is the only thing in 20+ years that has ever given me that feeling again.

Joker_vD · on June 30, 2022

Because even the built-in agents are huge and complex. Nevermind the homegrown agents some of which has config files that basically are YAML-serialized ASTs for some weird Turing-complete imperative language... and some just straight up embed a Lua script that does configuration and actual work. But thanks for your advice "don't make it weird": I won't! Too bad all the other people didn't heed to it.

oxff · on June 27, 2022

You aren't supposed to use a tool made for Google level complexity unless you work with such complexity in the first place.

tazjin · on June 27, 2022

As a Xoogler I'd say that Kubernetes is harder to use than Google's internal equivalents. It may not be harder to run, but that doesn't matter inside of Google unless you're on the teams responsible for the base layers.

My point is, Kubernetes isn't really "made for Google level complexity" - Google only uses it for a handful of cloud products, internal research stuff and not much else.

fn1 · on June 27, 2022

Why did Google then release the complex Kubernetes to the world than their simpler internal tools?

amannm · on June 27, 2022

To loosen up AWS-entrenched customers and their much more varied use-cases, probably.

eadmund · on June 27, 2022

I've long wondered if K8s is really a dastardly scheme to hold back the industry …

tazjin · on June 27, 2022

It's basically impossible to release this kind of internal tooling due to the way Google works. You'd have to rewrite it, or open-source essentially the entire foundation, and Google probably doesn't want to do either.

zikduruqe · on June 27, 2022

It's a ninja smoke bomb to distract the tech world into thinking we are cool like Google.

guywhocodes · on June 27, 2022

Where do I go for my 3-5 node cluster which I want to schedule containers to, run dynamic workloads etc?

How do I build a container based DAG in a small cluster today without k8s? Solutions that are not k8s tend to be single programming language/sdk based or not easier to set up.

I get the feeling the alternatives are dying off unnaturally fast.

josephcsible · on June 27, 2022

Docker Swarm?

guywhocodes · on June 28, 2022

Ideally it would be a hub and spoke model. I don't know how much swarm hammers etcd but k8s in my cluster produces >1kIOps on my SSDs doing nothing at all. Just health checks and cascading effects

drbojingle · on June 27, 2022

What exactly are you making that requires the use of multiple languages?

discreteevent · on June 27, 2022

It would seem that the requirement for multiple languages often has nothing to do with making something or the customer requirements. It's often a developer requirement. Developers want to work in their favorite language. In the 90s you just wrote in whatever the company or ecosystem had mandated. Now that developers are in high demand they can specify what they want to use.

It started with books like "Beyond java" (and java was beyond C++) and disparaging articles by the likes of Paul Graham. They made some good points but if you have ever worked on a single system with multiple languages and are honest about it you would have to say that it wastes an enormous amount of time and generates unnecessary complexity. When a developer has to make a microservice just to write part of the system in another language you have wonder if they have the ability to evaluate technical trade-offs. Just pick one language and get on with it.

drbojingle · on June 27, 2022

Kinda what I expect as well. I could understand it if a company was bringing in whole teams to scale up development of a large project quickly. Even then support would could be a nightmare. Doing it just because sounds painful.

guywhocodes · on June 27, 2022

My problem needing multiple machines is less weird than needing more than python for my dag?

drbojingle · on June 28, 2022

No, I honestly just don't know what you're talking about lol. What exactly are you building?

guywhocodes · on June 28, 2022

I'm building several things, some of them need puppeteer with plugins so have to use a lot of node.js there. Which I have no interest in otherwise. Some of it is built in higher performance languages. Some of it is off the shelf solutions

And most of all by having every node in the dag a container I don't have versioning issues.

Yes everything can be wrapped in python, doesn't make it a good idea however.

k8sToGo · on June 27, 2022

I don't think that is true. Kubernetes brings a lot of advantages to people who have to manage infrastructure (like me). It gives me a single interface that I can apply across multiple teams. They all only need to provide me with a docker image and that's mostly it.

Also what is the alternative? Self-written unmaintainable bash-scripts? That's what they had before. Every team had their own way of deployment, creating packages,.... It was quite the nightmare.

To be fair, we use EKS, so a lot of the annoying work is done by AWS.

nomorecodehere8 · on June 28, 2022

The first thing with bash scripts is you need to use a type system. Enforce json as your configuration language. Then have error handling as part of a default shell function library that emits json. All you are doing then is passing bad returns to a function that can emit json and reviewing json configurations for shell script variable definitions. Don't do in shell scripts what you can do in terraform. Don't use ansible/salt/puppet for what can be done in shell scripts. Shell scripts and makefiles go to CI/CD agencies (Jenkins). Ansible, terraform, etc...go to services that specialize in these.

treis · on June 27, 2022

>Also what is the alternative?

Docker compose.

Or if we're getting wild pick a technology, use it across the company, and rely on language tools to establish the APIs between modules. Then deploy the application however the language provides.

itintheory · on June 27, 2022

>Docker compose. Swarm is a closer alternative, although it's future is unclear.

Another alternative is Nomad.

likortera · on June 27, 2022

I still wake up in cold sweat in the middle of the night feeling herds of yaml files are chasing me to pull me into the deep swamps of the clusters.

LecroJS · on June 27, 2022

Yaml always mocking likortera. Sorry it’s early

develatio · on June 27, 2022

There is also https://k8s.af , which covers some horror stories!

ygouzerh · on June 27, 2022

More distributed system than Kubernetes, but quite fun : We deployed a MongoDB cluster on our Kubernetes Clusters. Our application was a having a chat feature that stored the messages into the MongoDB cluster. After some months, we realized that we got some weird issues, some messages was arriving in the wrong orders, like : 1. A : Hi ! 2. B : Bye ! See you next time ! 3. A : Great and you ? 4. B : Hello ! How are you ?

We thought it was an application issue, but it was that actually on the database side : the timestamp of each message was using the local time of the mongodb instance. And between different instances, the time was different. We realized that the Kubernetes Nodes had issues to connect to the NTP server, due to a rule in an random firewall.

When we fixed it, every other messages where in the good order

marcosdumay · on June 27, 2022

> due to a rule in an random firewall

The eternal practice of middleboxing your network. This didn't work well at the time LANs were completely isolated, break much better nowadays when LANs are just a convention over WANs, and fails for virtual LANs on a single host too.

Yet people just do it, every single time. Because setting the security in a single place is expected to be easier than setting at the endpoints (I blame Windows for that culture). What is kinda understandable, but here we are, talking about Kubernets, and having that same culture.

lampshades · on June 27, 2022

1. A : Hi !

2. B : Bye ! See you next time !

3. A : Great and you ?

4. B : Hello ! How are you ?

k8sToGo · on June 27, 2022

My biggest surprise was how vanilla even hosted Kubernetes clusters are. For EKS I had to configure and install quite a lot to make it work as expected. At that point you are installing and self managing so much on your own that I wonder if you gain anything.

melezhik · on June 27, 2022

This is that I have to use yaml to configure k8s. Every k8sish tooling has it's own yaml API, including helm, gitops, argocd and friends, so you end up having a bunch of brittle and very hard to understand and maintain yaml files ... Sigh

sureglymop · on June 27, 2022

I think that it's just the "kubectl" tool which accepts yaml input. This tool then talks to the actual API of your cluster and uses json (and maybe other formats).

melezhik · on June 27, 2022

It is not only k8s tool unfortunately. Helm and argocd also exposes YAML api. I am not against YAML in general btw, it’s just a variety of tools’ specific yaml formats one mixes and have to learn every time that makes one’s life difficult

k8sToGo · on June 27, 2022

I hate helm. Writing helm charts sucks. We use kustomize, but still YAML at the end.

high_5 · on June 27, 2022

I'm afraid k8s would become like git - it's a great tool out of which we mostly use like 5% of it's complete capabilities, yet we all use it because everyone is using it. Yet, k8s doesn't really make all the underlying stacks go away. When the shit hits the fan, you have to troubleshoot it with knowledge of much more than just YAML syntax.

birdiesanders · on June 27, 2022

k8s is just Linux, if you forget the Linux there is no hope.

captrb · on June 27, 2022

I admit to being a little frustrated with systems engineers telling me I should never need shell access to a production system again, that web-based Metrics and Tracing should be enough to debug all problems. I have twenty years of muscle memory using strace, dtrace, lsof, blah blah blah to troubleshoot complex problems. Furthermore I'm only brought in when the problem is sufficiently complex. I understand that it should be a break-glass exception, but I don't want linux abstracted away completely.

beckingz · on June 27, 2022

Kubernetes gives you high availability, deployment automation, and powerful management tools, which are all needed to run software applications 'at scale'

Running software at scale is my nightmare.

wg0 · on June 27, 2022

Sounds like crypto promises?

Normal_gaussian · on June 27, 2022

Storage:

Shared FS between nodes, autoscaling volume claim sizes, autoscaling volume claim iops, and measuring storage utilisation (iops e.g.) for pod/node/pv.

How have I solved it? I haven't and I know its a key part of cost-control for us in about 12 months.

Fast deploy:

I'm trying to get a test cluster up in less than half an hour. With the DAG for building it all I'm getting a failure rate of 30% if I don't leave arbitrary timings and extra steps. I've also only automated about 25% of our stuff, so I expect it will take longer.

nijave · on June 27, 2022

Are you setting up managed k8s on a cloud provider and, if so, which one?

I've had some issues with getting EKS dependency ordering correct (using Terraform)

Normal_gaussian · on June 27, 2022

Indeed - GKE here.

mkeedlinger · on June 27, 2022

I think some of the comments shed light on an interesting dichotomy I've noticed while talking to folk about K8s:

It seems that if you stick to simple configs, a setup hosted for you, etc, basically the happy path then people have had really good experiences with k8s. Those people can't understand how one could be inept enough _not_ to figure it out.

On the other hand, you'll also hear a lot of complaints about the difficulty of self-managed clusters, and attempting certain less popular or more complicated configs (or what have you). These people can't understand what benefit introducing such an insane amount of complexity could bring.

The second has mostly been my experience. I've tried now maybe a handful of times to create a cluster and get it running something on my home lab. At first I could rarely get it "up", but now I can usually get it to the point where I'd want to include storage or whatnot, and that's where I've been failing lately. Either way, I've never gotten it stable enough to warrant actual usage from me.

I like the idea of k8s; it seems like the natural next step of computing abstractions. I'm just not sure if "it's it", or if it's stable/reliable/evolved enough for people who don't need it now to invest in it yet.

sureglymop · on June 27, 2022

For me it is actually getting services exposed. Say i buy a domain and then i (trivially) containerize my applications and set up services in kubernetes. Now comes the networking part which is just a pain. How do I make my service accessible? It's easy with docker and an nginx reverse proxy but with kubernetes it's always seemed to be a real pain.

7sidedmarble · on June 27, 2022

Just making your pod accessible is not as complicated as it seems. All you need are: 1. A kubernetes service resource. This just contains a selector that points at your pod. 2. An ingress. You point this at the service you just made. You will get a static IP. Point your DNS at that.

And you're good to go. This assumes that you are using a provider that comes with an 'ingress controller' out of the box (which is what actually makes the ingress function. It's usually just nginx). If not, install the nginx ingress controller with helm. Then install cert-manager with helm for tls cert provisioning.

sureglymop · on June 27, 2022

So if I'm not using a cloud provider but just have a k8s cluster on a small vps with a public ip i just need the nginx ingress controller? What if i want https? Is there a way to automatically enable let's encrypt for different services and domains/subdomains?

linuxftw · on June 27, 2022

There's a wealth of material online to explain these things. let's encrypt can integrate with a number of ingress controllers trivially. Like much of anything else, you need to actually experiment with it to understand how it all fits together.

Bayart · on June 27, 2022

Nginx ingress controller + cert-manager is the most common, best documented way of doing this. If you don't have a domain already pointing to your public IP, you can use nip.io.

nprateem · on June 27, 2022

Cert manager

buzer · on June 27, 2022

It's not quite that simple if you are not with provider that provides LoadBalancer service integration with Kubernetes. Normally the input to Kubernetes cluster is essentially NodePort. That's normally more or less random high port (like 31453) that is exposed on all nodes/nodes that run the service that matches the selector. Unless you want your visitors to add that to the URL (and keep DNS up to date with active nodes), using it to provide end-user accessible HTTP/HTTPS services is not very viable.

You either need to find/create integration to provider's load balancer (or possibly CDN that allows non-1:1 port mapping) or use HostPort service. Latter has it's own share problems as well.

INTPenis · on June 27, 2022

I honestly kinda like kubernetes and I have no problem tying together a bunch of distributed resources in my head.

The biggest nightmare for me is networking, simply because I'm not trained in networking. I know the basics to become a senior sysadmin but it's not natural to me. So mix in kubernetes and it becomes even more abstract.

anothernewdude · on June 27, 2022

I hate that kubectl wants all the images to be already built. Instead I'm forced to keep docker-compose yaml around to actually build the damn things first. Which introduces more yaml that kubectl will insist on reading.

The documentation at the main kubernetes site is poor, and is being deprecated, but not in favour of anything new.

skinnyarms · on June 27, 2022

Give Skaffold a shot, takes a lot of the pain out of running locally. I think of it as a simple wrapper around Docker, Kubernetes, Helm and Port-Forwarding (plus other options I don't use) that makes it easy to use the same build/deploy definition to work locally or build/deploy in CI/CD.

Couple examples:

- Build the images that need it with "skaffold build"

- Can watch for changes are rebuild automatically when you "skaffold dev"

- Can automatically detect your services (or arbitrary) ports and forward them when working locally

- Advanced features like profiles and modules for supporting multiple environments

https://skaffold.dev/

bovermyer · on June 27, 2022

Interesting. You build images at deployment time? Why?