Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Kubernetes Horror Stories?
19 points by omginternets on July 28, 2020 | hide | past | favorite | 25 comments
I often see it mentioned in the comments that K8 is unfit for all but the largest orgs, and that the complexity of k8 comes with a very steep price. Interestingly though, I have yet to come across any specific examples or anecdotes. As someone who's considering a k8-based deployment and has consumed a fair deal of k8-related marketing, I'm worried that I might be seeing the upside, but not the downside.

So, HN ... what does it look like when k8 goes wrong? Is it a technical issue? A business issue (e.g. we had to hire a full-time k8 expert)? Both?




K8s is sophisticated software that offers a wide range of options because people want to use it in a wide range of scenarios. But you probably don’t need most of those options. Start with something like microk8s which is on rails, add knobs only if you need them.


You can ask for C, C++, PHP, Java horror stories. You can ask for AWS, GCP, Onprem, Digital Ocean horror stories too and get some. You can ask for developers, system admin, DBA horror stories, MySQL, Postgres or Oracle. Just because you get a story doesn't mean the topic in discussion shouldn't be used or isn't useful. Sometimes horror stories happen because of ignorance.

Here's a k8s horror story I heard. A company lost their entire cluster. Everything. Here's the flip side, they were able to rebuild their entire production from scratch and be back up in about 2 hours.


> Here's the flip side, they were able to rebuild their entire production from scratch and be back up in about 2 hours.

Are you sure about those 2 hours?


https://k8s.af prod incident stories.


I worked at a company where their k8s setup was implemented by someone who did not grok k8s. It was basically a tangle of bash scripts that imperatively set up the infrastructure, including many loops of polling to wait for things to be ready, and direct search-and-replace of string tokens in the resource definitions. The incredible thing was that it mostly worked. The downside was that it was quite brittle, so you had to make changes with great care. It could easily get stuck in an loop, or hit a timeout - often both. This would leave the cluster in a bad state. Then one would have to run a clean-up procedure before retrying. All of this was slow beyond belief.


A place I used to work had started their 'Google Cloud' journey, and the first thing the account manager did was teach everyone what GKE and K8S was and how it worked, regardless of whether or not it was required.

Just seems they were pushing it as a marketing thing rather than to solve specific technical issues. From my limited perspective I've seen it as a business issue, that turns into a technical one. i.e. inexperienced developers being forced into a complicated technology for the sake of marketing.


That is a good one:

https://www.youtube.com/watch?v=6sDTB4eV4F8

It's about Zalando (biggest clothes an shoes online-shop in germany) and a big opensource contributor especially patroni (Postgres HA).

https://github.com/zalando


I can't contribute anything (luckily), GKE has been absolutely smooth sailing for us (small-ish data eng/data science team doing most of their own cloud ops) – even BigQuery has caused more trouble.

Very curious how others have fared, though.


We went from building VM's with ansible on AWS to everything being on GKE. Previously if we wanted to build out a whole new stack it could be a multiple day endeavor, now on GKE I can build out a whole new GKE cluster in a matter of hours. Its also allowed our team to sleep at night as our Ansible driven AWS architecture was fickle and fragile. I'm sure that we could have engineered our AWS infra much better, but its a lot of work, and GKE allowed us to bypass that and focus more on the business's bottom line.

There are 2 major issues that we tend to run into on GKE:

DNS Problems: most likely self inflicted, but they are a pain and can take a while to troubleshoot.

Google Cloud Platform Quota Increases take FOREVER: as far as I am aware Google refuses to allow customers to set organization wide quotas for new projects that get spun up, and there is absolutely no way to automate quota increases requests via an API. Its absolutely infuriating and makes making disposable projects all but impossible, which I was really hoping to set up for my company.

If there is anyone at GCP reading this: Please, please, please allow customers to set organization wide quotas(even for new projects that get spun up) and please give customers an API to manage these things!!!!

Overall though, GKE is pretty great for us.


Would you like to exchange a bit on your work? We're building an internal (for now) machine learning platform for our team (play version in bio).

We've added features to suit our needs. Examples: near real time collaboration on notebooks with user cursors for when a data scientist colleague has a problem implementing something and we need to pair-program.

Automatic logging for parameters, metrics, and model to take the cognitive load off.

Multiple checkpoints for our notebooks.

Ability to turn a notebook into an AppBook: we automatically generate a form for parameters without the author tagging cells so you can train a model with a form or an HTTP request, without changing the notebook. Also useful when we want to allow a domain expert to train models by tweaking parameters that matter to them (no need to have Jupyter). Then click to deploy a model, or click to build its Docker image if they have a developer team that wants to send data to it.

Again, not all scenarios are handled but adding features that we needed in our projects.


Most freakouts are caused by folks whose infra was chicken-wire / chewing-gum to begin with. Start small, test often, and then make your technical decisions dispassionately.


I would add that the biggest gripe with Kubernetes that surfaces more often is just a reflection of lack of contact with not only distributed systems but also their deployment/operation, and Kubernetes just gets called into the discussion as a scapegoat.


Consume a k8s course, not the marketing, and set up a test server. For me the downside is there is a lot of new stuff to learn. Information overload!

(are we calling it K8 these days?)


What problems are you trying to solve...? Make a list of them, evaluate complexity of each solution and trade offs.

K8s is a tool, we have a lot of tools in software to solve problems.


An untold horror story would be the management & development overhead (and thus cost) associated with it.


> An untold horror story would be the management & development overhead (and thus cost) associated with it.

How do you expect to manage clusters of COTS hardware and VMs communicating through a VPN and supporting version-controlled blue-green deployments without adding any complexity?


You can have two separate machines (whether VMs or bare-metal) plus a load-balancer in front and go blue/green deployments by deploying changes on one machine then redirecting the traffic to it by reconfiguring the LB, and if all good you then update the second machine as well.

You do not need a VPN if all your machines are on premises in the same network or within a VPC in case of a cloud provider. In case where you do need a VPN (let's say a hybrid on-prem + cloud architecture) you can have a single VPN gateway (either software like Strongswan on a dedicated machine, or hardware router + whatever AWS's equivalent of that is) bridging the two networks.

In a lot of projects though, neither of those things are required at all. Not every project has a threat model requiring VPN'd communications between hosts or multiple environments & seamless deployments (a lot of projects would be fine with taking 10 minutes of downtime out of hours to deploy updates).

I'm not saying Kubernetes is bad or should never be used, but just like any tool it should be used for the right task and in some cases the cost, complexity or downsides associated with the tool end up being a net negative. I dislike the current trend that everything should be in containers, Kubernetized, etc where you spend more time on infrastructure than writing the code you're trying to deploy in the first place.


> You can have two separate machines (whether VMs or bare-metal) plus a load-balancer in front and go blue/green deployments

Congratulations, you've just added half a dozen moving parts that are untested and require constant maintenance.

And you still don't get version-controlled rollbacks.

What have you gained by not using Kubernetes? I see only losses.

> You do not need a VPN if all your machines are on premises

That assertion is simply wrong. You use a VPN to isolate your application's traffic.


K8 is also unfit for the largest organizations. They just have the advantage that they can throw any number of people after it and pretend that everything is alright.


What would be better use of their resources instead?


What exactly makes it unfit for large orgs?


The same stuff that makes it unfit for smaller organisations.


And those would be?


This was posted elsewhere in this thread:

https://github.com/hjacobs/kubernetes-failure-stories

To me the biggest problem is the hype and massive salesmanship used to sells extreme complexity to gullible companies / individuals. K8s really is the J2EE application server of last century on steriods.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: