
Kubernetes Horror Stories? - omginternets
I often see it mentioned in the comments that K8 is unfit for all but the largest orgs, and that the complexity of k8 comes with a very steep price.  Interestingly though, I have yet to come across any <i>specific</i> examples or anecdotes.  As someone who&#x27;s considering a k8-based deployment and has consumed a fair deal of k8-related marketing, I&#x27;m worried that I might be seeing the upside, but not the downside.<p>So, HN ... what does it look like when k8 goes wrong?  Is it a <i>technical</i> issue?  A <i>business</i> issue (e.g. we had to hire a full-time k8 expert)?  Both?
======
hardwaresofton
[https://github.com/hjacobs/kubernetes-failure-
stories](https://github.com/hjacobs/kubernetes-failure-stories)

~~~
elderbarry
K8s is sophisticated software that offers a wide range of options because
people want to use it in a wide range of scenarios. But you probably don’t
need most of those options. Start with something like microk8s which is on
rails, add knobs only if you need them.

------
segmondy
You can ask for C, C++, PHP, Java horror stories. You can ask for AWS, GCP,
Onprem, Digital Ocean horror stories too and get some. You can ask for
developers, system admin, DBA horror stories, MySQL, Postgres or Oracle. Just
because you get a story doesn't mean the topic in discussion shouldn't be used
or isn't useful. Sometimes horror stories happen because of ignorance.

Here's a k8s horror story I heard. A company lost their entire cluster.
Everything. Here's the flip side, they were able to rebuild their entire
production from scratch and be back up in about 2 hours.

~~~
nix23
> Here's the flip side, they were able to rebuild their entire production from
> scratch and be back up in about 2 hours.

Are you sure about those 2 hours?

------
kapilvt
[https://k8s.af](https://k8s.af) prod incident stories.

------
oftenwrong
I worked at a company where their k8s setup was implemented by someone who did
not grok k8s. It was basically a tangle of bash scripts that imperatively set
up the infrastructure, including many loops of polling to wait for things to
be ready, and direct search-and-replace of string tokens in the resource
definitions. The incredible thing was that it mostly worked. The downside was
that it was quite brittle, so you had to make changes with great care. It
could easily get stuck in an loop, or hit a timeout - often both. This would
leave the cluster in a bad state. Then one would have to run a clean-up
procedure before retrying. All of this was slow beyond belief.

------
tcbasche
A place I used to work had started their 'Google Cloud' journey, and the first
thing the account manager did was teach everyone what GKE and K8S was and how
it worked, regardless of whether or not it was required.

Just seems they were pushing it as a marketing thing rather than to solve
specific technical issues. From my limited perspective I've seen it as a
business issue, that turns into a technical one. i.e. inexperienced developers
being forced into a complicated technology for the sake of marketing.

------
nix23
That is a good one:

[https://www.youtube.com/watch?v=6sDTB4eV4F8](https://www.youtube.com/watch?v=6sDTB4eV4F8)

It's about Zalando (biggest clothes an shoes online-shop in germany) and a big
opensource contributor especially patroni (Postgres HA).

[https://github.com/zalando](https://github.com/zalando)

------
dividedbyzero
I can't contribute anything (luckily), GKE has been absolutely smooth sailing
for us (small-ish data eng/data science team doing most of their own cloud
ops) – even BigQuery has caused more trouble.

Very curious how others have fared, though.

~~~
farisjarrah
We went from building VM's with ansible on AWS to everything being on GKE.
Previously if we wanted to build out a whole new stack it could be a multiple
day endeavor, now on GKE I can build out a whole new GKE cluster in a matter
of hours. Its also allowed our team to sleep at night as our Ansible driven
AWS architecture was fickle and fragile. I'm sure that we could have
engineered our AWS infra much better, but its a lot of work, and GKE allowed
us to bypass that and focus more on the business's bottom line.

There are 2 major issues that we tend to run into on GKE:

DNS Problems: most likely self inflicted, but they are a pain and can take a
while to troubleshoot.

Google Cloud Platform Quota Increases take FOREVER: as far as I am aware
Google refuses to allow customers to set organization wide quotas for new
projects that get spun up, and there is absolutely no way to automate quota
increases requests via an API. Its absolutely infuriating and makes making
disposable projects all but impossible, which I was really hoping to set up
for my company.

If there is anyone at GCP reading this: Please, please, please allow customers
to set organization wide quotas(even for new projects that get spun up) and
please give customers an API to manage these things!!!!

Overall though, GKE is pretty great for us.

------
daviddever23box
Most freakouts are caused by folks whose infra was chicken-wire / chewing-gum
to begin with. Start small, test often, and then make your technical decisions
dispassionately.

~~~
rumanator
I would add that the biggest gripe with Kubernetes that surfaces more often is
just a reflection of lack of contact with not only distributed systems but
also their deployment/operation, and Kubernetes just gets called into the
discussion as a scapegoat.

------
quickthrower2
Consume a k8s course, not the marketing, and set up a test server. For me the
downside is there is a lot of new stuff to learn. Information overload!

(are we calling it K8 these days?)

------
aprdm
What problems are you trying to solve...? Make a list of them, evaluate
complexity of each solution and trade offs.

K8s is a tool, we have a lot of tools in software to solve problems.

------
Nextgrid
An untold horror story would be the management & development overhead (and
thus cost) associated with it.

~~~
rumanator
> An untold horror story would be the management & development overhead (and
> thus cost) associated with it.

How do you expect to manage clusters of COTS hardware and VMs communicating
through a VPN and supporting version-controlled blue-green deployments without
adding any complexity?

~~~
Nextgrid
You can have two separate machines (whether VMs or bare-metal) plus a load-
balancer in front and go blue/green deployments by deploying changes on one
machine then redirecting the traffic to it by reconfiguring the LB, and if all
good you then update the second machine as well.

You do not need a VPN if all your machines are on premises in the same network
or within a VPC in case of a cloud provider. In case where you do need a VPN
(let's say a hybrid on-prem + cloud architecture) you can have a single VPN
gateway (either software like Strongswan on a dedicated machine, or hardware
router + whatever AWS's equivalent of that is) bridging the two networks.

In a lot of projects though, neither of those things are required at all. Not
every project has a threat model requiring VPN'd communications between hosts
or multiple environments & seamless deployments (a lot of projects would be
fine with taking 10 minutes of downtime out of hours to deploy updates).

I'm not saying Kubernetes is bad or should never be used, but just like any
tool it should be used for the right task and in some cases the cost,
complexity or downsides associated with the tool end up being a net negative.
I dislike the current trend that _everything_ should be in containers,
Kubernetized, etc where you spend more time on infrastructure than writing the
code you're trying to deploy in the first place.

~~~
rumanator
> You can have two separate machines (whether VMs or bare-metal) plus a load-
> balancer in front and go blue/green deployments

Congratulations, you've just added half a dozen moving parts that are untested
and require constant maintenance.

And you still don't get version-controlled rollbacks.

What have you gained by not using Kubernetes? I see only losses.

> You do not need a VPN if all your machines are on premises

That assertion is simply wrong. You use a VPN to isolate your application's
traffic.

------
chvid
K8 is also unfit for the largest organizations. They just have the advantage
that they can throw any number of people after it and pretend that everything
is alright.

~~~
omginternets
What exactly makes it unfit for large orgs?

~~~
chvid
The same stuff that makes it unfit for smaller organisations.

~~~
dividedbyzero
And those would be?

~~~
chvid
This was posted elsewhere in this thread:

[https://github.com/hjacobs/kubernetes-failure-
stories](https://github.com/hjacobs/kubernetes-failure-stories)

To me the biggest problem is the hype and massive salesmanship used to sells
extreme complexity to gullible companies / individuals. K8s really is the J2EE
application server of last century on steriods.

