
Ask HN: Do's/don'ts of working with Kubernetes you learned through experience? - fiddlerINT
We are rolling out Kubernetes to production next month and I&#x27;m interested to hear from people who made that step already.
======
nameless912
Can I be honest?

As someone who hopped on the K8s bandwagon back in the early days (circa-early
2017), _do not_ go into production with Kubernetes if you're still asking this
question.

Just a few of the issues I've run into over the past 2 1/2 years or so:

\- Kubernetes DNS flaking out completely

\- Kubernetes DNS flaking out occasionally (for ~5 percent of queries)

\- Giving out too many permissions, causing pods to be deleted without a clear
reason why, often taking down production traffic or logging with it

\- Giving out too few permissions, making our deployment infrastructure depend
on a few lynchpins rather than sharing the production burden

\- probably a dozen different logging aggregation systems, none of which
strike a balance between speed and CPU cost

\- probably a half-dozen different service meshes, all of which suck (with the
exception of linkerd, which is actually quite good)

\- teams with bad santization practices leaking credentials all over the place

\- Running Vault in Kubernetes (really, don't ever do this)

\- Disks becoming unattached from their pods for no discernable reason, only
to be re-attached minutes later again with no explanation

\- At least one major production outage on every single Kubernetes-based
system I've built that can be directly attributed to Kubernetes

\- Etcd failovers

\- Etcd replication failures

\- Privilege escalation due to an unsecured Jenkins builder causing credential
exfiltration (this one was _super_ fun to fix)

Kubernetes is a powerful tool, and I've helped run some massive (1000+ node,
5000+ pod x 3 AZ's) systems based on K8s, but it took me a solid year of
experimenting and tinkering to feel even remotely comfortable putting anything
based on K8s into production. If you haven't run into any "major" issues,
you're going to very soon. I can only wish you good luck.

~~~
cocoa19
Thanks for describing your experience.

You do suggest not going into production with k8s, but what if the alternative
is as difficult to implement or worse?

------
wikibob
See the excellent collection of Kubernetes incident reviews at
[https://k8s.af](https://k8s.af)

~~~
yongjik
Wow, there are at least two linked stories where people are running pods
without resource limit. Without. Resource. Limit.

At my workplace, I've been whining that we really need resource limit for...
well much longer than I'd like... while thinking few other places will be
crazy enough to think you can just start jobs without limits. But apparently
many people do! Why are we running Kubernetes, again?

------
CameronBarre
Do keep a gitops folder / repository to keep your cluster in sync with
expectations, do not let adhoc edits become the norm.

Use tools like kustomize to reduce proliferation of duplicate k8s resource
files.

Do make sure you are using health and liveness checks.

Definitely take care to specify resource requests and limits.

Do use annotations to control provider resources, rather than manually
tweaking provider resources that are auto generated by basic k8s files with no
annotations.

Aggregate your logs.

~~~
TuringNYC
Could you comment more on your log aggregation tooling? Did you set up an ELK
stack? Self installed or paid? That is the most frustrating part for me — why
some of the tooling is not built in.

~~~
CameronBarre
We don't have the most sophisticated setup, we're on AWS so we use a fluentd
daemonset to ship logs to cloudWatch.

In many respects kubernetes is just a really nice piece of marble to start
with, you still have to carve your statue.

If certain tooling was built-in they would be making stack-specific subjective
decisions for you, which is somewhat antithetical to the kubernetes model.

They'll make critical operational decisions for you (e.g. core competencies
like scheduling work across a pool of resources), but when it comes to
supporting tooling, you still have to make some decisions on your own.

For a lesser burden, there are most likely helm charts or other prepackaged
log aggregation tools out there.

------
moon2
\- Deleting or bulk changing something? Always use the flag --record. This
way, you can refer back to what you changed using kubectl rollout history.

\- If you're planning on using GKE, you'll have to expose your apps using
Ingress (this way you can use GCP's L7 Load Balancing with HTTPs). However,
this architecture has many limits (e.g. a hard limit of 1000 forwarding rules
(FW) per project, each ingress creates an FW and k8s ingress can't refer to
another namespace), so make sure you use namespaces wisely.

\- Try to learn and teach people on your team about requests and limits. If
you don't use it carefully, you'll end up wasting a lot of resources. Also,
make sure you have Prometheus and Grafana set up, to give you some visibility.

\- Setup Heptio's Velero, it's a lifesaver, especially when running in a
managed environment where you have no access to etcd. It can be used to backup
your whole cluster and migrate workloads between clusters. If, for some
reason, you end up deleting a cluster by mistake, it will be easier to recover
its workloads using Velero.

~~~
ithkuil
Minor note: you can have multiple ingress resources for the same hostname.
This way you can route some paths to some services in a namespace and other
paths to other namespaces.

(Yes, it's confusing. Yes, it can be dangerous)

~~~
moon2
It's really confusing. The PaaS we provide to our devs creates a namespace per
app with an ingress (when the app is a web app). We have hit this limit and we
were thinking about doing what you mentioned.

After reading this gigantic issue
([https://github.com/kubernetes/kubernetes/issues/17088](https://github.com/kubernetes/kubernetes/issues/17088)),
we gave up and just created another GCP project.

------
sergiotapia
We're moving away from Kubernetes, to Aptible.

If you're asking these kinds of questions you shouldn't be using kubernetes.

If you are going to use it, be ready to have an engineer on your time be full
time devops. Or be ready to hire someone who knows k8. It'll be around 110k to
140k.

But really, don't use it. The gospel you hear is from engineers who already
invested their careers in it. Buyer beware.

~~~
943_924
Something I'm struggling to understand about the hype of orchestrating a
container architecture for everything you do is, if it's so incredibly
nuanced, full of pitfalls, and k8 takes 80 hours of instruction/practice to
full grasp how "easy" it makes life (despite there being layers on top of it
like openshift that most orgs are gonna use anyway) and takes at least one
"devops" person on a six figure salary to make sure it all doesnt crash and
burn.. what was the point?

~~~
sergiotapia
After a few weeks of diving into it, the point is that you can scale from tiny
1 pod, to huge enterprise Google-scale infrastructure.

But realistically, you don't need all that bullshit when you're a startup or
even a large company making 7 figures.

------
longcommonname
Don't give everybody prod access, but give enough people prod access.

Use namespaces and logically bounded clusters. Get your monitoring, and
tracing and a dashboard to visualize this figured out now.

------
rochacon
Managed or custom deploy? What is the size of the cluster and team that will
be using it?

Kubernetes is a hell of a lot configurable, so your environment matters a lot
on the must and nice to have.

If not managed, make sure you go through all components flags and configure
things like reserved resources, forbid hostpath usage, pod security policies
(do not allow root), etc

Also, avoid service meshes until you fully understand how to use “vanilla”
Kubernetes, don’t add this complexity from day 1 because debugging cluster
issues can get a lot harder.

------
charlieegan3
We experienced an issue with a validating webhook controller configured to
validate (way) more than needed - I wrote it up here:
[https://blog.jetstack.io/blog/gke-webhook-
outage](https://blog.jetstack.io/blog/gke-webhook-outage). It's on
[https://k8s.af](https://k8s.af) \- a great place for k8s related postmortems.

------
kasey_junk
Have a very clear business case of why you are using it.

Rolling out K8s should not be the goal. It’s a toolset, an expensive a
bleeding edge one. It’s also very much geared for operators not developers so
you likely need to build guide rails on top of it.

There are lots of good reasons to use K8s but make sure you know why you are.

------
yellow_lead
If your application requires high availability, make sure you are setting pod
disruption budgets and have some special behavior when SIGKILL is sent to an
app/pod. For some of our applications, we have some logic to finish all
current requests after SIGKILL is sent, so that none are dropped.

~~~
deanmoriarty
You mean SIGTERM, right? That's what gets sent to the containers before the
grace period expires.

~~~
yellow_lead
Yes, my mistake.

------
anon284271
Don't use Kubernetes.

------
eeZah7Ux
DON'T: use it.

~~~
kylek
You're gonna get downvoted to hell. But this is probably the sanest advice.

To OP- elaborate on what you're actually trying to accomplish

------
iamnothere123
DON'T USE IT !!!!!

