I tried to get a 3 node bare metal cluster running, and i came to the conclusion that its not worth it unless you have over 50+ services running on over 5+ servers and really really need instant automatic failover. Otherwise docker compose is so so so much easier.
I found very poor documentation on how to recover from failure scenarios, i dont like how the control plane has to be separate from the workers, i happen to choose k3s which was abandoned, i never understood why etcd NEEDS more than 2, i hated all the "beta" configuration tags/domains whatever they are called, there was no good way to know when they changed or what they changed to, I thought the whole point of helm was that you could update things easily but its the opposite: it breaks things. I couldn't believe there was no built in support for uninterruptible power supplies? maybe just a k3s thing.
I am sure it works for huge scale entire/multiple data center stuff, but anything in one location i bet is a mistake
The consensus algorithm (Raft in this case) requires that a majority agrees, so you need to have an odd number of nodes (and 1 wouldn't make sense, hence 3 is the minimum, and in practice works very well).
We have a couple of 4U8N Supermicro "fat twins" hooked to a Miktrotik CRS17 (10 GbE). Each node has 6 drive bays – to simplify things, we have installed similar capacity drives and use ZFS raidz2. 7 nodes out of 8 are production, and the last node is staging. Each node runs 2x older v4 Xeons (to a total of 64 virtual cores per node), with 256G RAM. This cost us about US$8k upfront per server, the chassis were decommissioned 2nd hand (looked nice), 2nd hand CPUs, and everything else new. We pay around US$2k for unmetered 30M colocation in Shanghai. In EU/US it must be cheaper and faster.
We provisioned the servers by hand, just installing Proxmox on each of them by hand. Proxmox is nice because it has ZFS root support out of the box without any magic. All other Proxmox services are disabled.
After that we have installed K3s on it via Ansible, with each node being both master and worker. There's no need for separation unless you have some really write heavy workloads that will throttle etcd and cause split brain.
On top of K3s we have three types of workloads:
- stateless workloads, or workloads with volumes from ConfigMaps or Secrets,
- stateful highly available workloads where HA is managed by application (Postgres, Redis...),
- stateful single unit workloads.
For HA workloads we use ZFS directly via openebs-zfs CSI driver, this ties workload to nodes, but since HA is managed by application, and we do minimum 5 or 7 nodes, we can usually easily turn a node off for maintenance and reseed the workload on another node. Most Helm charts will have this functionality built-in (e.g. Postgres has Patroni).
For stateful non-HA workloads (e.g. our knowledgebase) we use Longhorn on top of zvol formatted in ext4. Longhorn gives us nice backups to any S3-compatible storage, plus durability. We usually go with 5 replicas, or with 3 for less important workloads.
It worked nicely like this for a few years and survived several minor faults. Failing node is drained and replaced without any drama.
We mostly use Helm charts for applications, Bitnami for Redis and other things, TimescaleDB's chart for Postgres, own charts for our applications. Before rolling out to production we test on the staging machines. Of course, it's not the same, but close enough.
For dev we use k3d, which is multi-node K3s on top of Docker.
I use a single node microk8s instance mainly so I can stay familiar with syntax and things for work rather than an actual high availability system.
But I use Portainer and store my files on Github so Portainer can auto-update a deployment if I submit a code change, so kind of rudimentary CI/CD which could be fleshed out more with some GitHub Actions probably.
I use iSCSI mounted storage from my NAS on the host and k8s volumes storing configs there. Actual app data is on the NAS accessed via NFS from the relevant apps.
So a new deployment is usually test locally on my laptop, once it's good commit the code to github and either let the deployment auto update or go to Portainer and do it manually if it's a new deployment. Ingress traffic is done via Cloudflare Tunnels deployed in k8s.
I keep most apps in a single namespace called prod unless they need more than 1-2 pods. If I was doing this again I'd use a namespace per app, I do use a dedicated namespace for anything with a Helm deployment or needs a lot of pods (e.g. Immich)
Can you tell me more about your Portainer setup? Does it just update your app from an image or is it checking out code from a git repo on deploy? This approach sounds very interesting
I stopped doing it, but I used to manage in house k8s servers like pets and not cattle.
We had a MAAS installation at some point, which was neat while it worked. The server boots from the network, has some kind of tiny Linux distribution to register itself in MAAS, and shutoff. You can later provision it from MAAS and it will boot with wake on lan, install the image you selected, and be ready to SSH after a little while.
We also had an OpenStack cluster, but we went bare metal after some years because it was more cool at the time. This infrastructure was there to learn, experiment, and have fun.
Monitoring was done in a strange way. Nowadays I would install kube-prometheus-stack and be done. At the time it was Munin and some bespoke monitoring script linked to a stack light. https://en.m.wikipedia.org/wiki/Stack_light
I run k8s "in my house", which is probably different than what you had in mind, but it might still be useful for others.
I use ESXi and run a separate master node and a separate worker node. There's around ~12 services running in the worker node, mostly things related to media. I maintain a workbook for how I bring up a new node and how I upgrade the cluster, which are the normal operations I've had to do in the past. For example, in the past I had too little storage allocated to the worker node and it was easier to bring up a new one than to edit the existing one.
I use dynamic volumes that use NFS on my NAS for any data that needs to be persisted across pod restarts. This works surprisingly well. I use nfs-client-provisioner with helm.
I also use a combination of MetalLB, an nginx ingress controller, and a BIND service so I can point the DNS on my laptop to the BIND server and I can access all my services using DNS instead of IPs.
It's rather complex, there's a whole team handling it, we have servers in ~20 data centers distributed globally, large pools have > 20k pods running on each, each dc has 100TB to 1PB of RAM available.
We have pod affinity rules (we usually flush entire racks for infra updates) so failures don't bring down services.
Node failure is rather unusual, it's more likely that we either need to flush a rack to update it or some service has some issue.
We have separate environments with isolated hardware pools for production and testing (it may be colocated in the same dc).
Nodes have high performance NAS available and ephemeral local storage (SSDs) that it's wiped on pod restart.
If a node fails, you remove it from the pool and send someone to replace it when feasible.
Provisioning depends on the application, you can provision your own pod (if you have the right access), but applications tend to have deployer services that handle provisioning for them.
My current client, since recently, is running several Kubernetes clusters to support DTAP on NKE (Nutanix Kubernetes Engine) on several VMs on a single physical Nutanix cluster.
Although we faced some initial hickups setting it up because the client does not have physical equipment for application delivery and Nutanix does not provide MetalLB out of the box everything seems to work beautifully as we speak.
Management of the cluster is basically a combination of webbased Nutanix tools and kubectl, and nodes are virtualized and thus will survive hardware outages.
We run a 7 node cluster in a proxmox cluster, currently considering of two Proliants and an MSA using SAS controllers. Use NFS for permanent storage.
K8s can be redeployed using Kubespray from a Gitlab pipeline. Currently experimenting with Capsule to run tenants inside the cluster.
My team does quite a bit of this. We handle it in two different ways:
For some clusters we carve nodes out of VMWare simply using OS templates. For other nodes we we use cheap-and-deep blade servers and install the OS on bare-metal using PXE. Once the nodes are provisioned we use ansible to deploy Kubernetes. (Lately it's been RKE2 on top of Rocky.)
Generally speaking VM-based nodes are extremely reliable and seldom have to be rebuilt. (If we're paying to run VMWare its because the underlying hardware is high-quality.) Bare-metal nodes, on the other hand, are built on inexpensive hardware and they tend to fail in different ways. When they fail we cordon and remove them from the cluster and put them in a list. (We maintain sufficient overcapacity to soak failures as they come.)
If we're using persistence we have to take care that the statefulsets are configured correctly. Sometimes we use local-disk persistence so that our services can benefit from local NVME performance. Other times we use NFS (when we need persistence but not performance.)
We monitor cluster node health internally to Kubernetes and also externally using Nagios (shudder).
Kubernetes upgrades are a pain in the ass. Lots of times we'll just set up a second cluster to avoid the risk of a failure during an upgrade.
I'm doing a poc deploying Talos on Proxmox via Terraform.
One snag I've hit is applying specifig config to the cluster. Some things need to be patched on a single master node, which requires me to have some ugly conditionals in the HCL code.
Have you had that situation? How did you solve it?
I found very poor documentation on how to recover from failure scenarios, i dont like how the control plane has to be separate from the workers, i happen to choose k3s which was abandoned, i never understood why etcd NEEDS more than 2, i hated all the "beta" configuration tags/domains whatever they are called, there was no good way to know when they changed or what they changed to, I thought the whole point of helm was that you could update things easily but its the opposite: it breaks things. I couldn't believe there was no built in support for uninterruptible power supplies? maybe just a k3s thing.
I am sure it works for huge scale entire/multiple data center stuff, but anything in one location i bet is a mistake