I have experience running Kubernetes clusters on Hetzner dedicated servers, as well as working with a range of fully or highly managed services like Aurora, S3, and ECS Fargate.
From my experience, the cloud bill on Hetzner can sometimes be as low as 20% of an equivalent AWS bill. However, this cost advantage comes with significant trade-offs.
On Kubernetes with Hetzner, we managed a Ceph cluster using NVMe storage, MariaDB operators, Cilium for networking, and ArgoCD for deploying Helm charts. We had to handle Kubernetes cluster updates ourselves, which included facing a complete cluster failure at one point. We also encountered various bugs in both Kubernetes and Ceph, many of which were documented in GitHub issues and Ceph trackers. The list of tasks to manage and monitor was endless. Depending on the number of workloads and the overall complexity of the environment, maintaining such a setup can quickly become a full-time job for a DevOps team.
In contrast, using AWS or other major cloud providers allows for a more hands-off setup. With managed services, maintenance often requires significantly less effort, reducing the operational burden on your team.
In essence, with AWS, your DevOps workload is reduced by a significant factor, while on Hetzner, your cloud bill is significantly lower.
Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.
I've never operated a kubernetes cluster except for a toy dev cluster for reproducing support issues.
One day it broke because of something to do with certificates (not that it was easy to determine the underlying problem). There was plenty of information online about which incantations were necessary to get it working again, but instead I nuked it from orbit and rebuilt the cluster. From then on I did this every few weeks.
A real kubernetes operator would have tooling in place to automatically upgrade certs and who knows what else. I imagine a company would have to pay such an operator.
> Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.
Sure, but the TLDR is going to be that if you employ n or more sysadmins, the cost savings will dominate. With 2 < n < 7. So for a given company size, Hetzner will start being cheaper at some point, and it will become more extreme the bigger you go.
Second if you have a "big" cost, whatever it is, bandwidth, disk space (essentially anything but compute), cost savings will dominate faster.
Not always. Employing Sysadmins doesn't mean Hetzner is cheaper because those "Sysadmin/Ops type people" are being hired to managed the Kubernetes cluster. And Ops type people who truly know Kubernetes are not cheap.
Sure, you can get away with legoing some K3S stuff together for a while but one major outage later, and that cost saving might have entirely disappeared.
When I worked in web hosting (more than 10 years ago), we would constantly be blackholeing Hetzner IPs due to bad behavior. Same with every other budget/cheap vm provider. For us, it had nothing to do with geo databases, just behavior.
Yep I had the same problem years ago when I tried to use Mailgun's free tier. Not picking on them, I loved the features of their product but the free tier IPs had a horrble reputation and mail just would not get accepted especially by hotmail or yahoo.
Any free hosting service will be overwhelmed by spammers and fraudsters. Cheap services the same but less so, and the more expensive they are the less they will be used for scams and spams.
depending on the prices, maybe a valid strategy would be to have servers at hetzner and then tunnel ingress/egress somewhere more prominent. Maybe adding the network traffic to the calculation still makes financial sense?
That's a really good article. Actually, recently we were migrating as well and we were using dedicated nodes in our setup.
In order to integrate a load-balancer provided by hetzner with our k8s on dedicated servers we had to implement a super thin operator that does it: https://github.com/Intreecom/robotlb
If anyone will be inspired by this article and would want to do the same, feel free to use this project.
I loved the article. Insightful, and packed with real world applications. What a gem.
I have a side-question pertaining to cost-cutting with Kubernetes. I've been musing over the idea of setting up Kubernetes clusters similar to these ones but mixing on-premises nodes with nodes from the cloud provider. The setup would be something like:
- vCPUs for bursty workloads,
- bare metal nodes for the performance-oriented workloads required as base-loads,
- on-premises nodes for spiky performance-oriented workloads, and dirt-cheap on-demand scaling.
What I believe will be the primary unknown is egress costs.
>All root servers have a dedicated 1 GBit uplink by default and with it unlimited traffic.
>Inclusive monthly traffic for servers with 10G uplink is 20TB. There is no bandwidth limitation. We will charge € 1/TB for overusage.
So it sounds like it depends. I have used them for (I'm guessing) 20 years and have never had a network problem with them or a surprise charge. Of course I mostly worked in the low double digit terabytes. But have had servers with them that handled millions of requests per day with zero problems.
1 / 8 * 3600 * 24 * 30 = 324000 so that 1GBit/s server could conceivably get 324TB of traffic per month "for free". It obviously won't, but even a tenth of data is more than the data included with the 10G link.
They do have a fair use policy on the 1GBit uplink. I know of one report[1] of someone using over 250TB per month getting an email telling them to reduce their traffic usage.
The 10GBit uplink is something you need to explicitly request, and presumably it is more limited because if you go through the trouble of requesting it, you likely intend to saturate it fairly consistently, and that server's traffic usage is much more likely to be an outlier.
They run a Wireguard network between the nodes so you can have a mix of on-premise and cloud within one cluster. Works really well but unfortunately is a commercial product with a pricing model that is a little inflexible.
But at least it shows it's technically possible so maybe open source options exist.
You could make a mesh with something like Netmaker to achieve similar using FOSS. Note I haven’t used Netmaker in years but I was able to achieve this in some of their earlier releases. I found it to be a bit buggy and unstable at the time due to it being such young software but it may have matured enough now that it could work in an enterprise grade setup.
The sibling comments recommendation, Nebula, does something similar with a slightly different approach.
yes, like i said, throw an overlay on that motherfucker and ignore the fact that when a customer request enters the network it does so at the cloud provider, then is proxied off to the final destination, possibly with multiple hops along the way.
you can't just slap an overlay on and expect everything to work in a reliable and performant manner. yes, it will work for your initial tests, but then shit gets real when you find that the route from datacenter a to datacenter b is asymmetric and/or shifts between providers, altering site to site performance on a regular basis.
the concept of bursting into on-prem is the most offensive bit about the original comment. when your site traffic is at its highest, you're going to add an extra network hop and proxy into the mix with a subset of your traffic getting shipped off to another datacenter over internet quality links.
a) Not every Kubernetes cluster is customer facing.
b) You should be architecting your platform to accomodate these very common networking scenarios i.e. having edge caching. Because slow backends can be caused by a range of non-networking issues as well.
c) Many cloud providers (even large ones like AWS) are hosted in or have special peering relationships with third party DCs e.g. [1]. So there are no "internet quality links" if you host your equipment in one of the major DCs.
Be careful with Hetzner, they null routed my game server on launch day due to false positives from their abuse system, and then took 3 days for their support team to re-enable traffic.
By that point I had already moved to a different provider of course.
Digital Ocean did this to my previous company. They said we’d been the target of a DOS attack (no evidence we could see). They re-enabled the traffic, then did it again the next day, and then again. When we asked them to stop doing that they said we should use Cloudflare to prevent DOS attacks… all the box did was store backups that we transferred over SSH. Nothing that could go behind Cloudflare, no web server running, literally only one port open.
Nothing beats aws tbh, the level of extra detail aws adds, like emailing and alerting a gazillion times before making any changes to underlying hardware, even if non disruptive.
Robust <24 hour support from detailed, experienced and technical support staff, a very visible customer obsession laced experience all-around.
Ovh has issues with randomly taking down vps/baremetal instances at random with their support staff having no clue/late non-real time data on their instance state, they lost a ton of customer data in their huge datacenter fire 2 yrs ago, didnt even replicate the backups across multiple datacentres like they were supposed to, got sued a ton too.
I use OVH because the cost reduction supremely adds up for my workloads (remote video editing/ custom rendering farm at scale with a lot more cheaper OVH s3 suitable for my temporary but too many asset workload with high egress requirements) but otherwise I miss AWS and get now, just how much superior their support and attention to detail is.
Reading comments from the past few days makes it seem like dealing with Hetzner is a pain (and as far as I can tell, they aren't really that cheaper than the competitors).
> (and as far as I can tell, they aren't really that cheaper than the competitors)
Can you say more? Their Cloud instances, for example, are less than half the cost of OVH's, and less than a fifth of the cost of a comparable AWS EC2 instance.
I don't think so. We see the outliers. Those happens at Linode, Digital Ocean, etc also. And yes even at Google Cloud and AWS you sometimes get either unlucky or unfairly treated.
I haven't used it personally, but https://github.com/kube-hetzner/terraform-hcloud-kube-hetzne... looks amazing as a way to setup and manage kubernetes on Hetzner. At the moment I'm on Oracle free tier, but I keep thinking about switching to it to get off... Well Oracle.
I'm running two clusters on it, on for production and one for dev. Works pretty good. With a schedule to reboot machines every sunday for automatic security updates (SuSE Micro OS). Also expanded machines for increased workloads. You have to make sure to inspect every change terraform wants to do, but then you're pretty save.
The only downside is that every node needs a public IP, even though they are behind a firewall. But that is being worked on.
i recently read an article about running k8s on the oracle free tier and was looking to try it. i'm curious, are there any specific pain points that are making you think of switching?
Nope, just Oracle being a corp with a nasty reputation. Honesty it was easy to set up and has been super stable, and if you go ARM the amount of resources you get for free is crazy. I actually do recommend it for personal projects on the like. I'd just be hesitant about building a business based on any Oracle offering.
I've used this to set up a cluster to host a dogfooded journalling site.
In one evening I had a cluster working.
It works pretty well. I had one small problem when the auto-update wouldn't run on arm nodes which stopped the single node I had running at that point (with the control plane taint blocking the update pod running on them).
> While DigitalOcean, like other providers, offers a free managed control plane, there is typically a 100% markup on the nodes that belong to these managed clusters.
I don't think this is true. With Digital Ocean, the worker nodes are the same cost as regular droplets, there's no additional costs involved. This makes Digital Ocean's offering very attractive - free control plane you don't have to worry about, free upgrades, and some extra integrations to things like the load balancer, storage, etc. I can't think of a reason to not go with that over self-managed.
I’m planning on doing something similar but want to use Talos with bare metal machines. I suspect to see similar price reductions from our current EKS bill.
It took minutes to setup a cluster and I love having a UI to see what is happening.
I wish there were more products like this as I suspect there will be a trend towards more self-managed Kubernetes clusters given how expensive the cloud is becoming.
I set up a Talos bare metal cluster about a year ago, and documented the whole process on my website. Feel free to reach out if you have any questions!
This is probably out of left field, but what is the benefit of having a naming scheme for nodes without any delimiters? Reading at a glance and not knowing the region name convention of a given provider (i.e. Hetzner), I'm at a loss to quickly decipher the "<region><zone><environment><role><number>" to "euc1pmgr1". I feel like I'm missing something because having delimiters would make all sorts of automated parsing much easier.
Thanks for getting back to me! Now that you've written it out, it's plainly obvious, but for me the readability and flexibility of delimiters beats the speed of typing and scanning. Many a times I've been grateful that I added delimiters because then I was no longer be hamstrung by any potential changes to the length of any particular segment within the name.
> Hetzner volumes are, in my experience, too slow for a production database. While you may in the past have had a good experience running customer-facing databases on AWS EBS, with Hetzner's volumes we were seeing >50ms of IOWAIT with very low IOPS.
There is a surprisingly easy way to address this issue: use (ridiculously cheap) Hetzner metal machines as nodes. The ones with nvme storage offer excellent performance for dbs and often have generous amounts of RAM. I'd go as far as to say you'd be better off to invest in two or more beefy bare metal machines for a master-replica(s) setup rather than run the db on k8s.
If you don't want to be bothered with the setup, you can use one of many modern packages such as Pigsty: https://pigsty.cc/ (not affiliated but a huge fan).
Very nicely written article. I’m also running a k8s cluster but on bare metal and qemu-kvms for the base load. Wonder why you would chose VMs instead of bare metal if you looking for cost optimisation (additional overhead maybe?), could you share more about this or did I miss it?
Thank you! The cloud servers are sufficiently cheap for us that we could afford the extra flexibility we get from them. Hetzner can move around VMs without us noticing but in contrast they are rebooting a number of metal machines for maintenance now and for the last little while, which would have been disruptive especially during the migration. I might have another look next year at metal but I’m happy with the cloud VMs currently.
Note, they usually do not reboot or touch your servers. But yes, the current maintenance of their metal routers (rare, like once every 2 years) requires you to juggle a bit with different machines in different datacenters.
I didn’t touch on that in the article, but essentially it’s a one line change to add a worker node (or nodes) to the cluster, then it’s automatically enrolled.
We don’t have such bursty requirements fortunately so I have not needed to automate this.
From my experience, the cloud bill on Hetzner can sometimes be as low as 20% of an equivalent AWS bill. However, this cost advantage comes with significant trade-offs.
On Kubernetes with Hetzner, we managed a Ceph cluster using NVMe storage, MariaDB operators, Cilium for networking, and ArgoCD for deploying Helm charts. We had to handle Kubernetes cluster updates ourselves, which included facing a complete cluster failure at one point. We also encountered various bugs in both Kubernetes and Ceph, many of which were documented in GitHub issues and Ceph trackers. The list of tasks to manage and monitor was endless. Depending on the number of workloads and the overall complexity of the environment, maintaining such a setup can quickly become a full-time job for a DevOps team.
In contrast, using AWS or other major cloud providers allows for a more hands-off setup. With managed services, maintenance often requires significantly less effort, reducing the operational burden on your team.
In essence, with AWS, your DevOps workload is reduced by a significant factor, while on Hetzner, your cloud bill is significantly lower.
Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.
reply