Hacker News new | past | comments | ask | show | jobs | submit login
Kubernetes on Hetzner: cutting my infra bill by 75% (bilbof.com)
165 points by BillFranklin 11 hours ago | hide | past | favorite | 80 comments





I have experience running Kubernetes clusters on Hetzner dedicated servers, as well as working with a range of fully or highly managed services like Aurora, S3, and ECS Fargate.

From my experience, the cloud bill on Hetzner can sometimes be as low as 20% of an equivalent AWS bill. However, this cost advantage comes with significant trade-offs.

On Kubernetes with Hetzner, we managed a Ceph cluster using NVMe storage, MariaDB operators, Cilium for networking, and ArgoCD for deploying Helm charts. We had to handle Kubernetes cluster updates ourselves, which included facing a complete cluster failure at one point. We also encountered various bugs in both Kubernetes and Ceph, many of which were documented in GitHub issues and Ceph trackers. The list of tasks to manage and monitor was endless. Depending on the number of workloads and the overall complexity of the environment, maintaining such a setup can quickly become a full-time job for a DevOps team.

In contrast, using AWS or other major cloud providers allows for a more hands-off setup. With managed services, maintenance often requires significantly less effort, reducing the operational burden on your team.

In essence, with AWS, your DevOps workload is reduced by a significant factor, while on Hetzner, your cloud bill is significantly lower.

Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.


I've never operated a kubernetes cluster except for a toy dev cluster for reproducing support issues.

One day it broke because of something to do with certificates (not that it was easy to determine the underlying problem). There was plenty of information online about which incantations were necessary to get it working again, but instead I nuked it from orbit and rebuilt the cluster. From then on I did this every few weeks.

A real kubernetes operator would have tooling in place to automatically upgrade certs and who knows what else. I imagine a company would have to pay such an operator.


Manually updating k8s clusters is a huge tradeoff. I can’t imagine doing that to save a couple bucks unless I was desperate

> Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.

Sure, but the TLDR is going to be that if you employ n or more sysadmins, the cost savings will dominate. With 2 < n < 7. So for a given company size, Hetzner will start being cheaper at some point, and it will become more extreme the bigger you go.

Second if you have a "big" cost, whatever it is, bandwidth, disk space (essentially anything but compute), cost savings will dominate faster.


Not always. Employing Sysadmins doesn't mean Hetzner is cheaper because those "Sysadmin/Ops type people" are being hired to managed the Kubernetes cluster. And Ops type people who truly know Kubernetes are not cheap.

Sure, you can get away with legoing some K3S stuff together for a while but one major outage later, and that cost saving might have entirely disappeared.


When I worked in web hosting (more than 10 years ago), we would constantly be blackholeing Hetzner IPs due to bad behavior. Same with every other budget/cheap vm provider. For us, it had nothing to do with geo databases, just behavior.

You get what you pay for, and all that.


Yep I had the same problem years ago when I tried to use Mailgun's free tier. Not picking on them, I loved the features of their product but the free tier IPs had a horrble reputation and mail just would not get accepted especially by hotmail or yahoo.

Any free hosting service will be overwhelmed by spammers and fraudsters. Cheap services the same but less so, and the more expensive they are the less they will be used for scams and spams.


Tragedy of the Commons Ruins Everything Around Me.

depending on the prices, maybe a valid strategy would be to have servers at hetzner and then tunnel ingress/egress somewhere more prominent. Maybe adding the network traffic to the calculation still makes financial sense?

They could put the backend on Hetzner, if it makes sense (for example queues or batch processors).

That's a really good article. Actually, recently we were migrating as well and we were using dedicated nodes in our setup.

In order to integrate a load-balancer provided by hetzner with our k8s on dedicated servers we had to implement a super thin operator that does it: https://github.com/Intreecom/robotlb

If anyone will be inspired by this article and would want to do the same, feel free to use this project.


I loved the article. Insightful, and packed with real world applications. What a gem.

I have a side-question pertaining to cost-cutting with Kubernetes. I've been musing over the idea of setting up Kubernetes clusters similar to these ones but mixing on-premises nodes with nodes from the cloud provider. The setup would be something like:

- vCPUs for bursty workloads,

- bare metal nodes for the performance-oriented workloads required as base-loads,

- on-premises nodes for spiky performance-oriented workloads, and dirt-cheap on-demand scaling.

What I believe will be the primary unknown is egress costs.

Has anyone ever toyed around with the idea?


For dedicated they say this:

>All root servers have a dedicated 1 GBit uplink by default and with it unlimited traffic.

>Inclusive monthly traffic for servers with 10G uplink is 20TB. There is no bandwidth limitation. We will charge € 1/TB for overusage.

So it sounds like it depends. I have used them for (I'm guessing) 20 years and have never had a network problem with them or a surprise charge. Of course I mostly worked in the low double digit terabytes. But have had servers with them that handled millions of requests per day with zero problems.


> We will charge € 1/TB for overusage.

It sounds like a good tradeoff. The monthly cost of a small vCPU is equivalent to a few TB of bandwidth.


1 / 8 * 3600 * 24 * 30 = 324000 so that 1GBit/s server could conceivably get 324TB of traffic per month "for free". It obviously won't, but even a tenth of data is more than the data included with the 10G link.

They do have a fair use policy on the 1GBit uplink. I know of one report[1] of someone using over 250TB per month getting an email telling them to reduce their traffic usage.

The 10GBit uplink is something you need to explicitly request, and presumably it is more limited because if you go through the trouble of requesting it, you likely intend to saturate it fairly consistently, and that server's traffic usage is much more likely to be an outlier.

[1]: https://lowendtalk.com/discussion/180504/hetzner-traffic-use...


20TB egress on AWS runs you almost $2,000 btw. one of the biggest benefits of Hetzner

> Has anyone ever toyed around with the idea?

Sidero Omni have done this: https://omni.siderolabs.com

They run a Wireguard network between the nodes so you can have a mix of on-premise and cloud within one cluster. Works really well but unfortunately is a commercial product with a pricing model that is a little inflexible.

But at least it shows it's technically possible so maybe open source options exist.


You could make a mesh with something like Netmaker to achieve similar using FOSS. Note I haven’t used Netmaker in years but I was able to achieve this in some of their earlier releases. I found it to be a bit buggy and unstable at the time due to it being such young software but it may have matured enough now that it could work in an enterprise grade setup.

The sibling comments recommendation, Nebula, does something similar with a slightly different approach.


> They run a Wireguard network between the nodes so you can have a mix of on-premise and cloud within one cluster.

Interesting.

A quick search shows that some people already toyed with the idea of rolling out something similar.

https://github.com/ivanmorenoj/k8s-wireguard


Slack’s Nebula does something similar, and it is open source.

I'm a bit sad the aggressive comment by the new account was deleted :-(

The comment was making fun of the wishful thinking and the realities of networking.

It was a funny comment :-(


Enable "showdead" on your profile and you can see it.

It wasn’t funny. I can still see it. The answer was vpn. If you want to go fancy you can do istio with vms.

And if you wanna be lazy, there is a tailscale integration to run the cluster communication over it.

https://tailscale.com/kb/1236/kubernetes-operator

They've even improved it, so you can now actually resolve the services etc via the tailnet dns

https://tailscale.com/learn/managing-access-to-kubernetes-wi...

I haven't tried that second part though, only read about it.


Okay, vpn it is.

I just wanted to provide the link in case someone was interested, I know you already mentioned it 。 ◕ ‿ ◕ 。

(Setting up a k8s cluster over software VPN was kinda annoying the last time I tried it manually, but super easy with the tailscale integration)


yes, like i said, throw an overlay on that motherfucker and ignore the fact that when a customer request enters the network it does so at the cloud provider, then is proxied off to the final destination, possibly with multiple hops along the way.

you can't just slap an overlay on and expect everything to work in a reliable and performant manner. yes, it will work for your initial tests, but then shit gets real when you find that the route from datacenter a to datacenter b is asymmetric and/or shifts between providers, altering site to site performance on a regular basis.

the concept of bursting into on-prem is the most offensive bit about the original comment. when your site traffic is at its highest, you're going to add an extra network hop and proxy into the mix with a subset of your traffic getting shipped off to another datacenter over internet quality links.


a) Not every Kubernetes cluster is customer facing.

b) You should be architecting your platform to accomodate these very common networking scenarios i.e. having edge caching. Because slow backends can be caused by a range of non-networking issues as well.

c) Many cloud providers (even large ones like AWS) are hosted in or have special peering relationships with third party DCs e.g. [1]. So there are no "internet quality links" if you host your equipment in one of the major DCs.

[1] https://www.equinix.com.au/partners/aws


> yes, like i said, (...)

I'm sorry, you said absolutely nothing. You just sounded like you were confused and for a moment thought you were posting on 4chan.


Nobody said „do it guerilla-style”. Put some thought into it.

This is an interesting writeup, but I feel like it's missing a description of the cluster and the workload that's running on it.

How many nodes are there, how much traffic does it receive, what are the uptime and latency requirements?

And what's the absolute cost savings? Saving 75% of $100K/mo is very different from saving 75% of $100/mo.


Be careful with Hetzner, they null routed my game server on launch day due to false positives from their abuse system, and then took 3 days for their support team to re-enable traffic.

By that point I had already moved to a different provider of course.


Digital Ocean did this to my previous company. They said we’d been the target of a DOS attack (no evidence we could see). They re-enabled the traffic, then did it again the next day, and then again. When we asked them to stop doing that they said we should use Cloudflare to prevent DOS attacks… all the box did was store backups that we transferred over SSH. Nothing that could go behind Cloudflare, no web server running, literally only one port open.

where did you move, asking to keep a list of options for my game servers, i’m using ovh game servers atm

I went back to AWS. Expensive but reliable and support I can get ahold of. I’d still like to explore OVH someday though.

Nothing beats aws tbh, the level of extra detail aws adds, like emailing and alerting a gazillion times before making any changes to underlying hardware, even if non disruptive. Robust <24 hour support from detailed, experienced and technical support staff, a very visible customer obsession laced experience all-around. Ovh has issues with randomly taking down vps/baremetal instances at random with their support staff having no clue/late non-real time data on their instance state, they lost a ton of customer data in their huge datacenter fire 2 yrs ago, didnt even replicate the backups across multiple datacentres like they were supposed to, got sued a ton too.

I use OVH because the cost reduction supremely adds up for my workloads (remote video editing/ custom rendering farm at scale with a lot more cheaper OVH s3 suitable for my temporary but too many asset workload with high egress requirements) but otherwise I miss AWS and get now, just how much superior their support and attention to detail is.


Reading comments from the past few days makes it seem like dealing with Hetzner is a pain (and as far as I can tell, they aren't really that cheaper than the competitors).

> (and as far as I can tell, they aren't really that cheaper than the competitors)

Can you say more? Their Cloud instances, for example, are less than half the cost of OVH's, and less than a fifth of the cost of a comparable AWS EC2 instance.


even free servers are of no use if it’s not usable during a product launch. :) You get what you pay for i guess.

But i do agree, it is much cheaper.


To be fair what use is a server if you can’t afford to keep it running. This is especially true for very bootstrapped startups.

What competitors are similar to Hetzner in pricing? Last I checked, they seemed quite a bit cheaper than most.

Forum for cheap hosts:

https://lowendtalk.com/

Wouldn't reccomend any of these outside of personal use though.


I don't think so. We see the outliers. Those happens at Linode, Digital Ocean, etc also. And yes even at Google Cloud and AWS you sometimes get either unlucky or unfairly treated.

> they aren't really that cheaper than the competitors

This is demonstrably false.


Honestly hetzner supoort has bren outstanding from my experience. They are always there and very responsive using email

I haven't used it personally, but https://github.com/kube-hetzner/terraform-hcloud-kube-hetzne... looks amazing as a way to setup and manage kubernetes on Hetzner. At the moment I'm on Oracle free tier, but I keep thinking about switching to it to get off... Well Oracle.

I'm running two clusters on it, on for production and one for dev. Works pretty good. With a schedule to reboot machines every sunday for automatic security updates (SuSE Micro OS). Also expanded machines for increased workloads. You have to make sure to inspect every change terraform wants to do, but then you're pretty save. The only downside is that every node needs a public IP, even though they are behind a firewall. But that is being worked on.

i recently read an article about running k8s on the oracle free tier and was looking to try it. i'm curious, are there any specific pain points that are making you think of switching?

Nope, just Oracle being a corp with a nasty reputation. Honesty it was easy to set up and has been super stable, and if you go ARM the amount of resources you get for free is crazy. I actually do recommend it for personal projects on the like. I'd just be hesitant about building a business based on any Oracle offering.

I've used this to set up a cluster to host a dogfooded journalling site.

In one evening I had a cluster working.

It works pretty well. I had one small problem when the auto-update wouldn't run on arm nodes which stopped the single node I had running at that point (with the control plane taint blocking the update pod running on them).


> While DigitalOcean, like other providers, offers a free managed control plane, there is typically a 100% markup on the nodes that belong to these managed clusters.

I don't think this is true. With Digital Ocean, the worker nodes are the same cost as regular droplets, there's no additional costs involved. This makes Digital Ocean's offering very attractive - free control plane you don't have to worry about, free upgrades, and some extra integrations to things like the load balancer, storage, etc. I can't think of a reason to not go with that over self-managed.


The actual nodes are still way more expensive on digital ocean than they are in Hetzner. That’s probably the main reason.

8GB RAM, shared cpu on hetzner is ~$10

Equivalent on digital ocean is $48


I’m planning on doing something similar but want to use Talos with bare metal machines. I suspect to see similar price reductions from our current EKS bill.

Depending on your cluster size I highly recommend Omni: https://omni.siderolabs.com

It took minutes to setup a cluster and I love having a UI to see what is happening.

I wish there were more products like this as I suspect there will be a trend towards more self-managed Kubernetes clusters given how expensive the cloud is becoming.


I set up a Talos bare metal cluster about a year ago, and documented the whole process on my website. Feel free to reach out if you have any questions!

Any thoughts/feelings about Talos vs Bottlerocket?

Can anybody speak to the pros and cons of Hetzner vs OVH?

There ain't many large European cloud companies, and I would like to understand how they differentiate.

Ionos is another European one. Currently, it looks like their cloud business is stagnating, though.


I'd say stay clear of Ionos.

Bonkers first experience in the last two weeks.

Graphical "Data center designer", no ability to open multiple tabs, instead always rerouting to the main landing page.

Attached 3 IGWs to a box, all public IPs, GUI shows "no active firewall rules".

IGW 1: 100% packet loss over 1 minute.

IGW 2: 85% packet loss over 1 minute.

IGW3: 95% packet loss over 1 minute.

Turns out "no active Firewall rules" just wasn't the case and explicit whitelisting is absolutely required.

But wait, there's more!

Created a hosted PostgreSQL instance, assigned a private subnet for creation.

SSH into my server, ping the URL of the created Postgres instance: The DB's IP is outside the CIDR range of the assigned subnet and unreachable.

What?

Deleted the instance, created another one, exact same settings. Worked this time around.

Support quality also varies extremely.

Out of 3 encounters, I had a competent person once.

Other two straight out said they have no idea what's going on.


This is probably out of left field, but what is the benefit of having a naming scheme for nodes without any delimiters? Reading at a glance and not knowing the region name convention of a given provider (i.e. Hetzner), I'm at a loss to quickly decipher the "<region><zone><environment><role><number>" to "euc1pmgr1". I feel like I'm missing something because having delimiters would make all sorts of automated parsing much easier.

Quicker to type and scan! Though I admit this is preference, delimiters would work fine too.

Parsing works the same but is based on a simple regex rather than splitting on a hyphen.

euc=eu central; 1=zone/dc; p=production; wkr=worker; 1=node id


Thanks for getting back to me! Now that you've written it out, it's plainly obvious, but for me the readability and flexibility of delimiters beats the speed of typing and scanning. Many a times I've been grateful that I added delimiters because then I was no longer be hamstrung by any potential changes to the length of any particular segment within the name.

Yea, not putting in delimiter and then us having to change our format has bitten me so many times. Delimiter or bust.

You can treat the numeric parts as self-delimiting ... that leaves only the assumption that "environment" is a single letter.

> Hetzner volumes are, in my experience, too slow for a production database. While you may in the past have had a good experience running customer-facing databases on AWS EBS, with Hetzner's volumes we were seeing >50ms of IOWAIT with very low IOPS.

There is a surprisingly easy way to address this issue: use (ridiculously cheap) Hetzner metal machines as nodes. The ones with nvme storage offer excellent performance for dbs and often have generous amounts of RAM. I'd go as far as to say you'd be better off to invest in two or more beefy bare metal machines for a master-replica(s) setup rather than run the db on k8s.

If you don't want to be bothered with the setup, you can use one of many modern packages such as Pigsty: https://pigsty.cc/ (not affiliated but a huge fan).


Thanks, hadn’t heard of pigsty. As you say, I had to use nvme ssds for the dbs, the performance is pretty good so I didn’t look to get metal nodes.

There are plenty of options for running a database on Kubernetes whilst using local NVMe storage.

There are just pinning the database pods to specific nodes and using a LocalPathProvisioner or distributed solutions like JuiceFS, OpenEBS etc.


Very nicely written article. I’m also running a k8s cluster but on bare metal and qemu-kvms for the base load. Wonder why you would chose VMs instead of bare metal if you looking for cost optimisation (additional overhead maybe?), could you share more about this or did I miss it?

Thank you! The cloud servers are sufficiently cheap for us that we could afford the extra flexibility we get from them. Hetzner can move around VMs without us noticing but in contrast they are rebooting a number of metal machines for maintenance now and for the last little while, which would have been disruptive especially during the migration. I might have another look next year at metal but I’m happy with the cloud VMs currently.

Note, they usually do not reboot or touch your servers. But yes, the current maintenance of their metal routers (rare, like once every 2 years) requires you to juggle a bit with different machines in different datacenters.

Anybody running k3s/k8s on Hetzner using cax servers? How's that working?

I went hetzner baremetal, set up a proxmox cluster over it and then have kubernetes on top. Gives me a lot of flexibility I find.

Did you try Cloud66 for deploy?

Do you know that they are cutting their free tier bandwidth? Did not read too much into it, but heard a few friends were worried about.

End of they day, they are a business!


What about cluster autoscaling?

I didn’t touch on that in the article, but essentially it’s a one line change to add a worker node (or nodes) to the cluster, then it’s automatically enrolled.

We don’t have such bursty requirements fortunately so I have not needed to automate this.


Great write up Bill!

https://github.com/puppetlabs/puppetlabs-kubernetes

What do the fine people of HN think about the size/scope/amount of technology of this repo?

It is referenced in the article here: https://github.com/puppetlabs/puppetlabs-kubernetes/compare/...


Lovely website.

this is good

well, running on bare metal would be even better




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: