I've 'rebuilt' our kubernetes cluster almost 3 times since I started by applying lessons learned from running the last iteration for a few months. It's just like anything else in software development, as you start your tech debt is high mostly due to inexperience. Force yourself to reduce that debt whenever you can.
As an example: the first version had a bunch of N1's (1 VCPU machines) with hand written yaml files, no auto scaling. I had to migrate our database and had a headache updating the DB connection string on each deployment. Then I discovered external services, which let me define the DB hostname once. (https://cloud.google.com/blog/products/gcp/kubernetes-best-p...).
It's just to say with kubernetes, I think it's impossible to approach it thinking you'll get it right the first time. Just dedicate more time to monitoring at the beginning so you don't do anything 'too stupid' and take the time to bake in what you learn to your cluster.
On most consultancy projects, there is only one shot at every user story.
From my experience at corporate projects, you don't get to hire other contractor, rather management two levels above you get a nice lunch with the management level from the other side, and few things get sorted out among a couple of polite disagreements, and you need to be happy with whatever was delivered or go look for something else.
I would guess he more or less halved the price if you suppose at any minute 70% of the pods are idle, but they still consume > 0 resources, and the price per vcpu didn't change, like he said. Then you also have the overhead of running the Kubernetes processes which is cut to about 1/100th lowering the price further.
Not real numbers, but gives you an idea.
Let's not forget that you'd also have to pay at least 3 (8 hours per shift, not accounting for weekends) people to ensure 24/7 service availability through those 15 Hetzner nodes.
AWS/GCP/Azure et al make it so you generally don't need to worry about the infrastructure part of the problem. Everything else still applies though.
This is true if you run your own datacenter, but dedicated server providers typically monitor their servers and immediately intervene if they detect any connectivity/network issue (which is outside our control even if we use cloud providers like AWS).
For individual server failure (disk issue, etc), how you handle failure condition on dedicated servers is not much different than handling failure on cloud servers (remove the bad server from the cluster, re-image a new server and join it to the cluster). The main difference is provisioning new physical server is not instant, so you'll need to plan ahead and either have some spares or slightly over-provision your cluster (so you can take down a few nodes without degrading your service). You can do this automatically or manually, not much different than using cloud servers.
Using dedicated servers is not as scary as what people thought it would be.
For memory resources, GKE reserves the following:
255 MiB of memory for machines with less than 1 GB of memory
25% of the first 4GB of memory
20% of the next 4GB of memory (up to 8GB)
10% of the next 8GB of memory (up to 16GB)
6% of the next 112GB of memory (up to 128GB)
2% of any memory above 128GB
For CPU resources, GKE reserves the following:
6% of the first core
1% of the next core (up to 2 cores)
0.5% of the next 2 cores (up to 4 cores)
0.25% of any cores above 4 cores
It's important to grok the underlying systems here, imo. CPU requests map to the cpushares property of the container's cpu,cpuacct cgroup. A key thing about cpushares is that it guarantees a minimum number of 1/1024 shares of a core, but doesn't prevent a process from consuming more if they're available. The CPU limit uses the CPU bandwidth control scheduler, which specifies a number of slices per second (default 100k) and a share of those slices which the process cannot, afaik, exceed. So by setting the request to 20m and the limit to 200m the author left a lot of room for pods that look like they fit fine under normal operating conditions to spike up and consume all the CPU resources on the machine. K8s is supposed to reserve resources for the kubelet and other components on a per node basis but I'm not surprised it's possible to place these components under pressure using settings like these on a 1 core node.
To be clear I don't know exactly how the k8s scheduler weighs CPU requests vs. limits in fitting pods to nodes. It's something I've wanted to dig further into. I do know basically how the two underlying control systems function. The cpushares system (K8S CPU requests) cannot prevent a process from taking more shares. The CPU bandwidth control system (K8S CPU limits) will throttle a process at the upper limit, but processes will not be evicted by the kubelet for hitting this limit. So if you have a pod with requests set to 20m and limits set to 200m, those pods are able to take 200m. If the scheduler is using limits to fit pods then maybe you can get 3-4 of these on a 1 core node and leave enough CPU for the other system components. If it's using some weighted combination of limits and requests then it might place more than 3-4 pods on that node, each of which is then permitted to take up to 200m. Just a theory, and I am sure there are some folks here who know exactly how it works. Maybe we'll hear from someone.
Constantly adding and removing hosts can also negatively affect e.g. network provider, depending on which you use. In my experience, Weave worked significantly worse than something "simpler" like flannel when combined with frequent auto scaling.
What would you do if you were serving tons of data but didn't have to compute much?
I have been playing with an autoscaling k3s cluster (https://github.com/rcarmo/azure-k3s-cluster) in order to figure out the right way to scale up compute nodes depending on pod requirements, and even though the Python autoscaler I am noodling with is just a toy, I'm coming to the conclusion that all the work involved in using the Kubernetes APIs to figure out pod requirements and deciding whether to spawn a new VM based on that is barely more efficient than just using 25% "padding" in CPU metrics to trigger autoscaling with standard Azure automation, at least for batch workloads (I run Blender jobs on my toy cluster to keep things simple).
YMMV, but it's fun to reminisce that oversubscription was _the_ way we dealt with running multiple services on VMware stacks, since it was very rare to have everything need all the RAM or IO at once.
* 96x single-core CPUs with no multithreading
* 1x 96-core CPU with multithreading, but running all cores at full power all the time
* 1x 96-core CPU that can turn off sets of 16 cores at a time when they're not in use.
But mostly if it dies...
Actually, X * Y is massively higher than 1x (X * Y)
This isn't unique to k8s either; all sorts of queue-worker-oriented deployments have this issue.
The wasted idle capacity can be mitigated by having separate "burst" capacity (idle workers sitting around waiting for work) and "post-burst" capacity (a bunch of new workers that get created in response to a detected backlog on a queue), but orchestrating that is complicated: how much of a backlog merits the need for post-burst workers to begin starting? Instead, can the normal burst workers pay it down fast enough that no new instances/pods need to be scheduled? Do your post-burst workers always start up at the same rate (hint: as your application/dependency stack grows, they start slower and slower)? How do you define SLOs/SLAs for a service with a two-tiered scale-out behavior like this (some folks are content with just a max time-to-processed SLA based on how long it takes a post-burst worker to come online and consume the oldest message on the queue at a given point in time, other workloads have more demanding requirements for the worst/average cases)?
In many cases, just keeping the peak-scale amount of idle workers sitting around waiting for queue bursts is cheaper (from an operational/engineering time perspective) than building something that satisfactorily answers those questions.
The Kubernetes cluster was configured with horizontal pod autoscaling and cluster autoscaling, and to avoid problems the cpu limits were set to 0.5cpu. The end result was Kubernetes creating a myriad of nodes running 70% idle to accommodate the result of the cluster's autoscaling policy because a 1vCPU node does not have much headroom to accommodate peaks. For example, if you have 3 or 4 pods with 500m cpu limit running on a single vCPU node and it so happens that two peak at the same time, resource limits will be hit and cluster scaling will kick in to create yet another node just to meet that demand. In practice this mean that for each and every 1vCPU node to accommodate the peak demand of a single pod without kicking in cluster autoscaling, it needs to run at most at 50% idle.
This problem is mitigated by replacing 1vCPU nodes with higher vCPU count (the author switched to nodes with 16 vCPUs) because they have enough resources to scale up deployments without having to launch new nodes.
Inspect all the stuff and generated configs! That's how I started.
I’ve got a project using this setup, but it’s fairly common one l- e.g. Express with node clustering, Puma on Rails etc. On Kubernetes you obviously just forgo the clustering ability and let k8s handle the concurrency and routing for you.
So in this instance, I’m struggling to see why I wouldn’t request a value of 1vCPU for each process. My thinking is that my program is already single threaded, and asking the kubernetes CPU scheduler to spread resource between multiple single threaded processes is pure overhead. At that point I should allow each process to run to its full capacity. Is that correct?
This I feel gets a lot more complex now that my chosen language, DB drivers, and web framework is just starting to support multithreading. That’s a can of worms I can’t begin to figure out how to allocate for - 2vCPU’s each? Does anyone know?
Depending on how much (and how heavy) async work you're doing it might be reasonable to let a node process use multiple cores.
As far as the 96 CPU instance that really isn't good either unless your pods were all taking 1+ CPUs each. Even then, I'd rather run 6 x 16 CPU. There's a pod limit cap of ~110 per node, and also not to mention the loss of redundancy. I find 16-32 CPU nodes the best balance.
Agreed, this is the other side of the node sizing question. At the low end you have to consider what your most resource hungry workloads need (and we use nodepools to partition our particularly edge-casey things), and then at the upper end you don't want the failure of one node to take out half your stuff.
He has many idle pods with a bursty workload.
The author says they need to reserve a lot of cpu or containers fail to create. Why is this? Wouldn't memory be more likely a cause for the failure? How does lack of CPU cause a failure?
Later the author notes that a many core machine is good for his workload because "pods can be more tightly packed." How does that follow? A pod using above the reserved resources will bump up against the other pods on that physical machine whether you've virtualized it as a standard-1 or standard-16. Is there a cost savings because the unreserved ram over-provisioned? Wouldn't that overbooking be dangerous if you had uniform load across all the pods in a standard-16.
Said another way, why is resource contention with yourself in a standard-16 better or cheaper than with others in the standard-1 pool?
My understanding with going the vCPU options was simply the choice between pricing granularity and CPU overheard of k8s.
- needs granular scaling
- devops expertise is not core to the business
- save developer time
"K8s is so hard! All the tutorials are too basic! I had to redo my cluster multiple times!"
This is one of the author’s fatal assumption. The best practice I understand is to set cpu requests to be around 80% of peak and limits to 120% of peak before deploying to prod.
They set themselves up for disaster with this architecture where they have many idle pods polling for resource availability. This resource monitoring should have been delegated to a single pod.
Also it’s really unclear what specific strategy led to extra costs of 1000s of dollars...
This is a capacity planning or "right-sizing" problem. In prod you just don't go and flip completely your layout (100 1vCPU servers vs 1 100vCPU server or whatever) an more so in a stack you are not yet expert on, you change a bit and then measure. Actually you try to figure this out first in a dev environment if possible.
this community is ripe for implosion - what a joke
Devs like to focus on costs related to lost time. This is a pretty common trend and not really sure why you think that there's anything pathological about k8s. Maybe there's something pathological about infra folks in general and dev folks.
Not sure of the unit, but it generally means 83% of a single CPU's processing power. My understanding is that this is not strictly enforced, it's just a tool for Kubernetes to schedule pods and to make sure no set of pods add up to more than 100m on any CPU.
Never set cpu limits, always set mem request=limit unless you really have good reason not to.
Edit: also please advise how can i tune cfs period on gke
Kubelet has its own cgroup in hierarchy above pods and should set its cpu shares there as well (most cloud providers already do this).
The only two reasons to set cpu limits I’m aware of is to force pods onto dedicated cores (which doesn’t work on gke as of 1.13) and if you run some sort of pay-per-use compute platform for external users.
Why in the world do I need an account to read a glorified blog? It’s text data, I should be able to consume it with curl if I’m so inclined.
Which is a completely different approach than building one's own business. These platforms can still help drive traffic to a decent website though.
I use it for professional stuff that needs to be linked to but which isn’t on an official blog property of some sort.
I do keep a personal blog but sometimes cross-post to Medium.
This is the add-on I use: https://addons.mozilla.org/en-US/firefox/addon/disable-javas...
I hope there will be some kind of option like ‘open this link in reader mode’.
There is a reason why these DevOps certifications exist in the first place and why it is a huge risk for a company to spend lots of time and money on training to learn such a complex tool like Kubernetes (Unless they are preparing for a certification). Perhaps it would be better to hire a consultant skilled in the field rather than using it blind and creating these mistakes later.
When mistakes like this occur and go unnoticed for a long time, it racks up and creates unnecessary costs which amount as much as $10k/month which depending on the company budget can be very expensive and can make or break a company.
Unless you know what you are doing, don't touch tools you don't understand.
I'm sorry but this is an absurd comment, and a fairly miserable reply to make to an engineer who is sharing a mistake in hopes of helping others to learn. We all touch tools we don't understand, and hopefully we learn something and come to understand them better. FWIW I have been running prod. workloads on GKE since 2016, and I think any engineer who is familiar with managing and deploying to cloud infrastructure can readily learn to run a GKE cluster.
> Unless you know what you are doing, don't touch tools you don't understand.
I think it might be better to say "don't take risks you can't afford". You should experiment with new tools to see if they're better than what you're using now. Just don't deploy systems to prod before you really understand them.
That's laughable but I will play:
I will pay anyone with a devops cert $0.01 for a right to 10% of my savings over a year period. If I end up paying more for the service after hiring such person, that person will pay me 110% of the excess that I paid for the service as a result of hiring them. If a devops cert is actually any good then this would be a license to print money for anyone with a devops cert.
OP's problem is that his organization did not engage in any sort of risk management which is why they had
a) K8s as something magical that makes things work
b) Someone who did not know how K8s works being allowed to re-engineers K8s
c) No alert on a change of the usage data exported by Google
P.S. If you are on a cloud, drop everyting and implement the (c). It will save your shirt donzens if not hundreds of times a year.
An engineer who does it right isn't going to save you much money over your best case scenario - but they're going to keep you from losing millions in the worst case scenarios.
Those are the tales consultants and engineers that like to play with toys tell: it is a typical case for a premature optimization. The odds of you having enough traffic that needs to scale are slim to none.
If you do need to scale, the odds are your apps are over-engineered on corner cases and under-engineered in the main path: if your ORM takes 300 ms to initialize on every request without fetching any data from the database "scaling" is the last thing you should be worried about.
> An engineer who does it right isn't going to save you much money over your best case scenario - but they're going to keep you from losing millions in the worst case scenarios.
You will go out of business before those savings are going to matter.
I'll offer it for free so you can pay me immediately. I got my DevOp certificate in 1999 from BrainBench as a web master.
That approach makes sense for businesses that need to be risk-averse, e.g. self-driving cars.
But in a "move fast/break things" operation, wasting some money might be preferable to taking the time to do things right the first time around.
Yes there is a reason, but you're not going to like the answer if I tell you what it is.
But how do I know I’m an idiot?