Hacker News new | past | comments | ask | show | jobs | submit login
A Kubernetes/GKE mistake that cost me thousands of dollars (medium.com)
155 points by dankohn1 33 days ago | hide | past | web | favorite | 111 comments

The problem of learning by doing is that it's extremely hard to find good tutorials designed for production. Most of what I find these days is 'hello world'y and then you need some tool like Sentry to catch edge cases that don't get caught in your limited testing.

I've 'rebuilt' our kubernetes cluster almost 3 times since I started by applying lessons learned from running the last iteration for a few months. It's just like anything else in software development, as you start your tech debt is high mostly due to inexperience. Force yourself to reduce that debt whenever you can.

As an example: the first version had a bunch of N1's (1 VCPU machines) with hand written yaml files, no auto scaling. I had to migrate our database and had a headache updating the DB connection string on each deployment. Then I discovered external services, which let me define the DB hostname once. (https://cloud.google.com/blog/products/gcp/kubernetes-best-p...).

It's just to say with kubernetes, I think it's impossible to approach it thinking you'll get it right the first time. Just dedicate more time to monitoring at the beginning so you don't do anything 'too stupid' and take the time to bake in what you learn to your cluster.

These getting started tutorials are also very annoying for IT departments that have developers pushing for containerization und not understanding that it isn't just a few commands when you are responsible for operating them in production. They aren't the ones that have to be on call in the middle of the night and make sure it's all running.

Yes, this developers vs operations divide is originally the raison d'être for the "devops" (or SRE in google-speak) concept.

Depends on your org. You should try applying SRE principles. Developers get the pager until their application meets a defined criteria that you both agree on with management but-in.

Sometimes, though, an infrastructure team will want to manage the service internally, even if a managed service (e.g. GKE) is available and compliant with the workloads, and would reduce overhead.

Reduce overhead = eliminate their job

Which is nice when one has the luxury of being allowed to redo stuff.

On most consultancy projects, there is only one shot at every user story.

The point of getting constants is the consultant brings in higher level of skills and experience to the table. I sympathize with the consultants but thats for them and their employer to sort out. Otherwise I’ll hire a contractor at another rate

There is also the other side, getting lemons for oranges, because no one wants to be left behind doing pitches, and naturally everyone is an expert, with x years of experience delivering solid applications in production with multiple 9.

From my experience at corporate projects, you don't get to hire other contractor, rather management two levels above you get a nice lunch with the management level from the other side, and few things get sorted out among a couple of polite disagreements, and you need to be happy with whatever was delivered or go look for something else.

I'm just getting stared with kubernetes after years as a happy Heroku customer, and it's nice to hear others with similar experience. Not sure if you've seen it, but I read Kubernetes: Up and Running from a recommendation in another thread and it was very useful.

It would be helpful if the author specified what the price difference actually was here at the end of the day - their initial cost was $3,500/mo but they don't mention how much they pay now after changing instance types.

> In case of this particular deployment, these pods will change between idle/ active state multiple times every minute and spend 70%+ in idle state.

I would guess he more or less halved the price if you suppose at any minute 70% of the pods are idle, but they still consume > 0 resources, and the price per vcpu didn't change, like he said. Then you also have the overhead of running the Kubernetes processes which is cut to about 1/100th lowering the price further.

Not real numbers, but gives you an idea.

I wonder what would be the cost without k8s setup and no containers at all, pure compute nodes. I bet my last socks twice less, at least.

With $3500/mo budget you could get 15 AMD servers on Hetzner, each with 32 cores and 128gb of RAM. You won't get autoscaling and other nice stuff available in other cloud providers, but nothing beat dedicated servers in term of price vs performance ratio.

Commoditized cloud is basically paying a company to run your ops for you.

Let's not forget that you'd also have to pay at least 3 (8 hours per shift, not accounting for weekends) people to ensure 24/7 service availability through those 15 Hetzner nodes.

AWS/GCP/Azure et al make it so you generally don't need to worry about the infrastructure part of the problem. Everything else still applies though.

> Let's not forget that you'd also have to pay at least 3 (8 hours per shift, not accounting for weekends) people to ensure 24/7 service availability through those 15 Hetzner nodes.

This is true if you run your own datacenter, but dedicated server providers typically monitor their servers and immediately intervene if they detect any connectivity/network issue (which is outside our control even if we use cloud providers like AWS).

For individual server failure (disk issue, etc), how you handle failure condition on dedicated servers is not much different than handling failure on cloud servers (remove the bad server from the cluster, re-image a new server and join it to the cluster). The main difference is provisioning new physical server is not instant, so you'll need to plan ahead and either have some spares or slightly over-provision your cluster (so you can take down a few nodes without degrading your service). You can do this automatically or manually, not much different than using cloud servers.

Using dedicated servers is not as scary as what people thought it would be.

Except for security. You still have to.worry about that part of the infrastructure.

I meant it in cloud, AWS/GCP, because devops, lol. With autoscaling.

Yeah, basically this.

I would advise reading this section of the GKE docs. It explains the marginal gains in allocatable memory and cpu from running larger nodes https://cloud.google.com/kubernetes-engine/docs/concepts/clu...

For memory resources, GKE reserves the following:

255 MiB of memory for machines with less than 1 GB of memory 25% of the first 4GB of memory 20% of the next 4GB of memory (up to 8GB) 10% of the next 8GB of memory (up to 16GB) 6% of the next 112GB of memory (up to 128GB) 2% of any memory above 128GB

For CPU resources, GKE reserves the following:

6% of the first core 1% of the next core (up to 2 cores) 0.5% of the next 2 cores (up to 4 cores) 0.25% of any cores above 4 cores

I think the main issue this author ran into was caused by setting the CPU requests so far below the limits. I get that he was trying to communicate the spiky nature of the workload to the control plane, but I think it would have been better to reserve CPU to cover the spikes by setting the requests higher, especially on a 1 core node.

It's important to grok the underlying systems here, imo. CPU requests map to the cpushares property of the container's cpu,cpuacct cgroup. A key thing about cpushares is that it guarantees a minimum number of 1/1024 shares of a core, but doesn't prevent a process from consuming more if they're available. The CPU limit uses the CPU bandwidth control scheduler, which specifies a number of slices per second (default 100k) and a share of those slices which the process cannot, afaik, exceed. So by setting the request to 20m and the limit to 200m the author left a lot of room for pods that look like they fit fine under normal operating conditions to spike up and consume all the CPU resources on the machine. K8s is supposed to reserve resources for the kubelet and other components on a per node basis but I'm not surprised it's possible to place these components under pressure using settings like these on a 1 core node.

Another thing to keep in mind is that CPU rate is a compressible resource, unlike memory or disk space. It’s fine to briefly oversubscribe CPU. No terrible thing will happen.

If I am understanding the article right, this would have led to fewer pods per node and a higher overall cost, wouldn’t it have? The author claims that the extra headroom allowed per pod for spiky CPU usage helped absorb those spikes.

> If I am understanding the article right, this would have led to fewer pods per node and a higher overall cost, wouldn’t it have?

To be clear I don't know exactly how the k8s scheduler weighs CPU requests vs. limits in fitting pods to nodes. It's something I've wanted to dig further into. I do know basically how the two underlying control systems function. The cpushares system (K8S CPU requests) cannot prevent a process from taking more shares. The CPU bandwidth control system (K8S CPU limits) will throttle a process at the upper limit, but processes will not be evicted by the kubelet for hitting this limit. So if you have a pod with requests set to 20m and limits set to 200m, those pods are able to take 200m. If the scheduler is using limits to fit pods then maybe you can get 3-4 of these on a 1 core node and leave enough CPU for the other system components. If it's using some weighted combination of limits and requests then it might place more than 3-4 pods on that node, each of which is then permitted to take up to 200m. Just a theory, and I am sure there are some folks here who know exactly how it works. Maybe we'll hear from someone.

It's missing something crucial not mentioned, 100x nodes vs 1 means you have the overhead of running Kubernetes on those 100x nodes which is actually high ( kubelets ect ... ) on one node you just have 1core used by kube, the rest is available for your app.

Agreed. Although kubelet itself is not too terrible, all the other stuff you need to run alongside it (the "etc" part of your post) is what costs you. Network provider, per host monitoring, reserved resources per host, just to name a few.

Constantly adding and removing hosts can also negatively affect e.g. network provider, depending on which you use. In my experience, Weave worked significantly worse than something "simpler" like flannel when combined with frequent auto scaling.

So lets say that you were building a bandwidth heavy service, on the best providers each node is limited to around a 1gbps port per each 1GB 1vcpu node. If the goal is to maximize total data transferred my thought would be that it's better to have as many nodes as possible. Sending data at the rate cap shouldn't be that big a hit on the cpu with big enough chunks even when considering tls costs. But I'm not sure about this. It always seems that people are trying to maximize their compute capabilities and not their throughput when they talk about Kubernetes but I've never really had that focus.

What would you do if you were serving tons of data but didn't have to compute much?

I've seen this kind of thing happen a number of times, and it's good to remind ourselves that oversubscribing resources is still a good way to tackle the "padding" related to scaling.

I have been playing with an autoscaling k3s cluster (https://github.com/rcarmo/azure-k3s-cluster) in order to figure out the right way to scale up compute nodes depending on pod requirements, and even though the Python autoscaler I am noodling with is just a toy, I'm coming to the conclusion that all the work involved in using the Kubernetes APIs to figure out pod requirements and deciding whether to spawn a new VM based on that is barely more efficient than just using 25% "padding" in CPU metrics to trigger autoscaling with standard Azure automation, at least for batch workloads (I run Blender jobs on my toy cluster to keep things simple).

YMMV, but it's fun to reminisce that oversubscription was _the_ way we dealt with running multiple services on VMware stacks, since it was very rare to have everything need all the RAM or IO at once.

That's less of a K8s issue and more of a general multiprocessing issue. Would you rather have:

* 96x single-core CPUs with no multithreading

* 1x 96-core CPU with multithreading, but running all cores at full power all the time

* 1x 96-core CPU that can turn off sets of 16 cores at a time when they're not in use.

It depends. If that beefy machine dies, needs reboot, or the kernel can't scale up to server all these containers?

But mostly if it dies...

I read the whole thing and couldn't tell what is that "mistake that costs thousands?"

In a cloud, X * Y != 1x (X * Y)

Actually, X * Y is massively higher than 1x (X * Y)

Here's a whole collection of Kubernetes bloopers: https://github.com/hjacobs/kubernetes-failure-stories. I for one am glad people are sharing!

This is all very interesting, but one thing that occurs to me is: why are there so many idle pods? Is there any way to fold the work that is currently being done in multiple different pods into one pod? Perhaps via multiple threads, or even just plain single-threaded concurrency? Unless there is some special tenancy requirement, that might be the most efficient way to deal with this situation.

The author alluded to this by referencing task queues. It's not uncommon to have task queue workers listening on "bursty" queues which are usually quiet, with the constraint that when work arrives in the queue it should be picked up really quickly (I.e. no waiting for program, pod/container, or hardware startup). If you have multiple task queues and sets of workers for different kinds of work (I don't know if the author does, but it is definitely a common pattern), then you can easily end up with a decent number of idle pods sitting around.

This isn't unique to k8s either; all sorts of queue-worker-oriented deployments have this issue.

The wasted idle capacity can be mitigated by having separate "burst" capacity (idle workers sitting around waiting for work) and "post-burst" capacity (a bunch of new workers that get created in response to a detected backlog on a queue), but orchestrating that is complicated: how much of a backlog merits the need for post-burst workers to begin starting? Instead, can the normal burst workers pay it down fast enough that no new instances/pods need to be scheduled? Do your post-burst workers always start up at the same rate (hint: as your application/dependency stack grows, they start slower and slower)? How do you define SLOs/SLAs for a service with a two-tiered scale-out behavior like this (some folks are content with just a max time-to-processed SLA based on how long it takes a post-burst worker to come online and consume the oldest message on the queue at a given point in time, other workloads have more demanding requirements for the worst/average cases)?

In many cases, just keeping the peak-scale amount of idle workers sitting around waiting for queue bursts is cheaper (from an operational/engineering time perspective) than building something that satisfactorily answers those questions.

The idle pots are so that your cloud provider maxes out its profit.

It is difficult to determine what exactly the 'mistake' was in this post.

They were using high-resource nodes and made the mistake of shifting to low-resource nodes for fine-grained control over the total amount of resources provisioned -- i.e. a single 96-unit node vs 96 1-unit nodes. That was a problem because the low-resource nodes were much less efficient at processing the actual load; a portion of the resources for each node were allocated to k8s system processes, but also any idle pod consumed a higher proportion of the node's resources to do nothing. As a result the autoscaling functions provisioned even more resources than they were using with the high-resource nodes. The solution was to use medium-resource nodes that offered somewhat granular control but made efficient use of the available resources.

My take was that the mistake was using a myriad of low resource nodes to run deployments that required an increasing amount of computational resources to run without problems. This led to launching more nodes to accommodate peaks which then just stayed idling.

The Kubernetes cluster was configured with horizontal pod autoscaling and cluster autoscaling, and to avoid problems the cpu limits were set to 0.5cpu. The end result was Kubernetes creating a myriad of nodes running 70% idle to accommodate the result of the cluster's autoscaling policy because a 1vCPU node does not have much headroom to accommodate peaks. For example, if you have 3 or 4 pods with 500m cpu limit running on a single vCPU node and it so happens that two peak at the same time, resource limits will be hit and cluster scaling will kick in to create yet another node just to meet that demand. In practice this mean that for each and every 1vCPU node to accommodate the peak demand of a single pod without kicking in cluster autoscaling, it needs to run at most at 50% idle.

This problem is mitigated by replacing 1vCPU nodes with higher vCPU count (the author switched to nodes with 16 vCPUs) because they have enough resources to scale up deployments without having to launch new nodes.

I didn't read the post (don't Medium and most of the time it's useless stuff) but I'm not surprised. Gke / kubernetes has a ton of dark stuff going on. In, there are dozens of parameters and if you don't set them. They are default which can... Lead to strange situations :)

Inspect all the stuff and generated configs! That's how I started.

One thing that I’ve been unable to wrap my head around is how to effectively calculate the right CPU share values for single threaded web-servers.

I’ve got a project using this setup, but it’s fairly common one l- e.g. Express with node clustering, Puma on Rails etc. On Kubernetes you obviously just forgo the clustering ability and let k8s handle the concurrency and routing for you.

So in this instance, I’m struggling to see why I wouldn’t request a value of 1vCPU for each process. My thinking is that my program is already single threaded, and asking the kubernetes CPU scheduler to spread resource between multiple single threaded processes is pure overhead. At that point I should allow each process to run to its full capacity. Is that correct?

This I feel gets a lot more complex now that my chosen language, DB drivers, and web framework is just starting to support multithreading. That’s a can of worms I can’t begin to figure out how to allocate for - 2vCPU’s each? Does anyone know?

JavaScript, the language, is single threaded, but node and V8 are not.

Depending on how much (and how heavy) async work you're doing it might be reasonable to let a node process use multiple cores.

Good point - although it’s worth clarifying that I’m on Crystal for this, meaning I’m definitely single threaded in this instance. Would that be as simple as a case of 1vCPU per pod?

At $job we default to 1vCPU per pod with the option to ask for more if that makes sense (e.g. you use a lot of shared heap and can meaningfully multithread).

Thanks, this was almost exactly the brief validation I was looking for.

K8s nodes are going to work much better with more CPUs (to a certain extent). As the post said when you have idle and busting pods, you need headroom. If you have single CPU nodes, your bursting pods are going to more often oversubscribe the node as the pool size is smaller. If you had 3 pods per CPU and on average one bursts out of 3 out during any time period, There's a chance you can have 2, or 3 go and cause pods to be evicted and moved. But if they were on 16 CPU nodes it averages out more. Also single CPU clusters will still need to run their network layer, kublet etc.

As far as the 96 CPU instance that really isn't good either unless your pods were all taking 1+ CPUs each. Even then, I'd rather run 6 x 16 CPU. There's a pod limit cap of ~110 per node, and also not to mention the loss of redundancy. I find 16-32 CPU nodes the best balance.

> As far as the 96 CPU instance that really isn't good either unless your pods were all taking 1+ CPUs each. Even then, I'd rather run 6 x 16 CPU. There's a pod limit cap of ~110 per node, and also not to mention the loss of redundancy. I find 16-32 CPU nodes the best balance.

Agreed, this is the other side of the node sizing question. At the low end you have to consider what your most resource hungry workloads need (and we use nodepools to partition our particularly edge-casey things), and then at the upper end you don't want the failure of one node to take out half your stuff.

Unless you set cpu limits that is - then it’s going to be shit performance regardless (especially for multithreaded stuff)

I'm really struggling with connecting his conclusion to what we know of his workload. Can someone spell it out for me?

He has many idle pods with a bursty workload.

The author says they need to reserve a lot of cpu or containers fail to create. Why is this? Wouldn't memory be more likely a cause for the failure? How does lack of CPU cause a failure?

Later the author notes that a many core machine is good for his workload because "pods can be more tightly packed." How does that follow? A pod using above the reserved resources will bump up against the other pods on that physical machine whether you've virtualized it as a standard-1 or standard-16. Is there a cost savings because the unreserved ram over-provisioned? Wouldn't that overbooking be dangerous if you had uniform load across all the pods in a standard-16.

Said another way, why is resource contention with yourself in a standard-16 better or cheaper than with others in the standard-1 pool?

My understanding with going the vCPU options was simply the choice between pricing granularity and CPU overheard of k8s.

It’s a bit hard to unpack as there seem to be multiple unrelated things there but i think the gist is classic problem of smaller nodes causing more resource fragmentation. The solution to increase node size is also classic and easy enough in cloud environment but has its tradeoffs as well (like lower reliability)

Choosing the right size for nodes comes up often enough that I blogged some rough guidelines last year: http://cpitman.github.io/openshift/2018/03/21/sizing-openshi...

This article makes such a strong case for ditching k8s for serverless:

- needs granular scaling

- devops expertise is not core to the business

- save developer time

I thought the same. Even more so after reading the comments here.

"K8s is so hard! All the tutorials are too basic! I had to redo my cluster multiple times!"

> Therefore, the best thing to do is to deploy your program without resource limits, observe how your program behaves during idle/ regular and peak loads, and set requested/ limit resources based on the observed values.

This is one of the author’s fatal assumption. The best practice I understand is to set cpu requests to be around 80% of peak and limits to 120% of peak before deploying to prod.

They set themselves up for disaster with this architecture where they have many idle pods polling for resource availability. This resource monitoring should have been delegated to a single pod.

Also it’s really unclear what specific strategy led to extra costs of 1000s of dollars...

This is more of a procedural mistake than a specific technical (Kubernetes/GKE) mistake, even if the tech stack is the root cause.

This is a capacity planning or "right-sizing" problem. In prod you just don't go and flip completely your layout (100 1vCPU servers vs 1 100vCPU server or whatever) an more so in a stack you are not yet expert on, you change a bit and then measure. Actually you try to figure this out first in a dev environment if possible.

what is the psychosis in the k8s community where they feel the need to talk about losing thousands of dollars? it's a recurring theme with this community that they think somehow makes them look cool - wouldn't that be a clear sign that they should not be using k8s to begin with?

this community is ripe for implosion - what a joke

Most infrastructure folks love to talk about the expensive costs that occur from incidents like fires, the lazy infra engineers who don't do anything, hurricane shut down my data center, etc.

Devs like to focus on costs related to lost time. This is a pretty common trend and not really sure why you think that there's anything pathological about k8s. Maybe there's something pathological about infra folks in general and dev folks.

Truly learning kubernetes requires running it as a production system which runs the risk of incurring costs accidentally.

learning to develop a good drug habit runs the risk of incurring large costs as well - what's your point?

serious question, what do you use for deploying your services? scp fat bins on bare metal? podman and buildah? Maybe lxc? Ansible, etc? I've seen your posts but all I see are complaints, not remedies for what you seem to think is the scourge of kubernetes and docker.

it might surprise you that most devops/sysadmins don't use k8s/docker and think it's a disease - some of us are brave enough to speak our minds against the brownshirts that want to tear us down

You missed my question, what do you do in order to deploy your services?

Anyone knows what the "CPU(cores)" means exactly (e.g. 83m)? What's that m unit?

> Anyone knows what the "CPU(cores)" means exactly (e.g. 83m)? What's that m unit?

Not sure of the unit, but it generally means 83% of a single CPU's processing power. My understanding is that this is not strictly enforced, it's just a tool for Kubernetes to schedule pods and to make sure no set of pods add up to more than 100m on any CPU.

I actually just found the relevant Kubernetes documentation [0]. It stands for "millicores", so 83m would be 8.3% of a single CPU.

[0] https://kubernetes.io/docs/concepts/configuration/manage-com...

1m is 1/1000 of “cpu”. Quotes are because for requests and limits it actually means different things at container level.

Repeat this mantra every morning:

Never set cpu limits, always set mem request=limit unless you really have good reason not to.

CPU limits are essential for keeping runaway pods from taking over whole node. Except pathological cases it works very well with lowered cfs_period_us

This is wrong - cpu request already constrains them using cgroups cpu shares. All it does is fucks up your latency and wastes resources for underutilized nodes.

Edit: also please advise how can i tune cfs period on gke

requests do not constrain anything, it more or less specifies proportion CPU time is allocated when contention occurs. 50m req pod can be cut off from CPU completely by runaway 500m pod to the point it starts timing out and failing readiness pod. Or worse these effects start to be seen on kubelet, which almost nobody runs with a low enough GOMAXPROCS. Setting limits keeps node healthy.

It will constrain it to its proportional share, as you noted, but only during contention (which is a feature not a bug). Thus 50m pod will get 1/10th of cycles of 500m pod which is WAI. In case of fully subscribed node former pod will get exactly 50/(1000 * number of allocatable cores) share so I don’t see how that would cause issues provided that pod can actually survive on such small slice in the first place.

Kubelet has its own cgroup in hierarchy above pods and should set its cpu shares there as well (most cloud providers already do this).

Also I suggest reading original paper for cfs quota, it explains motivation pretty well.

The only two reasons to set cpu limits I’m aware of is to force pods onto dedicated cores (which doesn’t work on gke as of 1.13) and if you run some sort of pay-per-use compute platform for external users.

> Pardon the interruption. We see you’ve read on Medium before - there’s a personalized experience waiting just a few clicks away. Ready to make Medium yours?

Why in the world do I need an account to read a glorified blog? It’s text data, I should be able to consume it with curl if I’m so inclined.

Medium is terrible, I don't quite understand why people allow a company to monetize their blog for medium's gain. But, that's their choice, not mine.

In my experience, people use it for street cred basically. Whether somewhat disingenuously to 'validate' (sic) poor content by mere association with the 'brand' name (cue Forbes, LinkedIn...), or more genuinely when 'legitimate' authors seek to give exposure to their sharing via the 'brand' platform (e.g. "Our <N> years of exp with tech <X> at shop <B>" by lead figure <I>).

Which is a completely different approach than building one's own business. These platforms can still help drive traffic to a decent website though.

It’s a convenient place to publish the odd posting or two. I’ve also found that there’s some (unwarranted) perceived credibility plus it can be a good place to link that’s not mixed in with random personal stuff.

I use it for professional stuff that needs to be linked to but which isn’t on an official blog property of some sort.

I do keep a personal blog but sometimes cross-post to Medium.

ever since Facebook, youtube and other mega platforms, people seem to forget that a decentralized web also means that you should host your own content. I agree with you and find it sad that everything on internet today is hosted by some of those mega tech companies that dictate their laws and what we can and cannot see.

It's a pleasant writing experience. Don't even have to think about the title... meanwhile for my static site the title torments me because I need to put it there as the file name... of course I could just rename it or whatever but there's cognitive overload here (for me).


Content discovery via Medium itself is extremely poor, so this isn't really a complete answer on its face.

The ability to handle traffic is what was meant.

I don't think it was. Handling blog traffic is not difficult.

The thing is, it doesn't get better when you log in. I deleted my account shortly after creating it, because medium kept nagging me about stuff.

How about disable JS/Cookies for Medium?

That works like a charm and takes care the most obnoxious aspects of reading something in medium but there is a caveat: Images seem to be loading gradually as you scroll, so if you disable js you might miss some of them.

This is the add-on I use: https://addons.mozilla.org/en-US/firefox/addon/disable-javas...

So that they can make money by tracking you. Curl does not give money to Medium.

Medium works mostly ok with JavaScript off. Some images don't show, but all the text, and no nagging.

Click that little "x" in the upper left corner of the pop-up.

I just instinctively hit reader view on my iPad, and that worked too.


I hope there will be some kind of option like ‘open this link in reader mode’.

What a great idea. I would use that so much. Firefox reader mode is already pretty good, this would be an icing on the cake.

There is an overlay setting in Safari on iOS 13, but I think it was on 12 aswell but harder to get to it (I think you have to tap and hold the reader icon)

I think you can set reader view as default for a domain in safari (going forward)

You can even just press esc!

Sorry to hear this.

> As a disclaimer, I will add that this is a really stupid mistake and shows my lack of experience managing auto-scaling deployments.

There is a reason why these DevOps certifications exist in the first place and why it is a huge risk for a company to spend lots of time and money on training to learn such a complex tool like Kubernetes (Unless they are preparing for a certification). Perhaps it would be better to hire a consultant skilled in the field rather than using it blind and creating these mistakes later.

When mistakes like this occur and go unnoticed for a long time, it racks up and creates unnecessary costs which amount as much as $10k/month which depending on the company budget can be very expensive and can make or break a company.

Unless you know what you are doing, don't touch tools you don't understand.

> Unless you know what you are doing, don't touch tools you don't understand.

I'm sorry but this is an absurd comment, and a fairly miserable reply to make to an engineer who is sharing a mistake in hopes of helping others to learn. We all touch tools we don't understand, and hopefully we learn something and come to understand them better. FWIW I have been running prod. workloads on GKE since 2016, and I think any engineer who is familiar with managing and deploying to cloud infrastructure can readily learn to run a GKE cluster.

I've been in this industry for over 15 years. I don't have a single certification. I learnt everything by reading and doing. With your method, I would still know nothing.

> When mistakes like this occur and go unnoticed for a long time, it racks up and creates unnecessary costs which amount as much as $10k/month which depending on the company budget can be very expensive and can make or break a company.

> Unless you know what you are doing, don't touch tools you don't understand.

I think it might be better to say "don't take risks you can't afford". You should experiment with new tools to see if they're better than what you're using now. Just don't deploy systems to prod before you really understand them.

> There is a reason why these DevOps certifications exist in the first place and why it is a huge risk for a company to spend lots of time and money on training to learn such a complex tool like Kubernetes (Unless they are preparing for a certification). Perhaps it would be better to either hired a consultant skilled in the field rather than using it blind and creating these mistakes later.

That's laughable but I will play:

I will pay anyone with a devops cert $0.01 for a right to 10% of my savings over a year period. If I end up paying more for the service after hiring such person, that person will pay me 110% of the excess that I paid for the service as a result of hiring them. If a devops cert is actually any good then this would be a license to print money for anyone with a devops cert.

OP's problem is that his organization did not engage in any sort of risk management which is why they had

a) K8s as something magical that makes things work

b) Someone who did not know how K8s works being allowed to re-engineers K8s

c) No alert on a change of the usage data exported by Google

P.S. If you are on a cloud, drop everyting and implement the (c). It will save your shirt donzens if not hundreds of times a year.

Nah, you've probably just got a fragile configuration that won't scale and will cost you money in downtime, lost sales, or failure to live up to contract.

An engineer who does it right isn't going to save you much money over your best case scenario - but they're going to keep you from losing millions in the worst case scenarios.

> Nah, you've probably just got a fragile configuration that won't scale and will cost you money in downtime, lost sales, or failure to live up to contract.

Those are the tales consultants and engineers that like to play with toys tell: it is a typical case for a premature optimization. The odds of you having enough traffic that needs to scale are slim to none.

If you do need to scale, the odds are your apps are over-engineered on corner cases and under-engineered in the main path: if your ORM takes 300 ms to initialize on every request without fetching any data from the database "scaling" is the last thing you should be worried about.

> An engineer who does it right isn't going to save you much money over your best case scenario - but they're going to keep you from losing millions in the worst case scenarios.

You will go out of business before those savings are going to matter.

Love that explanation! Part of why Facebook grow so big is, that it was basically never down!

I'll save you tons of money but we'll need to move to my data hosting center in my basement. I have many c64s networked together most with a 1581 drive.

I'll offer it for free so you can pay me immediately. I got my DevOp certificate in 1999 from BrainBench as a web master.

> Unless you know what you are doing, don't touch tools you don't understand.

That approach makes sense for businesses that need to be risk-averse, e.g. self-driving cars.

But in a "move fast/break things" operation, wasting some money might be preferable to taking the time to do things right the first time around.

>> There is a reason why these DevOps certifications exist in the first place...

Yes there is a reason, but you're not going to like the answer if I tell you what it is.

> Unless you know what you are doing, don't touch tools you don't understand.

But how do I know I’m an idiot?

Even if you are just a noob, how will you ever learn if you don't get in there and have a try? Very un hacker sentiment for hacker news!

Remember that not everyone here is a hacker. Some are specialized consultants who have vested interests in speaking out against DIY. (And that’s not to say that they are wrong)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact