> Kubernetes is complex and I think they are partially right
Kubernetes is a distributed centralized operating system which itself depends on a distributed decentralized database, and has a varying network topology, permissions system, plugins, scheduler, storage, and much more, depending on how & where it was built, and runs applications as independent containerized environments (often deeply dependent on Linux kernel features) which can all have their own base operating systems. All of which must be maintained, upgraded, patched, and secured, separately, and frequently.
Kubernetes is literally the most complex single system that almost anyone in the world will ever use. It is the Katamari Damacy of the cloud.
> It allows dev teams to not worry about all these things; all they must do is to write a simple YAML file.
cackles, then sobs
> More importantly, teams no longer need to ask DevOps/infra folks to add DNS entry and create a Load Balancer just to expose a service.
more sobbing
> Should you use Kubernetes?
Should you change a tire with a crowbar, a can of WD40, and a lighter? Given an alternative, the alternative is usually better, but sometimes you don't have an alternative.
I worked at a large company that deployed it's own Kubernetes stack, on a VERY large number of physical hosts. The theory was that the K8S would simplify our devops story enough that we could iterate quickly and scale linearly.
In reality, the K8S team ended up being literally 10x larger than the team building the application we were deploying on it. In addition, K8S introduced entirely new categories of failure mode (ahem; CNI updates/restarts/bedsh*tting, operator/custom resource failures, and tons of other ego driven footguns).
The worst part? The application itself ran fine on a single dev workstation, but also on any random assortment of VMs. Just pass the consul details as environment variables. I am not saying everybody on K8S is in the same boat, but I think that far more people are planning on becoming a unicorn cloud service than have any hope of becoming a unicorn cloud service.
TL;DR: If your hosting solution requires more maintenance than the application itself, you made a boo-boo.
Using k8s is not remotely the same thing as maintaining your own k8s stack. One is easy (get your feet wet in an afternoon), the other is hard (maybe after a couple months of full-time study you can pull it off).
The vast majority of teams that have enough crap to run to warrant using k8s should not be maintaining their own k8s stack.
In fact, it’s entirely possible that running k8s is so hard that the only players that can do it reliably are the big cloud companies.
Actually most devs should have a much easier time with borg than k8s bc there’s an army of sres/devs running it for them. I was one of those running it but even I didn’t appreciate the complexity of running airtight kubernetes distro at the beginning. That’s on onprem tho, if you don’t know how to network just use the cloud thing
I'm not sure what you mean by operationally broken. While I would agree that k8s is not a great match for many, I wouldn't say it's fundamentally broken. It still has many rough edges.
What's amazing about borg is the number of clusters, their sizes, and the support teams that keep them healthy. k8s does not have that.
I agree with the point that production is hard. There's so many things you just don't think about as a developer that end up being important. Log storage, certificate renewal, etc.
I think how "hard" kubernetes is depends on how deep you go. If you're building a cluster from scratch, on your own hardware, setting up the control plane yourself etc. it's very very hard. On the other hand, if you're using a hosted service like EKS and you can hand off the hardware and control plane management to someone else, IMO it's actually very easy to use; I actually find it a lot easier than working with the constellation of services amazon has to offer for instance.
I do think there are parts of it where "best practices" are still being worked out though, like managing YAML files. There's also definitely some rough edges. Like, Helm charts are great... to use. They're an absolute nightmare to write, and there's all sorts of delightful corner cases like not being able to reliably upgrade things that use StatefulSet (last I used anyway). It's not perfect, but honestly if you learn the core concepts and use a hosted service you can get a lot out of it.
I disagree that production is hard in itself, I think the way people approach production makes it needlessly more difficult. It seems that when launching a service, everyone seems to try to get the entire stack in one shot, which is difficult.
A much better approach is "1. Make it work 2. Make it good 3. Make it fast". Your initial prototypes need to define the core functionality, and then you incrementally build stuff on top of that.
Anecdotally, everytime someone tells me Kubernetes is overkill and then follows your approach. In a year they end up building the capabilities that come out of the box with Kubernetes, and of course they aren't as well thought out. Because they were done adhoc and as needed.
Why is that bad? If the business survived for long enough to have that problem, that's a win, not a failure. Being killed or at least hobbled by unnecessary complexity, on the other hand, is a thing in way more businesses than we like to admit.
Yep, you can grow to be a huge company with just autoscaling groups and some terraform.
Absolutely painfully boring stack. I’m working at a place now with that stack and we haven’t had a single page in over a year now and we do 5 deploys a day ish.
When I see developers use k8s as “programmable infra replacing on-prem functionality as a baseline”, some time after step 3 is, “make it as cheap as on-prem”. Unfortunately, many of the “move fast and break things” tenor that got the services built in k8s in the first place, baked in decisions that will cost as much to re-engineer into an elastic cloud-savvy form that also saves money compared to on-prem.
This is often from organizational or project management mis-coordination than technical choices the developers are fully aware of. Like a system’s owners not prioritizing putting events of interest into a message queue and insisting upon polling for the information on one end. And an offshore operations team saying learning how to support AWS Lambda/Azure Automation/GCP Cloud Functions is an added cost scope to the contract. So the developer is backed into a corner to run cron/SystemD-based polling off a small EC2. Thousands of decisions like this add up fast, and take a major executive committee commitment to root out tech debt and prevent new tech debt from accumulating to unsustainable levels to prevent cloud initiatives from sinking beneath a mire of expensive small decisions in pursuit of new accomplishments. It is a tough balancing act.
Disagree. The first two objective clashes very much with each other. I have worked in teams that did that. And generally if anything non trivial is done without planning to just "make it work", it would be more often than not, ridden with many design issues which doesn't allow it to be good.
I've never seen anything that couldn't be incrementally improved. There are definitely tons of situations where the additional effort to "make it good" is worth incurring upfront. But you can definitely make something work then make it better. And a lot of times, you'll find that your carefully planned out "good" solution ends up not being all that good and needs incremental improvement. It can be hard to get something right on the first try especially if you haven't done it before.
I have rarely seen bad but barely workable website become good with incremental improvement. After the product reaches some threshold of users, it is hard to change the code structure in a safe testable way if there is no good way to test it in production like environment, and so developers are rightfully afraid to make any big changes in one go. I have seen many examples both internally and externally.
- My bank's website which is a decade old and has millions of users, has tons of issues. I am not even talking about UI or look. It randomly hangs, session disconnects etc.
- ERP tool of one of my previous org sucked, even though used by 10s of 1000s of people and having dedicated team in maintaining.
- One team that I worked with has no proper staging environment, and it would have required synced deployment of dozens of services with many teams working with different language and config. And just this simple task couldn't be completed even with push from upper management, just because of complexity.
Lot of companies doesn't even have these: Have a way to properly test product, have actionable alerting/logging, have reproducible environment, and distribute ownership clearly among code components. Kubernetes more or less forces a reproducible environment.
I see way more making-poor-decisions underbidding-complexity, then spending years hacking on shit cause it's kind of jank.
When folks have license to do good work, I usually see it going well. If they have some informed opinions & group wisdom at their disposal & which they leverage.
If you start designing into the future, you are now trying to make it good before making it work.
Making it work means you start with barebones, just to have a proof of concept that runs, is accessible, is secured, even if you have to click things together and manually ssh into EC2s to launch code. Once you have everything working, then the bigger design should be a lot clearer, in terms of what you have to do to make it good.
You can go from out of the box Debian listening on 22/80/443 directly on internet to WAF, intrusion detection, encrypted overlay networks with CA hierarchy and offline route keys, encryption at rest, elaborate IAM, hardware signed commits, full supply chain audits with every dependency vetted and vendored, multi-access-level logs, mandatory spyware on all corporate gear, and so on and so forth. For most businesses it's more of the former than the latter, and there's definitely no such thing as categorically "secure".
At the end of the day, getting a stable production environment is simply a tradeoff between the amount of complexity you need to make your infrastructure do what you want, and reducing complexity because it removes failure points from the production environment.
K8s is nice and all, but if all you really need can be solved by 2 VM's and a way to announce an anycast address (or use a loadbalancer if that is not an option), why would i add all that complexity?
For reasons of experience, all I ever want from a system is that it’s reproducible.
I had a vanity website running k8s in a managed cloud. I thought I was backed up by my provider and original ansible deployment, which was of course developed iteratively.
I originally did this mostly to do a practice deployment, and get the workflow.
A few years later, it went down and I didn’t notice for a few weeks. It was too unimportant to invest in monitoring, and not worth it to do over. Redeploying gave me lots of confusing errors (note: I also do this stuff for work)
Frankly, I was surprised that the provider doing updates would make my years-stable site fall over. I haven’t tried that one again for funsies, yet. It’s the vendor specificity that was my d’oh!
Cost aside, I wonder how far you can get with something like a managed newsql database (Spanner, CockroachDB, Vitess, etc.) and serverless.
Most providers at this point offer ephemeral containers or serverless functions.
Does a product focused, non infra startup even need k8s? In my honest opinion people should be using Cloud Run. It’s by far Google’s best cloud product.
Anyway, going back to the article - k8s is hard if you’re doing hard things. It’s pretty trivial to do easy things using k8s, which only leads to the question - why not use the cloud equivalents of all the “easy”things? Monitoring, logging, pub/sub, etc. basically all of these things have cloud equivalents as services.
The question is, cost aside, why use k8s? Of course, if you are cost constrained you might do bare metal, or a cheaper collocation, or maybe even a cheap cloud like DigitalOcean. Regardless, you will bear the cost one way or another.
If it were really so easy to use k8s to productionize services to then offer as a SaaS, everyone would do it. Therefore I assert, unless those things are your service, you should use the cloud services. Don’t use cloud vms, use cloud services, and preserve your sanity. After all, if you’re not willing to pay someone else to be oncall, that implies the arbitrage isn’t really there enough to drive the cost down enough for you to pay, which might imply it isn’t worth your time either (infra companies aside).
> Iteration speed and blazing fast automated tests.
Wholeheartedly agreed!
It is also nice to have that additional assurance of being able to self-host things (if ever necessary) and not being locked into a singular implementation. For example, that's why managed database offerings generally aren't that risky to use, given that they're built on already established projects (e.g. compatible with MySQL/MariaDB/PostgreSQL).
> When I discovered minio, I suddenly got much more confident coding against s3.
MinIO is pretty good, but licensing wise could become problematic if you don't work on something open source but ever want to run it in prod. Not really what this discussion is about, but AGPL is worth mentioning: https://github.com/minio/minio/blob/master/LICENSE
That said, thankfully S3 is so common that we have alternatives even to MinIO available, like Zenko https://www.zenko.io/ which is good for both local development as well as hosting in whatever environments necessary. I was actually about to mention Garage as well which seems better because it's a single executable but they also switched to AGPL, probably not an issue for local testing though: https://garagehq.deuxfleurs.fr/
> Does a product focused, non infra startup even need k8s? In my honest opinion people should be using Cloud Run. It’s by far Google’s best cloud product.
> Or just app engine honestly.
As a former App Engine PM and PM of Cloud Run, this warms my heart to hear--I'm glad folks love to use it :)
It's been a few years since I've worked on these products, but they were designed to be composed and used by each other. Cloud Run provides an autoscaling container runtime with Knative compatible API; Cloud Build (and Container Registry) + Cloud Run = App Engine, App Engine + CloudEvents = Cloud Functions.
With a Knative compatible API, you can theoretically take those primitives and move them to your own K8s cluster, managed (GKE/EKS/AKS) or self-managed, giving folks a tremendous amount of flexibility down the line if/when they need it (hint: the majority of customers didn't, and the fully managed services were great).
I had the idea that cloud run was like a (possibly auto-scaling?) abstraction of a container runtime or something so it makes sense that app engine uses it.
I don't think I have been aware of that fact though when using app engine in the past -- which is good, by the way, I don't want to know or care about this, just run my containers somehow :-)
Similar with fly.io, I have an application running on there and pleasantly surprised they don't even charge if the amount is under $5/month. I've been very happy with how easy it is to deploy and with the hosted Postgres. I'm using the JVM and works well; I originally played around with Elixir and was especially impressed with their feature set for it.
I think you could push that setup far. I'm not familiar with GCP or Cloud Run, but it probably integrates nicely with other services GCP offers (for debugging, etc.).
I'd be curious to read if anybody has that setup and what scale they have.
Regarding the second part, I totally agree, either use cloud or don't. For some reason, most companies want to be cloud-agnostic and so they stay away from things that are too difficult to migrate between cloud providers.
This. Greenfield products should be serverless by default. By the time you have sustained traffic to the point where you can run the numbers and think that you could save money by switching off serverless, that's a Good Problem To Have, one for which you'll have investors giving you money to hire DevOps to take care of the servers.
I tried to use lambda. Cold startup really is awful. You have to deal with running db migrations in step functions or find other solutions. The aurora serverless also does not scale to zero. Once you get traffic you overload RDS and need to pay for and setup a RDS proxy, and dont get me started on the pointless endeavor of trying to keep your lambdas warm. Sort of defeats the point. Serverless is not actually serverless and ends up costing more for less performance and more complexity
Its way simpler and cheaper to start with a single VPS single point of failure, then over time graduate to running docker compose or a single node k3 cluster on that VPS. And then eventually scale out to more nodes…
Without more details on how you tried to set up lambda...
> Cold startup really is awful
It has gotten significantly better over time, particularly for VPC-connected functions, as AWS no longer creates an ENI per function but re-uses an ENI. If most of your UI code is elsewhere (CDN, mobile app) then you're not hitting the lambda endpoint for initial UI draws.
> db migrations in step functions or find other solutions
Fargate? Especially as DB migrations might exceed the 15 minute maximum runtime for Lambda
> aurora serverless also does not scale to zero
Serverless V2 does auto-pause, but yeah, I agree that AWS's serverless SQL portfolio is lacking compared to Planetscale, Neon, other similar new entries. Which you can run without RDS proxy or a VPC.
I'll agree that projects for which response latency needs to be lower than what cold starts will reasonably permit should pick a different architecture, but I don't think most greenfield product projects are so latency-sensitive. That sounds to me like premature optimization.
> way simpler and cheaper to start with a single VPS single point of failure, then over time graduate to running docker compose or a single node k3 cluster on that VPS. And then eventually scale out to more nodes…
Cheaper in raw early cloud infrastructure costs, sure. Cheaper in total cost of ownership, particularly as the service starts to scale, including overprovisioning waste, engineering time dedicated to concerns unrelated to value... for any project for which autoscaling behavior would be bursty or at least unknown at best, I beg to differ.
Serverless does not necessarily mean lambda. It could be just about anything that runs containers for you. AWS ECS has an offering called Fargate that I've been happy with for our hosting. You are right though that the compute costs are typically more than renting a traditional VPS. There is definitely a tradeoff between labor and compute costs.
The point of serverless isn't to run 0 servers in your downtime, its to abstract away everything related to running hardware. I have an app that is built on a runtime (jdk, node, whatever) and I shouldn't have to deal with anything below that layer.
37signals is not like the typical large-scale startup. They have an extremely small team (around 30 people?), and just a couple of products.
Large-scale startups use dynamic-scheduled cloud services in part to reduce coupling between teams. Every service --- and there are dozens --- is scheduled independently, and new teams can get spun up to roll out new services without too much intervention from other teams.
When you've got a couple products that have been in maintenance mode for 10+ years and then 2 just two primary products, both of which are on the same stack, and you can predict your workloads way out into the future (because you charge money for your services, don't do viral go-to-markets, and don't have public services), there simply isn't much of a win to dynamic scheduling. You can, in fact, just have a yaml file somewhere with all your hosts in it, and write some shell-grade tooling to roll new versions of your apps out.
A lot of the reflexive pushback to not using k8s seemed like it came from people that either didn't understand that 37signals was doing something closer to static scheduling, or that don't understand that most of what makes k8s complicated is dynamic scheduling.
Most startups that I see trying to go with microservices too early do so while keeping a shared database between the services, so they're not really microservices, but a distributed monolith. This turns into a massive, massive pain.
Doing microservices well means building out templates for CI/CD pipelines, templates for observability, figuring out how best to share (or not share) common credentials like third-party API keys, setting up a service mesh, setting up services like Backstage and Buf... which inevitably requires hiring dedicated engineers at enormous cost.
If you can set up a monolith, why would you switch to microservices? So that you can have a smaller container image and faster autoscaling? So the developers on the second and third teams you hire can wait a little less time for CI tests to run?
It's a pretty big mistake to adopt microservices too early.
Sure, but I'm not thinking of small startups, but rather large ones like Doordash, Airbnb, Box, Square, Stripe; these places legitimately have lots of services, complex coupling patterns, and a need for some kind of indirection and dynamic placement to enable teams to get things done without coordinating every single PR.
37signals presumably doesn't have any of those problems. They should be using K8s. In the previous thread, I suggested that I'd consider Nomad if I were them, but really, they don't need Nomad either. They need a yaml file mapping containers to hosts, and some tooling to do smooth deploys and to hook up a relatively static service discovery map. Which: that's what they built.
Nitpick: it may be helpful to not use the term "startups" when you mean companies like Doordash, Airbnb, Stripe etc. Those companies _were_ startups... a very long time ago. Same with 37signals.
This advice offered in the hope that it may improve communication!
You are talking about a handful of unicorns, not just large startups, and even then only a subset of them — Shopify with their widely successful monolith comes to mind. By definition their requirements in their current unicorn state can't be the default.
Yet we collectively behave as they are and everyone needs dynamic scheduling (and the rest of the kitchen sink).
> Large-scale startups use dynamic-scheduled cloud services in part to reduce coupling between teams.
This is the crux. It's Conway's Law in action. The promise of Kubernetes to a large org is that you can split the baremetal and OS layer into one team that manages everything up to Kubernetes, and then that's the common interface where all other teams deploy their applications into. And besides the team separation, the value is that you have a central layer to put automatic policy enforcement, engineering doctrines, and so forth.
while i agree with you that kubernetes allows you to split the stack down to individual "specialisations", i don't see how this is different with kubernetes compared to "traditional" Virtual machines?
Usually the differentiator is either the container or VM. Everything beneath that gets managed by some kind of operations team (or teams if you are really large or specialized), while the stuff running ontop of the infrastructure is usually done by developers/application maintainers.
> what makes k8s complicated is dynamic scheduling.
… Which almost no startup or otherwise ever will need. Creating complex stuff for workloads you will never ever have. You hope to have them, but that’s called premature optimisation. And then you still mostly likely fall in the bracket of a company that will never need it.
I work for a company that routinely deploy very large scale software to airlines/airports/rail companies around the world. Millions of lines of mission critical server and mobile/browser/desktop client code.
We do it without the cloud, without micro-services, without Kubernetes etc. Just straight forward good old fashioned client/server monoliths. It’s simple. It works.
The reality is that 99% of people who think they need Kubernetes don’t actually need it. Almost all problems in software development are caused by the developers themselves. Not by the actual business problems.
Who said anything about a large-scale startup? Kubernetes is approachable all the way down to N=1 employees.
I strongly disagree with your take on static vs dynamic scheduling. Static scheduling ties your hands early. In a mature organization, it is very much an optimization.
Dynamic scheduling forces a cattle-not-pets mentality out of the gate, which is great. It also gives you all the knobs to figure out HA, right-sizing and performance that you'd ever want, for whenever you're ready for them. It's considerably more laborious to rearrange or tune things with static scheduling. I've run the gamut here and Kubernetes is by far the easiest and most approachable way to manage a fleet from 1 to $BIG that I've ever encountered. If you want to treat it like a static scheduler, that is also trivial. It's not like there's some huge cost to doing so. It's basically a NOP.
37signals blew off their foot by doubling down on their sunk costs (read: capex) of metal. They clearly don't want to not think about KVM and F5s and SSH keys and all the other odds and ends that entirely solved away by managed services for reasonable prices.
Which is it? Are they too big for the cloud or too small? You can't have it both ways.
You're likely looking at the problem from a different angle. Kubernetes is an ecosystem for managing software deployed on a fleet. There's value in that even if it's two or three hosts. What size of an org is too small for Terraform? Puppet? CI/CD? Docker? Some bash scripts?
All these tools should fundamentally solve some piece of the puzzle for you. Kubernetes on a managed platform just happens to cover many pieces that you still need to solve otherwise. It may be more novel than some other technology but it's not less proven or difficult, fundamentally.
Most of the criticism I see here is literally just FUD. I run Kubernetes at home. It's fine. It's less work than anything comparable I've ever encountered.
It's not a huge cost. This is ludicrous. Sign up for a free account on GCP and walk through their getting started guide and you'll have stuff running in under an hour.
> your "fine" rests on a base level of knowledge which is enormous, certainly incompatible with p99 of users
I doubt it. I have far, far more experience running services on bare metal pizza boxes in 48U racks in PoPs all over the world, but that is a base level of knowledge that actually is enormous and certainly incompatible with p99 of my pool of potential colleagues.
If you can figure out how to run a serverless app, you can figure out Kubernetes. It's not rocket science.
This is the attitude I recoil from. Some of the technologies you listed --- CI is an example --- earn their keep so well that they're essentially universal best practices (at least for serverside software). But others don't, and you can't use a list that includes, like, "source control" as a way to launder in the idea that Kubernetes is a universal best practice. It is not. It earns its keep in some places and not in others.
The only best practice I strongly preach is using source control to define your infrastructure all the way down to the versions of binaries serving your customers. Kubernetes is a means to an end there, when used in combination with "gitops". It's very compelling and a very malleable pattern.
As far as I can tell the other alternatives are to either go low and own the kernel on up, which means a lot more depth in your stack. Albeit this is the traditional systems curriculum, it still requires a lot of expertise that is relatively uncommon. I was once like you, I suspect, and our rose-colored glasses for the days of yore make going low more appetizing than I think is reasonable from a business perspective. No one at a startup should ever care about iptables.
The other end of the spectrum is stapling your software to managed services that range from portable (S3) to not (GCP's PubSub). For most startups I'd reckon this is actually the preferred approach as portability can be solved after you find some sort of product market fit. My reservation here is that going this route often blinds folks from understanding how to solve these problems yourself, without the handholding of a big cloud.
Here is where Kubernetes shines. It's the best of both worlds, and the lowest common denominator is relatively portable but at a high enough level that you can be productive. Surely our experiences here differ but mine thus far has been nothing short of positive in the last couple of years. Before then? Mostly a shitshow without deep expertise, but today you can get by without learning much at all.
In comparison to Kubernetes, the closest thing I've used that wasn't kubernetes was a bunch of home-rolled Terraform and puppet and shell scripts. Not too different than what 37signals is bragging about. But it sucked. It's fragile and complex and if the one person who wrote 95% of it leaves your company you are hosed. Kubernetes, when used in an appropriate manner, unshackles you from this sort of thing. Sure, YAML is terrible but at least your YAML looks a lot like my YAML or that YAML on stack overflow.
Kubernetes is fundamentally the next easiest thing behind whatever the equivalent to Heroku/Cloudrun/Fly.io.
If you think it's easier to run a service on a VPS by scp'ing a tarball and ./'ing the server then you're not being honest with yourself about your sysadmin hobby/addiction affecting your work. If you're skilled and competent, good for you! I am too, and we could run out servers happily together forever. But it's sadly a lift to ask newcomers to follow along. The industry is moving away from this level of depth as far as most developers are concerned.
I think the post title should be called “Production is hard” (as the author talks about later on). Pick up any technology out there: from Python, to C++, to K8s, to Linux… Do the analogous “Hello world” using such technologies and run the program on your laptop. Easy. You congratulate yourself and move on.
Production is another story. Suddenly your program that wasn’t checking for errors, breaks. The memory that you didn’t manage properly becomes now a problem. Your algorithm doesn’t cut it anymore. Etc.
Definitely if they weren't running very complex services or their business tolerated an occasional maintenance window.
The particular way Kubernetes can bite you is that it makes it much easy to start with far more complex setups - not necessarily much harder than to start with a simple setup! - but then you have to maintain and operate those complex setups forever, even if you don't need them that much.
If you're growing your team and your services, having a much bigger, more complicated toolbox for people to reach into on day 1 gives you way more chance of building expensive painful tech debt.
I think it may appear so because Kubernetes promotes good practices. Do logging, do metrics, do traces. That list quickly grows and while these are good practices, there's a real cost to implement them. But I wouldn't agree that Kubernetes means building tech debt - on the contrary, if you see the tech debt, k8s makes it easier to get rid of it, but that of course takes time and if you don't do it regularly that tech debt is only gonna grow.
I just rarely see people tackling greenfield problems with the discipline to choose to do Kubernetes without also choosing to do "distributed" in a broader, complexity-multiplying way.
If not for Kubernetes (and particularly Kube + cloud offerings) I really doubt they'd do all the setup necessary to get a bunch of distributed systems/services running with such horizontal sprawl.
I'm going to diverge from sibling comments: it depends.
As the article points out, k8s may really simplify deploys for devs, while giving the autonomy. But it isn't always worth it.
Yes, until you've scaled enough that it wasn't. If you're deploying a dev or staging server or even prod for your first few thousand users then you can get by with a handful of servers and stuff. But a lot of stuff that works well on one or three servers starts working less well on a dozen servers, and it's around that point that the up-front difficulty of k8s starts to pay off with the lower long-term difficulty
Whatever crossover point might exist for Kubernetes it's not at a dozen servers, at the low end it's maybe 50. The fair comparison isn't against "yolo scp my php to /var/www", but any of the disciplined orchestration/containerization tools other than Kubernetes.
I ran ~40 servers across 3 DCs with about 1/3 of my time going to ops using salt and systemd.
The next company, we ran about 80 in one DC with one dedicated ops/infra person also embedded in the dev team + "smart hands" contracts in the DC. Today that runs in large part on Kubernetes; it's now about 150 servers and takes basically two full ops people disconnected from our dev completely, plus some unspecified but large percentage of a ~10 person "platform team", with a constant trickle of unsatisfying issues around storage, load balancing, operator compatibility, etc. Our day-to-day dev workflow has not gotten notably simpler.
No it didn't. You ended up with each site doing things differently. You'd go somewhere and they would have a magical program with a cute name written by a founder that distributed traffic, scheduled jobs and did autoscaling. It would have weird quirks and nobody understood it.
Or you wouldn't have it at all. You'd have a nice simple infra and no autoscaling and deploys would be hard and involve manually copying files.
Right up until you needed to do one of the very many things k8s implements.
For example, in multiple previous employers, we had cronjobs: you just set up a cronjob on the server, I mean, really, how hard is that to do?
And that server was a single point of failure: we can't just spin up a second server running crond, obviously, as then the job runs twice. Something would need to provide some sort of locking, then the job would need to take advantage of that, we'd need the job to be idempotent … all of which, except the last, k8s does out of the box. (And it mostly forces your hand on the last.)
Need to reboot for security patches? We just didn't do that, unless it was something like Heartbleed where it was like "okay we have to". k8s permits me to evict workloads while obeying PDB — in previous orgs, "PDBs" (hell, we didn't even have a word to describe the concept) were just tribal knowledge known only by those of us who SRE'd enough stuff to know how each service worked, and what you needed to know to stop/restart it, and then do that times waaay too many VMs. With k8s, a daemonset can just handle things generically, and automatically.
Need to deploy? Pre-k8s, that was just bespoke scripts, e.g., in something like Ansible. If a replica failed to start after deployment, did the script cease deployment? Not the first time it brought everything down, it didn't: it had to grow that feature by learning the hard way. (Although I suppose you can decide that you don't need that readiness check in k8s, but it's at least a hell of a lot easier to get off the ground with.)
Need a new VM? What are the chances that the current one actually matches the Ansible, and wasn't snowflaked? (All it takes is one dev, and one point in time, doing one custom command!)
The list of operational things that k8s supports that are common amongst "I need to serve this, in production" things goes on.
The worse part of k8s thus far has been Azure's half-aaS'd version of it. I've been pretty satisfied with GKE, but I've only recently gotten to know it and I've not pushed it quite as hard as AKS yet. So we'll see.
I've never heard the term "resource budget" used to describe this concept before. Got a link?
That'd be an odd set of words to describe it. To be clear, I'm not talking about budgeting RAM or CPU, or trying to determine do I have enough of those things. A PodDisruptionBudget describes the manner in which one is permitted to disrupt a workload: i.e., how can I take things offline?
Your bog simple HTTP REST API service, for example, might have 3 replicas, behind like a load balancer. As long as any one of those replicas is up, it will continue to serve. That's a "PodDisruptionBudget", here, "at least 1 must be available". (minAvailable: 1, in k8s's terms.)
A database that, e.g., might be using Raft, would require a majority to be alive in order to serve. That would be a minAvailable of "51%", roughly.
So, some things I can do with the webservice, I cannot do with the DB. PDBs encode that information, and since it is in actual data form, that then lets other things programmatically obey that. (E.g., I can reboot nodes while ensuring I'm not taking anything offline.)
A PDB is a good example of Kubernetes's complexity escalation. It's a problem that arises when you have dynamic, controller-driven scheduling. If you don't need that you don't need PDBs. Most situations don't need that. And most interesting cases where you want it, default PDBs don't cover it.
> A PDB is a good example of Kubernetes's complexity escalation. It's a problem that arises when you have dynamic, controller-driven scheduling. If you don't need that you don't need PDBs. Most situations don't need that.
No, and that's my point: PDBs exist always. Whether your org has a term for it, or whether you're aware of them is an entirely different matter.
We I did work comprised of services running on VMs, there is still a (now, spritual) PDB associated with that service. I cannot just take out nodes willy-nilly, or I will be the cause of the next production outage.
In practice, I was just intimately familiar with the entire architecture, out of necessity, and so I knew what actions I could and could not take. But it was not unheard of for a less-cautions or less-skilled individual to do before thinking. And it inhibits automation: automation needed to be aware of the PDB, and honestly we'd probably just hard-code the needs on a per-service basis. PDBs, as k8s structures them, solves the problem far more generically.
Sounds like a PDB isn’t a resource budget then. We were using that concept in ESX farms 20 years ago but it seems PDBs are more what more SREs would describe as minimum availability.
Because they're completely different things you're comparing. The functionality that I describe as having to have built out as part of Ansible (needing to check that the deploy succeeded, and not move on to the next VM if not) is not present in any Helm chart (as that's not the right layer / doesn't make sense), as it's part of the deployments controller's logic. Every k8s Deployment (whether from a Helm chart or not) benefits from it, and doesn't need to build it out.
> needing to check that the deploy succeeded, and not move on to the next VM if not
It's literally just waiting for a port to open and maybe check for an HTTP response, or run an arbitrary command until non-zero status; all the orch tools can do that in some way.
… there's a difference between "can do it" and "is provided."
In the case of either k8s or VMs, I supply the health check. There's no getting around that part, really.
But that's it in the case of k8s. I'm not building out the logic to do the check, or the logic to pause a deployment if a check fails: that is inherent to the deployments controller. That's not the case with Ansible/Salt/etc.¹, and I end up re-inventing portions of the deployments controller every time. (Or, far more likely, it just gets missed/ignored until the first time it causes a real problem.)
¹and that's not what these tools are targetting, so I'm not sure it's really a gap, per se.
Yep. Still doing it today. Very large scale Enterprise systems with complex multi-country/multi-organisational operational rules running 24/7. Millions of lines of code. No Kubernetes. No micro-services. No BS. It’s simple. It works. And it has worked for 30+ years.
Kubernetes is hard because it's over-complicated and poorly designed. A lot of people don't want to hear that because it was created by The Almighty Google and people have made oodles of money being k8s gurus.
After wasting two years chasing config files, constant deprecations, and a swamp of third-party dependencies that were supposedly "blessed" (all of which led to unnecessary downtime and stress), I swapped it all out with a HAProxy load balancer server in front of some vanilla instances and a few scripts to handle auto-scaling. Since then: I've had zero downtime and scaling is region-specific and chill (and could work up to an infinite number of instances). It just works.
The punchline: just because it's popular, doesn't mean it's the best way to do it.
It’s not overly complicated, it’s just trying to serve everyone’s use cases. I’ve tried deploying to 10k servers with custom scripts in Jenkins, bamboo and AWS auto scaling groups but I’ve found kubernetes is the only tool that will elegantly handle a problem. You can probably write a script for the happy path but for a production service I’d bet my money on something that can handle all of the problems that come along with the statistics blow ups at scale. That said, I can be complete overkill for most systems.
For a happy medium, check out Nomad. I've been managing our infrastructure on Nomad for years by myself, with upwards of 40 nodes (auto-scaled) and the number of problems we've had can be counted on one hand (and was almost always a simple user error or fixed by upgrading). I spend most of the time I would otherwise spend doing tedious ops shit actually building things.
That said, Nomad and stateful services don't mix. Don't try. I think the same goes for k8s though.
Run them on EC2 or whatever managed service whatever cloud host provides (RDS, S3, etc). It's possible to run stateful services on Nomad (and I assume k8s) but from my understanding the cost is extremely high, usually much higher than the benefit.
You can setup a solid k3s cluster in 30 minutes. I'm sorry you had a hard time but just because you didn't succeed at your attempt doesn't mean it actually is super hard.
Setting it up is relatively easy, keeping it running consistently while navigating the schizophrenic levels of change and indecision on how things are done is another bag of chips. No need for passive aggression.
Yeah we use k3s for local and testing environments, where it shines. I wouldn’t want to use it to run stuff in prod and figure out how to do zero-downtime upgrades with multiple nodes.
What do you see as limitations to running k3s for prod? What makes it not shine there?
How big is your enterprise? You talk about zero-down-time upgrades of nodes. What prompts that to be a demand? What kind of band of maturity do you think this need buckets you into?
I am consistently confused by all of the talk about how "hard" Kubernetes is.
We spin up EKS. We install the newrelic and datadog log ingestion pods onto it, provided in a nice "helm" format.
We install a few other resources via helm, like external secrets, and external dns, and a few others.
Kubernetes EKS runs like a champ. My company saves 100k/mo by dynamically scaling our cloud services, all of which are running on Kubernetes, to more efficiently use compute classes.
My company has over 50 million unique users monthly. We have massive scale. Kubernetes just works for us and we only have 2 people maintaining it.
What we gain is a unified platform with a consistent API for developing our services. And if we wanted to migrate elsewhere, it is one less thing to worry about.
¯\_(ツ)_/¯
Feels like some kind of hipster instinct to dislike the "cool new thing"... even though k8 has been around for years now and has been battle tested to the bone.
So, what do you do when one of your pods suddenly cannot connect to another, even though both nodes seem to be passing healthchecks and stuff?
Spinning up a K8s cluster "in the cloud" is easy and everyone jumps on that happy "look how simple it all is" bandwagon, forgetting than it's just the beginning of a very long journey. There are millions of blog articles of varying quality that explain how easy it is, because it's very simple to spam search engines retelling a story how to click a couple buttons or do some basic Terraform/CloudFormation/whatever.
And here's what they don't tell you - maintaining this machinery is still your business, because all you get is a bunch of provisioned node machines and a cookiecutter to spin it up. Plus a bunch of scripts to handle most common scenarios (scaling, upgrading, some basic troubleshooting, etc). The rest is either on your or tech support (if you pay extra for it). And if you have a sysadmin-for-hire contract anyway, then its them who should have an opinion what's easy and what's hard. Contracting other people is always relatively easy - compared to what they do.
Yeah it's easy for you because it's two people's full time job to maintain it? Many of us are having to learn it and use it in our spare time, or on top of our other work. We wouldn't necessarily know the best practices, or to use newrelic and datadog, or what to use for external secrets, external dns, how to diagnose and debug the issues which inevitably will occur when setting it up.
Now this is true for doing it without k8s too, but somehow there was never a huge set of blog posts about "it's really hard to set up a load balancer and a secrets service and networking" but there is for k8s, so there must something intrinsic in either its design or its documentation that is causing that. I think it's probably that k8s is designed for google-scale deployments, so for most people the initial burst of complexity is a bit overwhelming.
Same, we use EKS and a very similar setup, our workload has some pretty high throughput and scaling requirements. Works amazing for our team, wouldn't change it for anything else at this point. Very low maintenance effort since AWS manages the K8s infra.
Your company "saves" over 100k/month paying WAY too much for EKS, which is extremely expensive.
If you're at any decent scale (looks like you are), then switch to GKE, or switch to on-prem and buy some hardware + a Kubernetes distro like Mirantis/Openshift/Tanzu.
Heck, go run k3s on Hetzner and you won't have that much more work, but save literally millions at the scale you're talking about.
While I understand where the author is coming from, my opinion of Kubernetes (and production deployment in general) isn't that it is hard per se, but that it involves many components.
I liken it to Lego. Each component separately isn't hard to work with, and once you figure out how to connect it to other components, you can do it 100 times easily. And like Lego, a typical Kubernetes environment may consist of several dozen or hundred pieces.
So, I wouldn't describe Kubernetes as hard - I would describe it as large (i.e., comprised of multiple interconnected components). And by being large, there is a fair amount of time and effort necessary to learn it and maintain it, which may make it seem hard. But in the end, it's just Lego.
As an Infra person reading k8s posts on hackernews has got to be one of the most frustrating and pointless things to read on here. You all just regurgitate the same thing every post. It's even the same people, over and over again.
30% of you are developers who think K8s is the devil and too complex and difficult, 30% of you like it and enjoy using it, and another 20% of you have never touched it but have strong opinions on it.
I would not use k8s unless we are convinced it will benefit us in the long run (think about constant effort that needs to put in to get things running). k8s is not magic. I would just stick with docker-compose or digital ocean for small startup. OR rent a VM on Azure OR if you really really need k8s use a managed k8s.
Docker swarm is a great option too, for production environments. It’s like the production-ready big brother to docker-compose (with better health checks and deployment rollout options). And it has much less of a learning curve than k8s.
> [K8s] allows dev teams to not worry about all these things; all they must do is to write a simple YAML file. More importantly, teams no longer need to ask DevOps/infra folks to add DNS entry and create a Load Balancer just to expose a service. They can do it on their own, in a declarative manner, if you have an operator do to it.
Yeah, as opposed to Cloudformation or Terraform, where you...uhhh...
Don't get me wrong, it requires work to set up your corporate infrastructure in your Favourite Cloud Provider(tm) to make those things available for developers to manage. But it takes work in k8s too - even the author says "if you have an operator to do it". Kubernetes is great for what it's great for, but these are terrible arguments in favour of it.
That's true, but I'd argue that TF is not as powerful as k8s. You could combine it with auto scaling services offered by cloud providers, but then you need to understand plugins and with k8s it often is a single value. For example, you can "add" Prometheus scraper just by adding labels. You won't have that with TF.
Kuberenetes has been a total failure at defining a simple "just works" devops workflow, but I don't think that is due to any deficiencies in the product itself. The basic premise behind its common use case – automating away the SRE/ops role at a company – is what is flawed. Companies that blindly make the switch are painfully finding out that the job of their system operator wasn't just to follow instruction checklists but apply reasoning and logic to solve problems, similar to that of any software engineer. And that's not something you can replace with Kubernetes or any other such tool.
On the other hand, there's still a lot of value in having a standard configuration and operating language for a large distributed system. It doesn't have to be easy to understand or use. Even if you still have to hire the same number of SREs, you can at least filter on Kubernetes experience rather than having them onboard to your custom stack. And on the other side, your ops skills and years of experience are now going to be a lot more transferrable if you want to move on from the company.
Kubernetes has been a masterwork ultra flexible but consistent underlay for building simple "just works" platforms.
It would be trash if it tried to be the answer; it'd be a good fit for no one. That's an unbelievable anti-goal.
But there are dozens of really good ci/CD & gitops systems that work very well with it that make all kinds of sense. Install k3s cluster in hour 1, setup gitops in hour #2.
Tentative agreement in where we both land. You need to make intelligent choices. You need technical understanding & problem solving. Kube isn't really much better or worse, but it at least sets a shape & form where there are common patterns no matter what systems & platforms a particular company happens to be running on kube, no matter which concern you're dealing with.
I think setting up something similar without k8s is like, 100 times harder? I was never deeply into DevOps but a single short video and few doc pages told me how to bring up highly available, load-balanced 2-node cluster, and how to rollout new services and versions in minutes with zero downtime. I also can precisely control it, monitor all the logs and resources without leaving my working terminal for a minute. I would never be able to set-up an infra like this without kube in a timespan of one day with little prior DevOps knowledge. The complexity beast it tames into structure is just mind-blowing, and it's a virtue that it came out just being a bit "hard".
"Torb is a tool for quickly setting up best practice development infrastructure on Kubernetes along with development stacks that have reasonably sane defaults. Instead of taking a couple hours to get a project started and then a week to get your infrastructure correct, do all of that in a couple minutes."
Perhaps put more simply, operating in production has a lot of intrinsic complexity that will probably surface in the tooling, and if you constantly reinvent to "fix" the complexity you'll eventually end up putting up it back.
That's how you end up with the modern javascript tooling hellscape where it looks like no one was around to tell a bright young developer "no"
I have a stack which runs with a single docker-compose file, 13 services, nothings too fancy, tried to transform it to kubernettes (using kompose) and, my file was converted into almost 24 yaml files, Im not even lookings at those little gremnlins, and will stuck with simple docker-compose settings and docker swarm.
You need a decent sized team to run an on premise k8s infra. If you’re in the cloud use a managed k8s.
It’s not for everyone in that I agree with the point the author makes. But if you have multiple teams doing app development k8s can be really nice. We do data movement and ML services. AKS has proven great for our use.
Your team needs scale with your demands. For many folk, a pretty boring setup is fine & requires no fiddling. Find a good gitops solution & many folks don't even need to think about kube, ever, ideally.
You do however need some folks who know Kube & systems. If you don't have them, hire some but not just for kube. If you can't do that, then yeah, it's a real personel investment.
Once you start having mtiple teams doing more complex stuff, you either need teams that work well together, or some kind of entity to figure out how not to make a mess. Turning on rbac is a significant early step towards being a mature, scale-out capable team.
I have yet to see an org scale past these needs myself, but most of me & my colleagues experience is at smaller/medium sized orgs.
I feel with AI, maybe in a couple of years it's going to trivial to deploy things on current infra stacks. AI can probably create whole range of Terraform scripts, deploy k8s and dockers and scale them automatically, with maybe a few humans as supervisor.
I suppose once the AI takes all the ops and dev jobs, we'll just have to seek employment doing the one thing AI can never seem to do - driving a vehicle.
The best reason to use k8s is to take advantage of the wide array of open source k8s resources, including helm charts, custom operators, articles and tutorials etc. It's become an open standard.
Like with many other technologies, k8s' advantage is its ecosystem.
something that considerably helped my communication was to transition from "what this intelligent person said is patently wrong" to "in what sense is this intelligent person correct".
It saves me and everyone else a lot of time, because if the obvious black-and-white response were needed I probably wouldn't be having the conversation in the first place.
Kubernetes is a distributed centralized operating system which itself depends on a distributed decentralized database, and has a varying network topology, permissions system, plugins, scheduler, storage, and much more, depending on how & where it was built, and runs applications as independent containerized environments (often deeply dependent on Linux kernel features) which can all have their own base operating systems. All of which must be maintained, upgraded, patched, and secured, separately, and frequently.
Kubernetes is literally the most complex single system that almost anyone in the world will ever use. It is the Katamari Damacy of the cloud.
> It allows dev teams to not worry about all these things; all they must do is to write a simple YAML file.
cackles, then sobs
> More importantly, teams no longer need to ask DevOps/infra folks to add DNS entry and create a Load Balancer just to expose a service.
more sobbing
> Should you use Kubernetes?
Should you change a tire with a crowbar, a can of WD40, and a lighter? Given an alternative, the alternative is usually better, but sometimes you don't have an alternative.