How Kubernetes and Kafka Will Get You Fired

reacharavindh · on April 20, 2023

Not that I want to advocate for Kubernetes, but this is just a stepping stone for another blog post in a few years "How managed services in AWS got us fired!" detailing how when AWS changed their pricing/strategies the companies had no way out of their design choices.

The "vendor agnostic" approach was the right call to make at that point in time. Sure, every business is unique and some cases fit better for a hands-off (lets pay for the convenience of AWS taking care of managed offerings) approach. But, it is a fallacy to think there is no cost to pay for that decision.

The cost of operating your vendor agnostic infrastructure is replaced by your team now needing to learn the intricacies of AWS such as IAM, AWS's way of networking, backups etc. Those operational needs dont just go away, they just become "easier" and more defined as AWS's way of doing it.

As a consultant, one must know where to draw the line and recommend the appropriate route.

ryan_lane · on April 20, 2023

This is easy if all you use is EC2, as K8s is effectively just a layer over EC2. Once you start using other AWS features you're locked in anyway, and tbh, what's the point of using AWS if all you're using is EC2?

Choosing to be vendor agnostic is purposely choosing to need to implement every single part of a stack yourself, rather than the (usually considerably better) AWS offerings, just in case you may want to switch away in the future. It's a massive waste of money and time for nearly every business.

Unless you're in the business of providing an infrastructural-level service (like heroku), you'll ship faster and cheaper by using a single cloud and going all in.

reacharavindh · on April 20, 2023

This is why I was not claiming that nobody should use AWS and everybody should be aiming for being vendor agnostic. You're right, many a times, for smaller businesses the vendor lock-in with AWS should be the least of their worries and should just go on about their business. From my mind, business that have bursty needs, the ability to scale up/down the infrastructure on AWS could be quite beneficial.

However, once you start growing and have the scale where you have dedicated teams of engineers operating your infrastructure, at some point it makes (economical and operational) sense to own your infra. For example, Instead of 10 different teams each having their own MSK setups to get a kafka experience spinning up so many instances for that occasional need, you could have a centralised kafka cluster operated by a central entity within the team. Instead of every team spinning up their own EKS clusters, and all the cruft associated with making it production ready, the company could set one way of running production services (and have a plan to be able to replicate that outside of AWS should there be a future need to).

> K8s is effectively just a layer over EC2 Yes, but one that comes highly abstract and lures you into tooling such as ingress, secret mgmt, image registry and all kubernetes artifacts that you need to care about in addition to your service itself.

> what's the point of using AWS if all you're using is EC2? Not needing to operate a data center, networking, worrying about needing to visit the server to change the failing disk, and largely - being able to scale up/down on demand without capital expenses.

If I read what I wrote, perhaps it sounds extreme and my grey beard agrees to the ideology more than most HN audience would.

ryan_lane · on April 21, 2023

> However, once you start growing and have the scale where you have dedicated teams of engineers operating your infrastructure, at some point it makes (economical and operational) sense to own your infra. For example, Instead of 10 different teams each having their own MSK setups to get a kafka experience spinning up so many instances for that occasional need, you could have a centralised kafka cluster operated by a central entity within the team. Instead of every team spinning up their own EKS clusters, and all the cruft associated with making it production ready, the company could set one way of running production services (and have a plan to be able to replicate that outside of AWS should there be a future need to).

If you have multiple teams spinning up MSK, then you have a different problem and running your own infra isn't going to make this easier. Same thing with EKS clusters. This is a problem with not having a team that manages your infrastructure platform.

Product engineers should own the software they write, should own deploying them (but not the infra that deploys them), should own the oncall (but not the observability infra), and should have access to relevant platforms they need.

Platform engineers should build the platform that platform engineers use. Usually this does not include the CI infrastructure, but in some cases it should.

SRE often owns the observability platform (but in some cases there's an observability team for this).

This is more the norm in large size companies, and usually those teams are split down into multiple other teams that own different parts of the infra that platform engineers depend on for their services. This is usually the case whether or not the company uses a cloud or runs their own infra.

When you run your own infra, you have to have expertise in all the things that you were previously using the cloud for. IMO running k8s yourself is asking for trouble, and EKS is saving you a lot of time and money. Running your own kafka is basically masochistic. Running your own database infrastructure is also extremely difficult, and you'll almost certainly have worse uptime than if you used Aurora (or DynamoDB).

Though you can save money running your own infra, you're going to have to hire people to run it, and you're going to have to plan further out, and will be less agile.

klooney · on April 20, 2023

> what's the point of using AWS if all you're using is EC2?

It's very high quality, gives great audit, you can have the same control plane in a ton of regions... EC2 is fantastic.

ryan_lane · on April 21, 2023

Yes, EC2 is dope. But so is DynamoDB, MSK, EKS, SQS, SNS, SES, IAM, KMS, etc. If all you use is EC2, then you need to manage the equivalent of a large number of services that are high quality and well run, and you'll need staff with expertise in running them well.

withinboredom · on April 20, 2023

For fun, I run a bare metal k8s cluster with my blog and other projects running on it. My last three nights have been fighting bugs. Bugs with volumes not attaching, nginx magically configuring itself incorrectly, and a whole bunch of other crap. This just magically started happening, but crap like this seems to happen at least once a month. It’s to the point where I spend at least one night a week babysitting the cluster.

I don’t have to pay someone else to handle this, but if did, I would get rid of k8s in a heartbeat. I’ve seen a devops team of only a few people manage tens of thousands of traditional servers, but I doubt such a small team could handle a k8s cluster of the same size.

I’m considering moving back to traditional architecture for my blog and other projects. K8s has been fun, but there’s too much magic everywhere.

asdfman123 · on April 20, 2023

No one has ever explained the point of it to me either.

I’ve heard it’s supposed to solve the problem of programs running differently on different machines. That’s a problem I’ve never encountered in my 12 years of experience.

But the types of issues you describe are very real and very time consuming.

mrkeen · on April 20, 2023

> No one has ever explained the point of it to me either.

It makes sense if you use docker. Docker containers need somewhere to live. If you want two copies of your service alive at all times, K8s is the thing which will listen for crashes, and restart them, etc.

asdfman123 · on April 20, 2023

Well, sure, but what’s the point of the whole ecosystem?

mrkeen · on April 20, 2023

The ecosystem isn't really anything more than the sum of its features.

I already mentioned K8s as an automatic container runner/restarter. But if you run two copies of a service, you need a load balancer to route traffic to them. You can program your own (more work), or download & run someone else's (less work). Or you can see what K8s provides [0] and do even less work than that.

If your services talk to one another, they could talk by hard-coded IP (maintenance nightmare), or by hostname. If they talk by hostname, then they need DNS to resolve those host names. Again, you can roll DNS yourself, or you can see what K8s gives you [1].

And on and on. Firewalls, https, permissions, password/secrets management.

There's one more thing to say about K8s which is that it has become a bit of a defacto standard. So you don't need to relearn a completely new way of doing this stuff if you decide to switch jobs / cloud providers.

[0] https://kubernetes.io/docs/concepts/services-networking/ingr... [1] https://kubernetes.io/docs/concepts/services-networking/dns-...

withinboredom · on April 20, 2023

K8s gives you a lot for free, until it doesn’t. I’m not saying the old way is better, but if it is better — it’s easier to fix when shit hits the fan. A bad day on k8s will take you completely offline, while a bad day on a single server may or may not take you completely offline (depending on your backup situation and how good your devops is).

joshribakoff · on April 20, 2023

You’re not making an apples to apples comparison. You can run k8 on a single server or run 1,000 bare metal servers. Number of servers and how you deploy to them are tangential things and not mutually exclusive.

You seem to also be implying that by running a single bare metal server you have eliminated any chance of downtime which isn’t true

For example if your process crashes on bare metal, you go down, unless you have some kind of supervisor that watched and restarts the process, if youre not using kubernetes as a supervisor then you need to set one up using some other tool. At the end of the day you can’t eliminate all tooling/downtime.

withinboredom · on April 20, 2023

I was just saying that no matter what, all your eggs are in one basket. K8s is program that can fail like any other program. If it does fail (like etcd getting corrupt or even the process itself crashing for some reason) you can end up with a collection of servers that can’t do anything (I’m actually in this position right now). It’s exceedingly rare that this can happen, but it’s also exceedingly rare with regular servers. The difference is cost, right?

If a single server fails, you may be offline but there are well-tread paths to come back online. Your material cost is the cost of that single server. If k8s goes down, oh boy. Not only is it very complex, requiring knowledge of how it works to diagnose and recover from, but there can be zero documentation on how to recover. You are now also paying for a cloud of bricks.

jiggawatts · on April 20, 2023

A random example from $dayjob: vendors like ESRI ship products that are actually a dozen spread across five sets of servers with certificates and load balancers everywhere. My customer has 7 sets of them due to acquisitions, each with dev, test, and prod instances. That’s 21 sets of a dozen servers or so. Just keeping up with OS updates and app patching is nearly a full time job!

Or just apply their official helm chart… and you’re pretty much done. You’ll also get better efficiency because the various environments will all get bin-packed together.

Is it perfect? No, but it’s better than doing to yourself!

DSingularity · on April 20, 2023

Consider the alternative in conditions where you need various forms of scalability in a cloud agnostic way. Especially when you have a complicated system of many services.

niemandhier · on April 20, 2023

It makes redundancy, self healing, scaling, rolling updates, rollbacks and the like error easy, assuming that your services are stateless.

If you do not need this features k8s is not the thing to use, unless you have the skill set anyways.

Things get messy when state and persistence is involved, I’d prefer to habe my backend DB not on k8s and link the services against it.

m463 · on April 20, 2023

I think some uses cases might be - running/testing software on a variety of hardware configurations, and sharing a limited pool of machines among people/projects.

badpun · on April 20, 2023

K8s was never meant to be used for running a blog :) It was built to support Google-scale deployment, with probably dozens of engineers just supporting the live clusters as they stumble into various bizzare states.

withinboredom · on April 20, 2023

Well, yeah. I just stuck my blog on there to reduce infrastructure costs for myself. The cluster runs much bigger things beside my blog :)

red-iron-pine · on April 20, 2023

> I don’t have to pay someone else to handle this, but if did, I would get rid of k8s in a heartbeat. I’ve seen a devops team of only a few people manage tens of thousands of traditional servers, but I doubt such a small team could handle a k8s cluster of the same size.

This has been my experience with a lot of the "we need to be cloud native! containers!" mantra in the enterprise. Some exec gets it in their head it's a good idea (and probably gets non-trivial "referral agent fees") this is a must do, and all of the young, hip developer types are happy to cheerlead it.

Two years later OpEx is exploding, most of the processes haven't yet been converted to be in the cloud, and the environment isn't noticeably better or different. It sucks, just sucks in a new and more expensive way that gives you less control of your data.

Seen this at 3 x F500 orgs and with multiple cloud providers, including the big 3 + one of the well known second tiers.

metacatdud · on April 26, 2023

Hey, you can try nomad. It works nice for small-med projects.

Works well with terraform and extending with nomad clients it's a breeze.

I set my personal bare metals with nomad infra and never looked back.

withinboredom · on April 26, 2023

How do you solve logging and storage? Those were two issues that caused me to leave it behind. With k8s, there is longhorn for storage so I can move databases around and have volumes replicated to deal with disk failures. Is there anything like that for Nomad?

metacatdud · on April 27, 2023

Do you mean this? https://developer.hashicorp.com/nomad/tutorials/stateful-wor...

Honestly I didn't got into the position where this to become critical and always managed to stay ahead.

So far(2yrs) so good. I am mostly one man show and I found k8s a bit too much, there is always something.

Maybe I was doing smth wrong or didn't knew how to plan better, idk :)

Edit//typos

puppymaster · on April 20, 2023

Nowadays when i try to scaffold quick ideas, I just start a cloudflare worker. You get a url, cron, key-value store and express-like JS server going with a click of a button. I don't even have to npm -i.

I can definitely see the appeal of tinkering with 'advance tech' for personal hobby tho. Because now I am pretty sure you know more about K8 than me :)

thunky · on April 21, 2023

> My last three nights have been fighting bugs

> I spend at least one night a week babysitting the cluster

...

> K8s has been fun

This is why everything sucks now.

dikei · on April 20, 2023

Well, for a counter anecdote, my bare metal K8S cluster haven't bugged out on me for months, living through multiple versions upgrade.

To each his own, I guess.

withinboredom · on April 20, 2023

I’m quite jealous. How big is your cluster? I’ve got several hundred cpus, nodes just for storage, and multi-region services. It’s quite a beast.

dikei · on April 20, 2023

I've 80 nodes, totaling around ~700 CPU cores, all in the same DC though.

amir734jj · on April 20, 2023

Use docker compose or portainer

thrashh · on April 20, 2023

I can confirm that maintaining a Kubernetes cluster is a full time job. Due to its design, there are a lot of moving parts even for the most minimum deployments.

Low key I hate touching Google-created projects. On paper technically sound but in practice a guaranteed usability disaster.

m463 · on April 20, 2023

I think this might be because google engineers are rewarded for inventing new things, not so much for refining or maintaining old things.

joshribakoff · on April 20, 2023

For some people learning to change the oil on their car could be “a full time job”, but it doesn’t change the fact oil changes are a commodity i can go pay $50 for at a shop.

Similarly, any cloud provider worth it’s salt provides a managed k8 cluster as a commodity these days.

cocoland2 · on April 20, 2023

Touches a chord this post.Systems like Kubernetes, Kafka are inherently complicated. My previous company got baremetal from AWS and installed k8s cluster on them. No offense to who architected it, we had multi country infra and made sense to take care of cost advantages on lower cloud costs using alternate providers

We got a lot of critical infra running on them and then slowly there was tech-debt that would start accumulating. Clusters have to get updated , older DNS versions in k8s are slow, networking (Older Weave versions was bursting through the seams when the traffic exploded with many applications onboarded). SRE teams get overwhelmed, constant requests for adding PVC (Kafka & C* was on k8s) took a toll. Sanity prevailed in the end, there was decision to move to hosted PaaS infra, though I no longer work there, I just reminisced what we were going through.

Though a "cloud-independent" solution will save pennies, it will definitely drown dollars in personnel costs and the uptime/SLA

History repeats itself, because we don't learn from our mistakes (us or others)

dikei · on April 20, 2023

I came expecting a war story about running Kafka on K8S. Instead, I find an advertisement for AWS server-less.

SilverBirch · on April 20, 2023

I don't really think this was an advert for AWS. My guess would be that if they'd been on Azure or Google Cloud he just would've suggested that the client use the built-in products on those platforms.

mt42or · on April 20, 2023

Very funny article, we are spending 2 people full time (on 4) trying to building on AWS services. This is really a mess and cost a lot of human resources.

withinboredom · on April 20, 2023

It took me 4-6 weeks to get a k8s cluster up from scratch and migrate to it. I don’t do devops for a living, but I’ve been building servers and doing devops stuff for fun for nearly 20 years. So I’m not a pro, but I know what I’m doing.

If your team is a bunch of software devs, you are doing the right thing… because k8s requires a bunch of knowledge you likely don’t have if you don’t have a Linux guru on the team. Even if you are using an expensive managed solution, things will go wrong and that knowledge is needed to prevent downtime.

joshribakoff · on April 20, 2023

This doesn’t seem consistent with my experience. When you say “4-6 weeks to get a cluster up” i wonder if you actually mean to learn kubernetes and play around with deploying things, as that would make sense. I was able to install k3s in around 15 minutes and deploy my first service, myself.

withinboredom · on April 20, 2023

I mean for production, with scripts/playbooks/firewalls/rbac/storage/networking/ingress/logging/backups/etc. Yeah, you can stand up a toy cluster in matter of minutes, but that’s not the same.

bradwood · on April 20, 2023

Having run k8s and Kafka in a previous job (I left before I got the sack) this article rings completely true.

New shop: lambda and eventbridge = life is good.

animesh · on April 20, 2023

lambda and eventbridge = life is good

Very interesting. I am struggling to articulate this mentally. Would you please expand on this comment?

ranguna · on April 20, 2023

I'm on a stack that also uses lambdas an event bridge, I think the OP means that the stresses originating from k8s and Kafka are gone when using lambdas and event bridge.

bradwood · on April 20, 2023

Indeed, that is exactly what I mean. There are some caveats though -- eventbridge doesn't give strict ordering or exactly-once delivery, so some work is needed to make idempotent consumers, etc. But, on aggregate, it's still much much easier than managing all the infrastructure that k8s and kafka requires.

holografix · on April 20, 2023

An explanation on what was actually being so difficult about managing Kafka and K8s would have been helpful.

Why didn’t the customer use EKS?

bsaul · on April 20, 2023

that also wasn't really clear in the article : did they used managed k8s and managed kafka ? in which case i don't really understand the issue, as there's nothing to manage..

metacatdud · on April 26, 2023

I am happy to see people are talking about this.

Everytime I am trying to point out systems are more and more complicated and do a call for simplicity I am pushed away.

I heard you are not taken seriously if you don't use a well established cloud provider or similar things.

Truth be told, there are not many projects you do or will work on which need this kind of things.

We though cloud will help us with a lot of things but to what cost and by cost I mean stress, data protection, money etc.

SilverBirch · on April 20, 2023

This is a good example of people making questionable practical decisions based on good principles. Yes, in principle it'd be good if you were agnostic of your cloud provider. But there's a few issues with it. Firstly, you're going to work so much harder doing what you want to do on AWS whilst avoiding doing the things AWS wants you to do. They're not dumb! You don't want to be locked in? They really want you locked in and they're going to work very hard to make that happen. Secondly, the likelihood of you ever actually making the decision to leave is extremely low, so you're paying all these costs for what is at best a theoretically risk. And finally, even if you do everything perfectly and never depend on anything uniquely amazon.... leaving is still going to suck! It's still going to be a huge amount of work to migrate away!

theoldlove · on April 20, 2023

Quite the advertisement for AWS. I’d like more details on what the migration to native AWS services looked like

lrobinovitch · on April 20, 2023

Not OP but I wrote about migrating to MSK from EC2 here: https://theleo.zone/posts/migrating-to-msk/

qeternity · on April 20, 2023

There seem to be two camps: those who think k8s is a godsend, and those who think it's the devil incarnate. We fall into the former.

We run Rancher across a couple of bare metal clusters and it's been mostly an amazing experience (ca. 3 years). The only issues we had were with Rancher specific bugs, but those have been resolved and for the most part our infra is pretty autonomous. We do all HA at the application layer, so local NVME as opposed to network storage. This means Patroni, Redis Sentinel/Cluster, etc. But it broadly just works. Maybe we're not big enough to bump into issues, but I couldn't imagine migrating to the labyrinth of vendor lockin masquerading as cloud services.

What am I missing? Why do we have such a wildly different experience to others?

morelisp · on April 20, 2023

I don't particularly like Kubernetes at all, and while I like Kafka it's definitely overkill for the kind of system discussed in the article. But I gotta say, 87%? What the hell? I had >98% uptime with the first Kafka tooling I ever built, and that was 3 nodes, shared machines with the ZKs, producers/consumers split across three cities, and processing 10x the traffic they're talking about; maintained by just me on medium-range Hetzner boxes.

It feels like there's something deeper hiding here, more along the lines of "our developers really don't / can't care about how the software is operating in production."

kyugkyugyuog · on April 20, 2023

i hate k8s!!!!!!!!!!!