Hacker News new | past | comments | ask | show | jobs | submit login
Container technologies at Coinbase: Why Kubernetes is not part of our stack (coinbase.com)
778 points by arkadiyt 32 days ago | hide | past | favorite | 397 comments



> We would need to build/staff a full-time Compute team

This actually was a very real problem at my current job. The data pipeline was migrated to k8s and I was one of the engineers that worked to do that. Unfortunately, neither myself (nor the other data engineer) was a Kubernetes guy, so we kept running into dev-ops walls while also trying to build features and maintain the current codebase.

It was a nightmare. If you want k8s, you really do need people that know how to maintain it on a more or less full time schedule. Kubernetes is really not the magic bullet it's billed to be.

> Managed Kubernetes (EKS on AWS, GKE on Google) is very much in its infancy and doesn’t solve most of the challenges with owning/operating Kubernetes (if anything it makes them more difficult at this time)

Oh man this hits home. EKS is an absolute shitshow and their upgrade schedule is (a) not reliable, and (b) incredibly opaque. Every time we did a k8s version bump, we'd stay up the entire night to make sure nothing broke. We've since migrated to cloud functions (on GCP; but AWS lambdas could also work) and it's just been a breeze.

I also want to add that "auto-scaling" is one of the main reasons people are attracted to Kubernetes.. but in a real life scenario running like 2000 pods with an internal DNS, and a few redis clusters, and Elastic Search, and yadda yadda... it's a complete pain in the butt to actually set up auto-scaling. Oh, also, the implementation of Kubernetes cron jobs is also complete garbage (spawning a new pod every job is insanely wasteful).


I work on a 2-person project and decided to go with kubernetes (through digitalocean) for the cluster. I am managing everything with terraform and I don't have any big problems. I like that I can write everything as terraform manifests, have it diffed on git push and applied to prod if I want to.

Sure it had a learning curve but now I just describe my deployments and k8s does the rest, which then reflects back on digitalocean. If I need more power for my cluster, I increase the nodes through digitalocean and k8s automatically moves my containers around how it deems fit.

I used normal blue/green deployments on self-managed VMs in the past, then worked with beanstalk, heroku, appengine and I much prefer k8s. Yes it's easier on heroku, but try to run 2-3 different containers on the same dyno for dev to keep cost down. On k8s I can run my entire stack on one single small digitalocean $10 VM if I wanted to.

I wouldn't even know what I else could pick that gives me equal flexibility and power?


> I used normal blue/green deployments on self-managed VMs in the past, then worked with beanstalk, heroku, appengine and I much prefer k8s. Yes it's easier on heroku, but try to run 2-3 different containers on the same dyno for dev to keep cost down. On k8s I can run my entire stack on one single small digitalocean $10 VM if I wanted to.

So you already spend about a decade learning all the skills. What the other guy is talking about coming from dev not from ops. If you come from dev you don't necessarily know what an ingress or egress is, and might never have done a blue/green deployment etc. This is all stuff that needs to be learned first. I worked with many many teams who had zero skills in data center tech before they were moved to k8s full time.

I personally like it to learn all that stuff. And I love that my job requires it now. But it's more like vim than like node-red, and that was a shock for many people, from engineer to EVP.


I'm also on a 2-person project on DigitalOcean k8s, also very happy.

K8s is kind of messy compared to Heroku, which I don't love, but is also way more powerful and can be more secure. I don't know what I'd use instead of it, exactly as you said.

Also, we run a VPC-only K3s node for some simple internal tools that works great as well.


For those missing Heroku: there exists Dokku [1], a small Heroku-like implementation for container management. It uses the same underlying buildpacks and you get the same comfort as with Heroku. And it's free to use. You can't deploy to multiple host machines though. But for small projects that fit on a single host, it's very nice to use.

[1] http://dokku.viewdocs.io/dokku/


Tried Dokku, but found CapRover [1] to be a much better / easier option

[1] https://caprover.com/


Looks great! considering migrating from dokku. I also came across exoframe recently which looks lower level but works with your existing docker projects https://github.com/exoframejs/exoframe


Better for what reason?

It seems Dokku wins at "easier" since in many cases, you can just push the application code you used for development and the required stack is automatically detected. Adding a database is two commands. No need to know Dockerfiles.


It's easier because for many cases you don't even need to search the document ro know what command to run to fire up a database. You just select it from a GUI list.


Looks interesting - but the installation instructions put me off a bit. Open a port on your server, and don't change the default password `captain42` - then run a cli tool from your dev machine.

I'll look more into it, but it didn't really inspire confidence.


You can choose a custom password during installation


I just don't understand why you can't change the password (or even better, choose a certificate) before you open the ports to the world.


> Also, we run a VPC-only K3s node for some simple internal tools that works great as well.

We do exactly the same thing! We have a one-node k8s for all these dev things that just works. Everything is containerized for local dev anyway so moving it to k8s was just writing the deployment manifest.

On heroku, all of these would be separate dynos (or one glued-together dyno that does everything). On a self-hosted VM we'd have to deal with managing that.

I liked this approach so much that I now have a small 1-node personal cluster that hosts all of my private hobby projects that aren't ready for prime time yet, that were on heroku previously. Costs me only $10 + (persistent storage + IP address if needed)


I feel like I’m witnessing two co-founding colleagues - who sit by each other day in and day out - discover the other’s persona on HackerNews.

Sure, maybe you two (@dvcrn and @arcticfox) don’t work together and don’t know each other, but it’s definitely more entertaining imagining the scenario above.


“If you like pina coladas...”



You really should write an article detailing how all this is set up. It sounds fascinating.


Heroku is also very cheap on the low end. EKS costs at least $72/mo before adding compute & storage. That would get you a lot of dynos.


I’m a bit dubious in the Heroku-is-less-secure-than-k8s claims


for 2 person project dokku is the smartest choice


if you can run everything on a $10 vm...do you really need k8 ?


You do if you do a resume driven development.


Ooooh, that one's going to sting in the morning


The point is to keep cost down while a project is in development, then scale depending on needs without having to worry about container distribution and resource utilization.

On dev I don't need a 10 node cluster if I had 10 containers running. One $10 vm is fine.

On prod I can start with a 3-node cluster for these 10 containers, then scale it up depending on traffic and needs while controlling my spending.

Not everyone has thousands of dollars of VC money to throw at hosting


most people probably want the following: - no downtime deployments - distributed jobs - as managed infra as possible

without k8s some things would be hard.


no downtime deployments? happened before k8s; used to do that several times with some haproxy.

distributed jobs? same; nothing prevents from spawning runners with adhoc libraries and queues.

managed infra? not specific to k8s


The "adhoc" part is the problem. K8S is standardized and offers high-availability, failover, logging, monitoring, load balancing, networking, service discovery, deployments, stateful services, storage volumes, batch jobs, etc. And it can run self-contained on a single machine or scale out to 1000 nodes.

Why piece all of that functionality together yourself into some fragile framework instead of using a industry standard?


"Why piece all of that functionality together yourself into some fragile framework instead of using a industry standard?"

Quite recently developed "industry standard". Many tools mentioned have been used for tens of years, they work robustly, are well documented and there is lots of people who can use them. I personally would use the word "industry standard" a little bit differently.


You still have to put them all together into some custom solution just for your setup which adds overhead and fragility. New employees will have to learn that instead of using K8S APIs. Deploying new components can’t take advantage of the wide and fast growing ecosystem.

There’s really nothing like the full suite that kubernetes provides.


> New employees will have to learn that instead of using K8S APIs

You know tech has made it when people offer argumentum ad Java.

'Sure, X might not be the best solution, but you can always buy someone to do it.'


It's also very empowering for people whose job isn't to run things, but to build the things that are to run. I've worked on a dozen-or-so of bespoke "industry standard" setups, and each and every one had a number of weird quirks, involved learning some new "industry standard" components and either made it very hard and dangerous for non-ops/devops/infra people to run their things themselves, or had homegrown tooling that pretty much replicated what the k8s API can do, just only a small subset and badly. Some YAML and kubectl are well within what a typical data scientist can be expected to understand, more so if it means they can run their things on a dev cluster themselves, and in a pinch, that data scientist can debug prod issues of their things, because it all works the same way. We have a very useful bot that was built and deployed by someone decidedly not ops while waiting for jobs to finish – a simple K8S deploymend YAML is like 50 LOC, with 40 being pretty much standard, et voilà, running bot, without having to build lots and lots of automation in-house or having to take up ops time to deploy this for them or having to grok advanced sysadmin-ing first. Used with appropriate caution and safeguards, it's super powerful.


> Quite recently developed "industry standard". Many tools mentioned have been used for tens of years, they work robustly, are well documented and there is lots of people who can use them.

k8s is based off of Borg, which has existed for far longer.

https://research.google/pubs/pub43438/


"What Google does" is not an industry standard.


That's not what I'm saying. Your point was about the maturity of k8s which makes sense if it sprung from nowhere, but k8s encapsulates a lot of "lessons learned" of a very stable and mature product (even if proprietary).


Really it's just "inspired" by borg. Using both, they're really not very similar.


It can be worth it to piece things together yourself. A complex tool can also be fragile if you don't take the time to learn and understand every facet of it.

If you only need certain parts of what k8s offers, building those parts yourself can offer you more stability, control, and insight into what your application is doing.

As with anything else it varies case-by-case.


Building it yourself is also a case of "resume-driven development".

Not that there aren't good reasons or good outcomes from having done a deep dive, but just putting it there.


I think using something like k8s prematurely is more an example of "resume driven development" than the other way around.

Building it yourself doesn't mean building all of it. For example, it's quite easy to get zero downtime deployments with a tiny bit of systemd configuration and the SO_REUSEPORT socket option. That seems easier for a team to understand than "here is kubernetes and everything that comes with it"


That is, however, a very tiny slice of what might be needed.

You need to deliver the application to the machines somehow. You might need to configure the network to reach the application, etc. All quite common tasks.

(And honestly, systemd can bite you just as much if not worse than k8s definitions, because the API is much less cohesive and the defaults like to cut off your hands)


Building one yourself is a good exercise for understanding something, I agree.

The problem with that logic is everyone on your team will need to do it, so you're going to be stuck picking a standard. Should it be yours, should it be mine, or should we both just learn something with a large community behind it?

Nothing is perfect of course, but k8s makes a really good target for CI/CD, which is something you want when you're developing as part of a team. If you're not quite a team yet and you don't know how to bootstrap k8s and CI/CD, then you need to figure out when those types of things are important.

Probably lots of people could stick with a monolith and a VM for longer than they did, but automated testing will save you a fair bit of time if you're not figuring out how to do it at the same time.


k8s is nice, yes. But, by its age, no more standardized now, than MySQL was 15 years ago.

k8s also adresses a very small portion of the market. If you have to scale, yes, you might need k8s. Chances are, you don't. Really.

I stopped counting the (supposedly big) customers that burnt themselves into k8s when they logistically need not to, and only brought organisational issues on them.


"some haproxy" - what is the haproxy configuration? I tried doing this myself and then realized setting up a cheap k8s cluster on DO was way easier. Now, yes, there have been occasional problems, but not since I just let DO handle the whole thing. I can do zero-downtime deploys, cronjobs (why is spawning a pod so wasteful? it spawns it, runs the job, and then kills it?), all with a single "kubectl apply" command that takes a split second to run.


If you're in AWS, you can use their ELB (Elastic Load Balancer) service instead of setting up your own haproxy. I worked on a team that used it for years without any issues orchestrating zero-downtime deployments. It was extremely easy and didn't require any real configuration.


Having done it myself in the past?

Setting up and taking care of k3s is much easier.


yeah, we started out with ansible aswell. but some stuff is just way harder. especially no downtime. haproxy is fragile and sending files via haproxy over ssh aswell. docker is also a standarized package+repository format. before we used some kind of hacky cdn solution, etc. it was stuff glued together, written by myself and I was the only guy who understood it and ever will be.


if a $10 vps would work, but you want no downtime deployments heroku would work perfectly


I use dokku since the distributed job is not a feature i’m looking for


I use k3s to manage my Plex/NAS/home server machine. Just simply as a deployment tool, it's much easier for me than managing a collection of services deployed by Ansible/Puppet held together by custom scripts and systemd units.


Sure, you can go without it. But you'll need to write a lot of annoying scripts manually.


Could you share the resources to get started? I'm in a similar boat, 2 person trying to set up K8S in DO using automation (CI/CD).


Sadly I don't have many resources I can refer you to (maybe someone else can add?) but digitaloceans kubernetes guides are excellent so I'd start with that: https://www.digitalocean.com/docs/kubernetes/how-to/

k8s has a loooot of stuff it can do and reading too many blog posts that go into too much detail can be intimidating, so my advice would be:

1. Create a dummy cluster on digitalocean (or docker desktop with k8s support / minikube), then setup kubectl to connect to it.

2. Start with the difference between resource types: What are deployments, what are pods?

3. Create a deployment manifest that just tries to pull a container from some registry, apply it with kubectl -f foo.yml

4. Play with kubectl to inspect things: kubectl get pods, kubectl get deployments, kubectl describe pod xxxxxx, kubectl logs xxxx

5. Learn how kubernetes ties things together through labels: Create a ClusterIP/LoadBalancer service and try to get it to balance to your pods from above through labels (https://www.digitalocean.com/docs/kubernetes/how-to/add-load...)

Deployments/services/(pods) are all you need in the beginning for running containers on k8s and exposing them. Of course then there are things like persistent storage but if your app is made to run ephemeral, you likely have storage/db setup externally already.

For running through the CI, once you have your manifests you could run kubectl apply directly through the CI if you wanted to. We are using terraform in front of k8s with hashicorps hosted state, then run `terraform plan`. If it passes, on dev we automatically apply to the cluster, on prod there is a manual step through the hashicorp admin UI that needs an apply trigger. Then there are more advanced tools like spinnaker that can be used to setup more complex pipelines on what to do on push.


Gitlab CI is my weapon of choice here since it’s integrated nicely with Kubernetes. There’s a wealth of tools out there - but the work I do on managing PCGamingWiki is publicly available [1] to give you a starting point. I use Kustomize + kubectl, and when I need to rollback a deployment I can just do it from Gitlab’s environment page.

1: https://gitlab.com/pcgamingwiki/pcgamingwiki


Try k3s.io . You will have a much more pleasant experience


My experience is the same. I really like automating tedious and error prone parts of deployments, and Kubernetes is the best tool I've found for that. It is a lot to learn about, and there are a lot of missing features that people go to great lengths to build for themselves (see "service mesh" for example), but the core is very solid.

I like loosely coupling things, and Kubernetes is the first ecosystem where that has worked well for me. (OK, it worked great when I worked at Google, but a lot of effort was put into that by thousands of people.) For example, for the first time in my life, I automatically renewed a TLS certificate for my various personal projects. When I started using Let's Encrypt, I just manually ran certbot every 3 months when I got a warning email that my cert was expiring. That is fine, but it's kind of a waste of time. There are tightly coupled solutions to this problem, but they basically require you to totally commit to their approach (Caddy is a good example of this). I use Envoy, but Kubernetes let me not care. I run cert-manager, which just runs in the background and updates my certs when they need to be updated. It's stored as a Kubernetes secret, which can be mounted into my Pod as files. When the secret changes, the filesystem is atomically updated. Envoy can notice this and start using the new certificate. cert-manager doesn't know anything about Envoy and Envoy doesn't know anything about Let's Encrypt. So I'm not locked into any particular decision -- I can change my CA, and nothing about my frontend proxy has to change. I can change my frontend proxy, and nothing about my certificate management has to change. This, to me, is a big deal. I have one less thing to worry about, and I am not locked into any other decisions.

I also like the flexibility with which I can write programs to manage my infrastructure. All the primitives available to me as a programmer are high-level and well-tested, and are the same things that the CLI tools do. For example, in preparation for HTTP/3, I needed some way to get UDP traffic into my cluster. My cloud provider doesn't provide a load balancer for UDP, so instead I wrote a program that watches changes to Nodes from the Kubernetes API server, and updates a DNS record with the external IP addresses of all the healthy nodes. Then I can instruct browsers capable of HTTP/3 to use that DNS address to attempt an upgrade to HTTP/3, and it doesn't matter that my cloud provider can't do that at a lower layer in the stack. The alternative to this approach is to basically commit to having a certain IP address available, and keep that updated manually. It's fine, but again, one more thing to worry about. I can take this exact code, and it will work perfectly on any other Kubernetes provider -- so I'm not tied to DigitalOcean, and I'm not tied to any manual processes. One less thing to worry about.

I agree that a lot of people get into a situation where they have to move hundreds of apps and tens of nodes all at once, and under those circumstances, it sure is a lot of work to figure out Kubernetes compared to putting a band-aid on the problem and getting back to work. The biggest problem is that you are probably facing some sort of crisis, and have to decide, with very little experience, whether you want to use a managed offering or build it yourself. Building it yourself is quite complicated. What CNI plugin are you going to use (they all seem both wonderful and horrible on paper)? Why do you have to buy five nodes only dedicated to master tasks, like etcd? How are we going to upgrade to the next version with no downtime? You can go managed, but then you give up a lot of control. Who controls DNS at the node level (fun fact: container pulls don't go through the same DNS stack that the Pod will eventually use)? How can you use gVisor to isolate pods from the host kernel? (You can't! You will have to run it yourself.) Compromise fatigue is going to kill you here -- you have a crisis, and all the options are bad. (I've been there myself. I started using Kubernetes because our Convox Rack was so outdated that we couldn't deploy new software anymore. We tried upgrading things, but it broke things even more. So until we got k8s working, and converted every workload from a proprietary format, we couldn't deploy software. It was frustrating. But the reality is that I wanted to switch a long time ago, so the transition was quite smooth, with no prior real-world experience. And now this problem won't happen again, because tens of thousands of people know how to deal with Kubernetes.)

I also agree with this article that Amazon's managed Kubernetes offering is terrible. EKS was my first Kubernetes experience, and it was clear to me that Jeff Bezos walked into someone's office and said "we need Kuberthingie in two weeks or you're all fired." The team saved their jobs, but that's about it. It's very much the managed Kubernetes solution for people that are locked into AWS already. What people really want is not Managed Kubernetes but "namespace as a service". They just want to kubectl apply something and let a background task provision their machines. They don't want to screw around with RBAC, service meshes, managing the Linux distribution on their worker nodes, managing the master nodes, etc. That service unfortunately doesn't exist. Maybe send me an email if you want to work on something like this, though, because I certainly do ;)

In summary, I get the pain points, but I think they are worth embracing. Things aren't perfect, but you are going to have pain points at all the big breakpoints in infrastructure. Going from 0 applications to 1 application is going to be a major change for your team/company. Going from 1 application to 2 applications is also going to be a major change, but most people overcome this with sheer willpower and tedium until they hit something like 10 or 15 applications, and then are up a creek without a paddle. I recommend embracing future growth early, so that your second application is as easy to run as your first. It's not hard, it's not time consuming, it's just very different from "I'll pop in a Debian CD and rsync our app over."


> What people really want is not Managed Kubernetes but "namespace as a service". They just want to kubectl apply something and let a background task provision their machines. They don't want to screw around with RBAC, service meshes, managing the Linux distribution on their worker nodes, managing the master nodes, etc. That service unfortunately doesn't exist.

I think you just described AWS Fargate and Google Cloud Run.

https://aws.amazon.com/fargate/

https://cloud.google.com/run


AWS's version is pretty half-baked. You can't provision or use persistent volumes (so no stateful apps), and you have to use their load balancer which terminates TLS (preventing your software from being able to do ALPN, using Let's Encrypt, supporting HTTP/3, etc.).

Cloud Run just seems like standard "serverless" stuff, nothing to do with Kubernetes. (The downsides involve not being able to run applications that are designed to run on a generic Linux box; everything has to be specially developed. That is fine, and they have open-sourced all the tools necessary to move off of them so you aren't locked in, but it's a bigger paradigm shift.)


I use Cloud Run and nothing is specially developed for it, other than just being stateless. I can take my container and run it on a generic Linux box with zero changes (in fact, I run it in WSL2 on my Windows machine all the time). And I just install the normal Node.js, imagemagick, etc in my Dockerfile, no special builds or flags.


That sounds pretty compelling then.


Cloud Run uses KNative which is a stateless extension of k8s


I'm on a 0.05 person project, using K8S through AKS


If it runs on a single instance then a simple docker-compose file give you all of that.


> > We would need to build/staff a full-time Compute team

I'm not sure I get this objection. Presumably in a company like Coinbase there is already an infrastructure team that runs the AWS instances, helps build the AMIs, etc. This team could re-tool and hire some k8s experts to help them make the shift. The promise of k8s (at least one of them) is that you can do more with less ops resource, since the system is so programmable. The idea that you'd need a completely new full-time team doesn't grok for me; that new team should replace another team that's no longer needed, or more likely, involve a combination of hiring some experts and retraining your existing engineers.

I do take seriously the other issues raised RE: security (though one GKE cluster per security boundary is a perfectly reasonable approach and gets you further than you might think).

> Unfortunately, neither myself (nor the other data engineer) was a Kubernetes guy

I think this is a different issue than the OP was raising; in any case, in order for a technology to succeed, you need to have subject matter experts embedded in your dev teams, or a separate function that provides the service to the teams that use it.

In the context of the OP, I think your case would be more like saying "hey data team, you need to build your data jobs into AMIs, go figure it out". Regardless of the technology chosen, it's not going to succeed if the teams doing the work don't know how to use the tools.


> that new team should replace another team that's no longer needed, or more likely, involve a combination of hiring some experts and retraining your existing engineers.

The gotcha there is that it rarely goes that way unless you have a very clear direction from senior management, at least at big corps. In most cases, it's just another thing that gets added to the pile, and it's incredibly difficult to migrate entirely out of whatever the old solution was, so now you end up supporting both.


You should always have a next gen and a production version of your infrastructure as a code.


Note that they already have a team who has built their current compute platform, who built the pipeline to run containers/processes on VMs with auto scaling groups.

It’s great if their own solution works well for them at less cost, but that system didn’t built itself and has non-zero maintenance costs.


> Note that they already have a team who has built their current compute platform, who built the pipeline to run containers/processes on VMs with auto scaling groups.

In that case I would expect Coinbase to write blog posts on how their setup is the absolute best solution to their problem, and not how they refrain from adopting the best solution to their problem because they claim they don't have anyone on the team that is able to pull that off.


> Presumably in a company like Coinbase there is already an infrastructure team that runs the AWS instances, helps build the AMIs, etc. This team could re-tool and hire some k8s experts to help them make the shift.

The key is that there is a lot of additional services and interface points to handle. As the Coinbase article noted, you need extra pieces on top of k8s (storage, service mesh, config/secrets, etc) that need care and feeding. Even if the company moved 100% of their services into k8s there's now more work to be done for the same level of service.

: The control points that k8s exposes are not simple "drop in your provider here" bits of integration. You would likely still have the same core providers (ex: EBS for storage) but there is now more code running to orchestrate them, and more access control to implement and verify.


My personal experience (4 years on GKE in production) has been the opposite; running on k8s has abstracted away a number of things that I’d otherwise have to engineer.

Volumes just get attached (using PersistentVolumeClaims), and automatically migrate to a new node of the original pod dies. Vs. having to do some sort of rsync between nodes to keep disks in sync.

Secrets get encrypted by k8s and mounted where needed. I would agree that RBAC is a bit tricky but I don’t think it’s harder than IAM provisioned with Terraform.

If you are not using a service mesh for your VMs then you don’t need one in k8s. (I don’t use one, and rolled TLS to the pod in less effort than it would take to maintain TLS to the VM). The reason you want a service mesh is to abstract TLS and retry mechanics from the application layer - i.e. make your service authors more productive. If you don’t use a service mesh then you are back to managing TLS per-service, which is where you are with VMs already.

There are definitely more services you _could_ run, but in my experience these are additive, I.e. they are extra work, but give you a productivity boost.

Anyway, YMMV and I haven’t operated a system as large as Coinbase, so I could be missing something. Interested in hearing others’ experiences though.


> As the Coinbase article noted, you need extra pieces on top of k8s (storage, service mesh, config/secrets, etc) that need care and feeding.

The problem with that assertion is that it does not make any sense at all. For instance, storage and config/secrets is already supported out-of-the-box with Kubernetes. Even so, complaining about storage with Kubernetes is like complaining about EBS or EFs or arguably S3 in AWS. And if you feel strongly about service meshes then you really aren't forced to use them.

> Even if the company moved 100% of their services into k8s there's now more work to be done for the same level of service.

There really isn't. For example, if they go with managed Kubernetes solutions then the only thing they need to worry about is to actually design their deployments, which would be very strange if they couldn't pull off. That's a ramp-up project for an intern if the solution architecture is already in place.

> You would likely still have the same core providers (ex: EBS for storage) but there is now more code running to orchestrate them

There really aren't. Kubernetes' equivalent to EBS is either setting up a volume or a persistent volume claim on a persistent volume. Just state how much memory you want and you're set.


> If you want k8s, you really do need people that know how to maintain it on a more or less full time schedule.

What is the alternative to k8s that does not need people to have any technical knowledge?

To me Kubernetes is extremely attractive because it helps me avoid learning cloud vendors' proprietary technologies. K8s is learn once, use everywhere, which is fantastic.

I am a 1-person venture doing everything from JavaScript/React to maintaining backend infra, and I couldn't have done it without k8s.


Plain old linux is the alternative, which is also "learn once, use everywhere" whether its AWS EC2 or GCP Instances or nearly any machine under the sun.

I don't see how k8s avoids the need to learn about cloud vendor specific tech. e.g searching "aws RDS k8s" gives me a beta github packages and a bunch of blog posts on how to configure it right. It doesn't sound like much less work than learning how to use RDS without k8s - read their docs, figure out the API.

Maybe i'm an "old man yelling at new tech" but meh i just see very little value in k8s because you inevitably need to understand the layer beneath - linux (k8s is far from a non-leaky abstraction imo), PLUS all the complexity of k8s itself. I do see the value when managing a big and complex infra with 100s of servers or something, but very few people have that problem.


> Plain old linux is the alternative

How do you run an application on a cluster of plain old linux machines? How do you do load balancing? How do you scale up and down? How do you update your app without downtime? How do you roll back easily if something goes wrong? How do you ensure all your servers are running the same version of dependencies? How do you update those dependencies? How do you replicate your environment if you want to add a new server to your cluster? If your app has microservices how do services discover each other? How do you mount volumes from cloud storage? How do you update configuration? How do you automatically restart failed applications? How do you monitor if your applications are working? How do you make sure the right number of MongoDB replicas are running at all times? How do you view your log files remotely? How do you port-forward from localhost to your Linux server to test your app locally?


These are commonly raised concerns, all of which have answers much simpler than "install this giant distributed system". I'll go ahead and answer them since I take the questions to be in good faith...

> How do you run an application on a cluster of plain old linux machines?

Build a package, install it in an image, run that image in an autoscaling group (or whatever equivalent your cloud of choice offers).

> How do you do load balancing?

An Elastic Load Balancer (v1 or v2), HAProxy, an F5 - this is deployment environment specific (just like in Kubernetes).

> How do you update your app without downtime?

Blue-green deployment, or phased rollout.

> How do you ensure all your servers are running the same version of dependencies?

Build them from a common image.

> How do you update those dependencies?

Update the Packer template that builds that image.

> How do you replicate your environment if you want to add a new server to your cluster?

Start the server from the same image.

> If your app has microservices how do services discover each other?

Consul, or DNS, depending on your appetite.

> How do you mount volumes from cloud storage?

It's a bit unclear exactly what you mean here, but I'll assume you mean either block devices (just attach them at machine boot, or on startup if they need a claim), or NFS.

> How do you update configuration?

Either update Consul and have it propagate configuration, or update a configuration package and push it out.

> How do you automatically restart failed applications?

Systemd restart policy.

> How do you monitor if your applications are working?

From outside - something like pingdom, and some kind of continuous testing. It's critical that this is measured from the perspective of a user.

> How do you make sure the right number of MongoDB replicas are running at all times?

Somewhat flippant answer here: the right number of MongoDB servers is zero. More generally, by limiting the size of an autoscaling group.

> How do you view your log files remotely?

Cloudwatch, Syslog, SSH (depending on requirements).

> How do you port-forward from localhost to your Linux server to test your app locally?

SSH.


Thank you. I use most of this, I've been using it for years and I just don't talk about it because it's hard to argue when people just want to force an idea that k8s is "really the best way of doing things".

Also, haproxy is one of the most reliable software I've ever used.


> Thank you. I use most of this, I've been using it for years and I just don't talk about it because it's hard to argue when people just want to force an idea that k8s is "really the best way of doing things".

I wouldn't call it the best way; Rather a good way because Kubernetes does encapsulate the really good bits from scalability, development, security and reliability aspect. It's not a panacea but if you have team bandwidth to run k8s cluster, it's definitely worth a look.

<3 HAProxy. It's a solid piece of software tested to the teeth over the years and is great; Here's the thing: You can run it as your preferred ingress controller too :) https://www.haproxy.com/blog/dissecting-the-haproxy-kubernet...


So if you'll indulge me -- this list is exactly why a system like Kubernetes is valuable and why I think personally that it contains a lot of essential complexity.

Kubernetes attempts to do all of the above, which is why it's so massive, and I'd argue it's actually less complex than knowing all the tools above -- but it's an equal degree less universally applicable. In this way, it's perfect for the dev who never wants to "do ops", and less so for the dev that already knows ops (or any regular sysadmin/ops person), because they already know all these answers.


Devops person here: I already know all of these answers and would still choose Kubernetes over hand-rolling all of this again and again.

"Just use Packer, some AMIs, some ASGs, some CF templates, some ELBs, some EC2 instances!"

No thanks: I'll Terraform an EKS cluster in 30 lines of HCL and deploy my applications with a Dockerfile and a handful of YAML files.


> CF Templates

Yeah, this is where it kicked in for me. Never mind the fact that all of that is AWS specific and absolutely doesn't help you if you ever move clouds. Great to know all the stuff below, but Kubernetes is a wonderful abstraction layer above that stuff, and it gets better every day.

CF could have become Kubernetes -- it was supposed to be, but it just never got the mixture right (and of course is AWS exclusive).


I find the ridiculous false dichotomy between Terraform for Kubernetes and Cloudformation for more basic infrastructure even more ironic given that I am still the eighth most prolific contributor to Terraform _over three years after leaving HashiCorp_.


Both options are questionable.


> So if you'll indulge me -- this list is exactly why a system like Kubernetes is valuable and why I think personally that it contains a lot of essential complexity.

Yes. I would agree to your statement precisely as an answer to @jen20.

Some things such as getting stateful systems, HPAs and persistent storages were a little tricky initially but a breeze after.

But I do want to mention that you really really need a team to look after it. Without it, it'll bound to be another snowflake.

[edit]: <sigh/> i meant to say stateful when i wrote stateless.


I feel like your post describes exactly what Kubernetes and container images would bring to your infra.

If you were to deploy a solution like you described, you would get something more complex than simply running Kubernetes, except worse. I suspect that you believe your solution would be simpler only because you are more comfortable with those technologies than with k8s. The more I read criticism of k8s, the more I'm persuaded that what people calls "old boring technologies" truly is "technologies I'm comfortable with".

On top of that, you'd need to separately document everything you do on your infra. The advantage of Docker images over AMI is that you have a file that describes how the image was built. With an AMI, you would need to hope that the guy who created the AMI documented it somewhere (or hope that he has not quit). Same goes for k8s, where configurations are committed into your repository.

At the end of the day, k8s stays a tool that you should use only when needed (and also if you have the capabilities of using it), but I think you shouldn't discard it simply because you are capable of producing the same result by other means. You get a lot more by using k8s, in my opinion.


> The advantage of Docker images over AMI is that you have a file that describes how the image was built.

https://packer.io - how is this even a discussion?

> the more I'm persuaded that what people calls "old boring technologies" truly is "technologies I'm comfortable with".

My infrastructure runs in Nomad on bare metal. I am by no means opposed to “progress”, I just don’t think Kubernetes is the be-all-and-end-all of infrastructure and would like to have a less hysterical debate about it than the parent to my original post presented.


I loved this list of succinct answers.


> Somewhat flippant answer here: the right number of MongoDB servers is zero.

AMEN to this one.

Very nice answer. It's the way we run infra.


What does this mean?


Thanks for this great list of answers. I was startled to see someone vomiting a list of unresearched questions as if they constituted a rebuttal. "Taking the bait" was the right call. Thanks again.


Would you mind pointing out some examples that you can achieve these on a bare-bones installation in an automated manner? Like, I mean sharing some real examples that I can install and forget. You can deploy the simplest "hello world" webserver in any language of your choice.

>> How do you run an application on a cluster of plain old linux machines?

> Build a package, install it in an image, run that image in an autoscaling group (or whatever equivalent your cloud of choice offers).

How come it is any different than running the Docker image on Kubernetes? The image build process is the same, delegating running the image to the platform is the same, the only difference at this point is the name of the platform you are running. Even if you were deploying ZIP archives to Elastic Beanstalk, if it doesn't work as expected, you'd have to debug it as an outsider, and you'd still have to know about the technology. I don't see how it is any different than Kubernetes.

>> How do you update your app without downtime?

> Blue-green deployment, or phased rollout.

How exactly? There are gazillion ways of doing them, they are rough concepts, what we need is a reliably working setup that requires as much effort from us as possible, and there are absolutely no standards on how to do them. Are you going to use Ansible? Maybe just SSH into the node and change the symlink? Maybe some other ways?

>> How do you replicate your environment if you want to add a new server to your cluster?

> Start the server from the same image.

How do you do that? You'd either do that manually on AWS console, or build some tooling to achieve that. If you were to do that via the autoscaling options the vendor is providing, then it is no different than Kubernetes: if that doesn't work then you'd have to debug regardless of the platform that is managing the autoscaling.

>> If your app has microservices how do services discover each other?

> Consul, or DNS, depending on your appetite

What is the difference between trying to learn how does Consul handle service discovery vs how does Kubernetes handle it?

>> How do you mount volumes from cloud storage?

> It's a bit unclear exactly what you mean here, but I'll assume you mean either block devices (just attach them at machine boot, or on startup if they need a claim), or NFS.

Would you mind sharing examples that are not vendor-specific and that'd be configurable on a per-service fashion easily, hopefully without writing any code?

>> How do you update configuration?

> Either update Consul and have it propagate configuration, or update a configuration package and push it out.

How is this any better than pushing your changes to Kubernetes? I personally don't know how does Consul work or how to update a configuration package and push it out to somewhere, I don't even know where to push them. In this context, learning them is also not any better than learning how to do them on Kubernetes.

>> How do you automatically restart failed applications?

> Systemd restart policy.

So, this means that you'd need to learn how to utilize Systemd properly in order to be able to start running your application and write the configuration for that somewhere, and also deal with propagating that configuration to all the machines.

>> How do you monitor if your applications are working?

> From outside - something like pingdom, and some kind of continuous testing. It's critical that this is measured from the perspective of a user.

The question was not really that. The tools like pingdom won't help you if an internal dependency of your application starts failing suddenly. You need a standardized solution for gathering various standard metrics from various services of yours, things like request rate, error rate, request durations, as well as defining custom metrics on the application level such as open database connections, latencies on dependencies, and so on. You will definitely need a proper metrics solution for running any serious workload, and in addition to that you'll also want to be able to alert on some of these metrics. There is no standardized solution for these problems, which means you'll need to roll your own.

>> How do you view your log files remotely?

> Cloudwatch, Syslog, SSH (depending on requirements).

The proper alternative to the Kubernetes' solution is Cloudwatch, and even then the simplicity of `kubectl logs <pod_name>` is still better than trying to understand how Cloudwatch works.

>> How do you port-forward from localhost to your Linux server to test your app locally?

> SSH.

This is not a trivial setup. Let's say you have a service A running remotely but it is not exposed, meaning that you cannot reach it from your local machine, and you'd like to be able to use that while developing your service B locally, how would you set this up in an easy way?

The points regarding the images are the same points as any Docker image, so it really boils down to the choice and one doesn't have an advantage over the other in this context.

What I am trying to say is: there are quite a lot of problems when running any kind of serious workload, and there are thousands of alternative combinations for solving them, and they were solved even before Kubernetes existed; however, there were no standardized way of doing things, and that's what Kubernetes is allowing people to do. There are definitely downsides of Kubernetes, but trying to point specific examples like these don't help as they are just names of individual software that also have learning curve and they all operate differently. I do wish there was a simpler solution, I wish Docker Swarm succeeded as a strong alternative for Kubernetes for simpler cases as it is brilliant working locally, and I wish we didn't have to deal with all these problems, but it is what it is.

As of today, I can write a Golang web application, prepare a 10-lines Dockerfile, write ~50 lines of YAML and I am good to go: I can deploy this application on any Kubernetes cluster on any cloud provider and have all these stuff defined above automatically. Do I need to add a Python application alongside: I just write another 20-lines Dockerfile for that application, again ~50 lines of YAML for Kubernetes deployment and bam, that's it. For both of these services I have automated recovery, load balancing, auto-scaling, rolling deployments, stateless deployments, aggregated logging, without writing any code for any tooling.


Great answers.

To be fair though, that's not "Plain old linux" like zaptheimpaler suggested was somehow possible. That's linux plus AWS managed services plus software from Hashicorp. Which is a great stack to be on, but has its own complexities and tradeoffs.


I wish I knew to do all this, is there any guide best practice that I can read like a book that includes all above?


The difficulty of all that stuff on plain old Linux is overstated and the difficulty of doing that on K8s is understated. And I agree with grandfathers point that if you don’t understand how to do that on plain old Linux you will struggle on K8s. K8s is ok for huge enterprises with tons of engineers but somehow K8s advocates make it seem like learning and operating nginx on ubuntu is this huge challenge when it’s usually not.


[flagged]


I think the point is more that the complexity of a cluster of VMs that manage the lifecycle of containers is often overkill for a service that would work with nginx installed on Ubuntu, and that often times the former is sold as reducing complexity and the latter as increasing it.


rumanator: the mistake is yours. Please read the parent post again, since you missed any nuance.

I agree with said parent. You can go a long way with ASGs, iptables and yum install - literally the entire first decade of your startup.

In larger companies, just the politics introduced with managing k8s by another dept. is mind-numbing.


No, the assertion that nginx on ubuntu is equivalent to kubernetes os mind-numbingly wrong, for starters for being entirely and completely oblivious to containers. The comparison isn't even wrong: it simply makes no sense at all.

And no, being able to run software is not equivalent to Kubernetes. It's not even in the same ballpark of the discussion. You are entirely free to go full Amish on your pet projects but let's not pretend that managing a pet server brings any operational advantage over, say, just being able to seamlessly deploy N versions of your app with out-of-the-box support for blue/green deployments and with the help of a fully versioned system that allows to undo and resume operating with a past configuration. You don't do that by flaunting iptables and yum, do you?

If anyone needs to operate a cluster then they need tooling to manage and operate a cluster. Your iptables script isn't it.


> No, the assertion that nginx on ubuntu is equivalent to kubernetes os mind-numbingly wrong

Read the actual post, instead of projecting your nonsense.


DNS round-robin sends inbound to 2 haproxy servers, which proxy traffic on to the rest of the cluster. Scaling means "add another box, install your app, add haproxy entry". Service discovery is just an internal-facing haproxy front. If you must have "volumes from cloud storage", you use NFS (but if you can help it, you don't do that). Updates, restarts (including zero-downtime courtesy of draining the node out of haproxy), etc. are all quite doable with SSH and shell scripts. You run an actual monitoring system, because it's not like k8s could monitor server hardware anyways. Likewise, syslog is not exactly novel. I... don't understand why you're port forwarding? Either run on the dev cluster or your laptop.

So yes, you'd need a handful of different systems, but "k8s" is hardly a single system once you have to actually manage it, and most of the parts are simpler.


Having done exactly that in the past, both by hand and with configuration management (including custom scripting that synchronized HAproxy configs etc.), I would say that I can do all of that much, much simpler in k8s.

Installing application and routing it with load balancing, TLS etc and support for draining is really simplified, something that used to require annoying amount of time (it got better as Lua scripting was extended in HAproxy, but by that time I also had k8s).

Resource allocation, including persistent storage (including NFS) is so much easier it's not funny, it makes my old years painful to think about. Syslog is not novel but getting all applications to log the same way was always a PITA, and at least with containers it's a bit easier (I still have to sometimes ship physical logs from containers...).

As for monitoring, that's one of the newer and more advanced topics, but it's possible to integrate more complex server monitoring with k8s - it already provides a lot of OS/application side monitoring that really makes it easier to setup observability - but now there's quite simple way to integrate custom node states into kubernetes allowing a quick glance way to check why a node is not accepting jobs, integrated with the system so you can actually trigger from a health-check that the node is in trouble and should not take jobs.


The easy answer is that you don’t need to run your service on distributed across 1000 VPS instances.

A handful of dedicated machines is enough. For example: stackoverflow.

None of these are difficult to answer. You can set up automated deployment/rollback however you want. You don’t care about the dependencies your code is compiled and linux has binary compat. You don’t need to split your app into 1000 unmanageable interdependent microservices. You have enough disk space on your dedicated machine or use NFS. You set up your systemd service file to auto restart your service. Etc etc.

When you have k8 as a hammer everything looks like a nail.


This is how Amazon works internally, and by choice, because unnecessary complexity is the enemy.

> How do you do load balancing?

With dedicated load balancers. They are even cheaper than servers because they are optimized for doing one thing.

> How do you scale up and down?

With pretty trivial automation than trigger deployments when load is too high.

> How do you update your app without downtime?

(app?!) A package manager pulls updated packages, deploys them and restarts the services. The LB does the flipping.

> How do you roll back easily if something goes wrong?

The package manager does that quite easily

> How do you ensure all your servers are running the same version of dependencies? > How do you update those dependencies?

Package managers support versioning on dependencies.

> How do you replicate your environment if you want to add a new server to your cluster?

You install MyStuff version 1.2.3 and everything goes in.

> If your app has microservices how do services discover each other?

Thankfully Amazon uses services that are not micro. DNS does the trick.

> How do you update configuration?

It's in the packages.

> How do you automatically restart failed applications?

OSes do that since decades.

> How do you monitor if your applications are working?

LBs check the endpoints, monitoring tools do the rest.

Essentially you can build all of this with the technology that existed 15 years ago. It works reliably and it takes much less learning.


Tangentially lots of those features are quite nicely handled by Erlang.

Using plain old Linux and Erlang/Elixir as a backend you can get extremely far with scaling with very little overhead and issues.

It is unfortunate that so many ignore the benefits of using Erlang as a backend technology.


This is a toxic trolling technique called "sealioning".

Thankfully, `jen20` was kind enough to take your questions in good faith and made a good demonstration of this confusion.


I call bullshit on any 1-man team needing to worry much about this stuff.

Also, see jen20's response.

If your only tool is a hammer (called k8s) then I'm sure everything really does look like a nail...


“Plain old Linux” really isn’t an alternative to K8S though. You would need a load balancer, service discovery, a distributed scheduler, a configuration management system (K8S is a very strong alternative to building things around Ansible IMO). You can do all of those things without K8S, of course, but not with “plain old Linux” (what would that be anyway? GNU coreutils and the kernel? Vanilla Debian stable?)


Maybe I’m way off, but aren’t all of those things required for k8s? Ingress controllers, etcd cluster, terraform modules, storage configuration, etc...

I guess if you pay for a hosted service a lot of the control plane is taken care of.

I’ve used k8s in orgs where it’s a great fit and really fills a need, but it is considerably more complex than running a web service balanced across a couple of machines, and it definitely requires a lot more upfront complexity (as in, you have to solve problems before you have they are actually problems).


At least in Azure Kubernetes ingress and etcd is yours, the other things are taken care of. And the terraform you get is quite nice.


I was stating that plain Linux isn’t an alternative to K8S, not that one should always use Kubernetes ;)


> I don't see how k8s avoids the need to learn about cloud vendor specific tech e.g searching "aws RDS k8s"

Not sure what you mean here.

Kubernetes is designed to be cloud and vendor agnostic.

And RDS is a hosted database that you connect to from your application. Whether that applications runs standalone, in a container or VM is irrelevant.


> Kubernetes is designed to be cloud and vendor agnostic.

But it’s not. Connecting to EKS is completely different from connecting to GCP. Setting up worker nodes is completely different too, oh and load balancers.

It’s only the last mile that’s similar.


We've migrated from Heroku to EKS using Terraform to set it up. We done a small PoC for Azure, and within a day we had a small cluster and one of our apps running. The app terraform code required only 1 variable change.

Sure, it's far from drop & replace, but "completely different" is a huge misrepresentation in my experience. Running multi-cloud seemed quite doable.


Provisioning a new cluster on each of these platforms is tens of lines of Terraform.


I feel like what 98% companies need is really just barebones linux with some good documentation how to spawn new nodes.

To use k8s you need to know linux anyways, but to use linux you don’t need k8s knowledge. Most of things are as easy to setup with linux and the things where k8s really shines are not needed most often.

What I really wonder is where did you learn k8s, which parts did you learn the most? It seems huge and I would love to be in your position


K8s is a nice api on top of gnu/linux. Want a iptables rule? Write a yaml (network policy). Want a storage for you app? Write a yaml (persistent volume) etc etc.

For people who already know Linux, kubernetes comes naturally because it is pretty obvious.

But indeed, by experience, many companies can go to "unicorn scale" with two or three boxes.


Couldn't you get nearly the same behavior using some basic Ansible playbooks? My impression was that the killer feature of k8s was scaling, automatic failover, etc., although to be fair it's been several years since I last looked into it.


Think of kubernetes as a cluster operating system. Instead of dealing with vms, you deal with your applications directly, without worrying about where they're running. It gives you a unified view and ability to manage a distributed system at the level of the application components.

Ansible can't give you anything like that. Even if you use Ansible to automate something like network setup, the commands and modules you use will be different between e.g. cloud providers. Kubernetes give you a consistent abstraction layer that's almost unchanged across providers, with the exception of annotations that are needed in some cases for integration with certain provider services.


For me k8s kinda extends what you can do with ansible.

For example, if you use just kubelet (the main daemon) without the api, drop some static yaml files (ala unit files) under /etc/kubernetes it will be something like systemd is. No big deal so far, but using the kubernetes api (just another daemon) it allows you to run the all the things anywhere.

I mean, you terminate one computer, k8s will figure out that the program should be running and it will be started in another computer. (Including moving detaching/attaching the block device from/into computers, this kind of things)

There is no much secret.

For me it really feels like an (well designed/integrated) API on top of standard Linux technology (ipvs, netfilter, mount, process mgmt, virtual ip, etc) via yaml running your stuff in 1..N computers but you manage it as one big computer :)

EDIT: added more context within "()"


I go in and out of development and ops and devops. 20 years.

To me as a developer heroku is the gold standard.

Docket-compose makes sense.

Kubernes and helm is a giant soul crushing wtf On getting anything done.

Now I’ll get it figure it out. Angular + exhaust+ typescript was a similar experience after I took a break from front end for a few years. A few months of pain, then it starts clicking.

but this just seems insane for a basic web app. So many different tools needed to get it going. Tutorials all have many steps that don’t apply, or use other odd pieces swapped in.


Many tutorials are unfortunately bad, they actually got worse over time.

Oooold presentations (think 2016 and older) tended to talk more about basic building blocks and how they interacted, especially the design involving resources and their controllers working in a loop of "check requested state, check real state, do changes to implement requested state", and how those loops went up from single pods, through ReplicationControllers (now ReplicaSets), then Deployments, etc.


This is the exact problem I ran into on rxjs. Every tutorial was badly out of date. Even ones written six months earlier.

You had to become a master of figuring out the direction of where community was going and stay on bleeding edge.

Some people found this fun. I have many times enjoyed this. When I want to get a boring feature built and hand off to the jr develops so I can spend some time with my kids. It’s frustrating


The most annoying thing, to me, is that Kubernetes doesn't even move fast enough!

The old presentations? Ones that remember K8s 1.0? There's one major change (move from Replication Controller to Deployment+ReplicaSet) that doesn't invalidate ~90% of the material, because the core stuff is about how controllers work!

Yet it seems more and more common to me that people don't learn the core mechanism of kubernetes unless per chance they got there writing CRDs :|


> Couldn't you get nearly the same behavior using some basic Ansible playbooks?

No. For starters, Ansible playbooks don't allow you to dynamically deploy and redeploy a set of containers in the node that at a given moment happens to have more available computational resources such as bandwidth, nor do teu respawn a container that just happened to crash.

Arguably Ansible only implementos a single feature of Kubernetes: deploy an application. Even so, while with Ansible you need to know and check all details about the underlying platform, with Kubernetes in essence you only need to say "please run x instantes of these containers".


Ansible gets you like 80% of the value of Kubernetes, yes. But you have to allocate applications to hosts manually, your configuration is less declarative, and what do you gain?


What about service discovery? What if you need to put power your node down for maintenance? What if your app goes crazy and consumes so much memory it OOMs some other services? Reliable, multi-service deployments ARE HARD and always be regardless what tech you use to achive it. Of course you can use Ansible and do it old-school way and wast lots of underutilised VMs and custom idempotent scripts. But K8S solves many of challenges in a standardised, data-driven way. It has steep learning curve but once learned, all resources/limits are properly set, logging in place, etc, it works nicely.


Firstly tell me more about your 1-person full-stack venture, but second how comes you, with barely any time for sitting down can use k8s happily but it falls over for others. I am struggling to see truth amount the comments here :-(


TBF with a completely greenfield project and managed K8s (GKE or EKS) you can absolutely get a pretty well set up infra very quickly if you are willing to learn how to do so.

I often get the feeling a lot of the negativity comes (rightfully so) from trying to replicate a current existing project into kubernetes. This is true of almost any paradigm - try replicating a Java EE monolith into Erlang and you are going to have a lot of problems. The big thing to note is that starting a project on Erlang very well might solve the problems that a Java EE project ran into, but that is because they were able to solve them at the ground floor, and just popping a Java EE project with all of it's architecture into an Erlang project will probably end up in a worse spot.

I think that this is what often happens with k8s as well - if you or your company have a currently working implementation that isn't on k8s, of course you won't be able to just easily plop it into a k8s cluster and everything be all well and good, but I think the problem is that people are equating that issue with k8s itself, which is a completely different paradigm.


> managed K8s (GKE or EKS) you can absolutely get a pretty well set up infra very quickly

And then tear your hair out when something doesn't work for some reason and root causing it requires learning a stupid number of layers. k8s is easy until it goes wrong.


Isn't this the same for every software? How is debugging issues with Linux, NGINX, any complex framework any easier?


Well, for one thing, all logs are in /var/log.

If it doesn’t fail just right in kubernetes there might be no logs at all.

I’m thinking on particular about trying to mount filesystems into pods.


Debugging k8s issues is like debugging a vast distributed monolith with a vague guesses on where the problem occurred.


Not the GP, but I honestly couldn't tell you. A lot probably comes down to tooling, the applications you are deploying, security requirements, etc., as well as how familiar you are with k8s itself.

I migrated PCGamingWiki from running on some Hetzner boxes to DigitalOcean Kubernetes in a few days of work creating Dockerfiles and k8s manifests. I run a Kubernetes cluster at work fairly hands-free that hosts applications critical to our billing operations, and developers on the team deploy new applications with little or no support. Any of the issues I've hit are an artifact of migrating legacy applications not designed to run in more-or-less stateless environments, which is why the PCGW Community site still runs on its own server (Invision Community sucks).

I really don't see all the issues people have that aren't due to a mismatch of application design vs target environment (and no, it's not monolith vs microservice - monoliths run just fine on k8s; but you should be designing your application with a 12-factor environment in mind) or a misguided notion that you will be drowning in YAML hell (it's real, but you can manage it - and it's directly related to the complexity of the services you are deploying).


I can totally agree with this. Most of my customers that I see struggling with K8s are those that haven't internalized 12-factor principles: not just heard, read or understood, but really internalized. It is unfortunate that K8s talks / blob posts / articles do not focus enough on this pre-requisite.


Seems like you are a thoughtful engineer who can also make sure you don't make any fundamental design flaws while building these systems well. Whenever I have seen kubernetes fail it's often also because the engineer(s) who built it were not thoughtful at all and often didn't fully understand what they were doing. Perhaps k8s's failing is it makes people who dont know enough think that they do.


I mean, I’m far from perfect - the trick has always been to KISS. I don’t use istio, it’s absolute overkill for my needs. I use nginx-ingress because it fits the bill, I know nginx as do enough other people that they could exec into a pod to debug it. I don’t run stateful applications that aren’t prepared to have servers randomly vanishing because it take a LOT of work to get these running in-cluster. I don’t use public helm charts because they often suck and making your own container is something you can do quickly if you were able to deploy the software on a traditional server. Every choice I make is done with the day 2 operations in mind, not what is hot, what gets initially deployed fastest - but what makes it so I can touch the thing as frequently as possible.

PCGW is a great example - installing a new Mediawiki extension, changing a config file, upgrading to a newer MW release is just updating a file or a git submodule and committing. I don’t get paid thousands a month to manage the site for the owner, so I make my time spent as efficient as can be done.


I think this is a big failing of the DevOps movement as a whole (at least what DevOps became in practice — devs doing ops) which results in things like passwordless mongodb exposed to the internet...


> I think this is a big failing of the DevOps movement as a whole (at least what DevOps became in practice — devs doing ops) which results in things like passwordless mongodb exposed to the internet...

Hardly. If anything at all, it tells about the _team_ and/or the culture of the organisation. In any DevOps/SRE/Opsec culture worth the salt, an immediate blameless postmortem analysis would be performed to help with premortem analysis in future.

DevOps is not about exposing unsecured endpoints. You've got it all wrong son.


I'm not your son, nor am I talking about what DevOps "is about", but about what is became in practice, which you would have understood hadn't you rushed to reply in the most condescending tone you managed to invoke.


> I'm not your son, nor am I talking about what DevOps "is about", but about what is became in practice, which you would have understood hadn't you rushed to reply in the most condescending tone you managed to invoke.

Fair enough. I am letting you know though the part where you got mixed up - that is not because _What DevOps has become in practice_. That is precisely because of failings and shortcomings in team culture and/or the organisation that practice DevOps.


This is pretty typical when a product is still climbing the adoption curve. I've helped small companies set up (managed) k8s clusters and migrate their apps to them, and when you know what you're doing, it's a super smooth experience that's basically all upside.

But, if you're approaching it for the first time with no assistance, there are lots of things that can trip you up, and lots to learn. That's not a reflection on k8s, it's just the nature of the large set of problems it's solving.

K8s is succeeding because it's very well designed, has a large and diverse ecosystem, and solves set of important problems that very few other tools even try to tackle. Apache Mesos perhaps comes closest, but it's not quite as pragmatic, and its adoption level reflects that.

Also, because of k8s' scope, many people may not fully appreciate the range of problems it's solving, seeing it through the lens of their own background and focus.


People who don't like - or don't "get" - declarative systems tend to spend an inordinate amount of time and effort fighting them. I've seen the same thing with a delcarative build system (maven), or with adopting an ORM - if you're willing to work with the tool then it will save you a lot of effort, but if you're determined to do things your own way then you can make it almost arbitrarily difficult.


> if you're determined to do things your own way then you can make it almost arbitrarily difficult.

This is true of software development in general, if not life itself!

But you're right that declarative systems amplify this issue.


Might also be the amount of experience a person has.

I have done Softwaredevelopment for ~15 years, swiched now to infrastructure and build a gke cluster. It was awesome, very logical, nice and easy to use.

Now i read stories from coinbase and don't get it.


> What is the alternative to k8s that does not need people to have any technical knowledge?

Paying for a managed provider. Heroku, Elastic Beanstalk, GAE, GCP, Fargate, etc. You push some buttons, they manage your cluster/services.

People still think they can get a free lunch by downloading some free software. If that were so, Windows and Mac would be dead and Linux would be the only desktop OS. But good news: I hear 2030 will be the year of the Linux Desktop!


Check out Dokku[0]

[0]: https://github.com/dokku/dokku


Couldn't have done it without k8s?

Ahahahahaha, kids say the darndest things these days...


I'm in the same boat opting for docker-compose instead. docker-compose is much simpler to manage. Obviously it doesn't have feature parity with k8s but docker-compose does the basics well.

An inexpensive VPS runs compose well with more resources at a lower cost than the managed costs of k8s.


I don't understand this comment at all.

Kubernetes is designed for a cluster of nodes not a single one.

Of course a single VPS is lower cost and easier to use but what happens if you want a second or third. Or any form of redundancy.


You just run scp with the compose file and run docker compose down; docker compose up

Bonus points if you just mount the compose file on an NFS share.

I've wasted enough man months on kubernetes that unless you tell me to manage 1000 nodes this approach will never cost me more time than the time I spent using and learning kubernetes.


A lot of people use it for deployment standard and don't care about multinode.


I presume they are using docker swarm in this instance.


Agreed. Good support for compose currently, and swarm is usually overkill. With a bit more elegant HA functionality, docker-compose could be the go-to for many more. The comment below claiming he doesn't understand your comment reminds me of all of those who will say "well its not FOR production" - seems more like superstition than science.


It's a pity that docker swarm did not make it. It wasn't perfect but it was a lot simpler to setup and manage than kubernetes.

If you can get away with it, vanilla docker hosts running docker compose provide most of the same benefits with a fraction of the cost. For most startups, that's a great way to avoid getting sucked into a black hole of non value adding devops activity. You lose some flexibility but vanilla ubuntu hosts with vanilla docker installs are easy to setup and manage. We used packer and ansible to build amis for this a few years ago with some very minimal scripts for container deploys.


> It's a pity that docker swarm did not make it.

Sorry I do not understand that statement, in my naive opinion Docker Swarm seems to be a thing. Care to elaborate, please?


It is, but at this point it is unclear for how long Docker Swarm will be supported, see e.g. https://boxboat.com/2019/12/10/migrate-docker-swarm-to-kuber...

We are actually currently in the process of migrating from Docker Swarm to k8s and I am not 100% sure that's a good idea. We will see.


"conversations have led us to the conclusion that our customers want continued support of Swarm without an implied end date."

https://www.mirantis.com/blog/mirantis-will-continue-to-supp...


No matter what they claim, it's really not supported in the sense most commercial oss projects are. We finally switched off after a minor version introduced a segfault when adding nodes in certain conditions, and the issue was unfixed after 5 months.


This. Docker Swarm, and by extension Docker EE / UCP, is barely in maintenance mode. Go compare the Moby and docker projects on GitHub vs kubernetes.

To be clear, I use Docker Swarm in my home cluster due to simplicity and ease of use. Unfortunately that pattern hasn't scaled to the Enterprise.


What about HashiCorp's nomad? Seems a lot simpler to manage than k8s and is actively developed.


It exists but in terms of people using it or it being actively developed, it's dead as a doornail ever since Docker was more less forced to also support kubernetes and basically gave in to the reality that world + dog was opting for kubernetes instead of swarm.

They never really retired it but at this point it's a footnote in Docker releases.

I've not actually encountered it in the wild in four years or so and never in a production setup.


Docker Swarm doesn't support multiple users and there is no remote-accessible API. Nomad doesn't implement network policy (Nomad Connect sidecars may be an option but sidecars bring new problems). Just learn K8S, Helm, Terraform & Terragrunt properly. Use proper tools (k9s, Loki, wrapper scripts around kubectl). Stop finding excuses for not using K8S. Stop putting proxies everywhere (that Istio/servicemesh bull*hit) and use Cilium CNI instead.


I'm also a fan of swarm, and still use it in production.

It's just so damn simple in comparison to k8s - basically, if you know Docker Compose, you know Docker Swarm.

I appreciate it doesn't have the full power of k8s, but it has what most apps need: simple deployments, zero-downtime updates, distributed configs and secrets.


You can't make the complexity disappear. Kubernetes just offers a very standardized, stable and relatively polished way to handle it.

The other alternatives, like running your own orchestration system, are just as complex if not more so, although you might be more familiar with it since you built it.

That being said, many companies don't need any of it in the first place for their scale, and that's probably the biggest issue with K8S today.


> That being said, many companies don't need any of it in the first place for their scale, and that's probably the biggest issue with K8S today.

Yes. I helped push out a k8s platform where I work, and it's been running well in production for ~3 years.

We most definitely did not start with "we want to deploy k8s". We started with "we're being asked to meet certain business requirements", which lead to "we will need to change how we do some of our development and deployment", which lead to "our platform will require these characteristics."

The easiest way to get those turned out to be buying a packaged k8s from a vendor (OpenShift, not that it matters; there's plenty of options).

Most of the pain with k8s seems to come down to people wanting to polish their CVs (by "having done k8s"), or people who sneer at packaged/hosted solutions because they want to build a cottage industry of building k8s services from scratch because it's their idea of a good time.

Unsurprisingly both take orgs down the route of pain, money, and regrets. Oftentimes the people driving this decision then prefer to say the problems is k8s, or containers. Or microservices or whatever fad it was they were chasing without understanding whether it would be a good idea.


It's impossible to make essential complexity disappear but it is certainly possible to reduce incidental complexity. Most software is much more complex than it needs to be and Kubernetes is no exception.


Before k8s every serious shop automated the crap out of their infra. Jump/kickstart recipes, rolling cluster patching that split RAID mirrors before applying, blue/green deployment scripts to tickle the loadbalancer, cron jobs to purge old releases ...

That stack is super complex and utterly bespoke to the company.

With k8s it’s standardised and usually better quality.


It's on a path to standardised, but not there yet: etcd vs. others, different ingress controllers, providers replacing most of network parts, storage is bumpy/not so standard, deploy may be kubectl apply/helm/operator.

I would really appreciate a more mature ecosystem.


I'd say it the other way; k8s has to do all the things that another management system would have to do, but sufficiently tied together that you can't ease into it as needed.


I guess that depends primarily on whether you're installing and operating K8S yourself. Use something like GKE and it's a very seamless experienced, with AKS getting pretty good and the rest being rather crummy.

Once you have a managed cluster, the deploying apps is fairly easy. A single container/pod is a 1-liner and you can work your way up from there.


That's fair; all of this is heavily influenced by your operating environment. If you can run GKE or if you have an ops team to deal with that stuff, then yeah k8s is great. Unfortunately, I'm part of the ops team, and our company is too small to have a dedicated k8s team and too low-budget to (likely) do well with a managed service (we get absurd value per money out of bare metal servers, which is very much a tradeoff that is sometimes painful). So to me, k8s looks like a very iffy tradeoff. Bigger company, bigger budget, different constraints? Yeah, k8s would be great.


> we get absurd value per money out of bare metal servers, which is very much a tradeoff that is sometimes painful

What do you find most painful about your use of bare-metal servers? The thing that I like most about a hyperscale cloud provider is the level of redundancy, including even multiple data centers per region, and their built-in health checks and recovery (e.g. through auto-scaling groups) based on that redundancy. With bare-metal servers, I'd have to cobble together my own failover system for the occasional time when one of those servers goes down or becomes unreachable due to a network issue. And of course, I'd probably find out that my home-made failover system doesn't actually work at the worst possible time.


I don't know that it's a single big thing; it is indeed many small things that we have to manage ourselves. We backup the databases with our own scripts, we failover manually (I miss RDS), we use an overlay network because no VPC, and deployments involve ansible running docker-compose. There's basically no elasticity; we provision massive bare metal servers with fixed memory and disk installed. But, it is dirt cheap, so we manage, and all the pieces are small and easy enough to use.


k8s and even docker are trying to solve a problem not many people will ever face. However, being the sexy new things loads of people get sucked into integrating it into their stack from the word go.

Being reluctant to adopt new technology unless you really really need might be a more sensible thing to do.


> This actually was a very real problem at my current job. The data pipeline was migrated to k8s and I was one of the engineers that worked to do that. Unfortunately, neither myself (nor the other data engineer) was a Kubernetes guy, so we kept running into dev-ops walls while also trying to build features and maintain the current codebase.

I've experienced this as well. At the last large company I worked at we had a Heroku-like system to run our apps on. That was deprecated for a Docker-based solution. And then _that_ has now been deprecated for a Kubernetes offering. We just ran some Python web apps–we didn't want to have to learn and support an entire system. And here's the thing, most big tech companies I've worked with are all made up of these small, "internal" services, that just want a simple place to run their services.


I've used k8s on an early phase project I later moved off from so I don't have much experience with it but I got the impression that simple scenarios worked like documented. You already had a docker based stack so it sounds like hosted k8s shouldn't be that far off from what you were running.

What issues did you encounter ? I had to spend a week working through it to get familiar with everything but I wasn't experienced with docker previously (I've played with it a few times but never had to setup stuff like custom registry, versioning, etc.)

I did end up in situations where the cluster was FUBAR but I eventually figured out that everyone just recommends rebuilding the cluster over diagnosing random stuff that went wrong during development.


Been there. Simple python django app. Max 20 APIs. Expecting 100 requests per month. But POs and directors wanted shiny DevOps tools. Problems with typical MNCs. Every quarter manager/director comes with some new hype.


If it's a hosted k8s, all you'd do is containerize your application, then create a deployment that pulls that container and exposes it through a loadbalancer. The container is the same if you'd push it to herokus container registry or beanstalk


Heroku I configure external items with some environments variables. Often there is a simple command to do it for me.

Helm. I’m staring at 20 pages of Yaml. Most of it seems arbitrary. Crap ton of interconnections. Plus defining all iam roles.

Doesn’t help that devops team keeps writing custom scripts to interact with it.


On k8s, you write those env variables into the deployment manifest that describes the container. If they are sensitive, pack them into a k8s secret and add that to the container instead.

You can use the external services the same way if they are already setup with things like DATABASE_URL that holds a postgres connection string to amazon RDS, etc. Running k8s doesn't mean you have to move your entire stack into k8s, you can still use hosted services just fine.

In all my time with k8s, I never used Helm. If you don't need it, don't use it and keep it simple. k8s can do a ton of stuff but in reality, you barely need more than the basic.


Isn't that the appeal for google's cloud run? A sort of go between from cloud functions/docker image.


> If you want k8s, you really do need people that know how to maintain it on a more or less full time schedule.

I think this is roughly similar to saying "if you want linux, you need people that know how to maintain it." Which is to say, you can create an architecture where this is absolutely true, but it doesn't need to be true.

The big issue with K8S right now isn't K8S; its that there aren't (big, well established) solutions like Heroku or Zeit or whatever, for K8S, where you don't need to worry about "the cluster", just like those solutions don't make you worry about Linux. K8S really is two parts; the API and the Cluster. The API is the more valuable of the two.

And, you know, maybe it won't ever get there. Heroku and Zeit solve strikingly similar problems to K8S. Maybe K8S just is a platform like that, but for enterprises who want to home-grow, and maybe most companies shouldn't worry about it. But I think the platform, and thus the community, simply needs more time to figure out where it makes sense.

Most companies shouldn't touch K8S. You'll probably regret it. But, to your second point: AWS literally has nothing beyond EKS/ECS + Fargate which approaches a Heroku-like service. Beanstalk is supposed to be that, but its really just a layer on-top of EC2 which doesn't touch the "ultra low maintenance" of a Heroku, or Zeit, or App Engine. So if you're on AWS, and you want to use their other excellent managed services, you either go outside AWS, or you'll go EKS, or you'll end up trying to in-house something even worse.


> I was one of the engineers that worked to do that. Unfortunately, neither myself (nor the other data engineer) was a Kubernetes guy, so we kept running into dev-ops walls

This seems like an organizational problem to me. "DevOps walls" sounds like there is a "DevOps team" (a famous DevOps anti-pattern) and there are knowledge silos between development and ops, which ironically is the exact opposite of what DevOps is about. What this also means is that developers need to be aware of the environment in which their services run and should be very familiar with how k8s works, why not take that as an opportunity to learn?


I have said this multiple times in the past (https://news.ycombinator.com/item?id=23361176 https://news.ycombinator.com/item?id=23243626) and will say it again - You (business logic developers) are not meant to use K8s directly, you are supposed to use a PaaS built on top of K8s (like Cloud Functions, Lambda etc)


But this makes no sense. Why would a cloud operator even bother with k8s if the customer is only interacting with functions? It’s much more efficient to bypass Kubernetes and run directly on the cloud’s native system (like borg).

I think you’re right though - Kubernetes is a massive red herring, we should ideally be running containers/functions on as close to bare metal as possible. Fundamentally, VMs are the wrong abstraction if all your code is containerized.

Joyent’s triton is the closest thing we have to this... I really don’t know why AWS/Azure/GCP haven’t cottened onto this, it would massively reduce their COGS and improve our developer experience.


> Why would a cloud operator even bother with k8s if the customer is only interacting with functions?

Because there are various "tiers" of users, some companies (like coinbase) could actually leverage K8s in their Codeflow/Odin project and prevent a lock-in. But a regular developer looking to just "get things shipped" isn't meant to waste his/her time with pure K8s.

> Kubernetes is a massive red herring

We agree but on a different note. The biggest selling point of K8s is it's API design. The entire industry needs to converge on one "defacto" standard of packing and deployment. Google's Cloud function is a perfect example of this. The API is based on K8s and Knative but under the hood it actually runs directly on GCE rather than GKE. What happens underneath is hidden from business developer, you only care about the data in your yaml and your docker image.


>> I really don’t know why AWS/Azure/GCP haven’t cottened onto this

Conflict of interest. If k8s yields to the most revenue why would they try to decrease that? If some customers are so delusional that they go for an inefficient abstraction so be it. Btw. this is my experience with k8s too, people use it because it is a trend. Not a single company / developer could justify using it to me over leaner resources like EC2, ASG, cloud native resources, etc.


How about companies who don't want to rely on cloud and want on-prem as much as possible? Or those who don't want to be tied to a single vendor? Apple is already getting rid of their Mesos based PaaS and moving to Kubernetes.


>> How about companies who don't want to rely on cloud and want on-prem as much as possible?

I don't know. Are you saying that staying on-prem means you have to use k8s? Google is not a single vendor?

>> Apple is already getting rid of their Mesos based PaaS and moving to Kubernetes.

And? Should every other company follow Apple?


The 'needing a team' aspect of Kubernetes sounds remarkably similar to conversations I had like 8 years ago when Openstack was the new hotness.

We went with ECS and have been happy with it. It plays well with all of AWS's other products and features. For the few things we have to run On-Prem we use Docker Swarm in single node mode and it works well (albeit missing a few features like crons from Kubernetes).


AWS Kubernetes (EKS) with Fargate is just as simple as ECS.

Far simpler if you can include the fact that with Kubernetes you can install new applications as simply as "helm install prometheus".


Are they still charging something ridiculous for control planes? When I looked at it they were like $200/month.

I wasn't aware they were offering fargate on EKS now. They weren't when I looked at it last.



Yep. I remember the Openstack consultants swarming to office trying to sell it as a solution and not a giant overhead / problem. Luckily they failed the POC so we did not need to waste our life on a non-issue trying to solve it with a non-solution. Now it is k8s' time to do the same. We will see how far this buzzword train gonna go.


TBF I like k8s as a service or platform. Once metallb came out to solve the "bare metal load balancer" problem it could deliver the entire product I was looking for.

My big issue with it is the underlying complexity and house of cards nature of running your control plane as sidecar containers on your runtime infrastructure.

You CAN set it up and run the control plane out of band, but last I looked there wasn't step by step documentation for doing so. I also couldn't find anyone doing it in prod which to me is a nonstarter. If I can't figure it out myself and can't hire for it I'm not doing it.

I'm sure it's fine on google cloud, but ECS solves 90% of our problems AND integrates with everything else we're using already.


I agree with most of this, but was surprised by your comment:

> Oh, also, the implementation of Kubernetes cron jobs is also complete garbage (spawning a new pod every job is insanely wasteful).

How often/how many cron jobs are you running that spawning a new pod per job is a problem ?


The first iteration (I actually wasn't around for that) was trying to run a cron for every "data ingestion job" -- at some points, we were doing about 50k+ API requests daily (FB/Instagram/Twitter/etc.) and that was absolutely not tenable using k8s cronjobs.


Why use cronjobs at all for this? This is a classic work queue problem.


I wasn't there for this decision, but I assume cronjobs were being treated as "cloud functions" -- and to be fair, the k8s documentation kind of makes it seem that you could technically do that, but fails miserably if you try to do so in practice.


50k/day is less than 1 qps. This is nothing. This is either not the full story or your cluster was setup completely wrong


Depends how spiky the distribution was. 20k a second at 2pm? Gonna have a bad time.


Counterpoint:

Run 11 person startup. Use hosted GKE. We spend less than 1 man-hour per week dealing with K8s or anything like that. K8S is a big reason we are able to out-execute our competition.


I agree with you completely, we use Kops and see a similar workload. The real boon for us is not in production, where the HA/error-tolerant/easy horizontal scaling certainly helps, but in development, where we can easily bring up ephemeral feature branch deploys as part of the CI/CD pipeline (which itself runs on k8s using Gitlab CI)


Yah.

I mean, k8s has 1000s of features. I am not using every one of them. I carved out the subset I need, and it is FANTASTIC. I get

* Great rolling deploys

* Self healing clusters

* Resource sharing / bin packing (rather than have 20 half used servers, I can have 6 much more heavily utilized servers)

I could maybe frankenstein these features on top of something else.. but frankly it seems absurd. If these are the only 3 features I use to run docker pods, k8s is a HUGE win to me.


You could do the same with AWS ECS

there is nothing special about K8s in your situation.


I do not use AWS

ECS does not bin pack, does it?



neat.

Not available for those of us who choose not to AWS.


I don’t understand the value of container tech (unless dealing with legacy code).

Serverless seems better and cheaper (lambda, GC Functions, Azure Functions). Pay as you go with no overhead costs. In many cases usage may never exceed the free credits.

For the database and persistent storage, you need a cloud native service like S3 and Postgres RDS. (Running a database in a container is asking for complete data loss from what I understand.) This is the primary cost for my tech stack and for lite loads $15 is about as minimal as possible.

So I don’t see how using containers could be cheaper unless you are keeping the data in the container, which is a bad idea since containers are supposed to be stateless.

So serverless cloud native tech seems better by far from everything I’ve experienced.

Also when using a tool like serverless framework, deploying an entire stack of serverless resources is dead simple.

I’ve tried a few times to see the value of containers, but I just don’t get it.


Your comment and the post and threads in general are making me chuckle, having just spent the last half hour reading this post, for no particular reason, from a few months ago about the transformation of racknut-and-perl roles like neteng and sysadmin into text editor-based devops:

https://news.ycombinator.com/item?id=22508968

I have to say, certain comments -- which I'm sure are just as real-world as yours -- lie in...let's call it tension...with the threads and comments here in this post. That everything in the old post appears to be subsumed into a "sysadmin II: internet boogaloo" of orchestrator expertise is humorously ironic.


I've found that ECS on AWS works great for similar use-cases. I guess it gets more love than EKS on AWS.

Seems like EKS is great for people to start looking at AWS without vendor lock-in, but then people switch over to ECS and/or Lambdas for compute.


Nomad + Consul has been pretty great to us over the last year. We're a small nonprofit and chose it specifically because we can't afford to pay someone to keep watch over k8s


GKE is fantastic compared to EKS.


Did you evaluate Cloud Run/Fargate for deploying the same Docker images without k8s?


> We've since migrated to cloud functions (on GCP; but AWS lambdas could also work) and it's just been a breeze.

Did you run into cold start delays with GCP cloud functions? AFAIK it is one of the reasons many prefer AWS Lambda over GCP


We actually had a cold start tolerance of ±30 seconds and so far, very few jobs are out of that range. This was one of our main sticking points, and the Google guys gave us some great tips on how to reduce cold starts.


I had a similar problem, setup a small cluster and it worked fine until next upgrade and nightmare repeated on each upgrade. I realize spending most of my time on fixing k8s issues rather than doing any valuable work


What did the typical "devops wall" look like? Can you recall any specific examples?


> so we kept running into dev-ops walls

This reads like an oxymoron. 'DevOPS' was supposed to break down walls between teams. You probably have a traditional "Ops" organization, regardless of team naming conventions.


Anecdata: Series B startup. I've found GKE to be almost completely painless, and I've been using it in production for more than 4 years now. I don't think the article gave a fair representation on this count; sharing a link to a single GKE incident that (according to the writeup) spanned multiple days and only affected a small segment of users doesn't (for me) substantiate the claim that "it isn’t uncommon for them to have multi-hour outages".

In my experience, multi-hour control-plane outages are very rare, and I've only had a single data-plane outage (in this case IIRC it was a GKE networking issue). Worst case I see is once or twice a year I'll get a node-level issue, of the level of severity where a single node will become unhealthy and the healthchecks don't catch it; most common side-effect is this can block pods from starting. These issues never translate into actual downtime, because your workloads are (if you're doing it right) spread across multiple nodes.

I wouldn't be surprised if EKS is bad, they are three years behind GKE (GA released in 2018 vs 2015). EKS is a young service in the grand scheme of things, GKE is approaching "boring technology" at 5 yrs old.


EKS is a feature parity product. The pricing makes that painfully obvious.


AWS wants k8s to fail because it works against the significant lock-in AWS has tricked teams into building themselves into. They do not want people already in the AWS ecosystem to move to EKS, it is instead there to not lose potential customers.

And that is why pricing and features sit right at “good enough” and not great.


We've been using EKS since it went GA. We haven't had a single control plane outage that I am aware of.


I'm sure it works fine if you can get it running (documentation didn't work when I played with it). I'm referring more to the $200/month it was per control plane.

That to me is a product they offer because someone else is offering it, but they don't want you to actually use it.


I take it the isolated $200/control plane is too expensive. What do you think the right price point is for that isolation?


ECS control planes cost nothing.

If you're actually looking to build isolation in AWS then you're going to need EC2 dedicated for your EKS member hosts. So you're not getting isolation for $200/month (I'd have to spec it out but dedicated hosts are pricey to the point that it'd be competing with physical hardware in a colo).

That $200/month for not-really-isolated is also per plane. So if you want a separate staging environment it's another $200/month. Client API sandbox? Same thing. It's wack.


Perhaps, in the box-ticking sense.

I'd be surprised if they have the same operational maturity, which translates to uptime, which was what I was talking about.

A quick search turns up these SLAs:

<bad link>

vs

https://cloud.google.com/blog/products/containers-kubernetes...

EKS: 3 nines

GKE: 3.5 nines

So Google is committing to 1/2 the downtime.

(Note the GKE SLA is for regional clusters, which is what you should be doing if you care about uptime. The zonal cluster SLA is 2.5 nines. I couldn't find a difference in EKS, maybe there's an equivalent better SLA for regional clusters I couldn't find.)

Edit - that EKS SLA link formatted weirdly, and so I did a little more digging and found a more recent SLA which matches GKE: https://aws.amazon.com/about-aws/whats-new/2020/03/amazon-ek...

So, per my original comment, I am surprised. (Having never used EKS directly I have no idea what their actual uptime is; in my experience GKE has been way higher than 3.5 nines, but obviously I don't have enough data to make statistically significant observations on this.)


In the region we are running it, its not had an outage that has affected us since they launched it.

We haven't had any issues with k8's upgrades either.


I work for a mid size company with 30-40 engineers managing 20-30 very diverse apps in terms of scale requirements and architectural complexity. It took our devops team (4-5 people) probably 18 months to learn and fully migrate all our apps to Kubernetes. The upfront cost was massive, but nowadays the app teams own their deployments, configurations, SLA's, monitors, and cost metrics.

Introducing Kubernetes into our org allowed us to do this; we would have never gotten here with our legacy deployment and orchestration Frankenstein. The change has been so positive, that I adopted Kubernetes for my solo projects and I am having a blast.

I understand Coinbase's position, and they need to stick to what works for them. I just wanted to bring up a positive POV for a technology I am becoming a fan of.


Do you have any intuition as to what percentage of the benefits your company has seen come from kubernetes specifically, and how much just from the exercise of spending 18 months working on a modern, coherent, efficient infrastructure?


Kubernetes is almost a standard, all our engineers are eager to learn it. It is extensible, some app teams have started building operators to tackle problems in a more efficient way. It allowed us to standardize metrics and monitoring, we didn’t have a clear story for this before. It is cloud agnostic and with great compatibility; I was part of a migration from GKE to EKS, and it was painless.


One other advantage of Kubernetes that is overlooked in the article is the benefit with a heterogeneous cluster of instant auto scaling. For example you have 10 apps on a k8s cluster that each use the same resources, you can give 20% buffer for the cluster, which would let any single app use 300% of their allocation instantly. With VMs, you’ll be stuck waiting for VMs to spin up or have to give each app their own large buffer to handle bursts of traffic.


It amazing to me how many people associate this with kube. You’ve had this for years in numerous ways, the only difference is some yaml and marketing.


What other solution is cloud agnostic as well as has an easy local dev story?


_any_ non-proprietary tool is "cloud agnostic". Kubernetes is bundled software, and achieves the things those pieces achieve. There is nothing holy about k8s specifically, the tools you train with are easy, and it's very easy to get skewed opinions on that.

For example a lot of people would find writing scripts cumbersome, but not a person who's written a lot of them. They're not any more fragile than other logic error capable software is


You didnt avtually answer the question: What free, cloud agnostic tools let me specify "keep 60% cpu load average" and work the rest out?

With k8s, a horizontal autoscaler is a few lines in a yaml file and the result works in any cluster run by any vendor.


Any load balancer will let you do that. Scaling is a few lines of scripts to do on any platform, and well worth the couple minutes, you don't even need a tool for just that.

No tool "works the rest out". It's _always_ a compromise, because inherent complexity can never be removed, only moved. What you gain in one area, encumbers another.

You may freely use k8s, but it's not magical nor easier to use than any existing systems. In fact adopting it often takes non-trivial time and the web is full of failure stories with very benign warnings and catastropic results.


Isn't that the benefit of cloud?

Maybe I'm spoiled by GCP and it's not the same in other providers; but I can have a brand new debian VM configured and operational in less than 10s (less the time that I run my own startup script).

Debian machines spawn with incredible speed; not much slower (if at all slower) than a new container.


Spawning a bare bones VM is easy and fast, the problem is getting the applications there, which is where k8s shines.


Maybe I'm spoiled then, it takes less than 2s to spin up my application on a new instance. :\

There are also many options for creating Debian based image files with all you need baked in.

I guess it depends very much on the application you have.


A lot depends on whether "bake application into AMI, use ASG to scale and zero-downtime deploy" is applicable or cost effective.

I think I've been in total of one company where it was somewhat true, and even there we didn't do it.


I'm a fan of cloud native kubernetes.

But: I've also been put in the terrible position of supporting a platform sold to run on kubernetes in an air gapped, on prem, bare metal environment.

But: I've also been put into the terrible position of fighting vendors armed with agile, k8s and microservices selling that combined mayhem as a replacement for an openmp,openacc based massively parallel, on prem, bare metal HPC system with a strongly conservative development and operational tradition.

I understand loving efficiency and time savings but not everything simply works better with k8s.


> I've also been put in the terrible position of supporting a platform sold to run on kubernetes in an air gapped, on prem, bare metal environment.

Would you share your experience?


Sorry, can't share details as I enjoy being employed.


Exactly. A lot of the comments here are along the lines of "we setup a k8s cluster and now managing it is a huge burden" which is not surprising. The power of k8s is to allow separation of concerns in your technology organization. You can have a dedicated team to build and maintain the underlying cluster and then app development teams are consumers who deploy their application on the common infrastructure. Kubernetes provides a nice abstraction layer so those two teams/orgs can interact through a well-defined API. As a dev team, we can manage our own infrastructure and pipeline through declarative configurations and let someone else manage the underlying compute and network infrastructure. As long as you don't fall into the anti-pattern of "every team builds it's own k8s cluster" then you should be able to derive some nice economies of scale.


You're comparing 18 months of migrating to K8s to doing nothing, rather than 18 months of other possible solutions.


As someone whose gone the opposite way (moving from ECS to Kubernetes), I think the author is understating how good managed Kubernetes solutions are.

At my current job, I use Azure's managed Kubernetes service, which does a great job at providing a consistent environment that's very easily managed, no unexpected updates, great dataviz, and if you choose, simple integrations to their data storage solutions (run stateless K8 clusters if you can) and key vault. We don't do much outside of our kubectl YAML files, which as commented below has a de-facto understanding by a large number of people.

CVEs will always exist, which is why network security is important. I think we can agree that the only ingress into your cloud environments should be through API servers your team builds, and everything else should be locked down to be as strict as possible (e.g. VPNs and SSO). With a system like K8, so many eyes on the code mean so many more CVEs will exist, so I don't find this argument compelling.

My team, and so many other teams worldwide are betting that the K8 community will accelerate much faster than roll-your-own solutions, and K8 gives us the best opportunity to create cloud-agnostic architecture. Additionally, helm charts are easy to install, and afaict more software vendors are providing "official" versions - which means for a team like mine, which is happy to pay for services to manage state, in the same vain a company chooses AWS RDS over managing their own Postgres server, we can get the same benefits as the author with a cloud-agnostic solution.


You don't see random network errors, often visible with DNS, on your Azure managed kubernetes clusters?


I haven't yet our ingresses have passed routine ping checks (we use New Relic synthetics for this) for a while now. Fingers crossed.


Ingress on AKS is easy, the outgress will be the pain if you need anything from it.


One thing that is regrettable about K8s winning the orchestration wars so remarkably, is that it pretty much killed all other solutions. Swarm is dead, Nomad doesn't seem like it has much community support and Mesos feels like it's on life support. Mesos still has a lot of people working on it however, but the perception feels different.

Personally I've found Mesos much easier to manage, secure, and operate than k8s. However, when it first came out all the cool kids were using it, then most of them jumped ship to k8s. AirBnB's Chronos is now pretty much a dead project, Mesosphere's Marathon is now gimped (no UI) and major features moved into DCOS. At the same time, Mesosphere (now D2iQ?), now seems like it's more focused on k8s.

k8s is everything plus the kitchen sink, and managed k8s isn't the killer feature I thought it would be. I don't blame people for not jumping on the k8s bandwagon at all.


We're using Nomad to manage a large fleet of firecracker vms at fly.io. It's not as robust as k8s, but I think that's a feature. It's well documented, extensible, and predictable. Not a big community, but hashicorp folks are responsive on GitHub.


We're working on community! There's an office hours tomorrow: https://mobile.twitter.com/HashiCorp/status/1270126346103132...

There's also https://discuss.hashicorp.com/ and we're discussing (harhar) how else to improve our community relations.


That's great to hear!


Nomad, Consul + Swarm is still a lovely solution and I prefer it a lot to K8s. K8s is a big monolith and often way too complex for my personal use cases. I hope Hashicorp sooner or later builds a proper replacement for Swarm so that we can have overlay networks without hassle. I know there is Weave, but never tried it.


K8s is a lot of things, but a monolith it is not. Quite the opposite - the complexity comes from the large number of relatively simple components interacting in various ways.


Which part of Swarm are you looking to replace?

Nomad already does the container part, and Consul Connect is the networking overlay / service mesh. Work is being done to get Nomad better integrated with Connect.


Broader CNI and multi-interface networking support coming in 0.12!

Would love to hear what else you miss from swarm: schmichael@hashicorp.com

Update: Honestly if you haven't tried out our Consul Connect integration, please do. mTLS with just "connect { sidecar_proxy {} }" stanza and Consul running: https://www.nomadproject.io/docs/integrations/consul-connect...


FWIW, where I work we're going with Nomad over K8S. It gives us everything we want and nothing more (plus it's from a source we trust and love: HashiCorp).

The nice thing with Nomad is, thanks to it's straightforward design/approach, it should be easy(ish) in the future to swap it out and go with something else if we outgrow it (or HashiCorp abandon it for some reason).


I know quite a few startups going the Nomad route. It might not be as mainstream, but I don't think it's going to be phased out any time soon. Rancher was still maintaining their own scheduler for a while (not sure if they still do) because there were a lot of legacy customers still on it. Racher and RancherOS has pretty much moved to a full k8s management shop though.


I tried using Nomad once after being a little worn out by Kubernetes' complexity. For some reason, the Nomad abstractions didn't click on the first couple attempts. In comparison, Kubernetes' abstractions mapped 1:1 to my understanding of the service oriented architecture.

I'd have probably gotten used to it had I spent more time using it, but it'd have taken some rewiring of thinking process.


> I tried using Nomad once after being a little worn out by Kubernetes' complexity. For some reason, the Nomad abstractions didn't click on the first couple attempts. In comparison, Kubernetes' abstractions mapped 1:1 to my understanding of the service oriented architecture.

<3 Nomad. However, Nomad only satisfies a really tiny part of Kubernetes ecosystem which is the ability to pack containers and schedule them efficiently in a cluster; Plus, it scales really well. Kubernetes provides a bit more than that.


Swarm being dead is news to me, we use it for our on-prem bits without issue (though we only have a few of them).

We use ECS for the majority of our container orchestration and love it.


I'd agree that k8s has a lot of functionality built-in, another important thing to realise is what k8s doesn't do.

In addition to the well-known integration points (Container Runtime/Network/Storage Interfaces), there's things like the lack of a good built-in user authentication mechanism with Kubernetes, which means you pretty much always need some external authentication service for your clusters.

That's not too bad if your on one of the big managed providers (GKE/AKS/EKS) but can get complex for people who want to deploy on-prem.


> That's not too bad if your on one of the big managed providers (GKE/AKS/EKS) but can get complex for people who want to deploy on-prem.

Go spin up Keycloak, join it to your user-directory of choice (or not and just use the internal directory), configure it as your authentication provider, done.


Right so in addition to the complexity of running k8s (which is the general point of the post) you now have to learn about OAuth servers and LDAP integration.

In many corporates you also now have the challenges of cross-team/department work, for the k8s team to work with the AD team to get it setup.

And still that won't get you away from the problem that without a first class user / group object in k8s people often end up running into problems with JML processes over time and mismatch between AuthN and AuthZ...


Or use Dex: https://github.com/dexidp/dex

Which has the advantage of not needing any external databases.


That works too depending on your requirements. Either way, authentication is not a hard problem to solve.


LOL. You clearly not worked with SSO or anything a bit more complex. It's a pretty hard problem, there are even companies whose whole portfolio is around authentication only!


Chronos wasn't killed by k8s -- it was killed by Airflow.


We‘ve been mostly happy users of Mesos, Apache aurora, and consul. It works pretty well for us (200+ engineers). We have maybe 5 people dedicated to keeping it all alive, and they’d be able to maintain the pieces of it that we use. Aurora configuration kinda sucks, but I think job configurations might just suck in general.

That being said, it has been concerning to us to see the big institutional players move off of mesos and aurora. We wouldn’t like to be active maintainers, but we could.

I think that’s the main difference between Mesos and K8s: with K8s, we wouldn’t want to maintain it, and we wouldn’t be able to (since it’s so large). Somehow mesos and aurora feel more manageable.


I'm a former Apache Aurora maintainer. Aurora has been (and continues to be) awesome for us and I'm so happy to hear other folks are still using it and it's working out for them.

Funny that you mention the configuration part. At the most recent KubeCon in San Diego, CA, the folks at Reddit gave a talk in which they said they got sick and tired of dealing with yaml. They accidentally went on to recreate Pystachio as the remedy so I think you're right on the money with your statement.

When the Project Management Committee (PMC) voted to put Aurora in the attic we were all super bummed but we just ran out of interested developers :(.

The PMC agreed to kick off an "official" fork but so far it's just me maintaining it: https://github.com/aurora-scheduler/aurora


Oh it’s super cool to see you in the wild! To clarify, I had a lot of qualifying thoughts running through my head when I said “kinda sucks” (hence the “kinda”!). :)

I actually think managing aurora configs is way easier than managing yml files, and I agree that I think aurora configs were ahead of the game: having access to python in your config feels like a super power. I feel like we’ll converge on something that compiles aurora configs into yml files, prior to runtime.

That being said, we’ve never been able to get good editor support for things like “go to definition”, with the whole “include” syntax. We have maybe 2-3k aurora config files, of which maybe 100 are shared boilerplate. Do you have any advice on this? I tell vim to treat them like python files, but pylint hates them :)

We were bummed by the PMC decision too. I think some people at my company have considered becoming maintainers over the years, but, for the most part, everything “just works”, so we haven’t felt a selfish need to, so to speak. I actually think it’s a kind of unintuitive credit to your project, that it doesn’t require a horde of maintainers. That being said, I’ll set aside some time this weekend to take a look at some issues. :)


Oh no worries, no offence taken at all with the comment, configurations files tend to suck in general :).

Pystachio was indeed very forward looking and the folks who worked on this at Twitter at the time deserve all the credit there.

I think what you mention is a general problem I've encountered with IDEs when it comes to dealing with Python (esp. the "go to" issue you mention). Even when I've had to touch the Aurora client code, which is full on Python code, I've come out pulling my hair thanks to PyCharm acting wonky.

> We have maybe 2-3k aurora config files

Those are some big files! The boiler plate stuff is definitely something I've heard before from users but, unfortunately, there doesn't seem to be a better answer.

When it comes to managing job configs, I'm pretty low on the pecking order in terms of knowledge since we ended up creating our own thrift client using Go to use with Aurora. (As a consequence, all our job definitions exist as Go code.)

Stephan Erb (https://twitter.com/erbstephan/) may have better advice in this case. Some of the Twitter folks may have good info too, but they've been radio silent for months.

> I actually think it’s a kind of unintuitive credit to your project, that it doesn’t require a horde of maintainers.

That's definitely a great point and a great compliment to the project. There's a lot of love that went into this project and I'd be ecstatic to get some new contributions, even if it something simple like fixing documentation or bumping up dependencies :D.


5 dedicated folks = a lot of dollars. $500k-1M? Do those 5 folks keep things up reliably without burning out?

Honestly curious - never used k8s or Mesos* in production and we have a much smaller team managing a much smaller infrastructure, I am sure.

edit: meant to say mesos but applies to both


Yea 500k-1m is probably correct.

But, to be fair, they’re not dedicated to only keeping the lights on, they probably spend 20% (1 eng year per year) of their collective hours fighting fires, so I don’t think burnout is too much of an issue. The remainder of their time is spent maintaining libraries and web interfaces. They’re a pretty standard “platform engineering” team.

In my perspective, it doesn’t matter what technology you go with (K8s or otherwise): a responsibly-managed your platform team requires at least one team with a full on-call rotation (i.e. at least 4 engineers), depending on how wide your golden path is.


I think it's more that Mesos committed suicide by DCOS - which sapped the strength from the community, something that first Google, then CNF work hard at avoiding.

Another thing that might have been just me, but I really couldn't see any structure in Mesos deployments. Yes, you could run X, Y, Z on top of it to get various features, but they all had separate APIs, separate input files, etc.

Having used k8s before that it was a huge blow, since at the time (late 2018) I was used to think that Mesos et al were more mature and advanced, and instead encountered k8s circa 1.2 with less cohesion :/


Mesos was great. It was a much cleaner, flexible separation of concerns. Too bad Mesosphere killed it.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: