Except that state on kubernetes seems really hard to do well, and data is inherently state.
Also the complexity cost significantly adds to the 'configuration that outperforms a single thread'. 99% of the time you'll be better off putting a docker container on a single box.
It took me a long time to come to terms with this fact. I was once asked during an interview what I wouldnt run in k8s and I said legacy apps and hardware specific apps. Now I think that for many applications cloud run or cloud function can reduce the complexity of software delivery. It takes a team just to manage the k8s infra while an app team can be responsible for their own cloud run app.
When I talk to people about k8s, if they haven't containerized their apps yet, I generally recommend starting with a simple CAAS setup and then, if you need some of the flexibility and power that k8s provides, then move those apps into a cluster.
It can do a lot of cool stuff, but there's plenty of complexity there, and there are quite a few potential nasty surprises in there (from a security standpoint anyway, which is where I focus)
OP is using google cloud/GCP, cloud functions and cloud run are google cloud serverless offerings. Cloud functions is comparable to lambda, cloud run is basically a kubernetes-based PaaS.
CFs are higher level than Lambda, they can implement HTTP endpoints directly. More like Serverless framework, or a combination of API Gateway + Lambda. And they have good container support, Lambda container support is a bit iffy.
Thanks both for explaining. Now I understand why, recently AWS Lambda added support for invocation URLs. Or I thought they did but the page seems to have been taken down now.
This is an interesting facet of software that feels lost. The naive way can often scale remarkably well. Even better, if you leave an approach stable, as in not changing, you can often get a lot of improvements by focusing on other sections.
This isn't too say that throwing more code and complexity can't help. But it is probably no different than any other help, in that most gains are probably slow to realize and more marginal than you'd hope.
the problem with container on a single box is that you still need to solve a bunch of problems thay k8s solves for you with that. By the time you've handled deploying updated containers, a sidecar for logging and/or metrics, service discovery, and decided you _maybe_ want two instances of your app, you likely should have just used k8s from the get go. You get all of that straight out of the box, and it's going to be much easier to hire someone to manage that than the mess of bash scripts you end up with otherwise.
I think they mean that 99% of the time you're better off using a single container on a single server and wiring them together using traditional tools (load balancers, network-attached storage, etc) than you would be if you adopt Kubernetes. This gives you many of the same application-development benefits (standard packaging format, write once run anywhere, etc) that Kubernetes does without the overhead of managing the infrastructure complexity.
Kubernetes is so complicated and many organizations do not really think about what problems they are trying to solve before diving in. Kubernetes solves so many problems that most IT organizations don't actually have or that are perfectly solvable in a simpler way with the existing virtual-machine ecosystem.
99% of the time for 99% of people you're better off not using a container at all and actually running your application on your OS instead of under $alot of layers of extra abstraction.
Of course if you're not a human person it's different. Corporate people, being made up of multiple different people, care more about their human elements being able to interact with each other's work. There containers might help.
Containers do not add extra abstraction layers, they're about proper namespacing of all resources. They simplify the OS's existing abstractions and remove a lot of pointless attack surface.
Exactly. Some people think containers are a whole new layer, akin to a VM. I think it is just their mental model that are used to equate containers and VMs.
I am pretty sure they're supposed to use a somewhat "Cooperative Namespacing" schema, that makes both the application and the OS/container abstraction layer more complex, while attempting to deliver lower overhead for said namespacing. Sorta like Cooperative Multitasking, compared to Preemptive Multitasking.
And even that doesn't work for some situations; as an example, old versions of Java (7 and older) will use the incorrectly namespaced Linux file system APIs to incorrectly read whole-system limitations instead of current container limitations to setup memory and CPU parameters.
> remove a lot of pointless attack surface.
Not really, containers are for lower overhead cooperative namespacing, not security. You want bare minimum lightweight VMs for security.
Microservices do allow scaling the development of an application on the human side. This is true!
A kickass CICD setup and good developer tools probably should be done first for the typical company, otherwise the majority of the added engineering time will go into scaling the infrastructure setup.
Unless severely resource constrained, it makes sense to at least toss the application under something light like an nspawn container. If you're using systemd hosts it's already there.
Even if you don't use user or network namespaces it at least confers a desirable separation of application and its dependencies vs. underlying host at the filesystem level.
It simplifies administration and allows your app and its dependencies to move at a pace decoupled from the host kernel and its minimal dependencies required for providing ssh access and systemd.
There's a concept called COST - Configuration that Outperforms a Single Thread, which basically asks the question: "how many cores/servers does a system need to outperform a single thread/core". Often this means that to outperform a single beefy server you need to get to 100+ servers to overcome the coordination costs and actually do more processing or process more faster. Lots of workload specific details here, but the general application in data engineering is "How many nodes does your spark cluster need to outperform my laptop, or how many nodes does your cluster need to outperform a single turbo.large.beef instance?"
My belief is that most companies chase scalability wrong in data.
That said, there are some companies that handle telemetry at scale and actually need to record and analyze billions of events per second. In these cases you will trivially hit the scaling limits of a single server. 99% of companies are not here.
I could be convinced that this number is actually as low as 90% :)
On a lot of platforms you'll see better cost savings scaling vertically too. With a Go app, it'll run just as good (or better?) on one (or two for HA?) beefy server vs a fleet, the http router will saturate all of your cores just fine.
Than normal processes that are platform dependent.
> Not sure what “on containers” means. “On nodes”, sure. Do other platforms not solve this as easily? (Running multiple workloads per host)
On containers means your application is packaged to be used as a consumable service, that is in turn used to compose a more complex workflow service that solves a ‘real’ problem. I guess other platforms also solve this, but the concept of containers (and Docker streamlined it) was the first. Everyone followed later.
> I guess same question as above, there aren’t other solutions where you give something a fleet of hosts and it runs things on them?
There are years of bin-packing research (think operations research from 50’s). But none of them came with the integration that something like K8S brought. Mesos is another example, but comercial support is slowly killing it.
> Again compared to what I guess?
Compared to having heterogeneity, even in your local environment, let alone in a production system.
I didn’t say neither, but I know why they exist. Basically, I said these (containers, binpacking, load-balancing, proxying, name spacing, isolation, etc) are all realizations and solutions to problems that one needs to solve when autonomically managing any distributed system that is being shared among multiple distributed applications, each using their own deployment model.
You don’t need to use k8s. But if you want scalability and manageability, you have to solve the same kind of problems it tries to solve, because ultimately resources aren’t infinite. Of course, this will always depend on the size of your (multiple) service(s).
This seems to be lacking depth and so it seems unclear why it's being upvoted. While I agree Kubernetes adds some advantages to real-time processing, not all of those advantages require you to run Kubernetes. This post also seems to be missing several of the downsides, massive complexity, networking, scheduling decisions, etc, etc.
Kubernetes is a very flexible way to host microservices today, which also means it takes a lot of effort to manage and keep running in a workflow.
And while we needed something like Kubernetes this past decade to grow microservice architecture en mass, many of things that Kubernetes does right now will be abstracted away over time in a way that maintains much of that same flexibility.
In that respect, yes, Kubernetes is important for the future of data platforms - but as a stepping stone (an angle that was absent in the author's post).
Ah - by flexibility, I meant to imply that there are thousands of different ways to host microservices, based on the needs and architecture of your Web app.
And because of this, there isn't a set of standard tools and practices that can work for all Web apps, and when things change it often requires big workflow changes as well. In other words, for anything other than a simple Web app (which could be workflowed with GitOps) you have to have a good SRE to keep things running with Kubernetes ;-)
I think that'd be near impossible to speculate on. There could be hundreds of different ways it could evolve that way, including ways that involve frameworks that haven't been invented yet.
"For the sake of God" should be "For the love of God" or maybe "For God's sake", I think; that's the analogy he was probably trying to use since I don't think English is his first language.
You can get resilience from hardware failure and shorter downtime windows if a host needs to be rebooted since the database will immediately start up on another node.
A good Kubernetes cluster will also give you centralized logging/monitoring/alerting as well.
So if you already have a cluster (and an administrator), you do get benefits. But the benefits are only really worth it if you can amortize the cost of the cluster administration over many applications.
I’ve seen some companies using development clusters and assigning namespaces per user. With MSSQL being able to run in containers they’re able to spin it up quickly and point it to a sanitized version of their production data. If they update MSSQL it’s just a matter of killing and spinning up a new container. Or you want to test an update just spin up the new version, point it to the same data, and run your tests.
Even databases like MySQL and PostgreSQL do scale out for HA reasons--they have multiple replicas. It's easy to set this up on in Kubernetes, especially for databases that have operators.
Also the complexity cost significantly adds to the 'configuration that outperforms a single thread'. 99% of the time you'll be better off putting a docker container on a single box.