Hacker News new | past | comments | ask | show | jobs | submit login
The future of Kubernetes is virtual machines (paulcz.net)
220 points by MordodeMaru on Dec 26, 2018 | hide | past | favorite | 122 comments

An important element of Kubernetes is that it standardizes the infrastructure control plane, and allows different pieces to be plugged in (for networking, storage, etc.).

The "virtual kubelet" essentially throws that all that away and keeps Kubernetes "in API only". For example, with virtual kubelets, scheduling is meaningless and networking and storage are restricted to whatever the virtual kubelet target supports (if useable at all).

Personally, I think the value proposition is tenuous -- you can create VMs today, doing so via the Kubernetes API isn't suddenly revolutionary. Just like throwing something in a "hardware virtualized" instance doesn't suddenly make the whole system secure.

Containers and Kubernetes are compelling for a variety of reasons, improving them to handle multi-tenancy is a broad challenge but I don't think the answer is to reduce the standard to what we have today (a bunch of disparate VMs).

> Personally, I think the value proposition is tenuous

Multi-tenancy is a pretty compelling value proposition when you reach any kind of scale. If you're in a regulated sector, it's non-negotiable.

Relying on the cluster as the security boundary is very effective ... and very wasteful.

> Containers and Kubernetes are compelling for a variety of reasons, improving them to handle multi-tenancy is a broad challenge but I don't think the answer is to reduce the standard to what we have today (a bunch of disparate VMs).

I think the argument is that rather than the painful (and it will be very painful) and probably incomplete quest to retrofit multi-tenancy into a single-tenancy design, we can introduce multi-tenancy where it basically actually matters: at the worker node.

At first glance it's confusing to go from "one master, many nodes" to "one node pool, many masters". But it actually works better on every front. Workload efficiency goes up. Security surface area between masters becomes close to nil.

Very cheap VMs are the means to that end.

Disclosure: I work for Pivotal and this argument fits our basic doctrine of how Kubernetes ought to be used.

I don't think multi-tenancy has been "retrofitted" onto Kubernetes. Kubernetes was designed with multi-tenancy in mind from the very early releases -- namespaces, authn/authz (initially ABAC, later RBAC), ResourceQuota, PodSecurityPolicy, etc. New features are added over time, such as NetworkPolicy (which has been in Kubernetes for a year and a half, so perhaps not "new" anymore!), EventRateLimit, and others, but always in a principled way. And the integration of container isolation technologies like gVisor and Kata are using a standard Kubernetes extension point (the Container Runtime Interface) so I do not view this work as retrofitting.

Moreover, even today there are real public PaaSes that expose the Kubernetes API served by a multi-tenant Kubernetes cluster to mutually untrusting end-users, e.g. OpenShift Online and one of the Huawei cloud products (I forget which one). Obviously Kubernetes multi-tenancy isn't going to be secure enough today for everyone, especially folks who want an additional layer of isolation on top of cgroups/namespaces/seccomp/AppArmor/etc., but there are a lot of advantages to minimizing the number of clusters. (See my other comment in this thread about the pattern we frequently see of separate clusters for dev/test vs. staging vs. prod, possibly per region, but sharing each of those among multiple users and/or applications.)

Disclosure: I work at Google on Kubernetes and GKE.

From a security (as opposed to workload isolation) perspective, I don't think k8s was designed with multi-tenancy in mind at all, in early versions.

Definitely I've had conversations with some of the project originators where it was clear the security boundry was intended to be cluster level in early versions.

Some of the security weaknesses in earlier versions (e.g. no AuthN on the kubelet, cluster-admin grade service tokens etc) make that clear.

Now it's obv. that secure hard multi-tenancy is a goal going forward (and I'll be very interested to see what the 3rd party audit throws up in that regard), but it is a retro-fit.

> I don't think multi-tenancy has been "retrofitted" onto Kubernetes. Kubernetes was designed with multi-tenancy in mind from the very early releases -- namespaces, authn/authz (initially ABAC, later RBAC), ResourceQuota, PodSecurityPolicy, etc.

My complaint is that these require assembly and are in many cases opt-in (making RBAC opt-out was a massive leap forward).

Namespaces are the lynchpin, but are globally visible. In fact an enormous amount of stuff tends to wind up visible in some fashion. And I have to go through all the different mechanisms and set them up correctly, align them correctly, to create a firmer multi-tenancy than the baseline.

Put another way, I am having to construct multi-tenancy inside multiple resources at the root level, rather than having tenancy as the root level under which those multiple resources fall.

> there are a lot of advantages to minimizing the number of clusters.

The biggest is going to be utilisation. Combining workloads pools variance, meaning you can safely run at a higher baseline load. But I think that can be achieved more effectively with virtual kubelet .

> The biggest is going to be utilisation. Combining workloads pools variance, meaning you can safely run at a higher baseline load.

Utilization is arguably the biggest benefit (fewer nodes if you can share nodes among users/workloads, fewer masters if you can share the control plane among users/workloads), but I wouldn't under-estimate the manageability benefit of having fewer clusters to run. Also, for applications (or application instances, e.g. in the case of a SaaS) that are short-lived, the amount of time it takes to spin up a new cluster to serve that application (instance) can cause a poor user experience; spinning up a new namespace and pod(s) in an existing multi-tenant cluster is much faster.

> But I think that can be achieved more effectively with virtual kubelet .

I think it's hard to compare virtual kubelet to something like Kata Containers, gVisor, or Firecracker. You can put almost anything at the other end of a virtual kubelet, and as others have pointed out in this thread virtual kubelet doesn't provide the full Kubelet API (and thus you can't use the full Kubernetes API against it). At a minimum I think it's important to specify what is backing the virtual kubelet, and what Kubernetes features you need, in order to compare it with isolation technologies like the others I mentioned.

Disclosure: I work at Google on Kubernetes and GKE.

One trick I used before was to create resources and leave them unused until they are allocated, at which point I create another one to top off the pool of pre-created resources. A stopped cluster takes up disk space and nothing else and this is an easy solution to the user experience issue.

Of course, hardening multi-tenant clusters is also needed. Even if the use case requires resource partitioning, there are use cases that don't and keeping one friend from stepping on another's toes is always a good idea.

I'd like to understand more about your second paragraph, since it shapes some of the work I want to do in 2019. What should I be reading or looking up?

Are multiple disclosures in the same thread really necessary?

I'm saying that I think the value proposition for the virtual kubelet is tenuous, not multi-tenancy as a whole.

For a single cluster, "very cheap" VMs solve some of the problems, but leave others unsolved (e.g. they prevent some hardware and kernel exploits, but lots of security issues can still hit you -- like the last two big K8s CVEs). They also sacrifice a lot of the things that make containers compelling on the floor (high efficiency and density), so I don't think they should be spun as a panecea.

You seem to be arguing that one shouldn't bother with multi-tenancy on a single cluster, which is a fine approach, but I do think that the technologies and tools to support the single cluster model are evolving. Calling it a "multi-tenancy retrofit" seems a bit FUD-y to me. Just because there are challenges doesn't mean it's not worth doing.

> I'm saying that I think the value proposition for the virtual kubelet is tenuous, not multi-tenancy as a whole.

I was tying them together because I see the former as an effective strategy to achieve the latter.

> Calling it a "multi-tenancy retrofit" seems a bit FUD-y to me. Just because there are challenges doesn't mean it's not worth doing.

What should I call it? It's being added retrospectively to a single-tenant design. The changes have to be correctly threaded through everything, through codebases managed by dozens of working groups, without breaking thousands of existing extensions, tools and applications.

What I expect will happen instead is that it will be better than it is now -- which is a win -- but that no complete, mandatorily-secure, top-to-bottom security boundaries will be created inside single clusters. We will still be left with lots of leaks.

Our industry is replete with folks trying to wedge the business of hypervisors and supervisors into applications and services. It's possible but always leaks and breaks and diverts enormous development bandwidth away from the core thing that is meant to be achieved. Kernels and hypervisors have privileged hardware access and decades of hardening that can't be truly replicated at the application or service level and which when imitated need to be designed in from the beginning.

I don't see that as FUD. I think it just is what it is. But I appreciate that my thinking is line with the doctrine Pivotal advances to its customers, which differs from the doctrine Red Hat and others advance (One Cluster To Rule Them All).

I’m not sure who at Red Hat is advocating one cluster to rule them all, but it’s just one point on the spectrum. There are lots of places where one cluster makes sense and two would be overkill - if you want to run lots of simple workloads, or have one very large scale app. But it’s equally smart to separate clusters by security domain or regulatory zone, or to create partitions to force your teams to treat clusters as fungible.

If there’s Red Hat documentation advising silly absolutes please let me know and I’ll make sure it gets fixed.

I don't have an example to hand, so it's obvious I went on second-hand accounts. Do you have something you'd normally point customers to when describing the tradeoffs?

For myself I see the argument for fewer clusters as about utilisation, the argument for more clusters about isolation. It's the oldest tug-of-war in computing. I think that shared node pools for multiple masters is going to be the combination that for most workloads will increase utilisation without greatly weakening isolation. I don't think multi-tenancy in the master will be as easily achieved or as effective.

In Red Hat OpenShift Consulting, we openly advise against “One cluster to rule them all” and the vast majority of our customers heed our advice. Our default delivery models support Sandbox, Nonprod, Prod cluster stand up. Some of us even support the idea that good IaC/EaC practices get our customers to where the cluster can be treated like cattle (much like pods and containers) in well-designed apps. My colleague Raffaele hinted as much when describing the problem as a matter of availability, disaster recovery and federation [0]. At least in OpenShift, multi-tenancy is a solved problem when cluster right-sizing has taken place. RBAC, node labels and selectors, EgressIP, quotas, requests and limits, multi-tenant or networkpolicy plug-ins go a long way.

[0] https://blog.openshift.com/deploying-openshift-applications-...

> In Red Hat OpenShift Consulting, we openly advise against “One cluster to rule them all” and the vast majority of our customers heed our advice. ... My colleague Raffaele hinted as much when describing the problem as a matter of availability, disaster recovery and federation [0].

To be honest, I should have realised this would be so.

> At least in OpenShift, multi-tenancy is a solved problem when cluster right-sizing has taken place. RBAC, node labels and selectors, EgressIP, quotas, requests and limits, multi-tenant or networkpolicy plug-ins go a long way.

Well, as you can guess, I am not convinced that this is really solved -- it looks like multiple discretionary access control mechanisms that need to be aligned properly, instead of a single mandatory access control mechanism to which other things align.

It's also about tooling.

I've seen many clusters being sold, but with no tooling to automatically build, monitor, secure and maintain these clusters, so you've got a DevOps team playing cluster wack-a-mole.

Of course the consultancies love that because it's a bespoke layer for them to build and support, but the reality is setting up a small team to run a couple of clusters eases the job of discoverability and secops, and for many orgs is "good enough".

Still, there is room for improvement, buy I doubt it's many masters without another product on top.

> It's also about tooling. I've seen many clusters being sold, but with no tooling to automatically build, monitor, secure and maintain these clusters, so you've got a DevOps team playing cluster wack-a-mole.

Pivotal's doctrine of how to use Kubernetes is explicitly multi-cluster oriented, but that's because we come to the table with tooling that excels at this kind of problem: BOSH.

I agree, but I think I’ve put less thought into it than you have. Just because I tend to focus on research and low-latency infrastructure, I’ve been really happy with the container-first approach. I really like being able to deal with these processes at the Linux level instead of the VM level and being able to tweak that stuff (cpu placement and isolation, accelerators and rdma network devices, etc.) There’s a reason VMs never really took off in HPC, but I think CRI-O is really poised to change the HPC paradigm, and k8s can be really beneficial in some business applications of HPC.

I definitely understand and agree that multitenancy is super important, but it would be a shame to agree that the bare metal performance is an okay sacrifice.

I agree, I'm not sure the virtual kubelet concept is a great idea overall. It sounds good on the surface, but most of the time these kind of abstractions do more harm then good, as far as I'm concerned. But, I could be wrong. :)

What we need is an open technology stack such that we can get full standardization and integration with k8s and other systems. We might get that with RISC-V.

As someone who has been a sysadmin and system programmer for 25 years and a crusty rat bastard for almost that long, I have to wonder how long it is going to be until someone realizes that a piece of hardware doing a task is more efficient than 27 layers of virtual machines and a long pair of tongs.

Your definition of efficiency is too simplistic. In many cases it’s more “efficient” to have rapid development cycle and low ops overhead than save a few bucks on hardware.

Even if we accept that server cost is key, Google realized long time ago that they can squeeze more out of their fleet if they can overcommit it because many workloads have variable utilization over time. Hence “containers” were conceived.

Very much this. People time is always the biggest cost. Docker and Kubernetes have been a game changer in time-to-production. Unless you need maximum performance out of the biggest iron available, it's cheaper to buy the next tier of EC2 instance than to waste developer and operations staff time trying to squeeze more out of existing servers. There are many levels of abstraction, but...who cares? It's about productivity per dollar.

> People time is always the biggest cost

It really depends on scale. If I spend 15 minutes improving the utilization of my home cluster by 10%, the only payoff I get is experience. If I do the same to Google's indexing infrastructure, I probably delayed the collapse of civilization by global warming by a full year.

This attitude leaves a lot of low-hanging fruit that, as the operation grows, can shave a couple million dollars off the operating costs.

> People time is always the biggest cost.


> Docker and Kubernetes have been a game changer in time-to-production.


> ...it's cheaper to buy the next tier of EC2 instance than to waste developer and operations staff time trying to squeeze more out of existing servers.

Only if your management is completely clueless. You can't solve people problems by buying machines or installing containers.

> It's about productivity per dollar.

No. Or, rather, only if by "productivity" you mean "clueless management KPI, meaning wasted company dollars per hour".

The root problem is higher-up management being unable to any set useful goals except "let's get investment capital and waste it ASAP to get more investment capital next year, growth, #yolo lol".

how can you say these are false when they're quite evidently true in some circumstances?

I'm on a team that provides a kubernetes-based internal SaaS thing and personnel costs seem to be easily recouped by the money we save from having autoscaling and a common node pool instead of 2 machines per service.

This has been going on for a number of years now and the most important part has always been to be backwards compatible with whatever developers have been doing for those years. That is productivity.

That's not to mention the leverage Kubernetes gives you to optimize and discover costs. Being able to see where optimization yields the best results is much better than optimizing everything. Sometimes it just isn't worth it.

I agree with you, being in the same position. I wonder if the DevOps trend started because of the constant friction between devs and ops. While devs main concern and job is to deliver new features, ops main concern and job is to keep things stable. These different goals caused ops to be seen as an obstacle to dev job. Being the devs close to the business was easy to predict that we would be taken out or at least required to change. I think, we (ops) are mainly the one to blame for the current situation and now we can only adapt or die.

It has become manageable to do both for one person.

Unfortunately, it's only really possible when you stand on the shoulders of massive companies like Amazon and Google, at which point you're not really doing "dev" and "ops", you're just doing "dev" and outsourcing "ops" to a big company that'll only do half the work for you.

Using recipes, pre-built images, automation, and mostly opaque system controllers that abstract away all the underlying detail. That isn't a reliable/secure system and only salesmen and those who have a stake in that model push it. Unfortunately this is the new normal, fait accompli.

But not allowed in many many organizations and firms.

Governance and security forbid people to be op and dev in one person.

Or God forbid: control the dev, pre stage and production environment.

Or maybe raw performance isn’t always the most important attribute of a system?

Not directly, but cost usually it is. Being able to serve the same number of customers with less computational power means reducing your cost.

That's a very simplistic view on the issue.

Most businesses do not have a constant workload 24/7, which means the ability to scale up and down will save more money than reducing the overhead of not running directly on metal.

There is also the cost of having to care about hardware to begin with.

Also, in the big picture, having 100 companies doing their own bare metal deployments is not terrible efficient, compared to 1 company doing that and 99 paying the first company for this service.

The list could go on, but I think you get my point.

This sounds like classic devops/cloud snake oil. I wonder what industry pays your bills?

* The most businesses/always on reasoning appeals to executive decision makers and rolls down hill to the technical people who are best educated to make the decision. Many people who believe that you can cut cost by sizing workloads end up racing their own models when it doesn't scale financially or computationally over time.

* What is so difficult about hardware? The cost of forgetting how to deal with it will be much higher in the long run.

* Yes, monopolies are healthy.

> Yes, monopolies are healthy.

Seems all 3 major cloud providers, aws, google and azure, are providing facilities to run docker containers. Seems Digital Ocean is getting there too. Hardly a monopoly, init?

> What is so difficult about hardware?

All of it. This is knowledge my company doesn't have and there is no point in investing in accuiring this knowledge at this time, since we have an easier solution. Not to mention that if we decide going bare metal is the way to go, we can do that later.

> The most businesses [...]

I think my english is failing me, I don't really understand this paragraph.

> This sounds like classic devops/cloud snake oil. I wonder what industry pays your bills?

Not the snarky remarks industry...

The biggest cost for growth companies is opportunity cost: what they aren't able to sell and deliver because of constraints in their development and operational setup.

If you have customers beating down your door to buy your product after you add n+1 feature, you can find investors to eat the extra operational cost without a problem as long as you can build n+1 fast enough that those customers don't go elsewhere.

The other element is elastic workload, of course. Having the application dynamically scale across a pool of machines in as tractable a manner as possible at the developer level can be an enormous cost saving all on its own. Instead of allocating machines based on individual high water marks, you can allocate a cluster based on the sum of average usage + as many standard deviations of resource usage to get as many 9s as you need.

10 Years ago when i needed a machine for anything, i had to create a ticket, wait a week or so to get what? One machine.

Than i got told, that they have to control it, otherwise backup and restore is not supported and on the other hand, they were only able to install stuff like svn by spending virtual 3 project days! 3! I never asked what it would cost to install some etherpad.

Two years ago in a startup, i logged into aws, clicked around for an hour and had 2 isolated networks, 4 instances, a load balancer, dns server, snapshot and backup configured. A few mouseclicks later i could have had an autoscaler as well.

Ops / own Hardware is a means to an end.

I want to be in a world where allocating hardware is as simple (at the API level) as allocating a block of memory or creating a thread. Kubernetes is a means to that end, and with an API that isn't tied to a specific vendor.

The efficiency of the actual code executing on the hardware is secondary to the efficiency of being able to automate allocation of hardware at the application level.

This is why things like MAAS exist, which is a 'cloud provider' that lets you provision bare metal machines. Plug in a new rack of servers, let them netboot, and let the clients spin up a machine matching their requirements in the same way they would spin up an AWS or Azure instance.

We mix bare metal spun up this way with Open Stack VMs, depending on the requirements. And slice up into containers using LXD when that makes sense.

That's kind of the point of k8s. Are you saying containers have too much overhead?

I don't think he knows what he's saying. At a small scale you have overhead because you need to run the orchestration software.

But everything else (disk, CPU, network) typically amounts to less than 1%.

amen. At first i figured containers were a good way to run code with difficult dependencies and weird operating environments, but that still doesnt make sense.

- systemd and uwsgi for example play well when run as a single user per wsgi application under a single nginx/lb.

- php-fpm already handles a ton of overhead from php apps.

- ansible deployments called from gitlab-ci can roll out apps just as well as deploying from the registry.

then i figured maybe they were on to something with the autoscaling thing...but that seems like a meaningless feature. Every good project already has metrics and forecasting...it would be absurd to think a final product like imgur.com or twitter does not know (down to the byte) how much storage theyll need in 4 months and the potential drivers.

auto-scaling infrastructure just betrays the fact that most developers throw resources at load problems instead of waiting for ops to figure out the actual issue.

I helped build an e-commerce marketplace, which is essentially a shopping cart system that was used by third parties. We had no idea how to predict traffic loads because we had no idea when one of our customers would run a successful campaign. Autoscaling on google appengine was a lifesaver for us. Our very small dev team focused on building features because we had zero devops and we didn't have to carry a pager.

I don't think most developers or admins really have the toolset/skills to debug to root cause. Most people see debugging as the process of making problems go away, not as the science of understanding a problem.

You need both developer and admin skills, as well as a pretty good understanding of the whole stack - the application, the framework/libraries, the application runtime/support libraries, the database, the kernel and the network.


This is, like all good HN articles, technically correct and practically incorrect.

It is correct that containers leak, and people know this. Multi-cluster strategies are real, and they shouldn't be. It should be OK to have one big cluster[1]. Until Kubernetes fixes this, there will be some friction to adopt it, based on real use cases like untrusted code and noisy neighbors.

It is incorrect because users (e.g. non-infrastructure engineers) don't know or care about the precise definition of containers and VMs are. The point of "containers" is that I can define something that acts like an operating system from the ground up, and it builds quickly and runs quickly in production.

Kubernetes doesn't win by forcing users to think about VMs. Kubernetes wins by adopting a VM standard that can be built by Dockerfiles. Infra engineers will love it.

But besides them? Nobody will care, because Docker for Mac will look the same.

[1] Maybe 1 cluster per region? There's a whole fascinating topic that starts with the question "when building a PaaS, do you expose region placement to devs?" The answer implies a ton of stuff about what exactly it's reasonable to expect from a PaaS and how much infrastructure your average dev has to know.

> Kubernetes doesn't win by forcing users to think about VMs. Kubernetes wins by adopting a VM standard that can be built by Dockerfiles. Infra engineers will love it.

This is what the CRI does/is, basically. Various projects sprung up to make it possible @ the runtime/kubelet level (kata-containers, frakti, containerd untrusted workloads), but support for runtimeClass[0] is what's going to tie it all together and it's already in Alpha.

Personally I think we should be moving away from docker files -- docker's superior ergonomics pushed the industry forward at the outset, but they lagged in features and compliance with any standards for a long time and were basically usurped. The Dockerfile is a decent format but it lacks a lot of good qualities, like being trivially machine-editable, and the docker client itself has some unfavorable tradeoffs when compared with tools like rkt and podman. I think it's a mistake to standardize on Dockerfiles, but it's definitely a good idea to standardize on the CRI (which Docker now adheres to via a containerd shim by default).

The best thing by far about kubernetes is the CRI[1], CNI[2] (Container Networking Interface), and CSI[3] (Container Storage Interface) standards that are coming out of it. I don't think any one realizes it yet, but they are inadvertently building secure/sandboxed computing for everyone. Containers are just sandboxed processes, and before this, most people were running basically completely unsandboxed processes (both in the cloud and on their personal computers) -- once all this stuff lands, linux is going to have some amazing features for running applications more safely -- production-grade safety for any application you run on your own machine available with standardized tooling.

[0]: https://github.com/kubernetes/enhancements/blob/master/keps/...

[1]: https://github.com/kubernetes/community/blob/master/contribu...

[2]: https://github.com/containernetworking/cni/

[3]: https://kubernetes-csi.github.io/

> It should be OK to have one big cluster

Assuming you're deploying your Kube cluster in the cloud, the costs of having multiple clusters is really reduced. You don't have to allocate physical machines or worry about utilisation as much - you just pick a node size and autoscale.

What that enables is thinking about other concerns when deciding how many clusters and where they are is right for your team.

There are operational reasons why having multiple clusters is a good idea. At the simplest level, making a config change and only risking a portion of the infrastructure is an example.

A pattern we're seeing a lot of recently is one cluster per "stage" per region, where a "stage" is something like dev/test, canary, and prod. (In some cases only prod is replicated across multiple regions.) I think this may end up being the "sweet spot" for Kubernetes multi-tenancy architecture. The number of clusters isn't quite at the "Kubesprawl" level (I love that phrase and am absolutely going to steal it) -- you can still treat them as pets. But you get good isolation; you can limit access to the prod clusters to only the small set of folks (and perhaps the CD system) authorized to push code there, you can canary Kubernetes upgrades on the canary cluster(s), etc.

As an aside, something that's useful when thinking about Kubernetes multi-tenancy is to understand the distinction between "control plane" multi-tenancy and "data plane" multi-tenancy. Data plane multi-tenancy is about making it safe to share a node (or network) among multiple untrusting users and/or workloads. Examples of existing features for data plane multi-tenancy are gVisor/Kata, PodSecurityPolicy, and NetworkPolicy. Control plane multi-tenancy is about making it safe to share the cluster control plane among multiple untrusting users and/or workloads. Examples of existing features for control plane multi-tenancy are RBAC, ResourceQuota (particularly quota on number of objects; quota on things like cpu and memory are arguably data plane), and the EventRateLimit admission controller.

There's active work in the Kubernetes community in both of these areas; if you'd like to participate (or lurk), please join the kubernetes-wg-multi-tenancy mailing list: http://groups.google.com/forum/#!forum/kubernetes-wg-multite...

Also, I gave a talk at KubeCon EU earlier this year that gives a rough overview of Kubernetes multi-tenancy, that might be of interest to some folks: https://kccnceu18.sched.com/event/Dqvb?iframe=no (links to the slides and YouTube video are near the bottom of the page)

Disclosure: I work at Google on Kubernetes and GKE.

Your experience mirrors what I've seen.

Many teams use clusters for stages because they work on underlying cluster components and need to ensure they work together and upgrade processes work (e.g. terraform configs comes to mind). Theres no reason to separate accounts because the cluster constructs aren't there for security.

Considering it deeper (I haven't had to think about this for a while), I think multi tenancy would cover almost all of the use cases I've seen except for the platform dev where people use clusters for separation when testing cluster config-as-code changes.

I basically split the clusters into livedata, nolivedata, random untrusted code (ci), shared tooling.

The idea being that you have process around getting your code to run on the livedata cluster and this we add more stringent requirements for accessing each API.

This is for soft tenancy, and you want to write admission controllers to reject apps that haven't went through the defined process.

The distinction is very helpful and gets at something I was struggling to articulate.

Edit: looking more in the thread, you clearly know this much better than I do. I'd like to get the chance to talk and improve my understanding, if you ever find some spare time.

Config blast radius, hardware sizing, networking partitions, disaster recovery, dev/prod separation -- lots of good reasons to have multiple clusters!

As a gut check, most of these reasons apply to AWS AZs as well. If your Kubernetes strategy calls for more than one per region per AWS account, it means that you, organizationally, don't trust containers as much as you trust VMs on AWS. And right now, you're probably right to do so.

It’s not only about trusting containers. It’s also what data that cluster has access to (anything on a PV attached to a cluster is readable and writable by someone with privileges). The control plane can program the infra to some degree (increasingly so over time).

Treating a single cloud account and the clusters it hosts as a single security cell is more than reasonable, especially if you need to handle sensitive data, be able to audit the state of the cluster (hope your cluster can’t overwrite the contents of the S3 bucket that you’re using for cloudtrail logs, etc), or reason about the risk of compromise.

We run lots of multitenant kube clusters, and I would still recommend anyone who expects to grow to create hard walls between clusters and cloud infra as soon as reasonably possible.

I don't understand your argument suggesting I don't trust containers because I would have more than one Kube cluster per region/account.

Multiple Kubernetes clusters for workload separation and AWS accounts for workload separation aren't quiet the same and bring different levels of complexity depending on what your internal processes look like.

You have the blast radius reduction by deploying to multiple AWS regions, similar to what AWS does with their infrastructure.

This is true at the machine and network level, however once you overlay a Kubernetes cluster over those constructs you create a new failure domain.

Right, but you have to be able to survive a region going down, so while K8s might increase the chance marginally that your region will go down, you need to be able to handle that eventuality anyway.

Glad to see people believe there should be one cluster per region, and k8s can be fixed to make that happen.

I am a firm believer of Borg style cluster OS, what Borg lacks is a modern API, but the fundamentals are already there.

Disclaimer: I am with Google's Borg team, with a focus on its client side.

"virtual machines" may be the wrong word. Containers is probably still the right word. What we probably want, is actually isolated containers.

And hell, you can nearly get that today. Combining Docker with gVisor is a potential solution to the soft tenancy problem as far as I can tell, and Kubernetes supports using it.

(And gVisor is by no stretch of the imagination a 'VM' - it is, at best, a tiny hypervisor, and maybe less than that.)

Dammit, most of you reading this do not need any of this shit! Build simple things!

You don't need them until you're hired by a client with nation-wide deployment requirements, and then you need them.

Not necessarily. Needless to say containers and kubernetes are pretty new technologies and nation wide deployment requirement existed before docker / k8s.

Most of the old big companies aren't using docker or k8s for their core services. They're all using legacy fat apps that are load balanced in baremetal or vms.

I doubt a home-made solution to load balancing will be seen as simpler, which is what's being argued by the comment I replied to.

The point isn't that you need Kubernetes specifically, it's that requiring a system that scales well is not as uncommon as the OP puts it.

k8ts is not the only avenue to building scalable systems -- scaling is mostly about architecture, not how you run your software.

> k8ts is not the only avenue to building scalable systems

I had already conceded that. What's your point?

I really worry about cost in this future of leaning on something like ACI/Fargate to actually run the containers.

An m5.large instance (2vcpu/8gb) costs $70/mo on-demand ($44/mo with a 1 year reservation). A similar Fargate runtime costs $146/mo.

A b2ms Azure instance (2vcpu/8gb) costs $60/mo on-demand ($39/mo 1 year reservation). Azure Container Instances at a similar provisioning level costs $176/mo by my calculations.

That's not a small difference. That's, like, 3x.

Point being, I love virtual-kublet from the perspective of a scale-up just trying to get a product out the door. But for established companies, I still think the core idea of a container on a VM you control is going to rule. Fortunately Kubernetes allows amazingly easy flexibility to switch, and that's a reason why it might be the most important technology created in recent history.

Also, the instance prices are already some crazy multiple of buying a machine, racking it, and running dozens of VM’s on it for 5 years.

I don’t know the prices for hyperconverged or converged infrastructure off the top of my head, but doubt the amazon rates are even close to competitive unless you’re at a tiny scale (or need a ton of tiny presences in different regions).

This feels like an infinite russian doll problem, putting k8s inside a bunch of vms I mean.. but I am glad that while everyone is still trying to learn how to properly do k8s these folks are thinking of the next thing

And running a Java services in the JVM. It's not real infrastructure unless you have at least four layers of VMs.

Kubernetes has a real chance to succeed where OpenStack failed. Good people at AWS have good reasons to be worried and will push us to a proprietary form of "serverless".

OpenStack's failure is that it's an extremely complex system that looks to be nearly impossible to deploy by yourself. You have to use a Distro that does everything for you- their way.

Kubernetes is simple enough to setup yourself. They have well documented tooling, and a solid do it yourself guide. OpenStack has none of that (that I can find). You select RedHat Openstack (RDO) or Canonical OpenStack (MAAS), and you have to use their all-in-one system to have a deployment- and that requires a narrow set of variables which every environment might not have. Which is insane- and will hinder adoption.

EDIT. Not 100% correct, see below comments.

You don't have to use a distro. There are well tested puppet modules, which you could make into your own "distro", as well as openstack-ansible and kolla, and openstack-helm which uses helm to deploy openstack. There is also StarlingX. There is no kubeadm like system however, though I'm not sure how many people will really use kubeadm in prod.

Is it complex? Yes. Is it more complex than k8s? Probably. However, are there multiple open source distros outside of RDO (which is not actually a distro and is instead packaging--see tripleo for a distro like solution based on RDO). MaaS is not an OpenStack distro; it's a way to manage baremetal nodes that OpenStack is then deployed onto using other Canonical related tools. That said, selecting a way to deploy and manage openstack is complex, but the same with k8s.

True. My only experience is with VMware OpenStack, and my quick googling didn't turn up much info. I think Kubernetes will fall to the same fate as OpenStack is sliding into. Growing complexity with promises of the world. Time will tell.

You seem to know more than I do, so I got to ask. Why does openstack-helm exist? Why would anyone want to deploy OpenStack on top of kubernetes? Is it so you can have the OpenStack API run in Kubernetes that manages physical boxes?

Because the OpenStack control plane is made up of many (some required, some optional) services that are basically Python daemons. It makes sense to run them in containers and manage their life cycle...which means it makes perfect sense (to me, and others) to run openstack as an application managed by k8s.

One of the issues with Openstack is managing the services required to run it. I imagine that helm could be used to make it easier to run these services in kubernetes.

True, but that adds another layer of complexity to an already complex system. If I'm using the ansible module, that's complex enough. Throwing in management of Kubernetes and the additional cruft containers adds- it seems like it's a lot of hassle for little gain.

I see this argument relatively frequently. "Why use Kubernetes when I can accomplish X with [puppet|chef|ansible]".

The answer is that k8s offers something fundamentally different and until the person posing the question gets that distinction, the argument is relatively pointless.

I'm not just bowing behind the argument that "you just don't get it... man". Let me point out that you're right. You can manage the openstack control plane perfectly well with your configuration management tool of choice and if you have that process really dialed, then you'll have a difficult time improving upon it with something like k8s.

The issue with Kubernetes isn't that it's hard to setup, but it's difficult to keep operate long term. Simple things like balancing workloads across availability-zones is still not trivial.

I've heard many stories of people who tried to run Kubernetes themselves in production and didn't have a great experience for it.

I've been running it in production for stateful workloads for over a year and it's been going swimmingly. I've never had better uptime or efficiency.

I think they are quite different solutions, and each benefits from the other. Further, they also tend to run into the same concerns at the about the same time in their life time, eg. concerns around complexity.

It's still containers. On a cloud provider the Kubernetes workers are VM's which orchestrate containers. With Kata Containers you're just spawning containers inside micro-vm's.

> It's still containers.

The security profiles of containers and VMs, including kernel-based VMs, are different. VMs still have a significant edge, because the attack surface is smaller and doesn't have many competing missions.

The attack surface of a container can be massively reduced with seccomp profiles -- there was a paper a few years ago which found that the effective attack surface of a hypervisor was about the same as the attack surface of a locked-down seccomp profile of a container (and LXC/Docker/etc already have a default whitelist profile which has in practice mitigated something like 90% of kernel 0days).

And let's not forget the recent CPU exploits which found that VMs aren't very separated after all.

The fact that Kubernetes disables this (and other) security features by default should be seen as a flaw in Kubernetes. (Just as some of the flaws of Docker should be seen as Docker flaws not containers-in-general flaws.)

> The attack surface of a container can be massively reduced with seccomp profiles

Yes, though as capabilities are added to the kernel, the profiles have to be updated.

That said, VM or no VM, this should be done no matter what.

> And let's not forget the recent CPU exploits which found that VMs aren't very separated after all.

This is a nil-all draw in terms of the respective security postures, though.

The economics agree, Zerodium pays as much for VM escape as for LPE. It does seem to be a bit of a low price though, $50,000.

The attack surface argument is debatable, depending on how the system is designers, since virtualization introduces the hypervisor surface.

The attack surface argument certainly is debatable.

I wonder how many multi-tenant workloads are actually at risk of an escape vulnerability. I wager that the multi-tenancy described in the article in the OP is actually disparate workloads across disparate teams in a particular enterprise where it seems (to me) fairly unlikely for someone with access to run a workload to also have the willingness to compile and run malicious code to take advantage of an escape vulnerability.

On the other hand, publicly available compute, i.e. AWS, GCP, Azure seems way more likely to be the subject of attacks from random malicious individuals seek to take advantage of an escape vulnerability if one existed.

The hypervisor surface can be made smaller, since its major goal is to manage hardware resources. A kernel has the same mission, but also has a mission to provide a rich API for applications.

it's not really as clear cut as that (IMO).

Shared kernel linux containers can be hardened to the point, where they likely have a smaller attack surface than a general purpose hypervisor (for example look at the approach that Nabla takes)

You then have the hybrid approach of gVisor, still containers, but smaller attack surface than the Linux kernel.

Of course this hardening approach can (and should be) applied to VMs too, which may tip the balance back to them, which is one reason that firecracker looks so interesting.

That is one big difference, with big implications.

Just wanted to clarify that Kubernetes was still scheduling containers. Even if VM's are being used to isolate them.

It's not all or nothing either. Containerd will support running a mix of containers and kata-containers across workers.

For anyone interested in this topic I wrote about some other container runtimes here: https://kubedex.com/kubernetes-container-runtimes/

Note that this is not the case for virtual kubelet-based implementations, and your point here and above are specific to how Kata works (the article is talking more generally).

Yeah.. I think I see what you say, I mean, the end user interface is the quite the same but this has big implications anyway for the systems design POV, so is not a small thing :)

While the use of namespaces seemed a bit discounted in the article, I think part of the problem is that it's an underutilized and poorly conveyed feature, but maybe that's just me? I'd like to see a better out-of-the-box vanilla K8S user management interface in the dashboard. Maybe the default UX on cluster creation is a user/service account that isn't cluster admin that is limited to 1 or more namespaces. There should be a better dashboard to configure and create users that is front and center when you first create a cluster. You have to work a bit to get something you can use to authenticate as cluster admin like role. This should help direct people towards creating more users and isolated namespaces.


User admin and the reliance on client-cert authentication is one of the biggest weaknesses I see in k8s security at the moment.

There are obviously other options like OIDC available, but it can be tricky to set up and isn't on by default, so instead client certs are used for user auth. and given the lack of certificate revocation, they're really not suited for that.

Agree, I'm not sure what the author is getting at here:

"Compounding this is the fact that most Kubernetes components are not Tenant aware. Sure you have Namespaces and Pod Security Policies but the API itself is not. Nor are the internal components like the kubelet or kube-proxy. This leads to Kubernetes having a “Soft Tenancy” model."

That's like saying the Hypervisor isn't "tenant" aware.

My org is using namespaces for tenant isolation. Works quite well.

Namespaces++. Unless you’re just running one app in a Kubernetes cluster (where default == that app), namespaces solve a plethora of problems.

Resource limits, network policy boundaries, etc.

That’s what we designed them for :)

I do think we’ve not explored enough of the per namespace policy stuff though - i’d like both podpreset and a reasonably simple scheduling policy (toleration + node selector control to replace the annotation based system) to make it in, as well as a simpler namespace initialization path so you can more easily lock down the contents of a namespace without having to proxy the create namespace API call.

By podreset are you referring to something like poddisruptionbudget?

Because it'd be neat to define that on a per namespace basis.

PodPreset went alpha with service catalog but hasn’t made it out to beta yet. It makes certain forms of injection / rules easier (you must use a standard log dir, you should use the provided HTTP_PROXY vars, etc). https://kubernetes.io/docs/concepts/workloads/pods/podpreset...

Being able to limit a user to only being able to edit one pod preset or scheduling policy (via rbac name access) would provide some useful flexibility for splitting control between admin and namespace user.

Okay, I get what your point.

I'm still living in the world where operating a platform for the benefit of a set of developers entails building and operating a set of services that abstracts the details of the infrastructure sufficiently that these things don't matter.

Also gVisor.

I get the multi-tenancy argument, for certain use cases. I'm not sure I understand the point about greater resource utilization. Presumably they mean workloads can be scheduled more densely given a set of hardware resources, vs. containers... but I'd think that containers would score better on that metric than VMs. Can someone expound?

The best part of the piece for me was "kubesprawl," my new favorite word for the week. We've seen it ourselves to some extent, but we are at least aware of it and try to exert some pressure in the other direction. Beyond that I am not particularly bothered by the idea of running lots of clusters for different purposes.

I'm picking the eyes out the article here but I had to twitch at this ...

> Linux containers were not built to be secure isolated sandboxes (like Solaris Zones or FreeBSD Jails). Instead they’re built upon a shared kernel model ...

Solaris Zones and BSD Jails both use a shared kernel.

And I'd bet you they were far from perfect in security isolation.

Now it may be true that security wasn't Linux containers prime reason for being but we have an existence proof that they can be made secure enough -- anyone can get a trial Openshift container for the price of a login.

As 2018 comes to a close its time to drag out the hubris and make a bold prediction. The future of Kubernetes is Virtual Machines, not Containers.

I’d say that’s less of a prediction than a matter of fact since it’s already happened in 2018 for AWS and GCP.

This post is kind of irrelevant for consumers imo, in that the future of Kubernetes is still the container interface, regardless of whether your vendor decides to run it in a container or a VM.

How has it already happened? It's true that EKS and GKE nodes are virtual machines, but that's entirely beside the point of this article which seems to make the point that the workloads running inside of Kubernetes will also be virtual machines.

Interesting notion, but I don't see it. The reason for kubesprawl as it is today is a result of the fact that today Kubernetes is hard to tune for disparate workloads which leads most folks to just punt and stand up multiple clusters.

That said, people are starting to figure it out and more tools like the vertical pod autoscaler are coming. Eventually the more efficient choice will be to run disparate workloads across the same set of hardware.

The future of Kubernetes (for some interesting use cases), is virtual machines. For the rest of us the virtual kubelet project represents a bridge between Kubernetes and Serverless.

I'd much rather pay my cloud provider to run my k8s workloads (billed by pod requests/limits) than pay for a control plane and three nodes just to run my workloads.

The nested kubernetes idea is interesting (linked to from the post). However, amazon and google aren’t using nested control planes for their infrastructure. Why is a single control plane good for them, but not for a Kubernetes deployment?

And people still wrangle with this... how can it be easier to struggle like this than learn how to use SmartOS and Triton? Kubernetes is a solution to severe OS virtualization deficiency in Linux, most notably orchestration. You, know, the problem which is non-existant in SmartOS with Triton and large scale configuration management with operating system packages. Every so often this hits "Hacker News" and people will rather struggle than master something new. But one cannot polish a turd.

At the risk of tarnishing my reputation amongst the hacker news docker/kubernetes hypecycle elite, have an upvote. The I.T. industry in general is funny. New technologies come and go like pop stars. Docker == ke$ha, Kubernetes == ice cube, triton is fred astaire. They all have their off moments. I personally like my platform stable, performant, secure, and boring. If I spent all of my time keeping up on the latest trends on how to spin up machines, I'd have little time to work on actual product. Something good will come out of this influx of cash, marketing, and cloud sales, eventually. Fits and starts. /me goes back to coding and deploying on triton, while patiently watching the docker/kubernetes show.

That was a big risk you took. You have guts for standing up to the hype machinery.

I'm still fighting the sneaking suspicion that putting kubernetes/etc out to the general public and having such a fast release cycle was just a genius play by the big cloud vendors to acquire customers (who will realize running this stuff on premises isn't as cheap as they thought it was after doing the math (all the math, security, training, operational expenses, personnel training/expenses, moving from docker->moby->rkt->gvisor->firecracker->now vm expenses, blah blah etc)). This current tech wave is kind of disheartening. Everybody is focused on hosting...can we not get icecube to play a show for folks that are pushing the envelope with technology as applied to the medical field, or saving the environment, yo?

Triton on-prem is a snap. Boot the headnode from usb, boot the cluster nodes from usb+pxe, and lets get to kicking ass, fighting the good fight focusing on real groundbreaking applications.

*edit: I'm still a little butt-hurt after kubernetes being rammed down my throat in a large enterprise environment. Apologies to those that are fighting the good fight with kubernetes, I know you're out there, and big high 5 :)

I'd be fine with that if they were using LKVM (from kvmtool) or that new Firecracker. But as far as I can tell most projects in this area use QEMU still, which is an outstanding piece of software, but doesn't scream micro-vm to me.

I had the impression DevOps ppl would cling to containers as the final form of VMs?

But I'm mostly a high level front-end guy doing back-ends with serverless tech only.

I'm convinced that Kubernetes is very good for job security, and not much else. Unless you're a managed services host, you should probably not be running it.

Please, for the love of all that is holy, use a cloud services provider if you need K8s-style service features. If you don't, then just cobble together your infrastructure in the simplest way possible that uses DevOps principles, methods and practices.

Could you elaborate a bit?

K8s is a very complex system. And the more complex the requirements, the more complex one needs to make K8s through additional software that isn't baked in. Complex systems are costly to run, but more importantly, they're costly to maintain due to the typical level of service required, the number of employees needed to maintain it, and the amount of specialized knowledge required. The system is also under constant maintenance due to its short release cycle. Basically, you need to build an entire cloud services team just to keep it running smoothly. (not for "test labs", but for real production services) And on top of all this, if you're running it on your own hardware, you don't even get the benefit of reduced infrastructure costs.

Because this is not only hard to get right, but very costly, it is much cheaper and easier to pay someone to do all this for you. It is almost guaranteed that doing it yourself will not give you any significant advantage, cost savings, or increased development velocity.

On top of this, most people don't even need k8s. K8s is a containerized microservice mesh network. If you don't need containers and you aren't running microservices, you may be trying to fit a square peg in a round hole. Even if you did need k8s, the benefit may be small if you don't have complex requirements.

Most people can get high-quality, reliable end results with simple, general-purpose solutions using DevOps principles and tools. If you're not Google or Facebook, you probably just need immutable infrastructure-as-code, monitoring, logging, continuous integration/deployment, and maybe autoscaling. You don't need an orchestration framework to deliver all that. And by going with less complex implementations, it will be easier and more cost-effective to maintain.

At the end of the day, if you need k8s, use it. But I really worry about most people who hop on the k8s bandwagon because they see a lot of HN posts about it, or because Google touts it.

TLDR: Kubernetes is an anti-pattern.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact