"""For example, the PID namespace makes it so that a process can only see PIDs in its own namespace, and therefore cannot send kill signals to random processes on the host."""
Weird. You can't kill a process that's not running as your user, even if you can see it. That's not the point of PID namespacing, and I hate seeing non-existent security rationale backported to unrelated concepts.
Are your hosts SELinux enforced? It was not clear to me what methods were used to implement MAC. The focus of the article seemed to be around capabilities.
Kata Containers sidestep this issue by running Kubernetes containers as lightweight VMs. Same for Firecracker. Did the team consider microVM solutions?
Yes. The work that Intel has done on Kata Containers is great. It seems like AWS, Fly.io, and others are leveraging Firecracker. AWS delivering bare metal machines has been a pretty big game changer in the cloud space.
I'm not sure how much easier to deal with the MicroVM model would be. I think it would be great to add the additional layer of security that gVisor and Co have. There's also benefits of being able o run a special kernel per workload. On the other hand, you do gain some complexity by having to run a separate kernel per workload, and having to inject all the sidecars (SSH, secret manager, etc..) into the VM.
From a performance perspective, there's a lot of unknowns. Up until relatively recently, virtfs did not exist, so you had to carve off a block device. You can't do things like share a page cache and dedupe inodes and dentries AFAIK with MicroVMs. Then, on top of that, there are still security boundaries we need to enforce, like network filtering. If you do this with VMs, you need to do it at the packet layer versus the connection layer, which is much more expensive. There are others...
Lastly, it's a question of maturity. "full blown" VMs are very mature. On the other hand, containers have been maturing for upwards of a decade now. We'll probably explore MicroVMs further in the future, but for now, Linux containers are the tool of choice.
> You can't do things like share a page cache and dedupe inodes and dentries AFAIK with MicroVM
I work on several large Kubernetes clusters each running diverse jobs and this would be a huge BENEFIT. The Linux virtual memory (VM) subsystem regularly locks under heavy contention (e.g. one process performing a ridiculous amount of local, buffered I/O, dirtying a ton of pages, while another process struggles to page back in its executable code), leaving random processes frozen for seconds or minutes. (Default frozen lock failsafe triggers after 120s, but when things are really bad I've seen processes frozen for much longer.) It's a persistent source of alerts and complaints from people. It was much worse last year (2019) owing to some kernel regressions, but it's beginning to creep back up as the clusters see more production use so I'm not looking forward to 2021.
Kubernetes pods rarely need to share filesystems between themselves--at least not for heavy usage--and if they do it's usually via NFS or something similar. So MicroVMs really only leave you with the fixed cost of duplicating kernel data structures, which given the amount of memory on most nodes isn't that much. And there's performance potential here precisely because processes in different VMs aren't contending for as many of the same locks, etc. The tradeoffs very much mirror those of multi-process vs multi-threaded and monolithic vs microservice architectures.
Realistically, though, I can see some performance degradations due to the indirection needed for accessing local storage, except where you can make use of hardware passthrough. But I'll take a few percentage losses in runtime costs over frozen processes.
Linux just isn't as robust with these heavy, multi-tenant environments. That's what operating systems like Solaris excelled at. It's kind of ironic, but not really because that's typically how these things play out, unfortunately :( People are attracted by the ease and low-cost of running popular Linux-based solutions, but then try to bend them to run the kinds of workloads and architectures the older systems were designed and optimized for. MicroVM architectures are better suited to Linux' strong points.
And that's all before we ever consider security. User namespaces are just a band-aid over the problem of Linux security. The real issue is the endless parade of kernel exploits. That's where seccomp comes in. But expecting SREs or even developers to properly seccomp-jail all the myriad containers deployed--homegrown and especially third-party--is not realistic. The whole conceit of containers is to lower the barrier of entry to writing and deploying applications at scale while pretending we can still protect people from themselves. If people couldn't figure out how to run multi-tenant before containers (that is, leverage traditional APIs, process model, and networking stack), how could we ever expect them to do any better with even morecomplex infrastructure? We already tried all that with SELinux. seccomp is better than SELinux because it puts the application developer in control, who is best positioned to know when, where, and how privileges are needed (assuming they know at all). Moving that work back to, effectively, the system administrator isn't going to turn out any better than SELinux did, or present-day container security for that matter.
Have you shared bug reports / reproductions with LKML? I think facebook was going a bunch of work in this space. There were a bunch of improvements in 5.9, esp. with cgroup2.
Thanks for offering to answer questions! I have a question (but haven't finished reading it yet, so maybe the answer is in here). The article starts by saying that Titus is your orchestration engine but later says you're moving to Kubernetes. Are you completely moving off of Titus, or running Titus within/parallel to K8?
The general idea is we're adopting parts of Kubernetes where we can and working to extend Kubernetes in ways that integrate well with the approach Netflix has taken to Compute. We plan to bring the approaches Netflix takes to Kubernetes to the community over time as makes sense for the broader community. This work on leveraging user namespaces is a good example of something that while unique to Netflix is something we would like to see the community benefit from as well.
This isn't my expertise (the cluster orchestration system), but I can answer to the best of my abilities: Titus, today is a system that sits on top of Kubernetes, and uses Kubernetes components to do its thing, but we've substituted many of the systems with our own. For example, closer to my area of knowledge, we've used our own executor / provider along with the Virtual Kubelet project (https://github.com/virtual-kubelet/virtual-kubelet) instead of Kubelet.
We're exploring where we can leverage the Kubernetes ecosystem, adapt components, or help contribute changes back that others can leverage to enable our use of more COTS components of Kubernetes.
tl;dr: We're swapping out the engines while in flight
You have to think about it like this: the average skill level of engineers at a large company will always move to the true average across all engineers outside the company. This means they have engineers that don't know what they're doing, and there's not much they can do to prevent it. The average "security" skill level is very very low, and even people who are good at it make huge mistakes constantly.
If you accept that, then it makes sense to spend time and money on preventing people who don't know what they're doing from hurting everyone else. It is therefore essential that mitigations like this are applied, even though if everyone did their job perfectly, they would not be necessary.
I work on this team. We host Netflix compute that while "internal" processes requests from Netflix users and the internet. We use industry standard frameworks and technologies in our workloads. All software ends up having security incidents. We have this lower level security to protect ourselves if the higher level technology has temporary security problems.
That wouldn't have been these user namespaces as they were first released with Linux 3.8 much later on. I think the user name space was the last one to land.
I think vserver used a completely different set of functionality. I never really used it, but didn't it involve its own custom kernel modules?
Linux Vserver wasn't/isn't kernel modules, but a patchset that directly modifies the kernel. It works based on so-called contexts; a process in a context cannot see processes in other contexts (at all), with the exception of context 0 which is the "root context", or if you disable PID isolation. VServer was a competing implementation to what evolved into LXC, as such it implemented similar concepts, such as PID namespacing, pretty early, using own code. It's just the looks though. VS used a single global namespace, but individual guest contexts could not "see" PIDs outside of their context.
Vserver is dead. A long time maintainer occasionally posts patches for newer kernels to mailing list, if he feels like it. -> http://list.linux-vserver.org/
It wasn't clear to me how using user namespaces helped prevent the CVEs. Docker is already using them and it didn't protect them. Did I miss something in the design that's different than Docker?
The trouble is that Docker does not enable user namespaces by default, and thus resulting in these CVEs. A lot of integrations (like the examples of secrets, and sidecars) do not work properly when used in conjunction with user namespaces, and tend to require modification. We did the work to make this work, and created a model (injected processes into the container) in order to create this clear boundary layer.
Many people use Docker with Kubernetes. Unfortunately, the Kubernetes Kubelet does not work with Docker and user namespaces (https://github.com/kubernetes/enhancements/issues/127). There is still work being done in this area by folks like Kinvolk.
Weird. You can't kill a process that's not running as your user, even if you can see it. That's not the point of PID namespacing, and I hate seeing non-existent security rationale backported to unrelated concepts.