Evolving Container Security with Linux User Namespaces

jeffbee · on Dec 23, 2020

"""For example, the PID namespace makes it so that a process can only see PIDs in its own namespace, and therefore cannot send kill signals to random processes on the host."""

Weird. You can't kill a process that's not running as your user, even if you can see it. That's not the point of PID namespacing, and I hate seeing non-existent security rationale backported to unrelated concepts.

chowyuncat · on Dec 24, 2020

It’s an information leak. Who knows what attackers can learn about the host via this channel. Why leak it unless you have to?

jeffbee · on Dec 24, 2020

OK, that's not a complexity trade I'd be willing to make, but I can see it. However, that's unrelated to the nonsensical reason given by the article.

cats4256 · on Dec 24, 2020

In traditional Docker installs, you (can) run as real UID 0 (and many containers do) with CAP_KILL [https://docs.docker.com/engine/reference/run/].

sargun · on Dec 23, 2020

I'm on the team that worked on this, if anyone has any questions!

Bender · on Dec 23, 2020

Are your hosts SELinux enforced? It was not clear to me what methods were used to implement MAC. The focus of the article seemed to be around capabilities.

sargun · on Dec 23, 2020

We use AppArmor for MAC.

anurag · on Dec 23, 2020

Kata Containers sidestep this issue by running Kubernetes containers as lightweight VMs. Same for Firecracker. Did the team consider microVM solutions?

sargun · on Dec 23, 2020

Yes. The work that Intel has done on Kata Containers is great. It seems like AWS, Fly.io, and others are leveraging Firecracker. AWS delivering bare metal machines has been a pretty big game changer in the cloud space.

I'm not sure how much easier to deal with the MicroVM model would be. I think it would be great to add the additional layer of security that gVisor and Co have. There's also benefits of being able o run a special kernel per workload. On the other hand, you do gain some complexity by having to run a separate kernel per workload, and having to inject all the sidecars (SSH, secret manager, etc..) into the VM.

From a performance perspective, there's a lot of unknowns. Up until relatively recently, virtfs did not exist, so you had to carve off a block device. You can't do things like share a page cache and dedupe inodes and dentries AFAIK with MicroVMs. Then, on top of that, there are still security boundaries we need to enforce, like network filtering. If you do this with VMs, you need to do it at the packet layer versus the connection layer, which is much more expensive. There are others...

Lastly, it's a question of maturity. "full blown" VMs are very mature. On the other hand, containers have been maturing for upwards of a decade now. We'll probably explore MicroVMs further in the future, but for now, Linux containers are the tool of choice.

wahern · on Dec 23, 2020

> You can't do things like share a page cache and dedupe inodes and dentries AFAIK with MicroVM

I work on several large Kubernetes clusters each running diverse jobs and this would be a huge BENEFIT. The Linux virtual memory (VM) subsystem regularly locks under heavy contention (e.g. one process performing a ridiculous amount of local, buffered I/O, dirtying a ton of pages, while another process struggles to page back in its executable code), leaving random processes frozen for seconds or minutes. (Default frozen lock failsafe triggers after 120s, but when things are really bad I've seen processes frozen for much longer.) It's a persistent source of alerts and complaints from people. It was much worse last year (2019) owing to some kernel regressions, but it's beginning to creep back up as the clusters see more production use so I'm not looking forward to 2021.

Kubernetes pods rarely need to share filesystems between themselves--at least not for heavy usage--and if they do it's usually via NFS or something similar. So MicroVMs really only leave you with the fixed cost of duplicating kernel data structures, which given the amount of memory on most nodes isn't that much. And there's performance potential here precisely because processes in different VMs aren't contending for as many of the same locks, etc. The tradeoffs very much mirror those of multi-process vs multi-threaded and monolithic vs microservice architectures.

Realistically, though, I can see some performance degradations due to the indirection needed for accessing local storage, except where you can make use of hardware passthrough. But I'll take a few percentage losses in runtime costs over frozen processes.

Linux just isn't as robust with these heavy, multi-tenant environments. That's what operating systems like Solaris excelled at. It's kind of ironic, but not really because that's typically how these things play out, unfortunately :( People are attracted by the ease and low-cost of running popular Linux-based solutions, but then try to bend them to run the kinds of workloads and architectures the older systems were designed and optimized for. MicroVM architectures are better suited to Linux' strong points.

And that's all before we ever consider security. User namespaces are just a band-aid over the problem of Linux security. The real issue is the endless parade of kernel exploits. That's where seccomp comes in. But expecting SREs or even developers to properly seccomp-jail all the myriad containers deployed--homegrown and especially third-party--is not realistic. The whole conceit of containers is to lower the barrier of entry to writing and deploying applications at scale while pretending we can still protect people from themselves. If people couldn't figure out how to run multi-tenant before containers (that is, leverage traditional APIs, process model, and networking stack), how could we ever expect them to do any better with even more complex infrastructure? We already tried all that with SELinux. seccomp is better than SELinux because it puts the application developer in control, who is best positioned to know when, where, and how privileges are needed (assuming they know at all). Moving that work back to, effectively, the system administrator isn't going to turn out any better than SELinux did, or present-day container security for that matter.

cats4256 · on Dec 24, 2020

Have you shared bug reports / reproductions with LKML? I think facebook was going a bunch of work in this space. There were a bunch of improvements in 5.9, esp. with cgroup2.

riyakhanna1983 · on Dec 23, 2020

> "There's also benefits of being able o run a special kernel per workload."

I wonder if we will see kernel namespaces that would let you run specialized kernel for each container just like a VM.

lemonspat · on Dec 23, 2020

Thanks for offering to answer questions! I have a question (but haven't finished reading it yet, so maybe the answer is in here). The article starts by saying that Titus is your orchestration engine but later says you're moving to Kubernetes. Are you completely moving off of Titus, or running Titus within/parallel to K8?

aspyker · on Dec 23, 2020

I work on this team at Netflix.

You can get a better idea of where we are in our journey to Kubernetes (both already deployed and being developed) from this presentation:

https://www2.slideshare.net/aspyker/herding-kats-netflixs-jo...

The general idea is we're adopting parts of Kubernetes where we can and working to extend Kubernetes in ways that integrate well with the approach Netflix has taken to Compute. We plan to bring the approaches Netflix takes to Kubernetes to the community over time as makes sense for the broader community. This work on leveraging user namespaces is a good example of something that while unique to Netflix is something we would like to see the community benefit from as well.

sargun · on Dec 23, 2020

This is a complicated question to answer.

This isn't my expertise (the cluster orchestration system), but I can answer to the best of my abilities: Titus, today is a system that sits on top of Kubernetes, and uses Kubernetes components to do its thing, but we've substituted many of the systems with our own. For example, closer to my area of knowledge, we've used our own executor / provider along with the Virtual Kubelet project (https://github.com/virtual-kubelet/virtual-kubelet) instead of Kubelet.

We're exploring where we can leverage the Kubernetes ecosystem, adapt components, or help contribute changes back that others can leverage to enable our use of more COTS components of Kubernetes.

tl;dr: We're swapping out the engines while in flight

tonetheman · on Dec 23, 2020

This stuff is really cool and interesting tech. But it really seems like overkill for internal processes... perhaps I do not understand their load.

Anyway really good stuff.

blaisio · on Dec 23, 2020

I disagree that it is overkill.

You have to think about it like this: the average skill level of engineers at a large company will always move to the true average across all engineers outside the company. This means they have engineers that don't know what they're doing, and there's not much they can do to prevent it. The average "security" skill level is very very low, and even people who are good at it make huge mistakes constantly.

If you accept that, then it makes sense to spend time and money on preventing people who don't know what they're doing from hurting everyone else. It is therefore essential that mitigations like this are applied, even though if everyone did their job perfectly, they would not be necessary.

aspyker · on Dec 23, 2020

I work on this team. We host Netflix compute that while "internal" processes requests from Netflix users and the internet. We use industry standard frameworks and technologies in our workloads. All software ends up having security incidents. We have this lower level security to protect ourselves if the higher level technology has temporary security problems.

flemhans · on Dec 23, 2020

I wonder what ever happened to Linux-Vserver, which had user namespaces in like 2005.

antod · on Dec 23, 2020

That wouldn't have been these user namespaces as they were first released with Linux 3.8 much later on. I think the user name space was the last one to land.

I think vserver used a completely different set of functionality. I never really used it, but didn't it involve its own custom kernel modules?

2ion · on Dec 23, 2020

Linux Vserver wasn't/isn't kernel modules, but a patchset that directly modifies the kernel. It works based on so-called contexts; a process in a context cannot see processes in other contexts (at all), with the exception of context 0 which is the "root context", or if you disable PID isolation. VServer was a competing implementation to what evolved into LXC, as such it implemented similar concepts, such as PID namespacing, pretty early, using own code. It's just the looks though. VS used a single global namespace, but individual guest contexts could not "see" PIDs outside of their context.

See https://www.kernel.org/doc/ols/2007/ols2007v2-pages-151-160....

2ion · on Dec 23, 2020

Vserver is dead. A long time maintainer occasionally posts patches for newer kernels to mailing list, if he feels like it. -> http://list.linux-vserver.org/

zimbatm · on Dec 23, 2020

It wasn't clear to me how using user namespaces helped prevent the CVEs. Docker is already using them and it didn't protect them. Did I miss something in the design that's different than Docker?

sargun · on Dec 23, 2020

The trouble is that Docker does not enable user namespaces by default, and thus resulting in these CVEs. A lot of integrations (like the examples of secrets, and sidecars) do not work properly when used in conjunction with user namespaces, and tend to require modification. We did the work to make this work, and created a model (injected processes into the container) in order to create this clear boundary layer.

Many people use Docker with Kubernetes. Unfortunately, the Kubernetes Kubelet does not work with Docker and user namespaces (https://github.com/kubernetes/enhancements/issues/127). There is still work being done in this area by folks like Kinvolk.

z3t4 · on Dec 23, 2020

Chroot and mounts should be seen as conveniences and not security features. You need to restrict capabilities with for example apparmor or selinux...