Hacker News new | past | comments | ask | show | jobs | submit login
New Linux vulnerability affecting cgroups: can containers escape? (paloaltonetworks.com)
123 points by zelivans on March 4, 2022 | hide | past | favorite | 82 comments



Back in the day, people insisted that containers were not security boundaries and should not be treated as such. They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary. So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?

The only other design I'm familiar with that sort of comes close are MicroVMs. Those have the downside of actually needing to run a VM though, and most (all?) cloud providers don't allow nested virtualization so you're stuck running on an enormous bare metal box.


I don't think the industry is moving towards deepening dependence on container/jail interfaces for multitenant workloads --- virtualization has gotten incredibly cheap. So these issues are mostly problems for internal data center segregation and blast radius reduction. It's not nothing, they're important security problems, but unless you're doing something dubious, they shouldn't be existentially important.

There are AWS and GCP instance types with nested virtualization that'll let you run Firecracker. Digital Ocean apparently supports it everywhere.


Slightly pedantic: ec2 doesn't actually support nested virtualization on any instance type I know of, but does have baremetal instance types that support virtualization.

The reason I mention this is because, sadly, baremetal instance types are only ever the largest size of a given family which is cost prohibitive for most users. And even if cost isn't an issue, they take much much longer to start (like 10-20+ minutes) and they actually fail to start far too frequently. It's really a shame that all instance types other than baremetal have virtualization extensions disabled, otherwise we'd be operating far more workloads in firecracker or kata. We operate huge kubernetes clusters so the cost is roughly the same whether it's fewer big instances or more smaller instances, but those startup times and reliability are terrible for autoscaling.

Please, AWS, bring nested virtualization to all nitro instance types!


You can run https://gvisor.dev/ without any virtualization requirement. We use this to host user-submitted configurations (not arbitrary code, but arbitrary input to ~mostly trusted code).

Does this not meet your requirements?


gvisor is awesome and works for particularly untrusted applications, but it's not a performance hit we'd be willing to take across the board and effectively only protects you from security bugs rather than other kernel issues. We run thousands of production database workloads, hundreds of load balancers, thousands web apps, ML jobs, batch processing, etc in kubernetes, most of which require as much performance as possible.

When an EBS volume for a pod goes impaired, if it's using xfs you can basically count the whole server as dead no matter how many xfs + block io timeouts you set. xfs will stop being able to mount/unmount any other filesystems once hung in an unmount call for one. With a proper VM, you'd passthrough the nvme device with pcie passthrough and the host would be totally unimpacted.

Also, gvisor's better mode requires kvm, but it's cool that it effectively functions with ptrace when you can't use kvm.


> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?

Yes, because systems that are designed with these kinds of security boundaries in mind already look like containers -- they're a natural match to actual capability-based systems like, for example, plan9's.

The problem here stems entirely from trying to keep these globally-overriding capabilities like CAP_SYS_ADMIN and CAP_DAC_OVERRIDE while also allowing users to create their own namespaces. All these CVEs weren't things as long as only root could create new userns', and now that normal users can all these areas where things weren't checked are coming out of the woodwork.

But a ground up capability-based system avoids this kind of problem by simply making it impossible to elevate to a privilege level like 'root' on POSIX systems, and so namespacing within those systems is incredibly natural to the point that it didn't really get a name (containers) until one was needed for linux' cognitive dissonance around the idea.


You're confusing capabilities systems. Linux capabilities are not "capabilities", they're a misnomer. They're just groupings of privileges.

Here is what capabilities are.

https://en.wikipedia.org/wiki/Capability-based_security

I don't think what you're advocating for makes a ton of sense tbh. You're basically saying "just make it impossible to privesc", which, yeah, that would be nice... but it's not like you can just do that.

I think your point is more that least privilege should be more common - that way exploits have less impact. I agree. That said, Linux Capabilities are extremely coarse, and most container escapes involve owning the Kernel, which from a real Capabilities model would be the trusted broker of capabilities to begin with.


I am not accusing linux of having a real capability system, so nope I'm not confusing them at all. I'm honestly not sure where you got me saying that it does, my tweet is a criticism of linux (or really POSIX) and its lack of true capabilities.

Also, I used plan9 as an example for a reason. The kernel is quite hands off about capabilities in general in plan9, and is definitely not the primary source of trust in the system beyond the fact that a kernel is always a central trust node (some userspace processes like factotum and the authentication server do the real work and hold secure information).

There are systems out there that "just make it impossible to privesc", so it is possible. It's just not really possible within POSIX, because POSIX is built around it.


OK, I apologize - that was my misunderstanding, and I should have worded it as "I think you're confusing" rather than accusatory. I wouldn't hold it against anyone to do so - the naming collision is unfortunate and has been a source of confusion for as long as it has existed.


Oh yeah it is absolutely confusing, and I think it's done real harm to the concept to have it misused in linux so badly.


What do you think about Fuchsia ? It's fully capability-based: https://fuchsia.dev/fuchsia-src/concepts/components/v2/capab...


I'm not sure a year has gone by without a vulnerability that breaks shared-kernel isolation in reasonable configurations. Nobody was going to DAC or MAC out `waitid`, but `waitid` for a time take a kernel address for its siginfo_t parameter.


I didn't mean to imply that there'd never been any kind of "container escape" vuln before userns creation was opened, just the "create userns, escape with magic privs" kind was new and largely because of that change.

(I do think the change will be a net good in the long run, because rootless docker is probably a net improvement, but I think maybe it would have also been a good opportunity to reconsider how they inherit these global capabilities)


It's not a binary thing. I would say something is a boundary if it requires an additional vulnerability to bypass. Containers these days fit that model. The nuance is how strong of a boundary it is.

Containers rely on the Linux kernel. The Linux kernel is shit, in terms of security, for a number of reasons. So all one requires is to own the kernel, and there are a lot of ways to do that. Containers block some system calls and can lower attack surface to a degree, which is great - I think it's a huge win that containers are so popular and, finally, some degree of isolation is widespread.

We'll be stuck in retroactive security mode until developers care to change that, especially ones with influence like kernel maintainers.

> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

Absolutely not. We'd have ended up with something like Firecracker or GVisor. The issues with containers are fundamental to the concept of having a shared Linux kernel, which is basically what makes a container a container.

> If not, is there any realistic way to go from where we are to where we should be?

Use Firecracker or GVisor.

> Those have the downside of actually needing to run a VM though

I think at this point VMs are not that big of a deal. It's clearly good enough for the vast majority of people who are running on the cloud.

> don't allow nested virtualization so you're stuck running on an enormous bare metal box.

This part is a bummer.

The other option though is to just not care if your OS gets owned. Split your services up, move capabilities across other boundaries like mTLS.


gvisor doesn't require nested virtualization, right? If you're willing to take a tenable user-mode-Linux performance hit, you should be able to run it on anything?


My understanding is that gvisor supports two modes of execution - one with virtualization and one without. AFAIK the official recommendation is to use the one with virtualization, but I've never dug into it.


Yeah, the original mode uses ptrace to intercept system calls, and then just implements the system call itself.


I'll quote Theo deRaadt here, he was talking about virtualization but I would guess the same could be said of containers:

You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes


Who was he referring to?


No one in particular. He's saying there are no perfect developers so no hypervisors will ever be perfectly secure.


Which is a silly statement, because for all X, no X will ever be perfectly secure. That's why we have multiple layers available and containers and VMs are just one of them.


> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

It might have looked like FreeBSD jails or Illumos / Solaris Zones. Both of which are containers designed as a security boundary from the start.


I'm here to push back on the fabled security powers of ground-up security-focused shared-kernel isolation. People love to bring up Zones and Jails in these conversations, presumably since both are much more coherent designs than Linux namespaces, MAC, BPF and cgroups, which are now comparably (if not more) featureful, but shambolic and hard to reason about. But none of these systems are sufficient for multitenant isolation. It would not be OK to rely on Zones for a major multitenant compute workload.


> But none of these systems are sufficient for multitenant isolation. It would not be OK to rely on Zones for a major multitenant compute workload.

You can definitely run hostile workloads securely in zones next to each other. Joyent ran a public cloud on zones and there are still smaller cloud providers who do.

In the Sun Solaris days zones were even certified for a bunch of high profile security certifications (if you care about such things).


And Joyent had problems doing that:

https://news.ycombinator.com/item?id=27078349

There's nothing you can do to "certify" zones to mitigate this. The problem is that zone cotenants share a kernel. You have to trust that the kernel attack surface is free of LPEs, and no reasonable person can trust that.


I don't see how bugs of zone escapes and such are necessarily proof of the concept not working.

Chrome also has had its fair share of sandbox escapes and zero-click remote code execution exploits. Does that mean you can't have a browser? I mean by those standards if even Google can't get it right us "mere mortal developers" might as well quit all together.

> The problem is that zone cotenants share a kernel.

Even with a "hardware" VM they share a kernel (it's just called a hypervisor). And while they share that kernel to a lesser extent there are also VM escapes. The VMWare and KVM security advisories are a testimony to that.


The Chrome sandbox would also be problematic for these workloads, for similar reasons! The point of isolated kernels is to foreclose on whole large classes of vulnerabilities. The problem of shared-kernel isolation is that you opt into them.

In the status quo ante of Firecracker, there were colorable arguments that hypervisors had comparably large attack surfaces to containers and jails and zones. But that's mostly out the window now: you can write a mostly memory-safe hypervisor and give it a tiny attack surface by providing only minimal support for virtio devices --- the big challenge with legacy hypervisor stacks is that they were designed to support things like desktop Windows, rather than being scoped down to serverside Linux.


Or HP-UX vaults grown out of Tru64.


> Back in the day, people insisted that containers were not security boundaries and should not be treated as such. They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

> However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary.

It has to be secure. Browsers are using pretty much the same technologies (seccomp-bpf, cgroups, namespaces, etc) to tightly sandbox Javascript from websites. Browsers run wildly untrusted code from all over the web, and are expected to pass through many forms of malware, not letting them escape the sandbox.

If containers can't be made secure, we have bigger problems.

> So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

No! Linux and Unix APIs are a mess of patchworks. They are pretty much insecure by default, with rare exceptions.

We could make a new platform with a saner API and make it run on top of Linux, and write new backend services targeting it. I think WASI may just be that. The only problem is that wasm have some overhead / doesn't have access to all CPU features.


I think Unikernel VMs are the future. Build your app into One blob with no user/kernel space boundary that runs in a guest VM. No boot time or wasted memory/latency (context switch) issues.

That said, even VM are best-effort security boundaries, then apparmor/selinux type restrictions put in place on the host should be the main hard security boundary IMO.


Good luck debugging that.


Shouldn't need luck. It wouldn't use qemu or vmware but a specialized VM manager that will interface with it via network/virtual-hardware and expose a virtual file system to it (e.g.: it will call "read()" but instead of glibc wrapping a syscall, a compiled in wrapper would ask the hypervisor to "read()", except it would just memcpy() around a file opened at virtual boot instead of asking the kernel to read a file and then send the data back while avoiding context switch and just send request, wait for interrupt).

File system, networking and security need to abstracted in a way that is ideal for performance and introspection, specifically for a unikernel built to interface with the abstraction.


> They're meant to contain things from going off the rails unintentionally, but an actual threat was another story.

I disagree with that idea. The actual that may be as limited in capabilities as a standard bug. Let's say you have a problem with your webapp where you can read an arbitrary file, but nothing else. Containers are a perfect protection in this case if you want to isolate the app from any other services running on the host (monitoring, provisioning, etc.).

There's no perfection and defence in depth is what we need to use everywhere. Unless you can break through all layers at the same time, imperfect layers are a valid improvement. See how many default protections you have to turn off to even make this bug viable.



At least some of the Azure series support nested virtualization. See https://docs.microsoft.com/en-us/azure/virtual-machines/dv4-.... There are a lot of series and I don’t know the breakdown but I would expect dsv4 to be one of the more widely used options because it is for generic CPU workloads.


> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?

Yes. The only difference is the Linux based systems and tools as opposed to Zones or Jails were the first pivot to a developer focused view rather than that of the sysadmin. This utility is why containers gained critical mass, not because the security focused foundations of other implementations was an impediment.


When developer productivity come before sysadims that is when security goes south, as history has proven on desktop systems.


With Spectre we discovered that not even VMs are adequate security boundaries.

My opinion: want security? Separate (bare metal) machines. Period.


...and in the spirit of the parent comment, Intel didn't intend for protected mode to be a security boundary either. The 286 and 386 programming manuals referred to the protections as a form of reducing the severity of bugs.


How are they not a security boundary? Nearly everything is a security boundary using defense in depth no?


Security boundaries in Linux are UIDs/GIDs, capabilities, SELinux domains, and others. These can be applied to processes regardless of whether the process runs in a container.

i.e. root inside a container is root on the host; the container itself doesn't help that. But other security features, that are applied to the processes within the container when the container is created, might.


Important note on this:

"Fortunately, the default security hardenings in most container environments are enough to prevent container escape. Containers running with AppArmor or SELinux are protected. "

So, all that hard work on SELinux continues to pay off.


Sadly, many answers to questions related to selinux issues, or howto's start with: Disable selinux.


Things may have changed, but the last few times I looked, it was breathtakingly hard to a) identify if /when selinux is what's screwing you, then b) get selinux to stop it.

I really wanted an audit mode that could also say "this command will unlock the specific thing I just blocked".

That was a few years ago. Since then, I've turned off selinux whenever I'm getting screwed by some opaque process, stuff starts working, and closing it back down while leaving what I need open remains impossible black magic.


Is audit2allow the thing you want?


Probably yes, but audit2allow is very hard to reason about. You can run it and hopefully it will enable you to allow the things you want to allow without also allowing things you didn't want.

Red Hat doesn't seem to have any interest in making SELinux more accessible than programming in assembly. The UX for the tooling around SELinux is an absolute dumpster fire.


On server environment that command is most of the time not installed by default.

Quick! tell me which package I need to install to get audit2allow on a system; without using Google, dnf whatprovides, or repoquery --whatprovides.

I'm still baffled why such an essential tool for quickly assessing violations and potential selinux booleans quick fixes is part of a obsfucated package name. I think some setroubleshoot family of tools might be installed by default on some systems, even if most answers will guide people to just use audit2allow.


I recall taking a stab at audit2allow a few years ago, and finding that it was incredibly opaque and felt like practising dark arts.

At this point, it's probably true that I should get onboard the SELinux train and learn it properly, but it's just... ain't nobody got time for that.


I believe this is considered one of the best videos: https://www.youtube.com/watch?v=_WOKRaM-HI4


I strongly believe that software that works for users is better than software that doesn't, and it's clear that for most lay folks, SELinux is software that doesn't work.

SELinux remains inscrutable and unusuable to the lay person. Microsoft had the same problem with Windows XP and especially after its service pack 2 when the Windows Firewall was introduced, that it was difficult to debug and applications didn't prompt to open ports or have an API to do so. So many a lay person posted on forums "disable firewall".

Users don't care why their tools don't work, they don't understand why or how to fix it. Technically complex SELinux audit tutorials are not helpful. There needs to be real, genuine attention to user experience an almost tutorial like CLI command. Something so simple anyone could safely make a program run. Whether that program is safe itself is another question, and users should be told that too.


> it's clear that for most lay folks, SELinux is software that doesn't work.

I have always used selinux enabled systems. For the first few years it was a bit confusing and frustrating at times, but for the last (decade?) I have never had to butt heads with it. The default policies shipped by e.g. Fedora (a userland closest to the development of this work and therefore probably better maintained than some others) work out of the box without hassle.

This very article refutes your assertion: here we see SELinux working for ordinary users without any additional fiddling. You, on the other hand, are probably exposed to this privilege escalation.


That's not my assertion, my assertion is that SELinux doesn't work for a lot of people even if it works for you or I; and that's why you see the advice to disable it in forum posts.

To be clear: SELinux is an important mitigation - just like the Windows Firewall - and one should not disable either.


I disagree. The advice to disable SELinux, like your assertion that it's too complicated for ordinary users, belongs to an older time. It's time to lay that myth to bed.

Sure, if you're messing around with k8s and doing fun eBPF stuff you are going to need to be careful. But for just installing an OS, running it to do some web-browsing, gaming, image editing, wordprocessing? I would be highly surprised if the defaults do not work.


> The advice to disable SELinux... time to lay that myth to bed.

I think we agree, and Fedora / Red Hat have done great work setting up great defaults.

But when a user encounters an issue with SELinux, the lack of feedback mechanisms to help them onto a better path results in them finding that advice.


Fedora literally gives you a notification and you can take action

(Me a as novice Linux user)


That's fantastic for Fedora desktop users. I don't expect you'd know, but is there a way to get the same quality of information via a CLI command?


grep denied /var/log/audit/*

On some systems the avc violations also get printed in dmesg.

If violations block your whole system from even running, you can enable permissive mode, this only logs violations without enforcing them.

As others already mentioned, turning violation logs into allow rules can be done with audit2allow. Wouldn’t recommend blindly using that though as the generated rules are always either too narrow or too wide, just use it as a guideline.


FYI, "ausearch -i -ts recent -m avc" gives you SELinux violations from the last 10 minutes in slightly more readable form.


I think there is, I's been a long time since I had issues with SELinux. To be honest I have no idea how the GUI works. I do everything with CLI.

From the top of my head, I don't know. But this might help:

https://wiki.archlinux.org/title/SELinux


One of the best things on Android is having SELinux and seccomp enabled.


I'd argue the vast majority of Linux desktop users (already a small group) don't use SELinux. So naturally when trying to help someone using something we don't have experience with and don't find necessary, that advice becomes more prevalent.


Which is an excellent indicator that the following advice is bad.


Why is the Linux community full of horrible advice?


Often the issue is that what was decent advice five years ago can become horrible advice later, but the formerly decent advice is already moderated to the top on Stack Overflow and Reddit and is the first hit on a Google search, while the approach that people should be using now just isn't found.

You'll often find horribly complex multi-step instructions for How to Do X ranked more highly than simple instructions about how to use a new interface to do everything in one step, because there was a window of time when the complex instructions were required, everyone was so grateful for them and upvoted them.


you are also safe if you are not running (EDIT: inside) the container as root, which is a common security practice for containers nowadays.


This style of writing sucks, and the abuse of the meaningless term "container" does nothing to clear it up. To reduce this CVE to one sentence: a process running in the top level control group, which has the ability to create user namespace, can take over the machine, because the kernel fails to check for CAP_SYS_ADMIN.

See how easy that was?


You kind of missed the key to the whole thing here, though, which is that users are able to create userns' now by default. This is really important to understanding this and the last few container escape CVEs.

The article doesn't do much better on that front, but it is in there at least.


Isn't the whole purpose of this style of writing to define terms like "top level control group" and "CAP_SYS_ADMIN" for those people who don't already understand what they mean?


The article doesn't do that. It throws around jargon without defining it, or defining it vaguely or inaccurately.


Wow, all that long article and no mention of the versions affected. Fortunately it is mentioned in the redhat bugzilla for that CVE and there it states that it is fixed in stable kernel v5.16.6. I assume it is also fixed in stable stable kernels released at the same time: 5.15.20, 5.10.97, and 5.4.177

sources: https://bugzilla.redhat.com/show_bug.cgi?id=2051505 https://lwn.net/Articles/883949/


Couldn’t you prevent against this sort of thing by using disposable VMs to host the containers? Sure it would be an extra layer of resources but it would double the complexity of the attack required to breach the physical node.


Correct on both counts; you can, and it hurts performance / resource use. There's also intermediate options like gvisor. In practice, the performance issues mean that most people don't bother.


I consider cgroups an administrative layer, not a security layer. They're for keeping apps from accidentally blowing up the host, not to prevent them hacking it. If you want security with containers, use Firecracker.


The article is maddening in its generic “update to the latest kernel” advice and not listing the specific kernel versions fixed. They are:

- Stable: 5.16.6

- LTS (for Alpine Linux): 5.15.20

Alpine 3.15 main is currently at 5.15.16 and thus vulnerable.


Podman and other container tools are now using user namespaces by default. I think it is clear there are some extra precautions needed, but ultimately the goal with running rootless containers is to improve security.


Podman also works fine rootless and with cgroups2, double win.


Does it support docker-compose?


Yes, there is a podman-compose package as well.


Actually, it supports docker-compose proper (v1 and v2, even if there are some bugs for v2)


“Container” seems to be used throughout to mean “Docker container”. There are other types of containers.


I think container escape is well understood by most to mean (for Linux) cgroups and/or the stack most folks use (containerd, Docker). It's a generic term but useful term, like VM escape, even though there are many kinds of virtual machine managers and hypervisors.


the other reply alluded to this, but to make it explicit: nothing about this CVE requires docker and it looks like you should be able to do it with a few syscalls in any process starting with a call to unshare(), unless something else (like selinux) is getting in your way.




Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: