However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary. So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place? If not, is there any realistic way to go from where we are to where we should be?
The only other design I'm familiar with that sort of comes close are MicroVMs. Those have the downside of actually needing to run a VM though, and most (all?) cloud providers don't allow nested virtualization so you're stuck running on an enormous bare metal box.
There are AWS and GCP instance types with nested virtualization that'll let you run Firecracker. Digital Ocean apparently supports it everywhere.
The reason I mention this is because, sadly, baremetal instance types are only ever the largest size of a given family which is cost prohibitive for most users. And even if cost isn't an issue, they take much much longer to start (like 10-20+ minutes) and they actually fail to start far too frequently. It's really a shame that all instance types other than baremetal have virtualization extensions disabled, otherwise we'd be operating far more workloads in firecracker or kata. We operate huge kubernetes clusters so the cost is roughly the same whether it's fewer big instances or more smaller instances, but those startup times and reliability are terrible for autoscaling.
Please, AWS, bring nested virtualization to all nitro instance types!
Does this not meet your requirements?
When an EBS volume for a pod goes impaired, if it's using xfs you can basically count the whole server as dead no matter how many xfs + block io timeouts you set. xfs will stop being able to mount/unmount any other filesystems once hung in an unmount call for one. With a proper VM, you'd passthrough the nvme device with pcie passthrough and the host would be totally unimpacted.
Also, gvisor's better mode requires kvm, but it's cool that it effectively functions with ptrace when you can't use kvm.
Yes, because systems that are designed with these kinds of security boundaries in mind already look like containers -- they're a natural match to actual capability-based systems like, for example, plan9's.
The problem here stems entirely from trying to keep these globally-overriding capabilities like CAP_SYS_ADMIN and CAP_DAC_OVERRIDE while also allowing users to create their own namespaces. All these CVEs weren't things as long as only root could create new userns', and now that normal users can all these areas where things weren't checked are coming out of the woodwork.
But a ground up capability-based system avoids this kind of problem by simply making it impossible to elevate to a privilege level like 'root' on POSIX systems, and so namespacing within those systems is incredibly natural to the point that it didn't really get a name (containers) until one was needed for linux' cognitive dissonance around the idea.
Here is what capabilities are.
I don't think what you're advocating for makes a ton of sense tbh. You're basically saying "just make it impossible to privesc", which, yeah, that would be nice... but it's not like you can just do that.
I think your point is more that least privilege should be more common - that way exploits have less impact. I agree. That said, Linux Capabilities are extremely coarse, and most container escapes involve owning the Kernel, which from a real Capabilities model would be the trusted broker of capabilities to begin with.
Also, I used plan9 as an example for a reason. The kernel is quite hands off about capabilities in general in plan9, and is definitely not the primary source of trust in the system beyond the fact that a kernel is always a central trust node (some userspace processes like factotum and the authentication server do the real work and hold secure information).
There are systems out there that "just make it impossible to privesc", so it is possible. It's just not really possible within POSIX, because POSIX is built around it.
(I do think the change will be a net good in the long run, because rootless docker is probably a net improvement, but I think maybe it would have also been a good opportunity to reconsider how they inherit these global capabilities)
Containers rely on the Linux kernel. The Linux kernel is shit, in terms of security, for a number of reasons. So all one requires is to own the kernel, and there are a lot of ways to do that. Containers block some system calls and can lower attack surface to a degree, which is great - I think it's a huge win that containers are so popular and, finally, some degree of isolation is widespread.
We'll be stuck in retroactive security mode until developers care to change that, especially ones with influence like kernel maintainers.
> My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?
Absolutely not. We'd have ended up with something like Firecracker or GVisor. The issues with containers are fundamental to the concept of having a shared Linux kernel, which is basically what makes a container a container.
> If not, is there any realistic way to go from where we are to where we should be?
Use Firecracker or GVisor.
> Those have the downside of actually needing to run a VM though
I think at this point VMs are not that big of a deal. It's clearly good enough for the vast majority of people who are running on the cloud.
> don't allow nested virtualization so you're stuck running on an enormous bare metal box.
This part is a bummer.
The other option though is to just not care if your OS gets owned. Split your services up, move capabilities across other boundaries like mTLS.
You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes
It might have looked like FreeBSD jails or Illumos / Solaris Zones. Both of which are containers designed as a security boundary from the start.
You can definitely run hostile workloads securely in zones next to each other. Joyent ran a public cloud on zones and there are still smaller cloud providers who do.
In the Sun Solaris days zones were even certified for a bunch of high profile security certifications (if you care about such things).
There's nothing you can do to "certify" zones to mitigate this. The problem is that zone cotenants share a kernel. You have to trust that the kernel attack surface is free of LPEs, and no reasonable person can trust that.
Chrome also has had its fair share of sandbox escapes and zero-click remote code execution exploits. Does that mean you can't have a browser? I mean by those standards if even Google can't get it right us "mere mortal developers" might as well quit all together.
> The problem is that zone cotenants share a kernel.
Even with a "hardware" VM they share a kernel (it's just called a hypervisor). And while they share that kernel to a lesser extent there are also VM escapes. The VMWare and KVM security advisories are a testimony to that.
In the status quo ante of Firecracker, there were colorable arguments that hypervisors had comparably large attack surfaces to containers and jails and zones. But that's mostly out the window now: you can write a mostly memory-safe hypervisor and give it a tiny attack surface by providing only minimal support for virtio devices --- the big challenge with legacy hypervisor stacks is that they were designed to support things like desktop Windows, rather than being scoped down to serverside Linux.
> However, realistically, given the env that a container gives you, it certainly looks and feels like a security boundary.
If containers can't be made secure, we have bigger problems.
> So are we just going to be stuck in this retroactive security cleanup mode forever? My point is that if it were designed from the ground up with the hard security boundary in mind, would we have ended up with containers in the first place?
No! Linux and Unix APIs are a mess of patchworks. They are pretty much insecure by default, with rare exceptions.
We could make a new platform with a saner API and make it run on top of Linux, and write new backend services targeting it. I think WASI may just be that. The only problem is that wasm have some overhead / doesn't have access to all CPU features.
That said, even VM are best-effort security boundaries, then apparmor/selinux type restrictions put in place on the host should be the main hard security boundary IMO.
File system, networking and security need to abstracted in a way that is ideal for performance and introspection, specifically for a unikernel built to interface with the abstraction.
I disagree with that idea. The actual that may be as limited in capabilities as a standard bug. Let's say you have a problem with your webapp where you can read an arbitrary file, but nothing else. Containers are a perfect protection in this case if you want to isolate the app from any other services running on the host (monitoring, provisioning, etc.).
There's no perfection and defence in depth is what we need to use everywhere. Unless you can break through all layers at the same time, imperfect layers are a valid improvement. See how many default protections you have to turn off to even make this bug viable.
Yes. The only difference is the Linux based systems and tools as opposed to Zones or Jails were the first pivot to a developer focused view rather than that of the sysadmin. This utility is why containers gained critical mass, not because the security focused foundations of other implementations was an impediment.
My opinion: want security? Separate (bare metal) machines. Period.
i.e. root inside a container is root on the host; the container itself doesn't help that. But other security features, that are applied to the processes within the container when the container is created, might.
"Fortunately, the default security hardenings in most container environments are enough to prevent container escape. Containers running with AppArmor or SELinux are protected. "
So, all that hard work on SELinux continues to pay off.
I really wanted an audit mode that could also say "this command will unlock the specific thing I just blocked".
That was a few years ago. Since then, I've turned off selinux whenever I'm getting screwed by some opaque process, stuff starts working, and closing it back down while leaving what I need open remains impossible black magic.
Red Hat doesn't seem to have any interest in making SELinux more accessible than programming in assembly. The UX for the tooling around SELinux is an absolute dumpster fire.
Quick! tell me which package I need to install to get audit2allow on a system; without using Google, dnf whatprovides, or repoquery --whatprovides.
I'm still baffled why such an essential tool for quickly assessing violations and potential selinux booleans quick fixes is part of a obsfucated package name. I think some setroubleshoot family of tools might be installed by default on some systems, even if most answers will guide people to just use audit2allow.
At this point, it's probably true that I should get onboard the SELinux train and learn it properly, but it's just... ain't nobody got time for that.
SELinux remains inscrutable and unusuable to the lay person. Microsoft had the same problem with Windows XP and especially after its service pack 2 when the Windows Firewall was introduced, that it was difficult to debug and applications didn't prompt to open ports or have an API to do so. So many a lay person posted on forums "disable firewall".
Users don't care why their tools don't work, they don't understand why or how to fix it. Technically complex SELinux audit tutorials are not helpful. There needs to be real, genuine attention to user experience an almost tutorial like CLI command. Something so simple anyone could safely make a program run. Whether that program is safe itself is another question, and users should be told that too.
I have always used selinux enabled systems. For the first few years it was a bit confusing and frustrating at times, but for the last (decade?) I have never had to butt heads with it. The default policies shipped by e.g. Fedora (a userland closest to the development of this work and therefore probably better maintained than some others) work out of the box without hassle.
This very article refutes your assertion: here we see SELinux working for ordinary users without any additional fiddling. You, on the other hand, are probably exposed to this privilege escalation.
To be clear: SELinux is an important mitigation - just like the Windows Firewall - and one should not disable either.
Sure, if you're messing around with k8s and doing fun eBPF stuff you are going to need to be careful. But for just installing an OS, running it to do some web-browsing, gaming, image editing, wordprocessing? I would be highly surprised if the defaults do not work.
I think we agree, and Fedora / Red Hat have done great work setting up great defaults.
But when a user encounters an issue with SELinux, the lack of feedback mechanisms to help them onto a better path results in them finding that advice.
(Me a as novice Linux user)
On some systems the avc violations also get printed in dmesg.
If violations block your whole system from even running, you can enable permissive mode, this only logs violations without enforcing them.
As others already mentioned, turning violation logs into allow rules can be done with audit2allow. Wouldn’t recommend blindly using that though as the generated rules are always either too narrow or too wide, just use it as a guideline.
From the top of my head, I don't know. But this might help:
You'll often find horribly complex multi-step instructions for How to Do X ranked more highly than simple instructions about how to use a new interface to do everything in one step, because there was a window of time when the complex instructions were required, everyone was so grateful for them and upvoted them.
See how easy that was?
The article doesn't do much better on that front, but it is in there at least.
- Stable: 5.16.6
- LTS (for Alpine Linux): 5.15.20
Alpine 3.15 main is currently at 5.15.16 and thus vulnerable.