Please check out the bulletin and update if you are using one of these services.
It is definitely possible to prevent a task running in ECS from getting root access to the host. If there is something missing that you feel we need to add to ECS to better enable this, definitely reach out and let me know!
Here's link to our security posting with more information and upgrade procedures: https://cloud.google.com/kubernetes-engine/docs/security-bul...
However, Google and Sysdig announced a partnership around Falco and GCSCC integration. It would make sense that such a tool would be able to be run on COS.
Perhaps I'm guilty wanting to have my cake and eat it, too. But this seems like an area where GKE and COS are somewhat limited.
So, falco will work on GKE and COS
In other words, this won't affect anyone who understands the implications of running a process as root. Unfortunately, the sad truth is that most people I've come across who have "lots of experience" with implementing Docker containers, do not even understand the basics of how they work, let alone the implications of root access. I've interviewed candidates who claim to know Docker but can't even tell me how Docker differs from traditional virtualisation or how it achieves its isolation. The best explanation that most of them come up with is, "Docker containers are more lightweight".
This sort of vulnerability should have been a non-issue but it has gained attention due to the sheer amount of incorrectly configured containers in the wild. This was an accident waiting to happen, and I doubt we've heard the last of this sort of thing.
We looked at this at $work and got into a serious rabbit hole about how exactly Linux capabilities work. I think if I started asking interviewees to explain permitted vs. effective capability sets and how file and process capabilities differ, I'd never hire anyone. (And I think to figure out yourself how to "correctly configure" a container, you need to have at least some understanding of that.)
> We looked at this at $work and got into a serious rabbit hole about how exactly Linux capabilities work. I think if I started asking interviewees to explain permitted vs. effective capability sets and how file and process capabilities differ, I'd never hire anyone. (And I think to figure out yourself how to "correctly configure" a container, you need to have at least some understanding of that.)
I think you've hit the nail on the head. I only ask those interview questions because I believe it's important to find out just how much a candidate understands. I have to admit I've let some of these things go, otherwise I'd never hire anyone either. I think in the end what it comes down to is that Docker is an ambitious project that is somewhat flawed from a security perspective. There have been numerous namespace vulnerabilities to date and I expect there will be plenty more found in the future.
A week ago Docker gained support for running dockerd as non-root: https://github.com/moby/moby/pull/38050
And there is this project for running Kubernetes as non-root: https://github.com/rootless-containers/usernetes
I disagree. It is definitely "safer" than running as root on the host -- but that shouldn't be the baseline. Honestly, I would argue that not using user namespaces (or non-root inside the container) is basically negligence at this point. Yes, it's annoying to do it with Docker, but other runtimes have solved this problem.
LXC actually explicitly states that privileged (non-userns) containers are fundamentally unsafe and I agree. When you look at the wide array of ns_capable and other userns checks that protect against all sorts of attacks, you really start to not trust anything that doesn't use user namespaces. Kernel developers assume that container runtimes are using user namespaces if they are trying to secure something.
Additionally, yes capabilities help. But you still have traditional Unix DAC issues (which is what is leveraged here).
> and this vulnerability is notable precisely because it's one of the rare ways to exercise that root privilege outside the container.
I would argue there's several very foundational security problems (which I'm trying to fix) that are made significantly worse by running a container process as root. Please don't do it.
We run Kubernetes + Docker with a policy of no root inside containers (we map you to your normal UID inside the container), but most of the rabbit hole we got into was trying to figure out the implications of setuid binaries inside the container. It seems like on a normal system, setuid binaries inside a container do in fact get host root unless you tell Docker to drop those capabilites, and also fscaps aren't really usable in a container image, I think.
As for setuid and fscaps, they both work in containers (container images can contain them but some filesystems don't support xattrs such as AUFS). And yes, without user namespaces, they escalate to host root. You can use no_new_privs which blocks things like setuid or fscap but it also can cause problems (though I think you can enable no_new_privs in Kubernetes).
User namespaces forces this.
It's not just about not running as root, it's making sure the container cannot map to the real UID 0.
I'd say Docker is a rather spiffed-up version of BSD chroot jails, with a repository and more than just filesystem isolation. Jails for all the things.
Probably just as wrong as the "lightweight virtualization" but I found it interesting that people claiming to have experience with Docker would immediately compare it with virtualization instead of jails. Not enough BSD?
RH CVE page, with the vulnerability’s metrics and the list of RH packages affected (plus links to the errata pages that have details on fixed builds): https://access.redhat.com/security/cve/cve-2019-5736
I wonder how many of those 4,000 docker daemons are running/managing containers of dubious origin.
It's a drop-in replacement for runc. With KataContainers it runs docker containers in a lightweight VM so you get all the security benefits of a VM. The downside is slightly slower container start up times and might not work in nested virtualized environments.
gVisor is used behind Go 1.11 on App Engine so Google must be fairly confident that it's a sufficient security boundary though I'm fairly sure they don't use the public KVM isolation so YMMV.
Some more info: https://www.freebsd.org/doc/handbook/jails.html
And for quick start: https://github.com/iocage/iocage
The upthread recommendation was using hardware VM technology, which is a fundamentally different isolation model from what software can provide and (at least in theory) makes that kind of exploit impossible. And while there are tradeoffs with everything, for you to throw that argument out due to personal platform loyalty is really, really bad advice.
Look, this isn't about whether jails are secure containers or not. I'm sure they're great. It's that responding to "if you want more isolation, try hardware virtualization" with "FreeBSD is just better because 19 years!" is not really enaging with the argument as framed.
(from a technical perspective, you would be running jails for years too - so much about platform loyality)
Jails haven't been used to protect as many high-value targets as Linux containers have. This is not a comment on the technical quality of jails. It may well be a comment on the world's anti-FreeBSD prejudice. But either way it's still true, and that means the 19 years of existence didn't magically harden the product.
This is not true in my experience at all. It may be true that it hasn't been in use at startups until Docker came out, but a few large, established companies I've worked at absolutely used Jails or Zones to protect their most valuable IP. And have been for a long time.
The distinction here is that people are running containers in the cloud and also often running untrusted code (e.g. vendor software, random exciting open-source things) inside containers, and collocating those with high-value targets in other containers. And large, established companies are doing that now just as much as startups are.
It took a massive marketing campaign to get people to use Docker and realize it made their life easier, so something like iocage would need the same push. (Also, nobody wants to start adopting additional OSes unless absolutely necessary)
Any kind of isolation is just icing on the cake.
If I have misunderstood jails and it's immune to kernel exploits please do correct me.
And Ubuntu’s: https://people.canonical.com/~ubuntu-security/cve/2019/CVE-2...
Personally, I like these vs. RHEL, since all the info is on page.
git clone https://salsa.debian.org/security-tracker-team/security-tracker.git
git clone https://git.launchpad.net/ubuntu-cve-tracker
Please patch if you don't 100% trust all users on your host.
If you are using SELinux, verify your containers are running as container_t. If not, verify you are using user namespaces that don't map host root into the container user's namespace. These should mitigate the issue.
(as far as trust goes, just don't trust any local users. there's too many ways to privesc on Linux, and SELinux is the only thing that stops most of them)
There is a distinct lack of knowledge on how to manage a system in the container ecosystem
If I am understanding the CVE correctly, you need to be able to launch privileged containers with an attacker-controlled image where the container user is root and not namespaced (i.e., the same root as the outside root user). How is this not "on the wrong side of an airtight hatch"?
Am I missing something here? If you can start privileged containers, why not just execute evil.exe directly?
I think my OS is on a read only filesystem though, and maybe I've got it namespaced correctly as well, but still it's pretty dangerous.
In my experience unprivileged containers work for most tasks, but there is breakage in some areas. Usually the issues are simple to resolve, like disabling OOM adjustments in systemd or changing the idmap range in winbind to be within the namespace allotment.
IIUC, using Docker's userns-remap would protect against this CVE by making the containers run unprivileged (container's id 0 != host's id 0) and should generally be the industry's best practice.
User namespaces which is setup such that uid's in the container are not mapped to any in-use uids on the host is needed.
Also, I would bet that freshly written C code has about 1 RCE bug every 100 LoC. This patch has 236 LoCs so probably about 2.36 RCE's.
Even if you leave a codebase alone in the sense that you only fix security bugs, you’ll end up with a slow trickle that never quite ends. There are the RCEs that get reported plus a bunch that don’t. So, if you:
- project into the future, assuming that if there still has been a trickle of bugs being found then more bugs will also still be found in the future.
- take into account that there are some number of unreported bugs. Maybe for every reported one there is one that isn’t. Dunno the ratio there.
Put all that together and it’s not hard to imagine a 1RCE/100LoC rate.
But still I’m kinda joking. But only slightly. Maybe if I had a way to bet money on this and it was a testable bet (it’s not because of the unreported RCEs) then I’d throw some cash down.
My first attempts at this patch used Go code but it wasn't possible to protect against all cases. Doing it in C was the only way to do it.
Source, Google talk at Linux Kernel Summit 2018.
(Fish in a Barrel, LLC is a nonexistent security research "company" consisting of people setting up fuzzers on the weekend and then proceeding to shoot fish in a barrel.)
I'll take your memory safe languages and up the ante with microkernels, which actually help immensely in all those cases. But Linux isn't going to go that route, either.
Google has been pushing for the Kernel Self Preservation project for quite a while now, which Android and ChromeOS make best use of, also a reason why the NDK is so constrained on Android.
Security in the kernel also had quite a few talks at Linux Conf 2019, just recently in New Zealand.
But taking the joke further, are you counting each release’s lines of code towards the RCEs or only new/modified code?
If you’re counting all vise then you’re double counting RCEs. I wouldn’t double count.
Edit: I missed the “RCE” context. Most of these are just privescs or memory disclosures.
It’s interesting that Linux kernel bug stats contradict my bold bet. But I’m imagining rando C code here, not necessarily open source, not necessarily in the kernel, not necessarily tested and reviewed the same way. This code is at least open source but I dunno to what extent this newly added code path in runc gets the kind of shaking out that makes kernel code solid.
Runc aside, I expect most C code to have a higher rate of every kind of bug than the kernel.
Any C code which is not available on the network (e.g. C code running on your refrigerator) by definition cannot have a remote code execution vulnerability.
Lots of software, such as the 'top' utility, makes no networking related calls in the codebase, so any instances of bugs would be buffer overflows or crashes, but not remotely exploitable by the usual meaning.
I think that you vastly under-estimate how difficult it is to accidentally write a remotely exploitable bug.
Sure, buffer overflows and undefined behavior happen all the time in C code. Those bugs might be 1 per 100 lines even in the average C code.
of those, hardly any will be network exploitable. Relatively little code will be handling data sourced from the network.
For example, I bet that some dude writing an image decoder in the 90's was thinking "it's cool, I don't have to worry about security" because he just knew that his code wasn't going to be remotely exploitable.
Anyway, my original comment was supposed to be as funny as your handle. I guess the humor ended up being just in how seriously folks took it.
The part I'm not joking about is that folks always underestimate the amount of security bugs that will be found in a piece of code in the future, either because the code ends up used in a way that wasn't predicted, or because some really great bug was just waiting for the right kind of genius to uncover it.