
CVE-2019-5736: runc container breakout - afshinmeh
https://seclists.org/oss-sec/2019/q1/119
======
NathanKP
Amazon employee here: we have released a security bulletin covering how to
update to the latest patched Docker on Amazon Linux, Amazon ECS, Amazon EKS,
AWS Fargate, AWS IoT Greengrass, AWS Batch, AWS Elastic Beanstalk, AWS Cloud9,
AWS SageMaker, AWS RoboMaker, and AWS Deep Learning AMI.

Please check out the bulletin and update if you are using one of these
services.

[https://aws.amazon.com/security/security-
bulletins/AWS-2019-...](https://aws.amazon.com/security/security-
bulletins/AWS-2019-002/)

~~~
tristanz
As far as I understand, EKS doesn't support PodSecurityPolicy yet so any user
that can launch a pod can trivially root the host via host mounts already.
This surprisingly isn't clearly documented.

~~~
NathanKP
ECS doesn't have a top level resource called "PodSecurityPolicy" but we do
provide task level configuration options for all the major settings that you
would normally put in your pod security policy, including including adding and
dropping capabilities, privileged or unprivileged mode, docker security
options for controlling SELinux or AppArmor, ulimits, sysctl settings, among
others. You can find all these configuration options and more documented here:
[https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#container_definition_security)

It is definitely possible to prevent a task running in ECS from getting root
access to the host. If there is something missing that you feel we need to add
to ECS to better enable this, definitely reach out and let me know!

~~~
tristanz
I'm referring to EKS here not ECS. EKS doesn't yet enable the
PodSecurityPolicy admission controller, so any user that can launch a pod via
EKS can root the EKS cluster regardless of RBAC rules. The main ask here is to
just find a way to enable PodSecurityPolicy admission controller so that
secure multi-user EKS clusters are possible like ECS.

------
sethvargo
Hey all - Seth from Google here. Please let us know if you have any questions
regarding GKE or questions about the upgrade process. I'm also happy to
escalate any feedback to our internal product and engineering teams.

Here's link to our security posting with more information and upgrade
procedures: [https://cloud.google.com/kubernetes-engine/docs/security-
bul...](https://cloud.google.com/kubernetes-engine/docs/security-
bulletins#february-11-2019-runc)

~~~
sethvargo
One thing I want to emphasize: you are only affected if you're using Ubuntu
base images for your node pools. If you're using COS, you are unaffected.

~~~
numbsafari
It would be great if folks could run COS and leverage tools like Falco. I
realize there is a trade-off between having a totally locked down OS and being
able to flexibly use such tools.

However, Google and Sysdig announced a partnership around Falco and GCSCC
integration. It would make sense that such a tool would be able to be run on
COS.

Perhaps I'm guilty wanting to have my cake and eat it, too. But this seems
like an area where GKE and COS are somewhat limited.

~~~
markstemm
Hi, Falco developer here. We do have support for running falco with an ebpf
program taking the place of the kernel module. You can learn more about ebpf
support at
[https://github.com/draios/sysdig/wiki/eBPF](https://github.com/draios/sysdig/wiki/eBPF),
and you should be able to run falco with ebpf by setting an environment
variable SYSDIG_EBPF_PROBE="".

~~~
numbsafari
This is awesome news! I see it’s still beta, which is probably why the Falco
docs still say GKE users must run Ubuntu images. Adding this to my tracking
list. Thanks.

------
wicket
> However, it _is_ blocked through correct use of user namespaces (where the
> host root is not mapped into the container's user namespace).

In other words, this won't affect anyone who understands the implications of
running a process as root. Unfortunately, the sad truth is that most people
I've come across who have "lots of experience" with implementing Docker
containers, do not even understand the basics of how they work, let alone the
implications of root access. I've interviewed candidates who claim to know
Docker but can't even tell me how Docker differs from traditional
virtualisation or how it achieves its isolation. The best explanation that
most of them come up with is, "Docker containers are more lightweight".

This sort of vulnerability should have been a non-issue but it has gained
attention due to the sheer amount of incorrectly configured containers in the
wild. This was an accident waiting to happen, and I doubt we've heard the last
of this sort of thing.

~~~
geofft
I think it's subtler than that. It is _mostly_ safe to run a contained process
as "root" because in theory the ways that root access can be exercised is
highly sandbox by the use of various namespaces, as well as things like
capability restrictions (you generally don't have CAP_SYS_ADMIN or a few
others), limited syscall attack surface (you generally have a syscall
allowlist via seccomp-bpf), etc. Yes, it's wrong to not understand that the
runc process runs as root. But I think it's only very slightly less wrong to
claim that a process inside the container has root access in the way that,
say, ssh root@host-system has root access. It mostly does not, and this
vulnerability is notable precisely because it's one of the rare ways to
exercise that root privilege outside the container.

We looked at this at $work and got into a serious rabbit hole about how
exactly Linux capabilities work. I think if I started asking interviewees to
explain permitted vs. effective capability sets and how file and process
capabilities differ, I'd never hire anyone. (And I think to figure out
yourself how to "correctly configure" a container, you need to have at least
some understanding of that.)

~~~
wicket
> I think it's subtler than that. It is mostly safe to run a process as "root"
> because in theory the ways that root access can be exercised is highly
> sandbox by the use of various namespaces, as well as things like capability
> restrictions (you generally don't have CAP_SYS_ADMIN or a few others),
> limited syscall attack surface (you generally have a syscall allowlist via
> seccomp-bpf), etc. Yes, it's wrong to not understand that the runc process
> runs as root. But I think it's only very slightly less wrong to claim that
> runc has root access in the way that, say, ssh root@host-system has root
> access. It mostly does not, and this vulnerability is notable in that it's
> one of the few ways to exercise that root privilege.

> We looked at this at $work and got into a serious rabbit hole about how
> exactly Linux capabilities work. I think if I started asking interviewees to
> explain permitted vs. effective capability sets and how file and process
> capabilities differ, I'd never hire anyone. (And I think to figure out
> yourself how to "correctly configure" a container, you need to have at least
> some understanding of that.)

I think you've hit the nail on the head. I only ask those interview questions
because I believe it's important to find out just how much a candidate
understands. I have to admit I've let some of these things go, otherwise I'd
never hire anyone either. I think in the end what it comes down to is that
Docker is an ambitious project that is somewhat flawed from a security
perspective. There have been numerous namespace vulnerabilities to date and I
expect there will be plenty more found in the future.

~~~
nicoburns
I believe there was work on non-root container host processes going on at some
point? Did that ever get to a usable state?

~~~
geofft
Rootless containers using user namespaces was merged into runc in March 2017:
[https://github.com/opencontainers/runc/pull/774](https://github.com/opencontainers/runc/pull/774)

A week ago Docker gained support for running dockerd as non-root:
[https://github.com/moby/moby/pull/38050](https://github.com/moby/moby/pull/38050)

And there is this project for running Kubernetes as non-root:
[https://github.com/rootless-
containers/usernetes](https://github.com/rootless-containers/usernetes)

------
gr2020
Looks like Docker 18.09.2 was released a few minutes ago to address this:
[https://github.com/docker/docker-
ce/releases](https://github.com/docker/docker-ce/releases)

~~~
el_duderino
PoC here:

[https://github.com/feexd/pocs/blob/master/CVE-2019-5736/expl...](https://github.com/feexd/pocs/blob/master/CVE-2019-5736/exploit.c)

------
CaliforniaKarl
Red Hat’s page on the vulnerability:
[https://access.redhat.com/security/vulnerabilities/runcescap...](https://access.redhat.com/security/vulnerabilities/runcescape)

RH CVE page, with the vulnerability’s metrics and the list of RH packages
affected (plus links to the errata pages that have details on fixed builds):
[https://access.redhat.com/security/cve/cve-2019-5736](https://access.redhat.com/security/cve/cve-2019-5736)

------
achillean
There are nearly 4,000 exposed Docker daemons:
[https://www.shodan.io/report/ol761bRb](https://www.shodan.io/report/ol761bRb)

~~~
nineteen999
So inexperienced developers are just as likely to reach for Docker as
experienced ones?

I wonder how many of those 4,000 docker daemons are running/managing
containers of dubious origin.

~~~
achillean
Around 10% of them are running cryptominers so it looks like there are already
people out there compromising these public Docker instances.

------
miguelmota
For better isolation check out KataContainers: [https://github.com/kata-
containers/runtime](https://github.com/kata-containers/runtime)

It's a drop-in replacement for runc. With KataContainers it runs docker
containers in a lightweight VM so you get all the security benefits of a VM.
The downside is slightly slower container start up times and might not work in
nested virtualized environments.

~~~
justanother-
Alternative idea: throw away docker and katacontainers and move to freebsd,
where jails were introduced on 14 Mar 2000 (no, seriously, superior technology
exists for 19 years - stable, time proven, working).

Some more info:
[https://www.freebsd.org/doc/handbook/jails.html](https://www.freebsd.org/doc/handbook/jails.html)

And for quick start:
[https://github.com/iocage/iocage](https://github.com/iocage/iocage)

~~~
ajross
Jails are virtually identical technology to Linux containers from a security
point of view. They've had holes before and they likely will again, and a
breakout like this (seems like the root cause here is a writable file
descriptor to the host binary) can absolutely compromise the host system.

The upthread recommendation was using hardware VM technology, which is a
fundamentally different isolation model from what software can provide and (at
least in theory) makes that kind of exploit impossible. And while there are
tradeoffs with everything, for you to throw that argument out due to personal
platform loyalty is really, really bad advice.

~~~
int_19h
My understanding is that jails were designed as a security boundary from the
get go, unlike containers. Wouldn't that result in code that's less likely to
be exploitable?

~~~
ajross
FWIW, "containers" aren't a thing. Namespaces, cgroups et. al. certainly were
designed with security in mind, as was docker/runc.

Look, this isn't about whether jails are secure containers or not. I'm sure
they're great. It's that responding to "if you want more isolation, try
hardware virtualization" with "FreeBSD is just better because 19 years!" is
not really enaging with the argument as framed.

------
CaliforniaKarl
Debian’s security tracker, showing the affected versions, and (when available)
the fixed versions: [https://security-
tracker.debian.org/tracker/CVE-2019-5736](https://security-
tracker.debian.org/tracker/CVE-2019-5736)

And Ubuntu’s: [https://people.canonical.com/~ubuntu-
security/cve/2019/CVE-2...](https://people.canonical.com/~ubuntu-
security/cve/2019/CVE-2019-5736.html)

Personally, I like these vs. RHEL, since all the info is on page.

~~~
dfc
In addition to the information all being on one page you can also:

    
    
        git clone https://salsa.debian.org/security-tracker-team/security-tracker.git 
        git clone https://git.launchpad.net/ubuntu-cve-tracker 
    

There is a lot of interesting things you can do with the data.

------
wodny
The vulnerability description seems to be lacking an explanation why the
/proc/$PID/exe symlink is so special and why using the #!/proc/self/exe
hashbang will work while using #!/usr/sbin/runc probably won't. Am I right
that the proc filesystem in proc_exe_link() fills the file_operations struct
in a way that causes open() not to go through a dereferencing procedure using
the filesystem but just open the file used to run the executable?

~~~
wodny
So I will answer myself. Experiments suggest it is like that:
[https://www.reddit.com/r/linux/comments/apmptq/cve20195736_r...](https://www.reddit.com/r/linux/comments/apmptq/cve20195736_runc_vulnerability_enabling_container/egcc313/)

------
darren0
The best fix is to upgrade to 18.09.2. For those that can't do that
immediately, backported versions of runc for Docker releases going back to
1.12.6 are available from Rancher at [https://github.com/rancher/runc-
cve](https://github.com/rancher/runc-cve). But please only do that as a
temporary workaround until you can properly upgrade to 18.09.2.

Please patch if you don't 100% trust all users on your host.

~~~
peterwwillis
For systems where you already enforce security policies, updating policy is
faster than upgrading software, due to fewer build processes, quality control
issues, and potential side-effects.

If you are using SELinux, verify your containers are running as _container_t_.
If not, verify you are using user namespaces that don't map host root into the
container user's namespace. These should mitigate the issue.

(as far as trust goes, just don't trust any local users. there's too many ways
to privesc on Linux, and SELinux is the only thing that stops most of them)

~~~
geofft
I believe that if you're in an environment where users don't have (and can't
gain) root inside the container, you're also fine, and if that's your
theoretical policy, getting to the point where you 100% enforce that might
also be easier than patching.

------
yujie1984
Mesosphere employee here. We have released the product advisory on this CVE.
Please check out the advisory and update your software.

[https://support.mesosphere.com/s/article/Known-Issue-
Contain...](https://support.mesosphere.com/s/article/Known-Issue-Container-
Runtime-Vulnerability-MSPH-2019-0003)

------
geggam
Next year this exploit will still be in thousands if not millions of
containers all over

There is a distinct lack of knowledge on how to manage a system in the
container ecosystem

~~~
ec109685
What do you mean _in_ containers?

------
deathanatos
Not that this shouldn't be patched and all, but this seems like it is being
treated with more urgency that is required.

If I am understanding the CVE correctly, you need to be able to launch
_privileged_ containers with an attacker-controlled image where the container
user is root _and_ not namespaced (i.e., the same root as the outside root
user). How is this not "on the wrong side of an airtight hatch[1]"?

Am I missing something here? If you can start privileged containers, why not
just execute evil.exe directly?

[1]:
[https://blogs.msdn.microsoft.com/oldnewthing/20060508-22/?p=...](https://blogs.msdn.microsoft.com/oldnewthing/20060508-22/?p=31283)

~~~
tinco
I run a privileged container, of which I am not the author. I also know many
other people run this container in privileged mode. No one is paying that
person for it, and if they want to or they get compromised at some point we
all might get rooted when we update the image.

I think my OS is on a read only filesystem though, and maybe I've got it
namespaced correctly as well, but still it's pretty dangerous.

------
tyingq
Is this something that non-privileged containers mitigates? Curious what the
big barriers are to this. I know they exist, but aren't used widely...I assume
because some functionality doesn't work.

~~~
iwalton3
Yes. The lxc commit[1] states that this issue only affects privileged
containers. No CVE has been issued for lxc because they consider privileged
containers to be insecure.

In my experience unprivileged containers work for most tasks, but there is
breakage in some areas. Usually the issues are simple to resolve, like
disabling OOM adjustments in systemd or changing the idmap range in winbind to
be within the namespace allotment.

[1]
[https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49bafb8...](https://github.com/lxc/lxc/commit/6400238d08cdf1ca20d49bafb85f4e224348bf9d)

~~~
brauner
I've also written a smallish blogpost about this CVE. I'm a LX{C,D} maintainer
and I've worked with Aleksa the runC maintainer together on a fix for this
CVE: [https://brauner.github.io/2019/02/12/privileged-
containers.h...](https://brauner.github.io/2019/02/12/privileged-
containers.html)

~~~
arno1
Thank you @brauner for writing this blogpost!

IIUC, using Docker's userns-remap would protect against this CVE by making the
containers run unprivileged (container's id 0 != host's id 0) and should
generally be the industry's best practice.

------
koolba
Is this issue specific to containers running as root?

~~~
cyphar
Yes. You need to be able to run a container as root (or rather, as a user
which has write access to the host runc binary -- which is usually root). User
namespaces protect you for this reason.

------
morpheuskafka
Yet to see an Ubuntu Security Notice released, I'm presuming an update to the
docker.io package will be released?

~~~
alexmurray
docker.io and runc are in universe so not officially supported by the Ubuntu
Security team so won't get a Ubuntu Security Notice

------
pizlonator
Yikes that's a big patch! Just on a meta-level, security vulnerabilities fixed
with big patches are usually the least fun.

Also, I would bet that freshly written C code has about 1 RCE bug every 100
LoC. This patch has 236 LoCs so probably about 2.36 RCE's.

~~~
megous
By your guess each new Linux release would have about 5000 new RCEs. So that's
25000-30000 new RCEs over the last year alone.

~~~
pjmlp
In 2018, 68% of Linux CVEs were caused by C's memory corruption features.

Source, Google talk at Linux Kernel Summit 2018.

~~~
wahern
And how many of those were related to the difficulty of managing IPC across
user/kernel boundaries, DMA, mbuf-based networking code, eBPF JIT'ing, etc?
So-called memory safe languages don't help with any of that.

I'll take your memory safe languages and up the ante with microkernels, which
actually help immensely in all those cases. But Linux isn't going to go that
route, either.

~~~
pjmlp
Plenty of them related to lack of bounds checking handling strings and arrays.

Google has been pushing for the Kernel Self Preservation project for quite a
while now, which Android and ChromeOS make best use of, also a reason why the
NDK is so constrained on Android.

Security in the kernel also had quite a few talks at Linux Conf 2019, just
recently in New Zealand.

