Understanding Docker Container Escapes

rtempaccount1 · on July 20, 2019

Whilst from an overall security perspective, this may not be that meaningful (the pre-req's are CAP_SYS_ADMIN or --privileged), it's still a really neat hack.

Also a good demonstration of how runc container security is really just Linux security, and the isolation provided by that type of container, is dependent on underlying Linux features.

I could see circumstances where this would be of practical use, where you've landed in a privileged container and are looking for a relatively easy/universal breakout to the underlying node.

The ToB breakdown is cool too as the initial tweet was a bit hard to parse, unless your Bashfu is good.

(ofc the post from ToB I'm really looking forward to hasn't been released yet, which is what they found in the Kubernetes audit)

panpanna · on July 20, 2019

Two minor remarks:

1. Docker is not synonymous with Linux containers. LXC for example has always had much better security than Docker despite using the same infrastructure.

2. You don't need to interface the kennel directly. There is for example a Google project (gvisor?) that acts as a syscall proxy in userland to combat these types of problems.

rtempaccount1 · on July 20, 2019

Out of curiosity, what makes you say that LXC has better security than Docker? What aspects of LXCs design or implementation provide additional security controls over those of Docker?

panpanna · on July 20, 2019

For example LXC containers are unprivileged by default. If you escape the container you end up as a normal user on host, not root.

LXC main dev is on HN. I'm sure he can explain this much better.

rtempaccount1 · on July 20, 2019

I'm not sure I follow you here.

The Docker daemon runs as root (which is kind of inevitable, if you don't then you end up needing hacks like slirp4netns to hook up networking)

However, there's nothing innately inside Docker that requires contained processes to run as root.

Indeed it's a standard part of recommended Docker security guidance not to run your contained processes as the root user.

Docker also provide the facility to enable user namespaces at the daemon level, so that root inside the container != root on the host.

cyphar · on July 20, 2019

When GP said unprivileged containers, they meant user namespaces (that's the terminology LXC uses). Docker doesn't default to user namespaces being used (LXC does) and within Docker it has many limitations that LXC/LXD do not. LXC/LXD can also isolate containers from each other by mapping different uid_maps, but with Docker all containers use the same mapping.

Disclaimer: I'm a maintainer of runc, the runtime Docker uses.

namibj · on July 21, 2019

That sharing makes me sad.

It's how Docker doesn't have a good reputation for security.

rtempaccount1 · on July 21, 2019

I understand that Docker's user namespace support is relative basic, but the point I was looking at was that if you don't run your contained process as root (e.g. unprivileged) by specifying a USER in the Dockerfile, then I wasn't aware of major differences in the security of an LXC container as against one running under runc

ofc happy to be corrected, as I'm aware you'd know more about this :)

cyphar · on July 21, 2019

It really depends whether you want to compare similarly-configured containers or the defaults.

If you compare the defaults, LXC wins overall because they have rootless containers and user namespaces by default (runc has them too -- I implemented them -- but it's not the default in Docker). To be balanced, LXD's isolation of individual containers is not on by default either (because of backwards compatibility requirements) -- but Docker doesn't have an equivalent feature. If you configure a Docker setup to be as-close-as-possible to an LXC setup, then it's much harder to give a definitive answer. Generally, the containers we set up look almost identical from the kernel's point of view so we have similar kernel 0day problems. So it comes down to the security of the runtime in particular.

I am currently working on solving several pretty fundamental security issues that exist both within LXC and runc (and many more programs generally)[1], so it's not like either is perfect (though LXC does have more code to defend against the attacks I'm working on fixing). LXC does make use of more of the kernel hardening work that we (both the LXC folks and myself) have worked on. A trivial example is that LXC uses TIOCGPTPEER (a feature I originally implemented that allows you to avoid certain theoretical attacks by container processes against the runtime) but Docker doesn't use it (and because runc doesn't have a container manager by design we can't implement it in runc). LXC also supports using pidfds (a new feature in Linux 5.1 that Christian Brauner has been working on for a while) which allow much nicer methods of avoiding PID recycling race conditions -- with runc we still use the old pid+starttime method which is prone to well-known (though usually harmless) attacks.

Funnily enough, I'm actually giving a talk about this topic at the end of this week[2] and was writing slides when I saw this thread. :P

[1]: https://github.com/openSUSE/libpathrs [2]: https://2019.container.camp/au/schedule/securing-container-r...

zapita · on July 20, 2019

That’s correct. Lxc is not “more secure” than Docker in any meaningful way. They are equivalent in their use of the Linux containment “plumbing”: cgroups, namespaces, capabilities, etc.

cyphar · on July 20, 2019

LXC supports user namespaces where containers are isolated from one another. LXD furthers this by remapping containers on restart (so you can change mappings). And you can "punch out" individual mappings for shared volumes and so on. Docker doesn't support this. In fact the recent expansion of the number of mappings allowed by the kernel is work done by Christian Brauner (an LXC maintainer who I collaborate with) -- this feature is so useful to LXC that they had to add the ability to have more mappings.

LXC also supports unprivileged operation (which I named "rootless mode"). Docker gained support for this very recently as an experimental feature in 19.03 (still not released), but LXC has supported it (and defaulted to it whenever possible) for years. Though, the Docker one is arguably better in some respects of lack-of-privilege (thanks to great work from Akihiro Suda and Guissupe Scrvano) but it's still new.

LXC also has put a lot more work into fundamental security work (both in-kernel and within LXC).

Disclaimer: I maintain runc, the runtime Docker uses. There is no question that LXC has better engineering in this department. I collaborate with them quite often, but they have more engineers working on fundamental problems within containers.

zapita · on July 21, 2019

Thanks for the detailed answer. A few follow-up questions if I may.

1. Docker does support user namespaces today, correct? Your reply seemed to imply that it doesn’t.

2. Once rootless mode is released in Docker stable, the only difference in available security features between lxc and Docker will be the more flexible uid mapping for user namespaces, correct?

3. The flexible uid mapping feature, compared to user namespaces with static mapping as implemented by Docker, is an additional protection against container-to-container attacks, but not against container-to-host attacks. Did I get that right?

4. User namespaces, with or without flexible uid mapping, are considered a less secure containment method than seccomp and selinux/apparmor, all of which Docker/runc and lxc support equally well, correct?

cyphar · on July 21, 2019

> 1. Docker does support user namespaces today, correct? Your reply seemed to imply that it doesn’t.

Yes (though I don't agree my comment implied that Docker doesn't support user namespaces at all), but it doesn't support having different mappings for individual containers. This has both usability problems (--volume is painful to use) and security problems (inter-container attacks are still possible if you can "break out" of the container or otherwise disrupt the other container).

> 2. Once rootless mode is released in Docker stable, the only difference in available security features between lxc and Docker will be the more flexible uid mapping for user namespaces, correct?

Security features, (arguably) yes. But I would still argue that LXC has more security hardening work put into it than Docker. Of course they've had their own security issues but there definitely are arguments to be made that it isn't identical. I outlined some examples here[1].

Also the default configuration is still going to be run-as-root-without-user-namespaces with Docker (meaning the vast majority of users are running hideously insecurely). LXD and LXC defaults to using user namespaces. To be fair, both use seccomp and AppArmor/SELinux policies by default -- but depending on seccomp and AppArmor/SELinux is a much worse security position than

> 3. The flexible uid mapping feature, compared to user namespaces with static mapping as implemented by Docker, is an additional protection against container-to-container attacks, but not against container-to-host attacks. Did I get that right?

Yes.

> 4. User namespaces, with or without flexible uid mapping, are considered a less secure containment method than seccomp and selinux/apparmor, all of which Docker/runc and lxc support equally well, correct?

That's not quite true. User namespaces are arguably a much better containment method for containers. There are hundreds of user-namespace related hardening checks within the kernel (as well as the obvious "the euid space is different" protections) which you don't end up taking advantage of if you run in &init_userns. In fact, most kernel developers working in this space (namely Eric Biederman) don't consider security issues to be as serious if you can't exploit them without disabling user namespace protections. CVE-2019-5736 and CVE-2016-9962 were both blocked by using user namespaces.

But yes, there are some breakouts that user namespace support in your kernel have historically caused (and we have seen that many times) -- but that's why both Docker and LXD block unshare(CLONE_NEWUSER) with seccomp. But you can have all three! And (once Docker is configured) then all three support them all equally effectively.

[1]: https://news.ycombinator.com/item?id=20491291

zapita · on July 22, 2019

Thank you once again for the detailed reply.

I appreciate that you have a nuanced position on the topic of Docker security, based on deep expertise. Sadly, that nuance is lost on 99% of the people I see shouting that "Docker is insecure", the same people who presumably downvoted my original comment into oblivion. They are calling Docker insecure not because they understand what you explained (they don't), but because they have heard half-truths or outright fabrications, and are repeating them with absolute conviction, without bothering to argue their point or check even the most basic facts. As someone who has a lot of actual first-hand experience with Docker I find that very frustrating.

So, although I agree with everything you said, and appreciate that you took the time to write it down; I believe that your answer has unintentionally vindicated the many people lurking on this site who hold the widespread, almost cult-like belief that Docker is very insecure - insecure to the level of gross negligence, in a way that you and I understand it isn't.

panpanna · on July 20, 2019

The "plumbing" is the same but they set it up slightly differently which in the end affects things like security.

For example user mappings were handled very differently in lxc compared to (earlier versions of) docker.

https://linuxcontainers.org/lxc/security/

throw2016 · on July 21, 2019

LXC is daemonless, there is no process hanging around after the container start, so it starts the container and uses any privileges required to setup things like networking, mounts etc and then drops privileges.

LXC had unprivileged container support since 2013 so that part is fairly mature now. 'Unprivileged' in this case means the container process itself is running as a normal user.

cyphar · on July 21, 2019

LXC does have a container manager though, which is a single process that stays alive for the life of a single container. Within runc (the runtime Docker uses), we don't have a container manager but the downside is that now the upper level needs to keep alive the descriptors and other kernel objects that allow for safe container management by the runtime.

[I maintain runc, and collaborate with the LXC folks.]

awinter-py · on July 20, 2019

> use official docker images

(as a mitigation)

okay, but even for official images, figuring out the provenance of a build on docker hub is totally impossible

I challenge you to start with an image sha and tell me what git version (or even what repo) was used to create it

docker needs to get better at supply chain

zeroxfe · on July 20, 2019

This is a weird kinda escape. The only time I use --privileged is when I'm debugging, _because_ it lets me easily elevate access. Does anyone actually run privileged containers for production workloads?

ianamartin · on July 20, 2019

If a thing can be done, some nontrivial number of people do it in production.

The number of people who do it in production is correlated with the complexity of the thing they are trying to do and how much the "fuck it, just do <some terrible idea>" relieves that complexity.

derefr · on July 20, 2019

Docker containers that create/manipulate your Docker containers (e.g. the Kubernetes control plane) are --privileged, if they don't just mount the Docker daemon socket into themselves (which is basically equivalent.) If you're using kubectl, you're speaking to a daemon running in a --privileged Docker container, which you could perhaps exploit.

rtempaccount1 · on July 20, 2019

There's some cases where it's use in Kubernetes. Some of the system pods will run privileged and also some CNI pods will too.

Xylakant · on July 20, 2019

Elasticsearch requires some sysctls set on the node and some of the helm charts around use a privileged init container to set those. A solution that’s obviously more or less accepted, though one I personally dislike.

merb · on July 20, 2019

well you can run that on the host directly and disable the init container.

Xylakant · on July 20, 2019

Absolutely (though that requires host access), it just doesn’t seem to be what people do.

tln · on July 20, 2019

Not sure about production workloads but its not just privileged containers you start that should be a concern.. anyone who can access /var/run/docker.sock can run a privileged container, so this can be a privilege escalation.

Because of this escape, giving access to /var/run/docker.sock to regular users is the same as giving them root access.

Also as the article says mounting /var/run/docker.sock is (now, because of this escape) the same as giving that container access to the host system.

mclehman · on July 20, 2019

On the other side of things, my favorite demo for people new to docker who aren't yet aware that sudoless docker ~~ root access is:

    docker run -itv /:/host ubuntu chroot /host

cyphar · on July 21, 2019

Or, even better use nsenter to join all the namespaces of PID 1 on the host (making your process an ordinary root process in the init namespaces).

zapita · on July 20, 2019

Mounting /var/run/docker.sock has always been the same as full host access. That is a well-documented fact irrespective of this particular escape.

deathanatos · on July 20, 2019

Yes. We use Logspout, a container that attaches to the stderr/out of other containers and forwards those stderr/outs to more centralized logging services, such as an ELK stack. The attach call requires the docker socket, and so it is effectively a --privileged container.

The idea, of course, is to minimize the surface area for attacks. That container has a need for it to run that way; the vast majority of our other containers do not.

wearsaredhat · on July 20, 2019

Container security products like Twistlock require privileged access. I’m sure there are other use cases.

the8472 · on July 20, 2019

Sometimes you need to disable at least some of the security features, e.g. to run gdb, perf. Or headless chrome with its security sandbox which just blows up with the default seccomp filters.

shaklee3 · on July 21, 2019

Yes. You need privileged to run DPDK in a container.

gcb0 · on July 21, 2019

> when I'm debugging

Ironically, attackers would have access to more valuable data (and more freedom of movement) on your personal/dev box than on a monitored production host.

oso2k · on July 20, 2019

The title ought to be updated to specify the container types in question are privileged containers. Even if the original blog post doesn’t. A privileged container is quite like running a process with a user that has sudo permission. Being surprised that such a container can do bad things seems disingenuous. It’s like saying, “When you give me the keys to your house, I can rob you blind. I just need to figure out which key is for the front door.”

tptacek · on July 20, 2019

The post says, over and over, almost ad nauseam, that you shouldn't be using --privileged and that the original tweet depends on --privileged; moreover, it presents an alternate version of the escape that doesn't depend on --privileged (but does depend on other elevated privileges).

The title should not in this case be editorialized; relative to the topic, it's about as boring as titles come.

lawnchair_larry · on July 20, 2019

I think the title is a bit clickbaity, perhaps unintentionally. I think parent’s point is that this post isn’t really going to teach you anything at all about docker container escapes, but rather, it teaches you about escaping containers that an admin took several deliberate uncommon and non-default steps to basically escape the container for you. If one clicked it to learn about real world container escapes, they left disappointed.

OJFord · on July 20, 2019

Obviously CAP_SYS_ADMIN and root-running containers should be avoided.

But on the occasion that they're necessary, isolating from other pods with a network policy and having no public ingress is enough right?

Assuming of course that you trust the container process(es) - or is that the issue?

rtempaccount1 · on July 20, 2019

I'd say "that depends"

For example if you're running Kubernetes, then if you have a user who has RBAC rights to exec into a privileged pod, but doesn't have rights to create new privileged pods, then this could be a privesc risk, as they can use something like this to escalate to the underlying node after exec'ing into the privileged pod.

It (as with all things security) depends on your threat model :)

OJFord · on July 21, 2019

Good point - I was thinking that 'operators' were trusted though, potential threat is the (ab)users of the running software.

As far as I can tell, as long as there is no service/ingress on the privileged container, and a netpol blocks those that do from accessing it, it's less than ideal but 'ok' that this privileged container is running behind the scenes.

paulddraper · on July 20, 2019

AFAIK, that's the only what to run Chrome unless you disable Chrome's sandboxing.

the8472 · on July 20, 2019

Lateral movement of an attacker within your cluster would also be a problem.

mychael · on July 21, 2019

The author is presenting this as if they found a serious security flaw in Docker. They're starting a container in privileged mode, what exactly did they expect?

This is like acting surprised that a Linux root user can do immeasurable harm to the underlying OS.

based2 · on July 20, 2019

https://www.reddit.com/r/netsec/comments/cfh7rk/understandin...

king_phil · on July 20, 2019

Funny thing about docker container security, bug that has not been fixed for ages: a custom AppArmor profile is only applied on the first container start, but for no later restart.

Yes, the container runs in the "unconfined" profile after a restart.

https://github.com/moby/moby/issues/38075

zapita · on July 20, 2019

That’s disingenuous. In this issue, the maintainers clearly explain that running your container as privileged is supposed to disable all confinement by apparmor. The bug is that the custom apparmor profile is sometimes applied, when it should never be. This is not a security issue in any way since the container is already privileged.

king_phil · on July 20, 2019

But in a privileged container you could still take away capabilites and/or permissions with an apparmor profile. Sometimes that happens, sometimes it does not. And when it does not, you have no way of knowing.

zapita · on July 21, 2019

> But in a privileged container you could still take away capabilites and/or permissions with an apparmor profile.

Right, what you want is “privileged except for XYZ”, which is not supported by Docker. That’s a missing feature which is not the same as a bug. Calling it a security bug is even more misleading.

> Sometimes that happens, sometimes it does not. And when it does not, you have no way of knowing.

Right, it should fail every time. That is a bug. But it’s not security bug, and fixing that bug won’t give you the feature you want, it will just make it clearer that the feature is not supported.