Hacker News new | past | comments | ask | show | jobs | submit login
Docker container breakout? (openwall.net)
317 points by eugeneionesco on June 18, 2014 | hide | past | favorite | 89 comments

Hi all, I'm a maintainer of Docker. As others already indicated this doesn't work on 1.0. But it could have.

Please remember that at this time, we don't claim Docker out-of-the-box is suitable for containing untrusted programs with root privileges. So if you're thinking "pfew, good thing we upgraded to 1.0 or we were toast", you need to change your underlying configuration now. Add apparmor or selinux containment, map trust groups to separate machines, or ideally don't grant root access to the application.

Docker will soon support user namespaces, which is a great additional security layer but also not a silver bullet!

When we feel comfortable saying that Docker out-of-the-box can safely contain untrusted uid0 programs, we will say so clearly.

Thank you for being so completely transparent about things like this, I wished this attitude was more common in the IT world.

Don't worry. This is how things get more secure. Just stay the course and whack the moles. :)

Docker is awesome by the way.

Speaking of whack a mole, last month I built a docker image for the Trinity syscall fuzzer. It's a great way of finding those moles, for anyone interested in contributing to either Docker or the kernel:


It is better to think "less insecure" than "more secure".

Great response on this. Nice seeing the transparency.

Great response.

"..., or ideally don't grant root access to the application."


Solomon, as always, brilliant and to the point. Keep rocking!!

also worth noting that you can still de-elavate the process in the container, discourse web runs under the discourse user in the container.

I don't use Docker, but this is just good peace-of-mind practice.

Here is your annual reminder: Use SELinux.

Nothing with a shared kernel is going to be very secure. That's just the nature of the beast. It's why Docker supports SELinux. It's why RHEL and CentOS ship with pre-written SELinux policy for common daemons.

If you intend to have more than zero services on the system, you want SELinux.

Or real virtualization + SELinux + sVirt ... libvirt on RHEL/Fedora/CentOS puts the qemu process into a container too.

For security yes, however, for performance / resource requirements docker is going to beat libvirt/qemu.

I think of docker as a "nicer chroot", i.e. it might be nice for testing deployment of networked applications with lots of servers, where setting up a new VM for each one would be both slow and an overkill.

Is running something inside docker worse than running the same application on the host from a security pov? If not then you can consider docker just as one way of deploying an application on the host, i.e. not something for shared hosting of independent/possibly malicious applications.

In addition, we have repeatedly hit problems with our "mock" build system which uses a different kernel from what userland software is normally tested with. eg: [1] [2]. This stuff is going to hit Docker users sooner or later. It is also infuriatingly hard to debug.

[1] http://bugzilla.redhat.com/1062533

[2] https://bugzilla.redhat.com/563103#c8

Or for running anything other than Linux on the exact same kernel. Running Windows, for example, or older copies of Linux.

Does it run Windows? Because libvirt/qemu does. A slow speed beats 0 speed usually? ;-)

A slow speed beats 0 speed usually? ;-)

Not if you're running Windows.

> Not if you're running Windows.

Not if Windows puts money in the pocket

Just buy more computers. Works for us and is cheaper than VMWare ESX (the thing we replaced).

Xen would be cheaper and give us better utilisation but getting rid of VMware was a good step towards cost savings :)

Use grsecurity and a least privilege policy generated specifically for your exact system, and not generic policies.

I can't help but be skeptical about SELinux, having been written by the NSA. What would make you choose SELinux over AppArmor?

NSA creation or not, SELinux is at its core really little more than a strict state/transition machine. It has also been vetted pretty well over the years.

The syntax for the transition rules on the other hand looks like someone disgorged C-struct assignments on paper and left them there. So while SELinux (the system) is quite easily understood, the rules used to build SELinux systems are frightening, large, and from first appearance, very, very complex.

(disclosure: I had the privilege of going through SELinux a few years ago when Nokia considered using it as part of their Maemo platform security. The engineering effort was eventually deemed too large and the benefit too little, so it was skipped as infeasible. Less than two years later, Google announced that they would be taking on the task with their SE-Android project.)

While I'm as suspicious of the NSA and their intentions as anyone (I am very actively involved in my local Restore the 4th and Cryptoparty chapters), the fact is that the code is Open Source, and has been vetted by some of the best, and most trustworthy, developers in the world (it has been in the mainline Linux kernel for over a decade).

I trust SELinux. I don't necessarily trust my understanding of SELinux...it is a complicated beast. But, I believe that when configured correctly, it is a very powerful tool.

As I understand it, SELinux covers more ground than AppArmor. I have low familiarity with AppArmor, however, so don't know enough to argue why one might choose one over the other. But, I don't have any suspicion of SELinux containing exploitable code inserted by the NSA.

the principles are really not that complicated really. I like this diagram of a similar implementation:


For all its sins, SELinux code is actually pretty clear/simple. Its also nicer than AppArmor if you ask me, and it records inodes, not path, for labelling.

SELinux is supported by the principal Linux vendor, Red Hat. As a result, on RHEL/CentOS systems, there's tons of high-quality policy pre-written for any daemon you could want to install.

Instead of writing thousands of lines of policy from scratch, even a very complex system configuration might require a one-liner tweak to the Red Hat-provided policy.

Isn't it for the NSA, but by Red Hat? May have my history mixed up.

I believe that the National Security Agency was the original developer, but that Red Hat is significant contributor.

NSA developed it and open-sourced it (much like their development of the crypto hash used for git), but Red Hat and the Linux kernel devs have pushed it the most since.

> Nothing with a shared kernel is going to be very secure.

SELinux is still operating in a shared kernel.

Apparently this is already fixed in Docker 1.0:

  Its fixed in docker 1.0 since CAP_DAC_READ_SEARCH is no longer available.

  Other FS-related threats to container based VMM's that have been discussed:

  - subvolume related FS operations (snapshots etc)
  - FS ioctl's that accept FS-handles as well (XFS)
  - CAP_DAC_READ_SEARCH also defeats chroot and other
    bind-mount containers (privileged LXC)
  - CAP_MKNOD might be a problem too (still available in docker 1.0) depending on the drivers available in the kernel
Source: http://seclists.org/oss-sec/2014/q2/565

Confirmed not working in Docker 1.0:

  root@377a6f4ab0a4:/# history
  10  wget http://stealth.openwall.net/xSports/shocker.c  
  11  cc -Wall -std=c99 -O2 shocker.c -static
  12  apt-get install build-essential
  13  cc -Wall -std=c99 -O2 shocker.c -static
  14  cc -Wall -std=c99 -O2 shocker.c -static -Wno-unused-result
  15  ls
  16  ./shocker
  17  shocker
  18  nano a.out
  19  cat a.out
  20  ./a.out
  21  history
  root@377a6f4ab0a4:/# ./a.out
  [***] docker VMM-container breakout Po(C) 2014           
  [***] The tea from the 90's kicks your sekurity again.     [***]
  [***] If you have pending sec consulting, I'll happily     [***]
  [***] forward to my friends who drink secury-tea too!      [***]
  [*] Resolving 'etc/shadow'
  [-] open_by_handle_at: Operation not permitted
  root@377a6f4ab0a4:/# uname -r

I believe it was Theo de Raadt who once said, "Why does everything think that when it comes to writing VM/container software suddenly people gain super human programming powers and no longer make the same mistakes they make writing operating systems?" (Slightly paraphrasing).

While the issue is currently fixed in the .12 and 1.0 versions. I doubt Docker is still completely bullet proof.

His words where: "You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes." (http://marc.info/?l=openbsd-misc&m=119318909016582)

It a wonderful quote by the way, I really like it and it mirrors my reservations regarding some peoples use of visualization.

Visualization is perfectly fine, for hardware utilization, ease of deployment and so on, just don't rely on it for additional security, because that's not what it's there for.

I disagree: the hardware virtualization mechanisms provide an extra level of protection. Just like the non-virtualized protection mechanisms do.

No virtualization developer is under the impression that it's magical or bulletproof.

You rely on the operating system's security mechanisms continuously, and developers work hard to fix bugs and vulnerabilities when they appear. Same goes for virtualization -- the security semantics are just different.

>No virtualization developer is under the impression that it's magical or bulletproof.

Developers: No, of cause not. Some users however assume that you're automatically safe because you run Vmware/Xen/HyperV whatever.

That is why Redhat contributed SELinux support for Docker so you can run with Mandatory Access Control enabled. Docker is a layer, and security is best in multiple layers. One of them will always be broken.

This phrasing unfairly conflates VM/hypervisor technology and containers. Containers being a pure software technology do require near superhuman ability to secure but VM/hypervisors can lean on chip-level separation.

People forget that in-chip memory protection didn't come about for security reasons, memory errors were a particularly dangerous and particularly common kind of bug and the hardware was extended to help with memory isolation. OS session ending memory errors are almost unheard of since operating systems have started fully utilizing the on-chip protection. Programmers didn't become "superhuman" at preventing these errors.

For similar reasons it's much easier for hardware-backed virtualization programmers to protect you from malicious business inside a VM than it is for OS or container programmers.

You now rely on chip designers being super-human.

The real truth is that the difficulty of containment is proportional to the interface that is available to the contained process. You don't need VM or hypervisor technology to build a virtually unbreakable container. You only need to prevent the contained process from using any syscalls at all.

Hardware only seems better at this kind of stuff because (a) it's harder to find errata in hardware and (b) the syscall interfaces of commonly used operating systems are much larger than what the hardware offers, and were developed without keeping containability in mind. It is a well known fact that tacking on security features in hindsight is problematic.

You don't need super-human chip designers because, as you say, "the difficulty of containment is proportional to the interface that is available". Hardware doesn't just seem better because "the syscall interfaces of commonly used operating systems are much larger than what the hardware offers", it is better. It is easier to analyse, has a more limited state-space, has more provable behavior, etc.

You can't just argue away the fact that a certain class of error has been all but eliminated by hardware-supported virtual memory. Multi-tasking as we know it today would basically be impossible without it. The reliability of "just get it right" systems like the early Macintosh isn't even comparable to, for example, a modern Linux machine that uses the chip to trap large classes of erroneous memory accesses.

Given that we have the above, a case of a class of error that programmers seemed unable to eliminate (practically) eliminated, I'm not really sure what you're arguing. Are you saying that hardware designers of the 80's were superhuman?

Okay... maybe Jay Miner...

My point is that there is no difference between software and hardware.

You don't need hardware to eliminate memory errors: software can do it as well. Two examples of this are the Singularity system that Microsoft Research built and Google's NaCl, where the system only loads code that can be verified not to access memory incorrectly.

Your claim that hardware is easier to analyze is also incorrect. Modern processors are extremely complex beasts and are not inherently simpler than software. All processors have long lists of errata. You may be mislead into thinking hardware is easier to secure because (a) those errata are less visible to userspace developers because the kernel shields you from them and (b) hardware developers invest much more resources into formal verification than software developers out of necessity (you can't just patch silicon). If software developers invested a similar amount of effort into formal verification tools, your impression would be rather different.

Again, the point is that there is no inherent distinction between software and hardware when it comes to securing systems. It is always and everywhere first a question of how you design your systems and interfaces and second a question of investment in development effort targeted at eliminating bugs.

"My point is that there is no difference between software and hardware."

Okay, now I see where you're coming from. Theoretically I agree. However, practically there are a number of things that make hardware different:

* Hardware has inherent "buy-in". The software systems you describe as also solving the memory access problem are basically opt-in frameworks. While you can make software frameworks hard to opt-out of (e.g. OS integration etc.) by definition... software runs on hardware...

* hardware solutions are often much more transparent. Again, your software example require a great deal of re-tooling. One of the most elegant aspects of the classic 80's memory access solution was how transparent it was.

* The ratio of software to hardware vendors has far fewer hardware vendors. Combine this with the fact that, as you point out, hardware is so expensive to retool and you create an environment where it is much more likely that a single hardware solution will be "correct enough" to enforce a constraint on software than it is the case that the majority of software will properly opt-in to a framework/code-correctly.

> You now rely on chip designers being super-human.

At this point I want to ask how we're defining "super-human." What level of reliability is considered to have "super-human" requirements? There are certainly very simple and clear ways that one product produced by normal humans is much more reliable than another. For example, if you admonished someone to wear their seat belt while driving, you would scoff if they replied "well then I'm just relying on seat belt designers being super-human."

I actually agree with this. I believe that, using the right techniques, both software and hardware can be produced correctly. It's a function of their design and complexity how easy it is.

It's also worth keeping in mind that modern processors are actually extremely complex and that they do regularly have errata, even though chip designers are extremely conservative in their approach by necessity (you can't just patch silicon) and are much more thorough and disciplined in their use of formal verification tools than the vast majority of software designers.

Agreed. It's also a question of complexity. Xen (for example) has a significantly smaller attack surface than the linux kernel because it just has less stuff to do

That hasn't stopped Xen from having bugs that have allow an attacker to escape the domU and gain access to dom0 and the hardware.

The key really is: "Don't rely on visualization for security".

Even if you physically separate, you risk being exploited over whatever medium you have to communicate with the untrusted machine. There are no silver bullets, unless you count total isolation.

All user mode code has been in OS-enforced, security-bounded, per-process VMs that access each other and hardware through virtualized interfaces since forever (well, since the 90s for mainstream microcomputer OSes).

"Containers" are just user-accessible support tooling to get creative with how those interfaces work. It really should be much easier to make container software than the entire virtualization infrastructure from scratch, in the same way that it's easier to write tar than a filesystem driver.

No software is ever gonna be bullet proof.

Except the software running a tank.

I write software that runs on tanks, its not bullet proof. Most the communication protocols just use security though obscurity. If you tell a gear box to shift form 1st to snip it'll do it, and break everything.

But when you considered air gap, and physical security surrounding it (12 inches of plate steal, 5 man team with guns, massive main gun), its pretty secure.

Except the software running a tank.

I write software that runs on tanks

Ha! Only on HN...

I'm thinking the joke is that, since the software runs inside the tank, which itself is bulletproof, the software is literally bulletproof since you can't shoot it with a bullet.

I think it was a joke... You can fire bullets at the tank and the stuff inside is fine.

It is, but vehicles including tanks and other armored things use horribly insecure serial protocols for communication. Which is no joke :\

Only for suitably small values of "bullet". :)

Not unless you're defining "bullets" to exclude things like https://en.wikipedia.org/wiki/Depleted_uranium#Ammunition.

Modern tanks have steel armor plates that are 12 inches thick?

i think they measure tank armour as equivalent to RHA steel

so it probably isn't steel in large quantities just as good as 300mm of RHA

I've used important software for submarines. OPSEC limits what I can say, but let's just say I wasn't very impressed.

This vulnerability is a good example of how a security bug is still a bug. That is: even if all the bad guys went away there would still be problems. This issue was fixed (as far as I can tell) pre-1.0 for non-security reasons. See this discussion: https://github.com/dotcloud/docker/issues/5661 .

Basically, docker used to have a "drop" list associated with each execdriver. By default docker kept all kernel bestowed capabilities but would explicitly drop those on the execdriver's drop list. This created issues with compatibility. If an image was prepared on a kernel that didn't have a particular capability at all and was suddenly run on one that was, weird hard to diagnose behavioral differences could emerge. So now docker drops everything by default and the execdrivers have a "keep" list. Also, there's now a check for the kernel defining a capability before trying to mess with it.

There are---of course---still some issues. For example, docker drops everything then tries to add the capability back. That won't work for some capabilities because some can't be added back once they've been renounced. Still, the situation is an interesting example of how a security bug is still a bug and is likely to bite you in some way sooner or later. See also XSS flaws that limit valid inputs.

Yes. I pointed it for LXC originally, filing the bug with LXC on sourceforge and later re-filing on github. Nothing to do with docker. Finally, they implemented it. This is like multiple years of timeframe we're talking about. Later I pointed it out for docker. Just sayin'. There's a lot of us contributing here, and much of the work isn't under the docker banner.

"There's a lot of us contributing here, and much of the work isn't under the docker banner."

Of course! If my comment could be taken as part of the mass of prose that can be read to imply otherwise, I apologize. I don't want to undermine what Docker has done with providing a packaging format, UX, and history niceness. I also don't want to undermine the years of hard work that's gone into kernel namespacing etc. that makes it all possible.

Just to be clear, this doesn't really seem like a problem with Docker specifically. It looks like a problem with the kernel's namespace isolation, affecting any container-based solution. Yes, that's in the PPS, but probably should be in the title.

that's true to some extent, but it also has to do with which namespaces and isolation features a given container solution supports, for example lxc has seccomp syscall filtering support and user namespace support ootb which would have mitigated this attack surface to those of the unprivileged user running the container (and covering the ps on kexec). in addition lsm usage (selinux, apparmor) can also limit the attack surface area.

Linux capabilities are not a good security model in general. They're more suitable for sysadmins who want to do the very basic locking down of system resources to prevent users from fucking things up, or preventing programs from doing basic accidental mistakes. A MAC or RBAC implementation is a lot more robust and actually fulfills the qualifications for things like secret/top secret computing systems.

grsec + containers is nicer.

From what I recall, grsec's rbac doesn't give you the same flexibility as selinux's mac. You have to pick and choose whether you want advanced heuristics to prevent different kinds of attack, or just get really fine-grained with your system control. I prefer grsec personally, but only because i'm lazy, and it's more than likely not certified for top secret systems.

grsec's developer philosophy fully admits this pragmatic focus... basically, if we are too lazy to use a custom policy because policy development is painful, then essentially we are not going to use any policy and will instead remain vulnerable. grsec makes it easier, therefore it actually gets used.

I studied SEL policies in depth back in 2000 or so, but have never once deployed a custom policy. I suspect others are the same, though common daemons on common distributions recently (~last 5 years) began to have usable pre-supplied policies, unfortunately standard services are so commodified these days they're often outsourced (email, chat, web, etc.) and so the benefits of this 'too-little-too-late' development are partly mitigated in practice.

"tested with docker 0.11"

Looks to be a few releases out past 0.11, notably 0.11.1, 0.12, and now 1.0. Can anyone confirm this works on the later versions of docker?

The kernel version is just as important, if not more important. I've performed tests that have given me non-specific breakout capabilities on certain combinations of Linux and Docker. I haven't however, as is done here, reproduced and written an exploit.

As an employee of Docker, I feel it is more important to me to know if we can breakout and patch those issues than to write viable exploits for them.

I have noticed that newer kernels and Docker versions (such as 1.0) are currently more difficult to break out of than they were in earlier versions. Again, however, it's highly dependent on the pairing.

What's important to recognize here is that even with breakout potential, containers should add a useful layer of security to break out of that wouldn't otherwise exist. Containers should never remove security from your system, they should only add to it. However, although deployers may find the removal of virtual machines to weaken their security story. The "secure all the things" story would be to put Qemu in a container, then run containers inside the VM.

Otherwise, security practices are as they always have been: Don't leave setuid binaries floating around, etc.

Relevant recent discussion on the security of containers vs full virtualization: https://news.ycombinator.com/item?id=7834338

Now try breaking out of a FreeBSD jail ;-)

That's dated Jan 15, 2009. Are there any newer exploits for breaking out of jails on recent versions of FreeBSD? OR breaking out of a Bhyve VM?

And that's not even an exploit, that's the sysadmin doing it wrong: "If the host system and the jail share the `john' user and you are sharing `/usr/local' as read-write between the host and the jail, then ``you are doing it wrong!''."

Confirmed this works on lxc 0.7.5-3ubuntu69 (ships with Ubuntu 12.04 lts). Just change "/.dockerinit" to any file within a bind mount and run it.

To fix this you can shutdown the container, edit the config in /var/lib/lxc/<name>/config and add dac_read_search to lxc.cap.drop. Voila.

  [*] Resolving 'etc/shadow'
  [-] open_by_handle_at: Operation not permitted

Docker image to test if your host is vulnerable to this particular open_by_handle_at() container breakout: https://registry.hub.docker.com/u/gabrtv/shocker/

Remember to always audit the source code before running something like this. Origin repo is here: https://github.com/gabrtv/shocker

> "The tea from the 90's kicks your sekurity again. If you have pending sec consulting, I'll happily forward to my friends who drink secury-tea too!"

Uh, what?

stealth is a legend of the '90s


Never really got talking but they sent me some friendly warez when I released an ARP-based remote OS detection script in like 1999. As I recall they were all Germanic. Doubt there's any 'reach out' from groups today, too much profit/loss going on...

With a technology used in production in a lot of places, does responsible disclosure apply here? Did the author (OP?) notify Docker and/or the Linux kernel team before distributing this for public consumption?

While the author called the exploit "shocker" I don't think anybody is shocked by it. There are likely many other ways to break out of Docker. And while I assume the Docker (and kernel!) people are closing them as they come along, I don't think anyone is claiming Docker is unescapable so it's not a surprise if it is.

I think he might have been sarcastic

See: http://www.docker.com/resources/security/

It has been used effectively in the past! I'd encourage anyone who is researching security and Docker to send new information here.


A fix has been released already - Docker 1.0. Thus, it shouldn't really be an issue.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact