Hi all, I'm a maintainer of Docker. As others already indicated this doesn't work on 1.0. But it could have.
Please remember that at this time, we don't claim Docker out-of-the-box is suitable for containing untrusted programs with root privileges. So if you're thinking "pfew, good thing we upgraded to 1.0 or we were toast", you need to change your underlying configuration now. Add apparmor or selinux containment, map trust groups to separate machines, or ideally don't grant root access to the application.
Docker will soon support user namespaces, which is a great additional security layer but also not a silver bullet!
When we feel comfortable saying that Docker out-of-the-box can safely contain untrusted uid0 programs, we will say so clearly.
Speaking of whack a mole, last month I built a docker image for the Trinity syscall fuzzer. It's a great way of finding those moles, for anyone interested in contributing to either Docker or the kernel:
Nothing with a shared kernel is going to be very secure. That's just the nature of the beast. It's why Docker supports SELinux. It's why RHEL and CentOS ship with pre-written SELinux policy for common daemons.
If you intend to have more than zero services on the system, you want SELinux.
I think of docker as a "nicer chroot", i.e. it might be nice for testing deployment of networked applications with lots of servers, where setting up a new VM for each one would be both slow and an overkill.
Is running something inside docker worse than running the same application on the host from a security pov?
If not then you can consider docker just as one way of deploying an application on the host, i.e. not something for shared hosting of independent/possibly malicious applications.
In addition, we have repeatedly hit problems with our "mock" build system which uses a different kernel from what userland software is normally tested with. eg:  . This stuff is going to hit Docker users sooner or later. It is also infuriatingly hard to debug.
NSA creation or not, SELinux is at its core really little more than a strict state/transition machine. It has also been vetted pretty well over the years.
The syntax for the transition rules on the other hand looks like someone disgorged C-struct assignments on paper and left them there. So while SELinux (the system) is quite easily understood, the rules used to build SELinux systems are frightening, large, and from first appearance, very, very complex.
(disclosure: I had the privilege of going through SELinux a few years ago when Nokia considered using it as part of their Maemo platform security. The engineering effort was eventually deemed too large and the benefit too little, so it was skipped as infeasible. Less than two years later, Google announced that they would be taking on the task with their SE-Android project.)
While I'm as suspicious of the NSA and their intentions as anyone (I am very actively involved in my local Restore the 4th and Cryptoparty chapters), the fact is that the code is Open Source, and has been vetted by some of the best, and most trustworthy, developers in the world (it has been in the mainline Linux kernel for over a decade).
I trust SELinux. I don't necessarily trust my understanding of SELinux...it is a complicated beast. But, I believe that when configured correctly, it is a very powerful tool.
As I understand it, SELinux covers more ground than AppArmor. I have low familiarity with AppArmor, however, so don't know enough to argue why one might choose one over the other. But, I don't have any suspicion of SELinux containing exploitable code inserted by the NSA.
Its fixed in docker 1.0 since CAP_DAC_READ_SEARCH is no longer available.
Other FS-related threats to container based VMM's that have been discussed:
- subvolume related FS operations (snapshots etc)
- FS ioctl's that accept FS-handles as well (XFS)
- CAP_DAC_READ_SEARCH also defeats chroot and other
bind-mount containers (privileged LXC)
- CAP_MKNOD might be a problem too (still available in docker 1.0) depending on the drivers available in the kernel
I believe it was Theo de Raadt who once said, "Why does everything think that when it comes to writing VM/container software suddenly people gain super human programming powers and no longer make the same mistakes they make writing operating systems?" (Slightly paraphrasing).
While the issue is currently fixed in the .12 and 1.0 versions. I doubt Docker is still completely bullet proof.
His words where: "You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes." (http://marc.info/?l=openbsd-misc&m=119318909016582)
It a wonderful quote by the way, I really like it and it mirrors my reservations regarding some peoples use of visualization.
Visualization is perfectly fine, for hardware utilization, ease of deployment and so on, just don't rely on it for additional security, because that's not what it's there for.
I disagree: the hardware virtualization mechanisms provide an extra level of protection. Just like the non-virtualized protection mechanisms do.
No virtualization developer is under the impression that it's magical or bulletproof.
You rely on the operating system's security mechanisms continuously, and developers work hard to fix bugs and vulnerabilities when they appear. Same goes for virtualization -- the security semantics are just different.
That is why Redhat contributed SELinux support for Docker so you can run with Mandatory Access Control enabled. Docker is a layer, and security is best in multiple layers. One of them will always be broken.
This phrasing unfairly conflates VM/hypervisor technology and containers. Containers being a pure software technology do require near superhuman ability to secure but VM/hypervisors can lean on chip-level separation.
People forget that in-chip memory protection didn't come about for security reasons, memory errors were a particularly dangerous and particularly common kind of bug and the hardware was extended to help with memory isolation. OS session ending memory errors are almost unheard of since operating systems have started fully utilizing the on-chip protection. Programmers didn't become "superhuman" at preventing these errors.
For similar reasons it's much easier for hardware-backed virtualization programmers to protect you from malicious business inside a VM than it is for OS or container programmers.
The real truth is that the difficulty of containment is proportional to the interface that is available to the contained process. You don't need VM or hypervisor technology to build a virtually unbreakable container. You only need to prevent the contained process from using any syscalls at all.
Hardware only seems better at this kind of stuff because (a) it's harder to find errata in hardware and (b) the syscall interfaces of commonly used operating systems are much larger than what the hardware offers, and were developed without keeping containability in mind. It is a well known fact that tacking on security features in hindsight is problematic.
You don't need super-human chip designers because, as you say, "the difficulty of containment is proportional to the interface that is available". Hardware doesn't just seem better because "the syscall interfaces of commonly used operating systems are much larger than what the hardware offers", it is better. It is easier to analyse, has a more limited state-space, has more provable behavior, etc.
You can't just argue away the fact that a certain class of error has been all but eliminated by hardware-supported virtual memory. Multi-tasking as we know it today would basically be impossible without it. The reliability of "just get it right" systems like the early Macintosh isn't even comparable to, for example, a modern Linux machine that uses the chip to trap large classes of erroneous memory accesses.
Given that we have the above, a case of a class of error that programmers seemed unable to eliminate (practically) eliminated, I'm not really sure what you're arguing. Are you saying that hardware designers of the 80's were superhuman?
My point is that there is no difference between software and hardware.
You don't need hardware to eliminate memory errors: software can do it as well. Two examples of this are the Singularity system that Microsoft Research built and Google's NaCl, where the system only loads code that can be verified not to access memory incorrectly.
Your claim that hardware is easier to analyze is also incorrect. Modern processors are extremely complex beasts and are not inherently simpler than software. All processors have long lists of errata. You may be mislead into thinking hardware is easier to secure because (a) those errata are less visible to userspace developers because the kernel shields you from them and (b) hardware developers invest much more resources into formal verification than software developers out of necessity (you can't just patch silicon). If software developers invested a similar amount of effort into formal verification tools, your impression would be rather different.
Again, the point is that there is no inherent distinction between software and hardware when it comes to securing systems. It is always and everywhere first a question of how you design your systems and interfaces and second a question of investment in development effort targeted at eliminating bugs.
"My point is that there is no difference between software and hardware."
Okay, now I see where you're coming from. Theoretically I agree. However, practically there are a number of things that make hardware different:
* Hardware has inherent "buy-in". The software systems you describe as also solving the memory access problem are basically opt-in frameworks. While you can make software frameworks hard to opt-out of (e.g. OS integration etc.) by definition... software runs on hardware...
* hardware solutions are often much more transparent. Again, your software example require a great deal of re-tooling. One of the most elegant aspects of the classic 80's memory access solution was how transparent it was.
* The ratio of software to hardware vendors has far fewer hardware vendors. Combine this with the fact that, as you point out, hardware is so expensive to retool and you create an environment where it is much more likely that a single hardware solution will be "correct enough" to enforce a constraint on software than it is the case that the majority of software will properly opt-in to a framework/code-correctly.
> You now rely on chip designers being super-human.
At this point I want to ask how we're defining "super-human." What level of reliability is considered to have "super-human" requirements? There are certainly very simple and clear ways that one product produced by normal humans is much more reliable than another. For example, if you admonished someone to wear their seat belt while driving, you would scoff if they replied "well then I'm just relying on seat belt designers being super-human."
I actually agree with this. I believe that, using the right techniques, both software and hardware can be produced correctly. It's a function of their design and complexity how easy it is.
It's also worth keeping in mind that modern processors are actually extremely complex and that they do regularly have errata, even though chip designers are extremely conservative in their approach by necessity (you can't just patch silicon) and are much more thorough and disciplined in their use of formal verification tools than the vast majority of software designers.
All user mode code has been in OS-enforced, security-bounded, per-process VMs that access each other and hardware through virtualized interfaces since forever (well, since the 90s for mainstream microcomputer OSes).
"Containers" are just user-accessible support tooling to get creative with how those interfaces work. It really should be much easier to make container software than the entire virtualization infrastructure from scratch, in the same way that it's easier to write tar than a filesystem driver.
I write software that runs on tanks, its not bullet proof. Most the communication protocols just use security though obscurity. If you tell a gear box to shift form 1st to snip it'll do it, and break everything.
But when you considered air gap, and physical security surrounding it (12 inches of plate steal, 5 man team with guns, massive main gun), its pretty secure.
This vulnerability is a good example of how a security bug is still a bug. That is: even if all the bad guys went away there would still be problems. This issue was fixed (as far as I can tell) pre-1.0 for non-security reasons. See this discussion: https://github.com/dotcloud/docker/issues/5661 .
Basically, docker used to have a "drop" list associated with each execdriver. By default docker kept all kernel bestowed capabilities but would explicitly drop those on the execdriver's drop list. This created issues with compatibility. If an image was prepared on a kernel that didn't have a particular capability at all and was suddenly run on one that was, weird hard to diagnose behavioral differences could emerge. So now docker drops everything by default and the execdrivers have a "keep" list. Also, there's now a check for the kernel defining a capability before trying to mess with it.
There are---of course---still some issues. For example, docker drops everything then tries to add the capability back. That won't work for some capabilities because some can't be added back once they've been renounced. Still, the situation is an interesting example of how a security bug is still a bug and is likely to bite you in some way sooner or later. See also XSS flaws that limit valid inputs.
Yes. I pointed it for LXC originally, filing the bug with LXC on sourceforge and later re-filing on github. Nothing to do with docker. Finally, they implemented it. This is like multiple years of timeframe we're talking about. Later I pointed it out for docker. Just sayin'. There's a lot of us contributing here, and much of the work isn't under the docker banner.
"There's a lot of us contributing here, and much of the work isn't under the docker banner."
Of course! If my comment could be taken as part of the mass of prose that can be read to imply otherwise, I apologize. I don't want to undermine what Docker has done with providing a packaging format, UX, and history niceness. I also don't want to undermine the years of hard work that's gone into kernel namespacing etc. that makes it all possible.
Just to be clear, this doesn't really seem like a problem with Docker specifically. It looks like a problem with the kernel's namespace isolation, affecting any container-based solution. Yes, that's in the PPS, but probably should be in the title.
that's true to some extent, but it also has to do with which namespaces and isolation features a given container solution supports, for example lxc has seccomp syscall filtering support and user namespace support ootb which would have mitigated this attack surface to those of the unprivileged user running the container (and covering the ps on kexec). in addition lsm usage (selinux, apparmor) can also limit the attack surface area.
Linux capabilities are not a good security model in general. They're more suitable for sysadmins who want to do the very basic locking down of system resources to prevent users from fucking things up, or preventing programs from doing basic accidental mistakes. A MAC or RBAC implementation is a lot more robust and actually fulfills the qualifications for things like secret/top secret computing systems.
From what I recall, grsec's rbac doesn't give you the same flexibility as selinux's mac. You have to pick and choose whether you want advanced heuristics to prevent different kinds of attack, or just get really fine-grained with your system control. I prefer grsec personally, but only because i'm lazy, and it's more than likely not certified for top secret systems.
grsec's developer philosophy fully admits this pragmatic focus... basically, if we are too lazy to use a custom policy because policy development is painful, then essentially we are not going to use any policy and will instead remain vulnerable. grsec makes it easier, therefore it actually gets used.
I studied SEL policies in depth back in 2000 or so, but have never once deployed a custom policy. I suspect others are the same, though common daemons on common distributions recently (~last 5 years) began to have usable pre-supplied policies, unfortunately standard services are so commodified these days they're often outsourced (email, chat, web, etc.) and so the benefits of this 'too-little-too-late' development are partly mitigated in practice.
The kernel version is just as important, if not more important. I've performed tests that have given me non-specific breakout capabilities on certain combinations of Linux and Docker. I haven't however, as is done here, reproduced and written an exploit.
As an employee of Docker, I feel it is more important to me to know if we can breakout and patch those issues than to write viable exploits for them.
I have noticed that newer kernels and Docker versions (such as 1.0) are currently more difficult to break out of than they were in earlier versions. Again, however, it's highly dependent on the pairing.
What's important to recognize here is that even with breakout potential, containers should add a useful layer of security to break out of that wouldn't otherwise exist. Containers should never remove security from your system, they should only add to it. However, although deployers may find the removal of virtual machines to weaken their security story. The "secure all the things" story would be to put Qemu in a container, then run containers inside the VM.
Otherwise, security practices are as they always have been: Don't leave setuid binaries floating around, etc.
And that's not even an exploit, that's the sysadmin doing it wrong:
"If the host system and the jail share the `john' user and you are
sharing `/usr/local' as read-write between the host and the jail, then
``you are doing it wrong!''."
Never really got talking but they sent me some friendly warez when I released an ARP-based remote OS detection script in like 1999. As I recall they were all Germanic. Doubt there's any 'reach out' from groups today, too much profit/loss going on...
With a technology used in production in a lot of places, does responsible disclosure apply here? Did the author (OP?) notify Docker and/or the Linux kernel team before distributing this for public consumption?
While the author called the exploit "shocker" I don't think anybody is shocked by it. There are likely many other ways to break out of Docker. And while I assume the Docker (and kernel!) people are closing them as they come along, I don't think anyone is claiming Docker is unescapable so it's not a surprise if it is.