Good post, except that it's extremely misleading to use Solaris as the canonical example of non-Linux containers and then say that non-Linux containers "haven't had as much exposure" and "the source code isn't always available for peer review and auditing". Solaris containers (in Solaris first, and then illumos when Solaris became closed-source again) have been open source since 2005 and running in hostile production environments that whole time.
True, I will update the blog post so that it feels less misleading. Source code for Solaris zones is indeed available; but I wouldn't consider it widely deployed. Of course, some people are using it in public hosting environments (the most notable example is probably Joyent); but I don't think that it's significant compared to the installed base of VServer, OpenVZ, or LXC out there.
I mean — it's trivially easy to get access to a Linux VPS, for a ridiculously low price (sometimes for free). Now compare with something equivalent based on Solaris zones.
But yeah, I'll definitely update the blog post, thanks!
Source code for Solaris zones is indeed available; but I wouldn't consider it widely deployed.
Why would you not consider it widely deployed? If it needs to be said: just because you don't use something doesn't mean that others aren't. Speaking only for us (I work for Joyent), we have deployed hundreds of thousands of zones into production over the years -- and Joyent was running with FreeBSD jails before that. And that's just us; there are many others in the illumos/SmartOS/OmniOS, Solaris and FreeBSD communities who have been running this technology in production -- broadly -- for years. Perhaps OS virtualization is a new technology for you, but understand that it's not new for everyone; some of us have been doing this for a while -- widely deployed and in production.
I wouldn't consider it "widely deployed" compared to the Linux installed base.
Sure, Joyent (and others) has "deployed hundreds of thousands of zones into production over the years", but you guys are the only well-known, large-scale, public hosting service using zones (and you're damn good at that, no doubt about it!)
Now compare with dotCloud, Heroku, Dreamhost, 1&1, Mediatemple, OVH, Amen (just to name those I can remember without doing an extensive research): those guys have also "deployed hundreds of thousands of Linux-based VPS into production", using VServer, OpenVZ, and more recently, LXC.
Don't get me wrong: I'm a big fan of Solaris (and its heritage); I have lots of my marbles on ZFS; I hacked basic ZFS support in Docker just for fun a while ago; and if I knew better, I would love to find a way to run sub-zones and a ZFS pool in a Joyent SmartOS instance and port Docker to your platform. But there is a helluva lot of Linux hosters out there.
I'll close with a lovely paraphrase:
« Perhaps LXC containers are a new technology for you, but understand that it's not new for everyone; some of us have been doing this for a while â widely deployed and in production. » ☺
Just because popular hosting companies are not using Solaris zones, that does not mean there are not large Zones-based deployments elsewhere in the industry. Particularly, some major corporations (including banks and telcos) are using Zones in their production Solaris environments. These turn out to be particularly large deployments, with hundreds of machines in data centers across the US.
These users do not broadcast their use of Zones, but having worked with them, they certainly do exist.
*Used to work at Sun on a project related to Zones and ZFS.
By the way, if anyone knows of a documented exploit for LXC, I would love to hear about it. People (generally advocating VMs, zones, jails, OpenVZ...) will often say that "containers are not secure", but once you've taken some basic steps (like locking down kernel caps and device access) it becomes difficult to find an actual threat.
Any local root vulnerability will also work in a container, eg this one http://www.ubuntu.com/usn/usn-1914-1/ - note that a lot of kernel vulnerabilities are never really announced, just quietly fixed.
A variant on http://grsecurity.net/~spender/msr32.c would have worked up until the capability check was added. Capabilities will help you, but if your business model is built on the assumption that the kernel performs capabilities checks everywhere it should then you really ought to be actively reviewing kernel entry points yourself.
Thank you so much for the first link! The very same "very black unix domain sockets magic" has been confounding me while reverse engineering a binary. OK, it calls recvmsg and then a wild FD appears from another process!? I had no idea...
>Finally, if you run Docker on a server, it is recommended to run exclusively Docker in the server, and move all other services within containers controlled by Docker.
This looks like what CoreOS is providing, a stripped down barebones host, with all other services not strictly necessary in the host moved to the containers.
>Capabilities turn the binary “root/non-root” dichotomy into a fine-grained access control system. Processes (like web servers) that just need to bind on a port below 1024 do not have to run as root: they can just be granted the net_bind_service capability instead. And there are many other capabilities, for almost all the specific areas where root privileges are usually needed.
This is awesome, has been a personal pain point in the past, trying to get JVM running as non-root in ubuntu server. Theoretically it's easy with IPTABLEs, but in practice it can be tricky to get working exactly right.
I support docker in its efforts. However, docker is too cute, too hyped, and too rapidly developed to trust with your security as yet. Quite frankly, you have to understand a bit more than how to call an API to have faith in your infrastructure's inherent security.
For example, in this article the author links to the 'list of dropped capabilities in the Docker code'. As it happens, I wrote that list quite some time ago, and wrote it for lxc-gentoo, a guest-generation script for raw LXC against an earlier kernel version with an earlier LXC userspace. Not only is the list now out of date, it's no longer using the preferred approach. Why is this? Instead of explicit drop ("allow all, deny some") after some months of raising the issue one of the LXC devs finally added the 'lxc.keep' (ie. "deny all, allow some") which is architecturally more secure against things like kernel upgrades which add or modify kernel capabilities.
There's a reason we keep saying docker is not yet production-ready.
Right now our focus is on usability and stabilizing the management API to make deployment-centric deployment awesome. You can be sure that before we tell anyone that they can use docker to sandbox untrusted code in a shared environment (which by the way is not the only use case of docker) we will be locking down our default lxc configuration and doing a sweep of all pending security issues.
For the record, we (dotCloud) have tens of thousands of lxc containers currently running untrusted code in production on shared infrastructure, and have had to monitor and maintain them 24/7 for several years. Before that we ran openvz. And before that, we ran vserver. So while docker itself may not yet be ready for production (and indeed we don't use it in production at dotcloud either), you don't need to worry about our stance on security. We care about it just as much as you do.
Sure - for openvz it was more powerful resource accounting and limits (this is back in 2008). I think there was extra goodness around networking, but honestly I can't remember. Mostly we were trying to figure out which project would find its way into mainstream, so we could standardize on that. vserver had been around forever and somehow never made it in, so OpenVZ looked like our best bet. Of course we turned out to be wrong :)
Good question - no, we don't. Developers can request for certain whitelisted commands to be executed within an environment that we know to be safe. For example, you can specify a list of system packages, and dotcloud will install them from the official LTS Ubuntu repository.
There's an ongoing discussion in the Docker community on the best way to make this possible in a shared environment. One possibility is to add support for OpenVZ, which has a better track record on that front (although it's not clear how much of the perceived difference is just fud). Another is to combine namespaces with SELinux, so that even if you break out of the namespace, you're stuck in a "limbo" context with no ability to do harm. Lastly, there's the possibility of extra instrumentation around the container, to limit the risk - for example you could allow root privileges only for a whitelist of commands on a whitelist of base images. Or you could only authorize network connectivity with a whitelist of remote hosts (keeping in mind most use cases which require root access involve short-lived image building). Or you could map containers with root privileges to dedicated virtual machines, separately from the unprivileged containers. Etc.
> we keep saying docker is not yet production-ready
Just a heads-up: I know this isn't your fault, but docker.io does not say this on the front page, About, or FAQ that I can see. In fact, it currently says "same container that a developer builds and tests on a laptop can run at scale, in production".
Docker looks very interesting, thanks for your work.
That's a good point! The website is maintained separately from the rest of the project so I don't have direct visibility over it (I'm the lead maintainer of docker). I should have checked this earlier. Thank you!
I learned a lot from this reply, thank you :) It's clear you have a passion for containers (something with have in common) and security (something I'm not an expert on.)
First, I think it's a little disingenuous to say that your issue disappeared. No one is censoring the Docker issue list. If you could provide a bit more information (your github handle, the issue title, etc.) I'll be happy to investigate.
edit: the first point was addressed, thanks :)
Second, Docker is an open source project with a rich community and a great deal of contributors for any project, even more so for a project less than 6 months old. People like yourself with clear passion can only make it better. I encourage you to continue your contributions by opening an issue and working with the maintainers to solve it.
I encourage you to continue your contributions by opening an issue and working with the maintainers to solve it.
Unfortunately I don't have time to run docker. Right now I am working on a broader-goaled system internally which supports arbitrary virtualization platforms and integrates concerns around platform integrity, host integrity, failover, automated scale-out, network topology specification and development/operations processes.
Docker apparently aims to make deployment really easy, and does this for some subset of cases, but with ease of use sacrifices security for new users who cannot evaluate statements such as the comments I added to its template in the commits referenced above.
To be frank I am not sure this is a winning goal, and suspect that any attempt to criticize docker's place within broader concerns would more likely result in something close to negative feedback from the existing developer community rather than an abstract thoughtfest resulting in wins for everyone. Happy to discuss further by email.
Hi, just to re-iterate my comment above: we absolutely care about security and welcome all security-related discussions. For example, just last week we released a hotfix to address an entirely different security concern . If you feel that a particular security concern has been overlooked, I apologize and encourage you to discuss it again by irc or email, keeping in mind that we are still at version 0.5 and actively discourage using docker in production.
At the same time, saying that Docker's goal is to "sacrifices security" is untrue and unfair to the project. So yes, as long as you make these unfounded statements, you will meet resistance in the form of a constructive rebuttal by the community. Especially coming from someone who "doesn't have time" to contribute to the project or even use it.
saying that Docker's goal is to "sacrifices security" is untrue and unfair ... unfounded statements
People running things they don't understand means probable security issues for those users... and I think it's totally fair and in no way bad form to discuss this tradeoff in the context of docker and similar projects. Especially given two attack vectors documented in the current codebase, and the fact that the article we are commenting on ignored such. What docker is attempting to do - apparently give people easy to use 100% portable containers for arbitrary code - is hard, and security for arbitrary code is one of the challenges.
Personally I wonder if perhaps taking some time out to consider the blurrier and more complex edge cases with regards to the project's overall goals and architecture, potentially considering a dalliance in to integration with weightier operations + development process concerns, higher security deployment requirement concerns and other areas that container-based deployments may affect would be really valuable for docker at the moment.
That's unfortunate. Even in development of products/internal infrastructure with overlap, there may be some ideas that benefit each project. It might also provide a more thorough understanding of the goals / strengths of the Docker project.
I'm eager to learn more about and continue our discussion. I will definitely take you up on your offer to email further.
Regarding the lxc.drop vs lxc.keep: of course, we eventually want to switch to the latter, since it's obviously better to "deny all, then allow some" than the opposite. And I can only be grateful that you provided those elements. It looks like you think like your contribution wasn't taken into account, but it definitely was, and I'm sorry that you feel that way.
So why does Docker still ship with lxc.drop? Well, a large number of people are still using LXC 0.7, which doesn't support lxc.keep, AFAIK. But it is very likely that Docker 1.0 will either require LXC 0.9, or totally get rid of LXC userland tools, or provide multiple implementations depending on what you have installed locally; and then lxc.keep will definitely kick in.
Also, the initial security choices of Docker represent a middle ground between "lock down everything" and "allow anything to happen". It had to be secure enough so that people could run regular app servers with a reasonable level of trust; and permissive enough to allow e.g. normal package managers to run.
Moreover, Docker is evolving: we recently added the "-privileged" flag (available in the master branch, and very probably in 0.6.0, due in a few days), allowing to switch between a more secure configuration, suitable for e.g. public PAAS environments, and a more permissive configuration, suitable for private PAAS, continuous integration, that kind of things. And this is just one step in that direction.
It looks like you think... and I'm sorry that you feel that way.
Err, where did you get that idea? I couldn't be less concerned about the fate of my docker 'contribution' of inline comments (which was simply given out of shock that nobody seemed to be considering these vectors, and was merely copied from lxc-gentoo).
My motivation in commenting here is to prevent people from getting the wrong idea about security and LXC, something the article, IMHO, failed to do. In fact, it came across as fairly misleading to my mind.
Are you implying that gaining root access inside a LXC container means that you can escalate to the host system, or to sibling containers?
If yes, I would like to see an example of that (that works on systems with very minimal lockdown, i.e. using the device control group and kernel capabilities).
Otherwise, if you just mean that "0-day Linux root vulnerabilities can be used to escalate from non-root to root in a Linux Containers", that's a truism, and it also stands true for VMs or OpenVZ systems.
Just like 0-day vulnerabilities will help people to escalate from non-root to root in a FreeBSD jail or Solaris zone.
No, I think the implication is that the kinds of kernel bugs that allow you to escalate from a non-root user to root within a container (by corrupting kernel data structures, for instance) will probably also allow you to escalate to root at the host level. If you can change the UID of a process, why should it be harder to change the UID namespace as well?
Plus, the user namespace functionality is fairly new and complex, and there have already been a few bugs found, e.g. . I assume all the known bugs have been fixed, but that doesn't ensure that more aren't lurking somewhere.
I disagree with this assertion that "VMs will always be more secure". Of course, they bring an extra layer (or rather, a layer of different nature).
But check the number of Xen vulnerabilities (I kept up with those for a while because I still run a Xen cluster): they are very real. And keep in mind that Xen (at least in my case!) doesn't bring an extra layer of security: if you are (e.g.) an IAAS provider using Xen to sell VMs, your customers can run anything they like in their VMs, and Xen will be the only layer. Your hypervisor will be "on the front line" if you see what I mean.
I would actually argue quite the contrary. I.E.: exploits affecting containers are likely to be exploits affecting all Linux systems, meaning that they will draw much more attention and scrutiny than exploits affecting hypervisors, and they are likely to be fixed faster.
To clarify: VMs will always be more secure for sandboxing a non-root service. In that case, untrusted code would have to get root first, then use that to either replace or exploit the kernel, and then exploit the VM.
In the case where you run untrusted root or kernel code, that code only needs to exploit the VM, true. (On the other hand, many VMs have smaller attack surfaces than the Linux kernel.)
I would disagree with your assertion in the case of PV guests in Xen - they have an extremely small attack surface. The hypervisor may be "on the front line", but is a far simpler beast than the kernel.
Certainly Xen has its fair share of vulnerabilities, but vastly fewer than the kernel.
I'm very interested in all those things, but I clearly lack a trajectory for learning them. Is there a reference I could read or a 'name' for that domain? How does one become educated on these things?
So far I've grabbed knowledge by reading paper on operating systems (and misunderstanding 80% of their content), reading man pages, reading Tanenbaum's textbooks, etc. But still I don't feel like I know or understand.
They say a lack of words for things render one blinds of their ignorance. Sometimes it's also that you just don't know what needs to be learnt.