Have we landed in what seems like yet another mono-culture situation where there are actually several viable technologies at hand?
If so, that would be a real shame. Docker is broken in quite a lot of ways and needs all the competition it can get.
The good news is that Google Translate understands PDFs and mostly gets it right:
That should be a decent start for further discussion :)
I guess the explanation might be the example image here: it's called "fedora". I thought running whole Linux distributions VM/LXC-style under Docker didn't really work - the Docker people always said you're only supposed run one app per container. Has this changed?
As with any philosophy, there are many smart people who hold an opposite view. Docker itself is perhaps deviating from the architectural simplicity of container==process, while the underdog rkt is enforcing this approach even at the cost of user-land usability (e.g. rkt cannot run containers as daemons by itself).
In practice, most containers must have an OS wrapping the application (unless you're distributing binaries), and that OS often ends up being a nearly fully baked version of Ubuntu/Debian/Fedora or whatever the person is most comfortable with outside of Docker.
edit PS: One common exception to this philosophy is a short-lived container for admin/management/debugging. Having a full OS with devel/admin tools is useful for investigation/maintenance purposes, then you kill it and nothing is changed in prod.
Who died and made you king of definitions? Use of the word "container" for virtualizing a full OS instance predates  Docker, ACI, OCI.
: https://us-east.manta.joyent.com/bcantrill/public/ppwl-cantr... (I wouldn't be surprised if there are even earlier IBM or FreeBSD uses of the word "container" with respect to OS virtualization.)
Sometimes that involves using one container that you tightly control to do a bunch of different things -- e.g.: "fedora".
You may use a base (fully loaded) fedora container to run apache in one container, (fedora:httpd), mysql in another (fedora:mysql), and something like tomcat (fedora:tomcat).
The alternative, preferred in some (most?) instances, is to use something like Alpine and bake in the bare minimum of everything for every process.
When folks run a bunch of stuff in one container it's typically using something like circus or supervisord running at pid 1.
The whole container hype is an offshot of the VM hype from a few years back, that was all about running multiple, apparently separate, servers on shared hardware. This by doing the OS and server installs into VM images that would be managed by hypervisors (effectively a stripped down OS that only existed to run VMs).
The VM approach have certain issues with overhead, as you are running multiple kernels and user spaces stacked on top of each other, with the underlying CPU spending time pretending to be whatever hardware the VM is presenting to the OS within it.
Thus you get containers, a chroot on steroids that is claiming to bring some of the benefits of VM (less idle hardware, thus getting all the accountants hot and bothered) but with reduced overhead. This by having the containers share kernel and general userspace.
Frankly we keep adding turtles to the stack.
We start out with a single cpu running a single process.
Then we stack multiple cpus into racks.
Then we implement timeshare so that each cpu can run multiple processes.
Then we virtualize the cpu into a process, that in turn are timesharing processes.
All this because hardware takes up space and electricity even while idle, but software do not. Thus the more software you can run on a single piece of hardware the "better".
Frankly, it seems like it's an explicit goal of docker to let people who know nothing about linux use linux containers. Another obvious goal is to ship fast, and add security later maybe (see the disaster that is docker registry auth).
Containers cannot and will not provide effective security. Capabilities more generally cannot and will not provide effective security in practice. Capabilities have been around in Linux since before 2000 and they've always been trivial to circumvent because of kernel bugs. The problem has never been granularity of capabilities but about code quality and surface area. If you're concerned about security, you want at _least_ something like seccomp or pledge. But as the recvmmsg issue shows even that's not enough. Heck, virtual machines aren't even enough--Xen has had two privilege escalation vulnerabilities in less than a year.
Containers are about easing deployments. Docker understands that, I think. And while there are many aspects about the whole container phenomenon that I find very unpalatable, I applaud Docker for understanding that, or at least not peddling the security angle too heavily.
The only practical, one-size-fits-all security boundary for software as complex as Linux or Windows is physically separate hardware. Linux's recent recvmmsg exploit notwithstanding, direct remote exploits of kernels are rare enough that you can make an argument for their benefit without exploit authors rolling their eyes.
So because you can't really secure a linux kernel perfectly, you might as well not bother with even trying at all?
Defence in depth. Sure, I understand there might be exploits that affect my system, but that doesn't mean I shouldn't use prepared sql statements, shouldn't run my processes as uid!=0, shouldn't use selinux, etc.
None of what you said is an excuse for not having more secure defaults and encouraging users to practice defence in depth better.
My point is that expecting containers to "secure" broken software is a delusion, especially when the implementation enforcing the constraints--in this case, the Linux kernel--is fundamentally broken from a security perspective in the relevant dimension (that is, local exploits leading to privilege escalation).
Arguing defense in depth in this case is like arguing that stacking 3 web application firewalls on top of an application is better than just 1. The problem being that none of them are capable of addressing the bulk of the proven exploits.
Back to your prepared SQL statement analogy, consider that nobody should be running software with root privileges today, regardless. Whether your network daemon is run in a container or not, it shouldn't being running as root in its steady state. So trying to lock down a container in such an instance is like trying to prevent SQL injection in software that concatenates data into the SQL statement string. It's intrinsically a failed strategy, and not in any sense an engineering solution. It's like saying doing something is better than doing nothing. Well, actually, in reality that's not the case; in reality sometimes doing something is just as bad as doing nothing (sometimes it's even worse, but that's not the argument I'm making here).
There are other, similar reasons why it's a bad strategy to see containers as a security measure. And I fail to see how anybody who has spent any amount of time in this industry, and who has spent any amount of time applying privilege separation principles, could reasonably disagree.
Now, can containers help you write more secure software? Yes, depending on context. But not to the exclusion of other techniques. Containers don't add a useful security layer by themselves, and shipping with slightly tighter capabilities doesn't change this fact in practice. Again, saying that shipping with slightly tighter capabilities (the particular capabilities system at issue here) is better than not is a distinction without a difference when most kernel-based root exploits are in innocuous system calls (e.g. networking system calls) that you could never lock down in a default container setting. If an application writer can't be bothered to make their software work without root, there are evidently other things going on. Again, the problem in reality is software quality--bad kernel quality, bad application quality--and papering over this doesn't help.
 A simple, single setuid call shortly after application startup would work at least as well as narrowing process capabilities, though likely be more effective. Even better, write your application to inherit socket and file descriptors so inetd or systemd can implement those privileged operations. These are minimal things all _correctly_ written software should support. If an application doesn't support at least the former, it's fundamentally broken and it's foolhardy to believe this can be papered over. Also, newsflash, systemd doesn't solve PID file races for uncooperative network daemons. That's another instance where people believed somebody came up with some magic software that inherently solved a problem without having to fix the underlying software. The difference being PID file races aren't a comparable security issue because they usually already require local, root capabilities; and systemd's solution arguably makes such a race meaningfully more difficult to exploit. So a misinformed engineer's decision in that context is unlikely to be part of a chain of events leading to billions of dollars of valuation going up in smoke.
I'm not sure why you're pointing out containers, Linux, or any specific implementation of anything. What you said seems both true and obvious - local exploits lead to privilege escalation. The only other options are denial of service or information disclosure. All practical systems will have all of those, unless we go into fully verified OS territory.
While I don't strongly disagree with the result, I think this is the wrong way to analyse security of some components. Containers provide abstraction over namespaces. Namespaces isolate some operations (fs access for example), but not others. (syscalls) Whether that improves the security of the system or not depends heavily on the configuration and context. There's a big overlap with the security improvement provided by dedicated system accounts - and that's worth remembering.
But let's not say containers don't improve on security as some absolute statement. There are tradeoffs everywhere.
Put it more simply, yes to do it correct requires inspecting each and every web-tire solution, and its sockets, permissions, group permissions, and dependencies . This consumes a hell of a lot of time! The alternative is to throw the web-tire solutions or package in it's own container, and place monitoring software on the container to kill the container if things go wild or somebody figures out how to hack through the web-tire solution.
If you _carefully_ parse Pottering's statements regarding systemd, a charitable interpretation would be that he never claimed systemd solved the PID race for uncooperative software. Unfortunately, everybody believed and most people still believe that it does. Afterall, there was never a problem with PID races with _cooperative_ software, so it was natural to assume that what systemd was bringing to the table was a solution to uncooperative processes.
That doesn't mean systemd isn't slightly better then `killall [process name pattern]`. But in security an inherently broken solution isn't any solution at all.
I did not claim otherwise. And I wouldn't even describe it as PID file race. PID files are a specific thing that was used with older inits.
I don't believe any init can solve the issue you're describing - it exists on a different layer. But cgroups freezer should be able to solve it.
To answer your question more directly, why do I use PID files:
1) Because I write and value portable software, both across different Linux distros and across different POSIX-like systems. I regularly target AIX, FreeBSD, NetBSD, OpenBSD, OS X, and Solaris; none as second class citizens.
2) Because IME the compile-test-debug cycle and similar non-production runtime modes is easier when you don't have to fiddle with separate init files. Supporting a simple PID-file-based backgrounding capability directly in your software is a relatively simple matter when not implemented as an after-thought.
3) Because sometimes you want to implement a service using multiple processes (e.g. worker pools, specialized privilege separation, etc), which is easier (both in development and production) when you fork them from your own master rather than trying to configure and coordinate through a third-party supervisor. Note that this doesn't exclude your service's supervisor running in the foreground under another supervisor like init (e.g. systemd) or daemontools.
4) Because using process groups and a shared controlling pseudo-terminal you _can_ atomically send a signal to all children in a process group. It's more robust than what systemd currently does with cgroups, it's portable as a practical matter, and relatively simple to accomplish early in main().
Note that I write my software to support running in the foreground under the control of any supervisor of choice, including support for inheriting a listening socket. This is always the default mode. Backgrounding, PID files, etc, are optional capabilities in my daemons.
Regarding cgroups freezer, it does seem like it should work. I stand corrected wrt Linux supporting atomic signals to an entire cgroup. But it seems like neither systemd nor Docker has actually implemented support. Perhaps it's for lack of motivation, or perhaps because there are non-trivial issues to resolve. It would be interesting to know what the hold up is.
> Why would you use PID files when you've got proper service management available?
as of course the flawed PID file mechanism was addressed long before systemd was even an idea.
cgroups do not solve the problem of "crazy runaway processes", by the way. cgroups are not jobs (in the Windows NT and VMS senses), and lack the functionality of a true job mechanism.
When systemd terminates all of the processes in a cgroup, it doesn't issue a single "terminate job" system call. There isn't such a thing. It sits in a loop in application-mode code repeatedly scanning all of the process IDs in the cgroup (by re-reading a file full of PID numbers) and sending signals to new processes that it hasn't seen before. There are several problems with this.
A first problem is that systemd can be slower than whatever is reaping child processes within the process group, leading to the termination signals being sent to completely the wrong process: one that just happened to re-use the same process ID in between systemd reading the cgroup's process list file and it actually getting around to sending signals to the list of processes. The order of events is: (1) systemd reads PID N from the cgroup's process ID list, then its timeslice happens to expire; (2) process N is scheduled and exits; (3) process N's reaper (its parent, or a sub-reaper) cleans up the zombie in the process table; (4) some vital part of the system is scheduled and spawns a new process re-using ID N; (5) systemd is finally scheduled, works down the list of PIDs that it read before, and merrily kills the new process.
A second problem is that a program that forks new processes quickly enough within the cgroup can keep systemd spinning for a long time, in theory indefinitely as long as suitable "weather" prevails, as at each loop iteration there will be one more process to kill. Note that this does not have to be a fork bomb. It only has to fork enough that systemd sees at least one more new process ID in the cgroup every time that it runs its loop.
A third problem is that systemd keeps process IDs of processes that it has already signalled in a set, to know which ones it won't try to send signals to again. It's possible that a process with ID N could be signalled, terminate, and be cleaned out of the process table by a reaper/parent; and then something within the cgroup fork a new process that is allocated the same process ID N once again. systemd will re-read the cgroup's process ID list, think that it has already signalled the new process, and not signal it at all.
These are addressed by a true "job" mechanism. But cgroups are not such.
Most of what you complain about though is solved by duplicate pids not being assigned in short succession (which is generally true, though in reality is an implementation detail of linux which is not promised).
Which leads to another point. If you think that this is my complaint, you should read the systemd people on the subject over the years. They were rather expecting (version 2) cgroups to be a true job mechanism, and they've complained every time that it has once again turned out not to be.
Moreover: this is not to mention that Docker and others will manipulate the freeze status of control groups for their own purposes, and there is no real race-free mechanism for sharing this setting amongst multiple owners, such as an atomic read-and-update for it.
You know, nearly every time I hear that phrase, something bad follows. You are kind of an exception, that's not a bad set of protections, except that they are off-topic, and won't apply to container design.
In practice real defense does not have depth. If you are in a position where you can apply a deep stack of protections, you've already lost.
That "if you can't secure it perfectly with one tool, you're doing it wrong/you've lost/etc." is so mind-bogglingly anti-security that if you are in charge of more than your home network for network security, that company has lost. That's like saying "If your application needs high availability instead of just being reliable, you've already lost."
We secure networks by putting things on different VLANs. And putting them through firewalls that govern access between different parts of the network. And by enforcing user-based access control via the firewall ("next-gen" features) and then hopefully, at the application level. And for privileged systems, we are pushing to use two-factor authentication.
I just listed off 6 different access controls for a given application. Hopefully, each one of those independently would secure the application pretty well. But the sum of all those make for a very well secured app, with no single point of failure.
> run my processes as uid!=0, shouldn't use selinux
You can automatically have your applications all run in usernamespaces (uid!=0) with containers. You can automatically apply selinux policies to all containers and then layer further contexts.
Will you PLEASE stop this FUD. What you mean to say is LINUX containers cannot and will not provide effective security.
And the reason for that is they were never designed with security as the starting point. Linuxworld believes you can just bolt security on later. You can't.
FreeBSD jails and Joyent zones have solved containers by starting with security as the constraint. It's other platforms that are horribly designed and continue to waste resources reinventing the wheel over and over again.
Maybe you will be interested in a paper concerning air-gapping by Joanna Rutkowski: http://invisiblethingslab.com/resources/2014/Software_compar...
tl;dr: air-gapping has its own strong disadvantages.
You don't secure the container, you secure the system running the container.
When you hand me a container to run, maybe I run it in a secured VM. Or maybe I run it inside a secured container; or I [get further out of my depth and] secure the Docker process itself.
From the perspective of the person writing the code running in the image, I just need to secure my application code - such as standard SQL injection, proper management of configuration keys, etc - but I don't really worry about securing the system running my container from my container.
...Although I guess that raises the question: how much do I need to secure my container from the system it's running on - or is that such a breach of trust it's pointless?
I assume that we have some OpenShift engineers in the house -- is v3 running directly on Docker and dropping capabilities, or does it run on RunC and then add them back as needed?
Disclosure: I work for Pivotal, we're on the Cloud Foundry side of the PaaS fence, which I guess makes us the Montagues.
tl;dr run SmartOS?
>This one’s easy. If you have this capability, you can bind to privileged ports (e.g., those below 1024).
Sure, it can bind to port 80 on a virtual network device that is unique to that docker container, unless you give the container host network privileges.
The `--privileged` flag is equivalent to giving your container all capabilities: https://github.com/docker/docker/blob/97660c6ec55f45416cb2b2...
I don't get what you mean by saying how networking affects capabilities. Because it doesn't?