Hacker News new | past | comments | ask | show | jobs | submit login
Dropping Linux capabilities to improve container security (redhat.com)
175 points by cheiVia0 on Oct 18, 2016 | hide | past | favorite | 52 comments

I wrote an open source BPF tool recently that shows which capabilities your application is using, live, for the purposes of determining such a whitelist: http://www.brendangregg.com/blog/2016-10-01/linux-bcc-securi...

As you guys are using FreeBSD at Netflix, have you reached over to OpenBSD and tried pledge[1]? It looks like a terrific addition to the sandboxing suite of tools. It seems like it would go hand in hand with your whitelisting tool if only they ran on the same platforms.

[1] http://man.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man2/...

It's da to see the word "container" used synonymously with "Docker container". There are other container-technologies out there, and this one only applies to Docker.

Have we landed in what seems like yet another mono-culture situation where there are actually several viable technologies at hand?

If so, that would be a real shame. Docker is broken in quite a lot of ways and needs all the competition it can get.

Could you elaborate on (or link us to examples of) those 'quite a lot of ways'?

Sure. Here's the slide-set for a pretty good rant (although in Norwegian):


The good news is that Google Translate understands PDFs and mostly gets it right:


That should be a decent start for further discussion :)

Yeah "decent". Whole thing is a childish speil about "docker bad, systemd cool, m'kay".

Agree with digi_owl, this reads like the ramblings of a child. (And it doesn't seem to be the translation).

systemd-nspawn can drop capabilities too.

What's holding back user namespaces with Docker, are containers requiring real root common or is there some other reason it's turned off by default?

I guess the explanation might be the example image here: it's called "fedora". I thought running whole Linux distributions VM/LXC-style under Docker didn't really work - the Docker people always said you're only supposed run one app per container. Has this changed?

The core philosophy of "one app per container" has not changed. Strictly speaking, this will never change because containers - whether Docker, ACI, or OCI - are processes and not VMs. I recommend this accessible presentation on how to write a poor-mans Docker by hand[1] to remove some of the magic behind containers.

As with any philosophy, there are many smart people who hold an opposite view. Docker itself is perhaps deviating from the architectural simplicity of container==process, while the underdog rkt is enforcing this approach even at the cost of user-land usability (e.g. rkt cannot run containers as daemons by itself).

In practice, most containers must have an OS wrapping the application (unless you're distributing binaries), and that OS often ends up being a nearly fully baked version of Ubuntu/Debian/Fedora or whatever the person is most comfortable with outside of Docker.

edit PS: One common exception to this philosophy is a short-lived container for admin/management/debugging. Having a full OS with devel/admin tools is useful for investigation/maintenance purposes, then you kill it and nothing is changed in prod.

[1] https://www.youtube.com/watch?v=HPuvDm8IC-4

> Strictly speaking, this will never change because containers - whether Docker, ACI, or OCI - are processes and not VMs.

Who died and made you king of definitions? Use of the word "container" for virtualizing a full OS instance predates [1] Docker, ACI, OCI.

[1]: https://us-east.manta.joyent.com/bcantrill/public/ppwl-cantr... (I wouldn't be surprised if there are even earlier IBM or FreeBSD uses of the word "container" with respect to OS virtualization.)

Haha, valid point. My thinking is within the current nomenclature and implementations of "containers" vs "Virtual Machines". These definitions could change drastically in the future.

Even in the Linux world, OpenVZ has had full OS containers for many years. Cheap VPSs were often just a container with an OS installed in it.

The main problem with user namespaces is file ownership. User namespaces allow you to shift a range of user ids to a different set of user ids on the host, but there is no way to mount a filesystem with the same set of shifts. This means you have to copy the root filesystem for the container and chown it which is pretty hairy. Note that there have been various proposals to do some kind of shifting at the fs layer[1] but nothing has been finished yet. [1] http://lwn.net/Articles/637431/

On this issue, a more recent status update from James Bottomley at the 2016 Linux Security Summit:



so I think a better explanation is 'you should only run one command per container.'

Sometimes that involves using one container that you tightly control to do a bunch of different things -- e.g.: "fedora".

You may use a base (fully loaded) fedora container to run apache in one container, (fedora:httpd), mysql in another (fedora:mysql), and something like tomcat (fedora:tomcat).

The alternative, preferred in some (most?) instances, is to use something like Alpine and bake in the bare minimum of everything for every process.

When folks run a bunch of stuff in one container it's typically using something like circus or supervisord running at pid 1.

I never understood why container security needs to be treated differently then regular security for the os and process.

Because they are effectively turning into a mini-os.

The whole container hype is an offshot of the VM hype from a few years back, that was all about running multiple, apparently separate, servers on shared hardware. This by doing the OS and server installs into VM images that would be managed by hypervisors (effectively a stripped down OS that only existed to run VMs).

The VM approach have certain issues with overhead, as you are running multiple kernels and user spaces stacked on top of each other, with the underlying CPU spending time pretending to be whatever hardware the VM is presenting to the OS within it.

Thus you get containers, a chroot on steroids that is claiming to bring some of the benefits of VM (less idle hardware, thus getting all the accountants hot and bothered) but with reduced overhead. This by having the containers share kernel and general userspace.

Frankly we keep adding turtles to the stack.

We start out with a single cpu running a single process.

Then we stack multiple cpus into racks.

Then we implement timeshare so that each cpu can run multiple processes.

Then we virtualize the cpu into a process, that in turn are timesharing processes.

All this because hardware takes up space and electricity even while idle, but software do not. Thus the more software you can run on a single piece of hardware the "better".

One difference would be that containers have the need to try and prevent escaping the container and mucking with the host os.

So... coming from Rails land, where convention reigns, why is the locked-down list not the standard?

Docker has repeatedly optimized for usability (including dumb users) over security. User experience is "my app just works", not "my app works, but only after I remember to add these 10 capabilities".

Frankly, it seems like it's an explicit goal of docker to let people who know nothing about linux use linux containers. Another obvious goal is to ship fast, and add security later maybe (see the disaster that is docker registry auth).

Linux had a remotely exploitable root privilege vulnerability in the recvmmsg system call just last week. Linux experiences local root exploits via innocuous-seeming system calls roughly every 3 months, last time I checked.

Containers cannot and will not provide effective security. Capabilities more generally cannot and will not provide effective security in practice. Capabilities have been around in Linux since before 2000 and they've always been trivial to circumvent because of kernel bugs. The problem has never been granularity of capabilities but about code quality and surface area. If you're concerned about security, you want at _least_ something like seccomp or pledge. But as the recvmmsg issue shows even that's not enough. Heck, virtual machines aren't even enough--Xen has had two privilege escalation vulnerabilities in less than a year.

Containers are about easing deployments. Docker understands that, I think. And while there are many aspects about the whole container phenomenon that I find very unpalatable, I applaud Docker for understanding that, or at least not peddling the security angle too heavily.

The only practical, one-size-fits-all security boundary for software as complex as Linux or Windows is physically separate hardware. Linux's recent recvmmsg exploit notwithstanding, direct remote exploits of kernels are rare enough that you can make an argument for their benefit without exploit authors rolling their eyes.

> The only practical security boundary for software as complex as Linux or Windows is physically separate hardware

So because you can't really secure a linux kernel perfectly, you might as well not bother with even trying at all?

Defence in depth. Sure, I understand there might be exploits that affect my system, but that doesn't mean I shouldn't use prepared sql statements, shouldn't run my processes as uid!=0, shouldn't use selinux, etc.

None of what you said is an excuse for not having more secure defaults and encouraging users to practice defence in depth better.

Using prepared SQL statements is about writing correct software.

My point is that expecting containers to "secure" broken software is a delusion, especially when the implementation enforcing the constraints--in this case, the Linux kernel--is fundamentally broken from a security perspective in the relevant dimension (that is, local exploits leading to privilege escalation).

Arguing defense in depth in this case is like arguing that stacking 3 web application firewalls on top of an application is better than just 1. The problem being that none of them are capable of addressing the bulk of the proven exploits.

Back to your prepared SQL statement analogy, consider that nobody should be running software with root privileges today, regardless. Whether your network daemon is run in a container or not, it shouldn't being running as root in its steady state. So trying to lock down a container in such an instance is like trying to prevent SQL injection in software that concatenates data into the SQL statement string. It's intrinsically a failed strategy, and not in any sense an engineering solution. It's like saying doing something is better than doing nothing. Well, actually, in reality that's not the case; in reality sometimes doing something is just as bad as doing nothing (sometimes it's even worse, but that's not the argument I'm making here).

There are other, similar reasons why it's a bad strategy to see containers as a security measure. And I fail to see how anybody who has spent any amount of time in this industry, and who has spent any amount of time applying privilege separation principles, could reasonably disagree.

Now, can containers help you write more secure software? Yes, depending on context. But not to the exclusion of other techniques. Containers don't add a useful security layer by themselves, and shipping with slightly tighter capabilities doesn't change this fact in practice. Again, saying that shipping with slightly tighter capabilities (the particular capabilities system at issue here) is better than not is a distinction without a difference when most kernel-based root exploits are in innocuous system calls (e.g. networking system calls) that you could never lock down in a default container setting. If an application writer can't be bothered to make their software work without root, there are evidently other things going on.[1] Again, the problem in reality is software quality--bad kernel quality, bad application quality--and papering over this doesn't help.

[1] A simple, single setuid call shortly after application startup would work at least as well as narrowing process capabilities, though likely be more effective. Even better, write your application to inherit socket and file descriptors so inetd or systemd can implement those privileged operations. These are minimal things all _correctly_ written software should support. If an application doesn't support at least the former, it's fundamentally broken and it's foolhardy to believe this can be papered over. Also, newsflash, systemd doesn't solve PID file races for uncooperative network daemons. That's another instance where people believed somebody came up with some magic software that inherently solved a problem without having to fix the underlying software. The difference being PID file races aren't a comparable security issue because they usually already require local, root capabilities; and systemd's solution arguably makes such a race meaningfully more difficult to exploit. So a misinformed engineer's decision in that context is unlikely to be part of a chain of events leading to billions of dollars of valuation going up in smoke.

> is fundamentally broken from a security perspective in the relevant dimension (that is, local exploits leading to privilege escalation).

I'm not sure why you're pointing out containers, Linux, or any specific implementation of anything. What you said seems both true and obvious - local exploits lead to privilege escalation. The only other options are denial of service or information disclosure. All practical systems will have all of those, unless we go into fully verified OS territory.

While I don't strongly disagree with the result, I think this is the wrong way to analyse security of some components. Containers provide abstraction over namespaces. Namespaces isolate some operations (fs access for example), but not others. (syscalls) Whether that improves the security of the system or not depends heavily on the configuration and context. There's a big overlap with the security improvement provided by dedicated system accounts - and that's worth remembering.

But let's not say containers don't improve on security as some absolute statement. There are tradeoffs everywhere.

As a side not I think your view is a bit too simplistic. For my own situation and I guess many other that most software packages assume some level or another root access and trying to manage all the hybrid of configuration files when you're deploying multiple web-frameworks on the same machine/vm can be a configuration nightmare.

Put it more simply, yes to do it correct requires inspecting each and every web-tire solution, and its sockets, permissions, group permissions, and dependencies . This consumes a hell of a lot of time! The alternative is to throw the web-tire solutions or package in it's own container, and place monitoring software on the container to kill the container if things go wild or somebody figures out how to hack through the web-tire solution.

Why would you use pid files when you've got systemd available? All the process management can be done directly with children or cgroups for some crazy runaway processes.

systemd DOES NOT solve the PID file race. What is does is poll a cgroups proc file that lists the PIDs in a cgroup, then iteratively calls kill on those PIDs. But this is _just_ as racy as a regular PID file. I explain more here, with links to the relevant bits of code:

Long story short, Linux does not offer a way to atomically send a signal to a cgroup, and so there's no way for systemd to reliably kill errant (or worse, malicious) processes that keep forking. Worse, last time I checked systemd had a bug in their read-iterate-kill loop that could lead to an infinite loop. I also detail that issue in the above LWN thread.

If you _carefully_ parse Pottering's statements regarding systemd, a charitable interpretation would be that he never claimed systemd solved the PID race for uncooperative software. Unfortunately, everybody believed and most people still believe that it does. Afterall, there was never a problem with PID races with _cooperative_ software, so it was natural to assume that what systemd was bringing to the table was a solution to uncooperative processes.

That doesn't mean systemd isn't slightly better then `killall [process name pattern]`. But in security an inherently broken solution isn't any solution at all.

> systemd DOES NOT solve the PID file race

I did not claim otherwise. And I wouldn't even describe it as PID file race. PID files are a specific thing that was used with older inits.

I don't believe any init can solve the issue you're describing - it exists on a different layer. But cgroups freezer should be able to solve it.

I thought your question implied that systemd was superior to PID files because it solved the PID race.

To answer your question more directly, why do I use PID files:

1) Because I write and value portable software, both across different Linux distros and across different POSIX-like systems. I regularly target AIX, FreeBSD, NetBSD, OpenBSD, OS X, and Solaris; none as second class citizens.

2) Because IME the compile-test-debug cycle and similar non-production runtime modes is easier when you don't have to fiddle with separate init files. Supporting a simple PID-file-based backgrounding capability directly in your software is a relatively simple matter when not implemented as an after-thought.

3) Because sometimes you want to implement a service using multiple processes (e.g. worker pools, specialized privilege separation, etc), which is easier (both in development and production) when you fork them from your own master rather than trying to configure and coordinate through a third-party supervisor. Note that this doesn't exclude your service's supervisor running in the foreground under another supervisor like init (e.g. systemd) or daemontools.

4) Because using process groups and a shared controlling pseudo-terminal you _can_ atomically send a signal to all children in a process group. It's more robust than what systemd currently does with cgroups, it's portable as a practical matter, and relatively simple to accomplish early in main().

Note that I write my software to support running in the foreground under the control of any supervisor of choice, including support for inheriting a listening socket. This is always the default mode. Backgrounding, PID files, etc, are optional capabilities in my daemons.

Regarding cgroups freezer, it does seem like it should work. I stand corrected wrt Linux supporting atomic signals to an entire cgroup. But it seems like neither systemd nor Docker has actually implemented support. Perhaps it's for lack of motivation, or perhaps because there are non-trivial issues to resolve. It would be interesting to know what the hold up is.

Or, more generally:

> Why would you use PID files when you've got proper service management available?

as of course the flawed PID file mechanism was addressed long before systemd was even an idea.

cgroups do not solve the problem of "crazy runaway processes", by the way. cgroups are not jobs (in the Windows NT and VMS senses), and lack the functionality of a true job mechanism.

When systemd terminates all of the processes in a cgroup, it doesn't issue a single "terminate job" system call. There isn't such a thing. It sits in a loop in application-mode code repeatedly scanning all of the process IDs in the cgroup (by re-reading a file full of PID numbers) and sending signals to new processes that it hasn't seen before. There are several problems with this.

A first problem is that systemd can be slower than whatever is reaping child processes within the process group, leading to the termination signals being sent to completely the wrong process: one that just happened to re-use the same process ID in between systemd reading the cgroup's process list file and it actually getting around to sending signals to the list of processes. The order of events is: (1) systemd reads PID N from the cgroup's process ID list, then its timeslice happens to expire; (2) process N is scheduled and exits; (3) process N's reaper (its parent, or a sub-reaper) cleans up the zombie in the process table; (4) some vital part of the system is scheduled and spawns a new process re-using ID N; (5) systemd is finally scheduled, works down the list of PIDs that it read before, and merrily kills the new process.

A second problem is that a program that forks new processes quickly enough within the cgroup can keep systemd spinning for a long time, in theory indefinitely as long as suitable "weather" prevails, as at each loop iteration there will be one more process to kill. Note that this does not have to be a fork bomb. It only has to fork enough that systemd sees at least one more new process ID in the cgroup every time that it runs its loop.

A third problem is that systemd keeps process IDs of processes that it has already signalled in a set, to know which ones it won't try to send signals to again. It's possible that a process with ID N could be signalled, terminate, and be cleaned out of the process table by a reaper/parent; and then something within the cgroup fork a new process that is allocated the same process ID N once again. systemd will re-read the cgroup's process ID list, think that it has already signalled the new process, and not signal it at all.

These are addressed by a true "job" mechanism. But cgroups are not such.

You can also just freeze the whole cgroup, and then the kernel will cascade it to all tasks within said cgroup, and then you can do process management.

Most of what you complain about though is solved by duplicate pids not being assigned in short succession (which is generally true, though in reality is an implementation detail of linux which is not promised).

I was expecting someone to mention the freezer. Not only does systemd not use the freezer, but the systemd people explicitly describe it as having "broken inheritance semantics of the kernel logic". You'll have to ask them what they mean by that, but the freezer does not, for them, magically turn cgroups into a job mechanism either.

Which leads to another point. If you think that this is my complaint, you should read the systemd people on the subject over the years. They were rather expecting (version 2) cgroups to be a true job mechanism, and they've complained every time that it has once again turned out not to be.

Moreover: this is not to mention that Docker and others will manipulate the freeze status of control groups for their own purposes, and there is no real race-free mechanism for sharing this setting amongst multiple owners, such as an atomic read-and-update for it.

Sure, it's still not perfect. What I meant with cgroups for runaway processes is that at least you still see which group they belong to. We've still got to solve the "kill the whole cgroup" idea.

> Defence in depth.

You know, nearly every time I hear that phrase, something bad follows. You are kind of an exception, that's not a bad set of protections, except that they are off-topic, and won't apply to container design.

In practice real defense does not have depth. If you are in a position where you can apply a deep stack of protections, you've already lost.

Please, just please tell me you don't work in cyber security full time.

That "if you can't secure it perfectly with one tool, you're doing it wrong/you've lost/etc." is so mind-bogglingly anti-security that if you are in charge of more than your home network for network security, that company has lost. That's like saying "If your application needs high availability instead of just being reliable, you've already lost."

We secure networks by putting things on different VLANs. And putting them through firewalls that govern access between different parts of the network. And by enforcing user-based access control via the firewall ("next-gen" features) and then hopefully, at the application level. And for privileged systems, we are pushing to use two-factor authentication.

I just listed off 6 different access controls for a given application. Hopefully, each one of those independently would secure the application pretty well. But the sum of all those make for a very well secured app, with no single point of failure.

Only one of those doesn't apply actually!

> run my processes as uid!=0, shouldn't use selinux

You can automatically have your applications all run in usernamespaces (uid!=0) with containers. You can automatically apply selinux policies to all containers and then layer further contexts.

"Containers cannot and will not provide effective security."

Will you PLEASE stop this FUD. What you mean to say is LINUX containers cannot and will not provide effective security.

And the reason for that is they were never designed with security as the starting point. Linuxworld believes you can just bolt security on later. You can't.

FreeBSD jails and Joyent zones have solved containers by starting with security as the constraint. It's other platforms that are horribly designed and continue to waste resources reinventing the wheel over and over again.

> The only practical security boundary for software as complex as Linux or Windows is physically separate hardware

Maybe you will be interested in a paper concerning air-gapping by Joanna Rutkowski: http://invisiblethingslab.com/resources/2014/Software_compar...

tl;dr: air-gapping has its own strong disadvantages.

That sounds entirely reasonable. Based on the ensuing discussion, it sounds like -

You don't secure the container, you secure the system running the container.

When you hand me a container to run, maybe I run it in a secured VM. Or maybe I run it inside a secured container; or I [get further out of my depth and] secure the Docker process itself.

From the perspective of the person writing the code running in the image, I just need to secure my application code - such as standard SQL injection, proper management of configuration keys, etc - but I don't really worry about securing the system running my container from my container.

...Although I guess that raises the question: how much do I need to secure my container from the system it's running on - or is that such a breach of trust it's pointless?

So... coming from Rails land

  I'm sorry

As they note, RunC is definitely a safer substrate for containers than fully-dressed Docker.

I assume that we have some OpenShift engineers in the house -- is v3 running directly on Docker and dropping capabilities, or does it run on RunC and then add them back as needed?

Disclosure: I work for Pivotal, we're on the Cloud Foundry side of the PaaS fence, which I guess makes us the Montagues.

Clickbait headline.

Also the actual headline and probably used tongue in cheek.

Why yes - finaly a savvy tech article to post on Facebook!

Sure is, but it made me chuckle.

Project Atomic (mentioned at the bottom of the OP) looks super-handy.

> Secure Your Containers with This One Weird Trick

tl;dr run SmartOS?

so much talk about security but still the blog is served as http. SMH

I'd find it a lot easier to believe this isn't a redhat FUD-job if the article listed the entire list of privileges and contrasted those that docker gives by default and those that require --privileged [1]. Also required would be meaningful discussion about how other defaults, such as networking, affect these capabilities.

>This one’s easy. If you have this capability, you can bind to privileged ports (e.g., those below 1024).

Sure, it can bind to port 80 on a virtual network device that is unique to that docker container, unless you give the container host network privileges.

[1] https://docs.docker.com/engine/reference/run/#/runtime-privi...

The article told you how to find the entire list of privileges. Just type `man 7 capabilities` or go here: http://man7.org/linux/man-pages/man7/capabilities.7.html

The `--privileged` flag is equivalent to giving your container all capabilities: https://github.com/docker/docker/blob/97660c6ec55f45416cb2b2...

I don't get what you mean by saying how networking affects capabilities. Because it doesn't?

The post chooses to emphasize the scary (ZOMG it allows port 22!) without the necessary information that the default networking mode is bridged, such that the container wont be opening port 22 on the host. So the article implies that this capability allows a security risk, but the default networking mitigates that risk.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact