
Dropping Linux capabilities to improve container security - cheiVia0
http://rhelblog.redhat.com/2016/10/17/secure-your-containers-with-this-one-weird-trick/
======
brendangregg
I wrote an open source BPF tool recently that shows which capabilities your
application is using, live, for the purposes of determining such a whitelist:
[http://www.brendangregg.com/blog/2016-10-01/linux-bcc-
securi...](http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-
capabilities.html)

~~~
fnord123
As you guys are using FreeBSD at Netflix, have you reached over to OpenBSD and
tried pledge[1]? It looks like a terrific addition to the sandboxing suite of
tools. It seems like it would go hand in hand with your whitelisting tool if
only they ran on the same platforms.

[1] [http://man.openbsd.org/cgi-bin/man.cgi/OpenBSD-
current/man2/...](http://man.openbsd.org/cgi-bin/man.cgi/OpenBSD-
current/man2/pledge.2)

------
josteink
It's da to see the word "container" used synonymously with "Docker container".
There are other container-technologies out there, and this one only applies to
Docker.

Have we landed in what seems like yet another mono-culture situation where
there are actually several viable technologies at hand?

If so, that would be a real shame. Docker is broken in quite a lot of ways and
needs all the competition it can get.

~~~
nilliams
Could you elaborate on (or link us to examples of) those 'quite a lot of
ways'?

~~~
josteink
Sure. Here's the slide-set for a pretty good rant (although in Norwegian):

[https://kly.no/presentations/2016-09-24_Ignite-
talk_Docker.p...](https://kly.no/presentations/2016-09-24_Ignite-
talk_Docker.pdf)

The good news is that Google Translate understands PDFs and mostly gets it
right:

[https://translate.google.com/translate?sl=no&tl=en&js=y&prev...](https://translate.google.com/translate?sl=no&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=https%3A%2F%2Fkly.no%2Fpresentations%2F2016-09-24_Ignite-
talk_Docker.pdf&edit-text=)

That should be a decent start for further discussion :)

~~~
digi_owl
Yeah "decent". Whole thing is a childish speil about "docker bad, systemd
cool, m'kay".

------
fulafel
What's holding back user namespaces with Docker, are containers requiring real
root common or is there some other reason it's turned off by default?

I guess the explanation might be the example image here: it's called "fedora".
I thought running whole Linux distributions VM/LXC-style under Docker didn't
really work - the Docker people always said you're only supposed run one app
per container. Has this changed?

~~~
e1g
The core philosophy of "one app per container" has not changed. Strictly
speaking, this will never change because containers - whether Docker, ACI, or
OCI - are _processes_ and not _VMs_. I recommend this accessible presentation
on how to write a poor-mans Docker by hand[1] to remove some of the magic
behind containers.

As with any philosophy, there are many smart people who hold an opposite view.
Docker itself is perhaps deviating from the architectural simplicity of
container==process, while the underdog rkt is enforcing this approach even at
the cost of user-land usability (e.g. rkt cannot run containers as daemons by
itself).

In practice, most containers must have an OS wrapping the application (unless
you're distributing binaries), and that OS often ends up being a nearly fully
baked version of Ubuntu/Debian/Fedora or whatever the person is most
comfortable with outside of Docker.

edit PS: One common exception to this philosophy is a short-lived container
for admin/management/debugging. Having a full OS with devel/admin tools is
useful for investigation/maintenance purposes, then you kill it and nothing is
changed in prod.

[1]
[https://www.youtube.com/watch?v=HPuvDm8IC-4](https://www.youtube.com/watch?v=HPuvDm8IC-4)

~~~
insaneirish
> Strictly speaking, this will never change because containers - whether
> Docker, ACI, or OCI - are processes and not VMs.

Who died and made you king of definitions? Use of the word "container" for
virtualizing a full OS instance predates [1] Docker, ACI, OCI.

[1]: [https://us-east.manta.joyent.com/bcantrill/public/ppwl-
cantr...](https://us-east.manta.joyent.com/bcantrill/public/ppwl-cantrill-
zones.pdf) (I wouldn't be surprised if there are even earlier IBM or FreeBSD
uses of the word "container" with respect to OS virtualization.)

~~~
e1g
Haha, valid point. My thinking is within the current nomenclature and
implementations of "containers" vs "Virtual Machines". These definitions could
change drastically in the future.

------
tayo42
I never understood why container security needs to be treated differently then
regular security for the os and process.

~~~
digi_owl
Because they are effectively turning into a mini-os.

The whole container hype is an offshot of the VM hype from a few years back,
that was all about running multiple, apparently separate, servers on shared
hardware. This by doing the OS and server installs into VM images that would
be managed by hypervisors (effectively a stripped down OS that only existed to
run VMs).

The VM approach have certain issues with overhead, as you are running multiple
kernels and user spaces stacked on top of each other, with the underlying CPU
spending time pretending to be whatever hardware the VM is presenting to the
OS within it.

Thus you get containers, a chroot on steroids that is claiming to bring some
of the benefits of VM (less idle hardware, thus getting all the accountants
hot and bothered) but with reduced overhead. This by having the containers
share kernel and general userspace.

Frankly we keep adding turtles to the stack.

We start out with a single cpu running a single process.

Then we stack multiple cpus into racks.

Then we implement timeshare so that each cpu can run multiple processes.

Then we virtualize the cpu into a process, that in turn are timesharing
processes.

All this because hardware takes up space and electricity even while idle, but
software do not. Thus the more software you can run on a single piece of
hardware the "better".

------
RangerScience
So... coming from Rails land, where convention reigns, why is the locked-down
list not the standard?

~~~
TheDong
Docker has repeatedly optimized for usability (including dumb users) over
security. User experience is "my app just works", not "my app works, but only
after I remember to add these 10 capabilities".

Frankly, it seems like it's an explicit goal of docker to let people who know
nothing about linux use linux containers. Another obvious goal is to ship
fast, and add security later _maybe_ (see the disaster that is docker registry
auth).

~~~
wahern
Linux had a remotely exploitable root privilege vulnerability in the recvmmsg
system call just last week. Linux experiences local root exploits via
innocuous-seeming system calls roughly every 3 months, last time I checked.

Containers cannot and will not provide effective security. Capabilities more
generally cannot and will not provide effective security in practice.
Capabilities have been around in Linux since before 2000 and they've always
been trivial to circumvent because of kernel bugs. The problem has never been
granularity of capabilities but about code quality and surface area. If you're
concerned about security, you want at _least_ something like seccomp or
pledge. But as the recvmmsg issue shows even that's not enough. Heck, virtual
machines aren't even enough--Xen has had two privilege escalation
vulnerabilities in less than a year.

Containers are about easing deployments. Docker understands that, I think. And
while there are many aspects about the whole container phenomenon that I find
very unpalatable, I applaud Docker for understanding that, or at least not
peddling the security angle too heavily.

The only practical, one-size-fits-all security boundary for software as
complex as Linux or Windows is physically separate hardware. Linux's recent
recvmmsg exploit notwithstanding, direct remote exploits of kernels are rare
enough that you can make an argument for their benefit without exploit authors
rolling their eyes.

~~~
TheDong
> The only practical security boundary for software as complex as Linux or
> Windows is physically separate hardware

So because you can't really secure a linux kernel perfectly, you might as well
not bother with even trying at all?

Defence in depth. Sure, I understand there might be exploits that affect my
system, but that doesn't mean I shouldn't use prepared sql statements,
shouldn't run my processes as uid!=0, shouldn't use selinux, etc.

None of what you said is an excuse for not having more secure defaults and
encouraging users to practice defence in depth better.

~~~
wahern
Using prepared SQL statements is about writing correct software.

My point is that expecting containers to "secure" broken software is a
delusion, especially when the implementation enforcing the constraints--in
this case, the Linux kernel--is fundamentally broken from a security
perspective in the relevant dimension (that is, local exploits leading to
privilege escalation).

Arguing defense in depth in this case is like arguing that stacking 3 web
application firewalls on top of an application is better than just 1. The
problem being that none of them are capable of addressing the bulk of the
proven exploits.

Back to your prepared SQL statement analogy, consider that nobody should be
running software with root privileges today, regardless. Whether your network
daemon is run in a container or not, it shouldn't being running as root in its
steady state. So trying to lock down a container in such an instance is like
trying to prevent SQL injection in software that concatenates data into the
SQL statement string. It's intrinsically a failed strategy, and not in any
sense an engineering solution. It's like saying doing something is better than
doing nothing. Well, actually, in reality that's not the case; in reality
sometimes doing something is just as bad as doing nothing (sometimes it's even
worse, but that's not the argument I'm making here).

There are other, similar reasons why it's a bad strategy to see containers as
a security measure. And I fail to see how anybody who has spent any amount of
time in this industry, and who has spent any amount of time applying privilege
separation principles, could reasonably disagree.

Now, can containers help you write more secure software? Yes, depending on
context. But not to the exclusion of other techniques. Containers don't add a
useful security layer by themselves, and shipping with slightly tighter
capabilities doesn't change this fact in practice. Again, saying that shipping
with slightly tighter capabilities (the particular capabilities system at
issue here) is better than not is a distinction without a difference when most
kernel-based root exploits are in innocuous system calls (e.g. networking
system calls) that you could never lock down in a default container setting.
If an application writer can't be bothered to make their software work without
root, there are evidently other things going on.[1] Again, the problem in
reality is software quality--bad kernel quality, bad application quality--and
papering over this doesn't help.

[1] A simple, single setuid call shortly after application startup would work
at least as well as narrowing process capabilities, though likely be more
effective. Even better, write your application to inherit socket and file
descriptors so inetd or systemd can implement those privileged operations.
These are minimal things all _correctly_ written software should support. If
an application doesn't support at least the former, it's fundamentally broken
and it's foolhardy to believe this can be papered over. Also, newsflash,
systemd doesn't solve PID file races for uncooperative network daemons. That's
another instance where people believed somebody came up with some magic
software that inherently solved a problem without having to fix the underlying
software. The difference being PID file races aren't a comparable security
issue because they usually already require local, root capabilities; and
systemd's solution arguably makes such a race meaningfully more difficult to
exploit. So a misinformed engineer's decision in that context is unlikely to
be part of a chain of events leading to billions of dollars of valuation going
up in smoke.

~~~
viraptor
Why would you use pid files when you've got systemd available? All the process
management can be done directly with children or cgroups for some crazy
runaway processes.

~~~
JdeBP
Or, more generally:

> _Why would you use PID files when you 've got proper service management
> available?_

as of course the flawed PID file mechanism was addressed long before systemd
was even an idea.

cgroups do not solve the problem of "crazy runaway processes", by the way.
cgroups are not jobs (in the Windows NT and VMS senses), and lack the
functionality of a true job mechanism.

When systemd terminates all of the processes in a cgroup, it doesn't issue a
single "terminate job" system call. There isn't such a thing. It sits in a
loop in application-mode code repeatedly scanning all of the process IDs in
the cgroup (by re-reading a file full of PID numbers) and sending signals to
new processes that it hasn't seen before. There are several problems with
this.

A first problem is that systemd can be slower than whatever is reaping child
processes within the process group, leading to the termination signals being
sent to completely the wrong process: one that just happened to re-use the
same process ID in between systemd reading the cgroup's process list file and
it actually getting around to sending signals to the list of processes. The
order of events is: (1) systemd reads PID N from the cgroup's process ID list,
then its timeslice happens to expire; (2) process N is scheduled and exits;
(3) process N's reaper (its parent, or a sub-reaper) cleans up the zombie in
the process table; (4) some vital part of the system is scheduled and spawns a
new process re-using ID N; (5) systemd is finally scheduled, works down the
list of PIDs that it read before, and merrily kills the new process.

A second problem is that a program that forks new processes quickly enough
within the cgroup can keep systemd spinning for a long time, in theory
indefinitely as long as suitable "weather" prevails, as at each loop iteration
there will be one more process to kill. Note that this _does not_ have to be a
fork bomb. It only has to fork enough that systemd sees _at least one more_
new process ID in the cgroup every time that it runs its loop.

A third problem is that systemd keeps process IDs of processes that it has
already signalled in a set, to know which ones it won't try to send signals to
again. It's possible that a process with ID N could be signalled, terminate,
and be cleaned out of the process table by a reaper/parent; and then something
within the cgroup fork a new process that is allocated the same process ID N
once again. systemd will re-read the cgroup's process ID list, think that it
has already signalled the new process, and not signal it at all.

These are addressed by a true "job" mechanism. But cgroups are not such.

~~~
TheDong
You can also just freeze the whole cgroup, and then the kernel will cascade it
to all tasks within said cgroup, and then you can do process management.

Most of what you complain about though is solved by duplicate pids not being
assigned in short succession (which is generally true, though in reality is an
implementation detail of linux which is not promised).

~~~
JdeBP
I was expecting someone to mention the freezer. Not only does systemd not use
the freezer, but the systemd people explicitly describe it as having "broken
inheritance semantics of the kernel logic". You'll have to ask them what they
mean by that, but the freezer does not, for them, magically turn cgroups into
a job mechanism either.

Which leads to another point. If you think that this is my complaint, you
should read the systemd people on the subject over the years. They were rather
expecting (version 2) cgroups to be a true job mechanism, and they've
complained every time that it has once again turned out not to be.

Moreover: this is not to mention that Docker and others will manipulate the
freeze status of control groups for their own purposes, and there is no real
race-free mechanism for sharing this setting amongst multiple owners, such as
an atomic read-and-update for it.

------
jacques_chester
As they note, RunC is definitely a safer substrate for containers than fully-
dressed Docker.

I assume that we have some OpenShift engineers in the house -- is v3 running
directly on Docker and dropping capabilities, or does it run on RunC and then
add them back as needed?

Disclosure: I work for Pivotal, we're on the Cloud Foundry side of the PaaS
fence, which I guess makes us the Montagues.

------
quickben
Clickbait headline.

~~~
50CNT
Also the actual headline and probably used tongue in cheek.

------
AlexCoventry
Project Atomic (mentioned at the bottom of the OP) looks super-handy.

------
insaneirish
> Secure Your Containers with This One Weird Trick

tl;dr run SmartOS?

------
andrewvijay
so much talk about security but still the blog is served as http. SMH

------
lowbloodsugar
I'd find it a lot easier to believe this isn't a redhat FUD-job if the article
listed the entire list of privileges and contrasted those that docker gives by
default and those that require --privileged [1]. Also required would be
meaningful discussion about how other defaults, such as networking, affect
these capabilities.

>This one’s easy. If you have this capability, you can bind to privileged
ports (e.g., those below 1024).

Sure, it can bind to port 80 on a virtual network device that is unique to
that docker container, _unless_ you give the container host network
privileges.

[1] [https://docs.docker.com/engine/reference/run/#/runtime-
privi...](https://docs.docker.com/engine/reference/run/#/runtime-privilege-
and-linux-capabilities)

~~~
kccqzy
The article told you how to find the entire list of privileges. Just type `man
7 capabilities` or go here: [http://man7.org/linux/man-
pages/man7/capabilities.7.html](http://man7.org/linux/man-
pages/man7/capabilities.7.html)

The `--privileged` flag is equivalent to giving your container all
capabilities:
[https://github.com/docker/docker/blob/97660c6ec55f45416cb2b2...](https://github.com/docker/docker/blob/97660c6ec55f45416cb2b2d4c116267864b62b65/daemon/exec_linux.go#L23)

I don't get what you mean by saying how networking affects capabilities.
Because it doesn't?

~~~
lowbloodsugar
The post chooses to emphasize the scary (ZOMG it allows port 22!) without the
necessary information that the default networking mode is bridged, such that
the container _wont_ be opening port 22 on the host. So the article implies
that this _capability_ allows a security risk, but the default networking
mitigates that risk.

