So, `docker exec <container id> cat /path/to/file` is definitely something that ...

cyphar · on May 30, 2019

A privileged user in the container could ptrace(2) the process and start messing with its output. If you have an IPC protocol (like the sendfd you bring up later) then you've now opened windows for container runtime attacks. Yes, you could double (and triple) check the fd is not malicious but so many programs already don't do this properly -- depending on it being done properly is not an ideal solution.

So, you don't want to join all the namespaces, only the mount namespace. But if you join the monut namespace then MS_MOVE might start messing with your resolution. So really, what you wnt to do is to just pivot_root(2) into the container's rootfs -- this is what "docker cp" already does today and it has a bug because it pivot_root(2)s into an attacker-controlled path. And all of these solutions will require a fork+exec with Go.

> Is that really ludicrously expensive?

If you're doing it once for every single filesystem syscall (every stat(2), readlink(2), open(2), getdents(2)) and so on then yes it gets very expensive. As I mentioned, the other solution is to run entire operations inside a pivot_root(2) but then your runtime code gets more complicated (Docker already does this and it's already hard enough to understand how things like chrootarchive actually work at the end of the day). Not to mention that (as this bug shows) even if you run inside a pivot_root(2) you might make a mistake that entirely negates the security benefits.

paulfurtado · on May 30, 2019

To be clear, docker exec is not currently vulnerable to this attack, right?

Currently, docker exec (somehow safely) fully enters the container's namespaces and cgroups, then exec's a command inside the container. My suggestion was basically to have a statically compiled C binary that executes in the fully untrusted context, which things can ptrace and manipulate all they want. The thought was that the C binary would open the file descriptor from inside the untrusted context so that it is incapable of doing anything privileged and then send the file descriptor back over the inherited unix socket via sendfd. I'd imagine the only way this could be vulnerable is if sendfd is vulnerable somehow since this means 100% of the path resolution happens from a fully isolated context.

The performance argument makes plenty of sense, but it sounds like it'd be solvable by just doing a classic tar pipe where tar (or similar) is running in the fully untrusted context and writing its output to a pipe (with no unix sockets involved). You'd just need to get that statically compiled tar binary into the container, similar to how `/dev/init` is done. Would this be unreasonable? `kubectl cp` is already doing an actual tar pipe via docker exec, the missing bit is that it fails if tar does not exist inside the container, so you'd need to inject it in. This would fully removed the complexity of chrootarchive and any path checking, and you'd be able to rely entirely on the security constraints of docker exec.

cyphar · on May 30, 2019

My point wasn't that docker exec is vulnerable, it's that you were to write a script like:

  % docker exec ctr tar cf - some/path | tar xf -

it would be vulnerable to attack, because the container process could ptrace(2) the tar process and then change its output to cause the host-side tar to start overwriting host files.

My point is that you have to be careful if you're designing a system where the security depends on running a process inside the container and trusting its output -- ideally you wouldn't run any process inside the container at all so that an attacking process can have no impact on it. And that's true even if we assume that the "tar" is not a binary that the container has direct control of.

This concern also applies if you aren't using tar and are instead using some other technique. Hell, even if you don't have an IPC system you could still probably attack a process that is just trying to do a single open(2) call -- you just need to make it pass a different fd.

paulfurtado · on May 30, 2019

My argument is that the kernel gives us namespaces, seccomp, selinux, apparmor, etc for isolation and attempting to implement all of the path resolution and permission checking from a privileged context outside of the container defeats all of that and requires reimplementing all of those guards from userspace, which feels futile. By using tar, you're left with serialized path strings and file contents rather than file descriptors, and it should be far easier to sanitize those strings than deal with the linux filesystem API.

I definitely recognize that the container process could ptrace the tar process, and with kubectl cp, it's even directly using whatever tar binary is in the container so tar could easily be malicious from the start, but what it can never do is break out of the container onto the node when the tar file is not being extracted onto the node using the docker daemon's prvileges, which is extremely important for multi-tenant environments.

If you executed your example command as root on the node, then yes, a vulnerability in the node's tar implementation could allow a malicious tar file to take over the node at extraction time, but tar does guard against this by default, as do standard posix user permissions: the tar extraction can happen in a completely unprivileged context.

I do view tar's extraction as a valid attack surface since modern tar implementations are complex, however, that would require a tar CVE and there's no reason that `docker cp`'s output target handling is any less vulnerable to the same problems. I really think the most important thing to guard against is at input time.

cyphar · on May 30, 2019

"kubectl cp" has had security bugs in the past[1] that are very in-line with what I just outlined (I didn't know this beforehand -- but I would've guessed it was vulnerable if they hadn't seen this issue before). In fact the fix in [1] doesn't look entirely complete to me -- it seems to me you could further mess with the output.

I agree that we should use security in depth (and all of those kernel facilities are great), but actually joining the container itself is not a good idea -- you need to treat it as the enemy. I am not in favour of implementing them all in userspace, this is why I'm working on new kernel facilities to restrict path resolution.

[1]: https://github.com/kubernetes/kubernetes/pull/75037