Not sure if there are other scenarios where this would hit as well.
One to be aware of, but as with most vulnerabilities, good to understand how it can be exploited, when you're assessing mitigations...
But, as I mentioned in TFA, the plan is to rework https://github.com/cyphar/filepath-securejoin to have a sane API that detects attacks on older kernels while using the new kernel bits (once merged).
Anyway I switched to Caddy  even though I deeply love Apache and it has an amazing history behind it.
If the path contains symlinks, then we have to validate that the real parent directory of the symlink target doesn't allow that target to be replaced. E.g. we are following /foo/bar and bar is a symlink to /xyzzy/pop. If /xyzzy is writable to adversaries, pop can be replaced by a malicious symlink even if pop itself isn't writable to other users.
Anyway, the idea is that if a path we would like to operate on has any components vulnerable to manipulation, we fail the entire operation.
Things can be simplified if we canonicalize the path (so it is free of symlinks) but that is often undesirable. Software should keep the paths it has been given as-is; the symlink abstraction belongs to the user and should be respected.
This combo exists exactly to avoid these kinds of issues.
Another way would be temporarily joining the container's mount namespace to obtain the source handle. But that can't really be done in go since goroutines don't play nicely with per-thread operations.
Edit: After looking through the go standard library it seems that there is an impedance mismatch. Go just does not expose the necessary pieces to do this properly. A dedicated docker-cp tool in C or Rust could probably handle this better. I could be wrong though, maybe it's just not part of the stdlib.
The point is that one could do this correctly today with the right rituals. openat2 wouldn't save you if you were still doing plain realpath+open across a security boundary even though openat+procfs or setns are available.
The new syscall implementation would still need a fallback impl. for older kernels after all.
Yup, this is why I'm planning on getting a sane API available in <https://github.com/cyphar/filepath-securejoin> which projects can use so that the correct thing is done with both old and new kernels. Right now it has a (slighly) improved version of the code Docker has, but I'm rewriting it.
It should be noted that there are lots of examples of interfaces which are incredibly hard to make safe without openat2 -- such as O_CREAT.
It would probably require a redesign of how docker interacts with the host filesystem though, and obviously relies on something that isn't yet in the mainline Linux.
That said, docker does support `docker exec` which gives you a shell inside the container's namespaces and cgroups. I'd imagine they could do something similar and just not call exec once they've entered the container. This would be similar to calling `docker exec $containerid cat /path/to/file`
You also definitely wouldn't want to be running container runtime code inside all of the namespaces in the container -- this is hard to do safely and should be avoided (there are at least two CVEs related to the joining phase of just runc found in the past few years).
Doing it with cat is just an example, since cat may not be present inside the container. Instead of cat, and in the absence of fork-without-exec in go, you could execute a small C program injected similarly to `/dev/init` (or via memfd_create) that uses sendfd over a unix socket created by socketpair to pass back a file descriptor to the runtime so that the data is only buffered once.
So, you don't want to join all the namespaces, only the mount namespace. But if you join the monut namespace then MS_MOVE might start messing with your resolution. So really, what you wnt to do is to just pivot_root(2) into the container's rootfs -- this is what "docker cp" already does today and it has a bug because it pivot_root(2)s into an attacker-controlled path. And all of these solutions will require a fork+exec with Go.
> Is that really ludicrously expensive?
If you're doing it once for every single filesystem syscall (every stat(2), readlink(2), open(2), getdents(2)) and so on then yes it gets very expensive. As I mentioned, the other solution is to run entire operations inside a pivot_root(2) but then your runtime code gets more complicated (Docker already does this and it's already hard enough to understand how things like chrootarchive actually work at the end of the day). Not to mention that (as this bug shows) even if you run inside a pivot_root(2) you might make a mistake that entirely negates the security benefits.
Currently, docker exec (somehow safely) fully enters the container's namespaces and cgroups, then exec's a command inside the container. My suggestion was basically to have a statically compiled C binary that executes in the fully untrusted context, which things can ptrace and manipulate all they want. The thought was that the C binary would open the file descriptor from inside the untrusted context so that it is incapable of doing anything privileged and then send the file descriptor back over the inherited unix socket via sendfd. I'd imagine the only way this could be vulnerable is if sendfd is vulnerable somehow since this means 100% of the path resolution happens from a fully isolated context.
The performance argument makes plenty of sense, but it sounds like it'd be solvable by just doing a classic tar pipe where tar (or similar) is running in the fully untrusted context and writing its output to a pipe (with no unix sockets involved). You'd just need to get that statically compiled tar binary into the container, similar to how `/dev/init` is done. Would this be unreasonable? `kubectl cp` is already doing an actual tar pipe via docker exec, the missing bit is that it fails if tar does not exist inside the container, so you'd need to inject it in. This would fully removed the complexity of chrootarchive and any path checking, and you'd be able to rely entirely on the security constraints of docker exec.
% docker exec ctr tar cf - some/path | tar xf -
My point is that you have to be careful if you're designing a system where the security depends on running a process inside the container and trusting its output -- ideally you wouldn't run any process inside the container at all so that an attacking process can have no impact on it. And that's true even if we assume that the "tar" is not a binary that the container has direct control of.
This concern also applies if you aren't using tar and are instead using some other technique. Hell, even if you don't have an IPC system you could still probably attack a process that is just trying to do a single open(2) call -- you just need to make it pass a different fd.
I definitely recognize that the container process could ptrace the tar process, and with kubectl cp, it's even directly using whatever tar binary is in the container so tar could easily be malicious from the start, but what it can never do is break out of the container onto the node when the tar file is not being extracted onto the node using the docker daemon's prvileges, which is extremely important for multi-tenant environments.
If you executed your example command as root on the node, then yes, a vulnerability in the node's tar implementation could allow a malicious tar file to take over the node at extraction time, but tar does guard against this by default, as do standard posix user permissions: the tar extraction can happen in a completely unprivileged context.
I do view tar's extraction as a valid attack surface since modern tar implementations are complex, however, that would require a tar CVE and there's no reason that `docker cp`'s output target handling is any less vulnerable to the same problems. I really think the most important thing to guard against is at input time.
I agree that we should use security in depth (and all of those kernel facilities are great), but actually joining the container itself is not a good idea -- you need to treat it as the enemy. I am not in favour of implementing them all in userspace, this is why I'm working on new kernel facilities to restrict path resolution.
I wonder how an O_PATH handle from a different namespace behaves once you switch back. If *at lookups are performed under its original namespace you only have to obtain it once.
We came up with this idea before on LKML and I'm trying to remember what the issue with this solution was. It's definitely better than the current method by Docker (and actually, I might be able to get this to work within Docker more simply). It wouldn't work with rootless containers, but you could fix that somewhat trivially by switching to the userns of the container's pid1. Since /proc/self doesn't exist, obvious attacks through that such as /proc/self/root won't work.
There is a somewhat esoteric problem, which is that you could trick the process into opening an O_PATH stashed by a bad container process -- but right now there is a pretty big flaw in how the kernel handles this problem anyway that I'm also trying to solve upstream.
O_PATH is a Linux Kernel 4.x thing; can't find it in POSIX. Won't compile on non-Linux POSIX operating systems or older kernels. It seems to be an optimization; if we omit it, we just get a "fatter" file descriptor that takes more cycles to set up.
It'll certainly be easier to get right with the newly proposed syscalls. But you can also get it right with the current ones.
> O_PATH is a Linux Kernel 4.x thing; can't find it in POSIX.
I don't think that is relevant to docker which relies on many linux-specific APIs anyway.
That way, even if the symlink somehow gets resolved to a bad path, it will refuse to read it because it does not exist from the point of view of the `docker cp` command. I.e. make the command use a "principle of least authority" where it cannot even see files outside of a certain set of paths, let alone be tricked into opening them by a symlink.
This is why the the proposed patch to docker is pausing the container during the copy.
I don't think you understand. If we verify that the absolutized path has sane permissions from top to bottom, then nothing else can operate on it; nothing else that is not either the superuser, or our own user ID. I.e. no untrusted security context.
(If you think that the requirement is literally "nothing else", such as a different thread in the same application in exactly the same security context, then that's a whole different set of goalposts in another soccer field.)
I think a non-privileged user (or the use of user namespaces) would limit the scope of the attack based on permissions (though you have to make sure there is no suid binaries on the host as well... or use no-new-privileges for the container), however the attack still exists.
I don't think so -- Docker does all archive operations from the context of Docker (so, as root). Obviously with rootless Docker this is different, but I highly doubt anyone has started using it yet.
EDIT: Obviously also if Docker has an AppArmor profile applied or restrictive SELinux labels then it will also be limited by that.
We're told not to build our own since that's a waste of time and our version won't be as good.
At the same time we're mocked for not understanding what we're using.
What's the happy medium?
Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.”
Knight turned the machine off and on.
The machine worked."
At some point you simply have to trust other developers, but don't be afraid to build your own, especially if it's smaller functionality (looking at you left-pad ...).
Dependency management is a skill all developers should learn but it's not one that is actively taught or encouraged. Most (including myself) only really start learning once you've felt the bite of that one bad dependency, or fought through dependency upgrade hell.
In short, make sure you're asking yourself "Do I really need this as a dependency or can I build something myself?"
Try to audit gcc if you’re using C :)
Then there’s the question of auditing the hardware you’re running on ;)
And "The Thirty Million Line Problem": https://www.youtube.com/watch?v=kZRE7HIO3vk
When you build and use a non-trivial piece of software, you also create a large body of non-trivial design and reliability issues, which will take you many iterations and possibly rewrites to get right. Mature software has generally had time to address those issues, and what isn't addressed is better understood by the users and implementors.
When you roll you own, you have to ask: compared to the thing I'm replacing, do I understand the problem domain well enough to anticipate the issues?
Well with one exception. There is a cross-compilation build environment for an Intel XScale based-industrial computer which is deployed to several hundred remote locations. The previous developer was fond of working alone unchecked and insisted on creating a (pet snowflake) Docker host and a set of containers into which he installed the SDK for said industrial PC. God only knows where he pulled the original Docker base image from. He spent a year doing this unchecked (amongst a few other things) so that he could pad out his CV with the word "Docker". And after that was done, he left the business.
You start the Docker container and it fairly neatly builds the entire environment and creates a filesystem image which can be written to a CF card and installed in the physical industrial PC's, and handed to the field maintenance team.
My point is - the whole build process for this could run on a bare VM which would be under our regular configuration management. The Docker container and host really provide no benefit in this situation (except perhaps quick start up time for a fresh build environment) and are really just a hassle, because we'd rather not mess with Docker. The integration into our Jenkins instance was a complete nightmare - we spent hours poring over the Docker documentation which we found sub-par, or too new (the Docker version he had installed was ancient by this point) and running afoul of various Docker bugs.
We don't have enough other use cases for Docker to make it worthwhile for the rest of the team to learn in depth at this point. It's still on my TODO list to de-dockerize this build process and nuke that Docker host VM forever.
The professional will use and understand tools, it's the enthusiast who builds his own.
In a number of disciplines, you might learn the basics of building a tool that you use, but you elect to use that knowledge to buy a good tool rather than mastering that problem domain.
So what do you do with the one you built? Can you bring yourself to 'build one to throw away'? We struggle with this mightily. Once we've put effort into something and we have a 'thing' to show for it, we have trouble walking away from the thing. It's our form of hoarding.
Problem with that is that the code was full of bad documentation, obscure function names, unexpected behaviors and vast areas of missing functionality. Our people writing the software were less capable than the open source equivalents. With the support of some other people I was stumping for stewardship over authorship.
Basically, we should have been allowed to use smaller libraries of Open Source software as long as one or more employees were intimately familiar with the internals of that library. For the one I wanted to use, I already knew it fairly well so I could have reached that level with a month or two of work. Instead I had to keep adding features and bug fixes to our busted piece of crap, or (more often) reimplement functionality in the caller.
Sure! And this does not mean that once we are standing on them we can start to piss on their head. (e.g., the python programmers who despise C, or the matlab programmers who despise fortran). There is a worrying trend of ignorant programmers who dismiss "old-school" or "legacy" systems without realizing that they rely on them every day.
You can mock perl, for example, but then you buy a brand-new macbook pro and unless you run "file /usr/bin/*|grep -i perl|wc -l" you do not realize how much your computer depends on the this language.
Possibly, not building your own, but investing some time researching how others have built already existing tech?
That helps if what's out there doesn't work for you - either if it breaks (and you have to fix someone else's stuff that you use) or lacks feature-wise (and you _have_ to build your own, either by extending or re-designing).
Also forget about sleeping or having any social or family life for the next eight years.
We're told not to build our own since we don't understand what we're using
MalwareBytes is actually reporting that something on that page is trying to make an outbound connection to port 50685 on ip address 188.8.131.52. I don't know what to make of this, but it smells fishy. It's very odd that this program will make a claim if there isn't something going on. I'm not going to risk visiting that page.
I'm not sure on port 50685 but when I load the page there are no connections on port 50685.
Are you connecting over HTTP or HTTPS? If the latter, have you checked who issued the SSL certificate?