
Escaping Docker container using waitid() – CVE-2017-5123 - Da5hes
https://www.twistlock.com/2017/12/27/escaping-docker-container-using-waitid-cve-2017-5123/
======
hacknat
This exploit is interesting, but if you are doing container security correctly
it’s actually not a big deal. In particular if you are setting per-container
usernamespaces, like you ought to be, then this exploit doesn’t do anything.
In fact you can actively give a usernamespaced container any CAPs you want,
because they are isolated to that container’s uid:gid offset.

Obviously, giving containers unecessary CAP privileges in unwise, but if you
are practicing sound security best practices then there would be multiple
layers of defense between you and this CVE. I think a strong AppArmor profile
and SecComp profile would also make this CVE moot.

Edit: Also, this exploit relies on you being able to fork up to a certain pid
value. You can and should take advantage of Linux’s per cgroup ulimit
functionality. No container needs more than 255 threads (even if they do you
can make special exceptions for such applications).

Edit2: Additionally this CVE relies on the getuid syscall being available,
there is no reason to give a container this syscall, you should block it, ala
this guide: [https://rhelblog.redhat.com/2016/10/17/secure-your-
container...](https://rhelblog.redhat.com/2016/10/17/secure-your-containers-
with-this-one-weird-trick/)

I have to say I’m more than a little dissapointed in Twistlock for not
pointing out what countermeasures you can employ against this and other CVEs.

~~~
oblio
As a somewhat of a container noob, could you expand on "per-container
usernamespaces"?

~~~
baq
Follow-up question: And why docker doesn't do that by default?

~~~
hacknat
Because the Docker project doesn’t make money off of security. It is actually
quite infuriating, because they have become the de facto container image
standard. Most of their security has actually come from Twistlock (I am not a
Twistlock employee, FYI). My recommendation to most Admins or Devs that are
serious about container security is to let your developers use docker, but run
your images with CRI-O on your servers: [http://cri-o.io/](http://cri-o.io/)

~~~
cpuguy83
There are trade-offs to using userns and many ppl don't like the current set
of trade-offs. In addition changing a default like this is a breaking change.
Admins can enable userns by default in a daemon, but making it a hard-coded
default is much more difficult.

It's not just a matter of enabling user ns. There is no support at the vfs
layer for uid/gid mapping. This means in order to use it, images must be
chowned with the remapped ID's. Per-container mappings are not supported for
this reason (it would require copying and chowning the entire image for each
container mapping).

Do you care to qualify your statement about CRI-O?

~~~
ecnahc515
I recall seeing some patches submitted to make it possible to pass an uid/gid
offset to the mount syscall at one point when people were implementing
usernamespaces for container runtimes like docker. So is this fixable without
having to make every file system implement this feature, or is there something
else holding back better support for doing uid shifting for use with user
namespaces?

~~~
cpuguy83
That has not been accepted into the kernel. It's called "shiftfs", which
basically let's you perform the uid/gid shift on mount.

------
cirowrc
In the article the author states:

> CVE-2017-5123 was published earlier this year on Oct 12 — it was a Linux
> kernel vulnerability in the waitid() syscall for 4.12-4.13 kernel versions.

Does this mean that kernel versions prior to 4.12 are not affected? That's
what I understood from the related issue in the bug tracker
[https://bugzilla.redhat.com/show_bug.cgi?id=1500094](https://bugzilla.redhat.com/show_bug.cgi?id=1500094)

By the way, this is very important:

> In 2017 alone, 434 linux kernel exploits where found, and as you have seen
> in this post, kernel exploits can be devastating for containerized
> environments. This is because containers share the same kernel as the host,
> thus trusting the built-in protection mechanisms alone isn’t sufficient.
> Make sure your kernel is always updated on all of your production hosts.

Great article!

~~~
TheDong
That range isn't quite correct. It only impacted 4.13 kernels, and only
4.13.0-4.13.6 (inclusive, distro-dependent due to backports).

It was patched in 4.13.7 after being introduced in the 4.13.0 merge window.

See [https://lwn.net/Articles/736348/](https://lwn.net/Articles/736348/)

This issue shouldn't have happened at all, but it was caught and patched very
quickly, so relatively few real-world systems are or were affected.

~~~
Da5hes
it was introduced by this commmit:
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c48abe91be03d191d0c20cc755877da2cb35622)

which by itself is in a 4.12 vanilla code tree

~~~
TheDong
That commit is correct (4c48abe91be03d191d0c20cc755877da2cb35622), but it was
not in 4.12 as cut by linus.

[https://github.com/torvalds/linux/commit/4c48abe91be03d191d0...](https://github.com/torvalds/linux/commit/4c48abe91be03d191d0c20cc755877da2cb35622)
(click the little '...' to expand tags it's in) or:

    
    
        $ git tag --contains 4c48abe91be03d191d0c20cc755877da2cb35622
        v4.13
    

What is your methodology that gets that it is in the 4.12 tree?

~~~
Da5hes
you are right, i actually didn't check on git, my bad

------
zenlikethat
Apparently Linux privilege escalation bugs are now "Docker container escapes"?
Thanks to Twistlock for a detailed article but call it what it is, a Linux
vulnerability not specific to Docker.

~~~
Xylakant
Since people rely on docker containers as an isolation layer between
potentially unfriendly services, a linux bug that allows breaking that
isolation barrier is a relevant thing. It’s worth being called “docker
container escape”

~~~
zenlikethat
People should not rely on that to any degree more than they would rely on
colocated processes on a VM being isolated. The easiest way to be safe is to
assume that all containers are already broken out of - what would you do then?
Make sure processes are running as non-root, use various protection layers
(pick your poison - SELinux, gresecurity, etc.), take away capabilities, and
don't run workloads you don't trust.

~~~
Xylakant
I’m not saying they should, but they are. And given the fact that containers
are often marketed as “lightweight VMs”, I’m Not very surprised by that.

~~~
akvadrako
In this case the analogy is apt - VM isolation is also not very secure - the
exploits like row hammer are usually more heavyweight though.

~~~
Xylakant
sure, any Xen guest escape receives equal amounts of press for exactly that
reason: It's an isolation barrier breaking down. However, trivial exploits
breaking VM isolation have been relatively rare lately.

------
cbisnett
Just to clarify the terminology here:

\- A vulnerability is a sofware bug that has particular behaviors and
ramifications that allow it to be used maliciously.

\- An exploit is a crafted piece of input data that is designed to trigger a
vulnerability to execute arbitrary code, crash the target (Denial-of-Service),
etc.

> In 2017 alone, 434 linux kernel exploits where found, and as you have seen
> in this post, kernel exploits can be devastating for containerized
> environments.

There are a few places in the article like this one where the correct
terminology is vulnerability not exploit. cvedetails.com aggregates
vulnerabilities. Places like exploit-db.com aggregate exploits people have
written to take advantage of vulnerabilities to enable them to perform some
unintended action against the target.

Edit: formatting

------
alpb
Any ideas why this is branded as "Docker"? Are the same namespacing constructs
not being used by other Linux container runtimes? I think this should be
titled "Escaping Linux containers" as docker is not at fault here?

~~~
jo909
What do you mean by "branded"?

The author shows a concrete exploit of the kernel bug described in
CVE-2017-5123 as he has developed it in the context of the docker container
environment.

He shows how to use this bug to break out of docker, so he calls the blog post
"Escaping Docker ...".

Which is IMHO the most interesting container runtime to write such an exploit
for first because it is very widely deployed, but it might also just have been
what the author is most familiar with or what was easiest to develop for him.

~~~
chowyuncat
Think of it this way: what if the author had titled it "Escaping Ubuntu
containers" ?

~~~
dchest
Why? The article demonstrates exploitation of Docker containers.

~~~
oblio
Yeah, but is it limited to Docker containers? Can other container types be
attacked in the same way?

~~~
jo909
It is a reasonable _assumption_ that other container runtimes on linux might
be affected by the same kernel bug. The article does not explore that and the
author has no duty to do so just to avoid using a branded technology name.

How would you reasonably talk about "Linux containers" without having a very
exhaustive list of all existing implementations and testing all of them? If
one of them is not affected you are now factually wrong.

~~~
chowyuncat
The exploit overwrites kernel memory credentials of a task structure. That
structure is the lynchpin of kernel security, including SELinux.

------
mattmcknight
If anything, this points out that the use case of Docker for security
isolation, such as in a multi-tenant architecture, is probably still not a
good one.

In most use cases I see containers used for rapid and consistent deployment.
The isolation benefit with multiple containers on a host is that if you
install things with different library dependencies you don't run into
conflicts. As such, the comparison for the common use case is just software
installed directly on the host, which also is subject to this vuln.

------
snvzz
>In 2017 alone, 434 linux kernel exploits where found, and as you have seen in
this post, kernel exploits can be devastating for containerized environments.
This is because containers share the same kernel as the host, thus trusting
the built-in protection mechanisms alone isn’t sufficient.

More than one kernel exploit _per day_. Exploiting Linux is just a matter of
finding one such vulnerability and using it. This can be done in a single day.

There's just no fixing megabytes of buggy kernel code.

It really drives home the need for a proper OS based on a verified,
capability-enabled microkernel such as seL4.

~~~
mehrdadn
I'll surely get a lock of flak for this, but these kinds of bugs would be
_trivial_ to avoid in C++. All you need is to make the pointer arguments to
syscalls be some other data type (say, user_ptr<T>) that performs an access-
check upon conversion to a raw pointer. Then the compiler simply wouldn't
_let_ you bypass the access-check, so you simply _could not_ forget to do so.
That's the fundamental difference between C++ and C: one of them actually lets
you write code that _cannot_ contain many classes of mistakes, and the other,
well, doesn't. For the life of me I don't understand the stubbornness behind
sticking to the same languages and tools from decades ago.

~~~
0xcde4c3db
According to Linus, programmers who prefer C++ are so bad that he would have
chosen C solely to avoid dealing with their "total and utter crap" code, and
C++ is only good for kernel development if you limit yourself to a C-like
subset anyway [1].

[1] [https://lwn.net/Articles/249460/](https://lwn.net/Articles/249460/)

~~~
user5994461
There are C++ mechanisms that do not work well in the kernel context. Like any
implicit memory allocation or exceptions.

Linus is not entirely crazy. The Windows kernel SDK only supported C++ in the
last decade and it has a lot of limitations.

------
DyslexicAtheist
Linux kernels in production (since we all now like to run docker there :))
without grsec/seccomp have always been pretty dangerous. What I dislike about
docker is their feature creep and lack of proactively steering their users to
accepting _more secure defaults_. The mindset towards security in the Linux
kernel community remains shockingly stubborn compared to the shift to "better
security", which is taking over the rest of the industry.

~~~
DyslexicAtheist
actually the most affected by this CVE would be medium sized companies not
investing in enough internal development pumping out services fast with secure
default (startups rushing to their MVP maybe too). The companies running in a
totally automated farm with Kubernets or docker swarm usually don't have
containers with long uptimes.

------
saagarjha
In case the article author is here: The code snippets given aren't escaped, so
&, <, > show up as HTML entities instead.

------
mehrdadn
> The vulnerability is that the highlighted access_ok() check was missing in
> the waitid() syscall.

Why in the world does this class of vulnerabilities still exist in 2017? Why
are kernel maintainers not writing some kind of C linter that makes sure every
single pointer argument to every syscall is passed to a well-known function
like access_ok (Linux) or ProbeForRead (Windows)? Literally all you need is a
syntactic check; you don't even need to do any kind of semantic analysis...
since all you want is to flag the code so someone can inspect each spot
manually. Why is this not done?!

~~~
quotemstr
C++ would also make it harder to get it wrong. Its type system is powerful
enough to enforce rules like "you must call access_ok before writing through a
pointer": you just have access_ok transform an inaccessible pointer token of
some sort, passed in as a syscall parameter, into a different kind of object
through which you can write into memory.

The generated machine code would be identical to what's in the kernel today,
but it'd be both safer and cleaner. C++ still has to get over the bad gang-of-
four-1990s-era-object-goo reputation it has among systems people.

~~~
mehrdadn
> C++ would also make it harder to get it wrong.

Funny you mention this...
[https://news.ycombinator.com/item?id=16032324](https://news.ycombinator.com/item?id=16032324)

~~~
quotemstr
It's a thought a lot of people have, I bet. :-)

------
AgentME
Does this escape only work if they have root inside of the container? I
usually try to make it so my containers always contain a non-root process as
an extra layer of security.

~~~
hacknat
No it doesn’t matter. If they have waituid and getuid then they are off to the
races.

~~~
zenlikethat
Running your containers as non-root is still great though! Shocking how
uncommon it is.

------
ttul
OpenBSD has randomized pids since the dawn of time. Why has Linux not taken
this basic step to improve security?

~~~
hacknat
Randomized pids wouldn’t nexessarily help that much in this situation,
especially if the getuid syscall is available. However, I agree with your
general sentiment that there are basic security features that Linux could
implement to make a lot of CVEs impotent. I think the community is coming
round, but this stuff takes more work than most people may realize.

~~~
ttul
OpenBSD randomizes everything they possibly can. This is a Good Thing and so
cheap..

------
quotemstr
Related: [https://lwn.net/Articles/736348/](https://lwn.net/Articles/736348/)

------
upofadown
It has been general knowledge that escaping from a container is trivial since
forever. The article is merely an example.

Is there actually a counterpoint for this? Who is saying that containerization
can be used for isolation?

------
eeZi
This is precisely why we need projects like Grsecurity.

------
crb002
Linus really needs to start having more formal verification around patches.

------
Matt3o12_
This might be a bit off topic but I wonder why the vulnerability has been
patched this way:

    
    
        if (!access_ok(VERIFY_WRITE, infop, sizeof(*infop)))
            goto Efault;
    

Why doesn’t the if use curly brackets? I thought it has been established that
it is best practices to always use curly brackets even if they are explicit,
especially after Apple's infamous goto bug[1].

Secondly, why does it use goto at all? I thought it has also been established
not to use goto unless it is the only performant solution (and performance is
important in that case). Sure Efault with probably kill the program but
wouldn’t it still be better to use a function call considering that the
desired resolution should be the same?

[1]:
[https://www.imperialviolet.org/2014/02/22/applebug.html](https://www.imperialviolet.org/2014/02/22/applebug.html)

~~~
umanwizard
> I thought it has also been established not to use goto unless [...]

“Established” by whom? Certainly not by kernel developers — `goto` is very
common in all kernels I have looked at (xnu, Linux, bsd)

~~~
drchickensalad
And it has quite a consensus as the best solution to this problem. Goto being
considered harmful is a generally true statement. However, this usage is a
more specific exception to the rule, with objective benefits vs alternatives.

~~~
user5994461
"goto considered harmful" is an old meme from 50 years ago, when control
blocks like _if_ and _for_ were invented.

