
Introducing rkt’s ability to detect privilege escalation attacks on containers - manacit
https://coreos.com/blog/rkt-detect-privilege-escalation.html
======
catern
I don't have the inclination to write a long comment about this. But reading
sections like this:

>This state can then be verified whenever a process performs an action
requiring a permissions check. For example, when a process requests that a
file be opened, the kernel now calls out to the hypervisor. The hypervisor is
then able to examine the process state and ensure that it remains consistent
with its internal representation of process state.

Makes me very uneasy. You're making the hypervisor more complicated,
duplicating the logic in the kernel, because you feel the kernel is insecure.
The only argument for doing this, AFAIK, is that the hypervisor is still
simpler overall, and therefore more secure. But when you start introducing
these kind of complications, that argument becomes kind of farcical... Adding
another layer on top of the kernel and duplicating your permissions checks in
there, I do not think that will end well.

~~~
mjg59
You can take a look at the patch - there's very little additional complexity
in either the kernel or the hypervisor. The kernel permission checks remain
unmodified, with the hypervisor checks as a backup.

The argument for this isn't that the hypervisor is simpler, it's that the
hypervisor is running at a separate privilege level to the rest of the stack.
The hardware enforces an additional level of separation. This just adds
validation on the other side of that boundary.

(Disclaimer: author of the code)

~~~
xkxx
Hey, author, thanks for commenting. AFAIR, you also are the guy who criticizes
Ubuntu's policy regarding derivative works. Anyway, I have a question for you.

If I understand rkt's architecture correctly, the hypervisor runs multiple
Linux kernel instances, every kernel runs several processes. If somebody
exploits one of those processes and creates a root-owned process, then rkt
detects such an attempt and restarts the kernel. But what if the exploit
doesn't create any root-owned processes? You can get rights equal to those of
the superuser without getting uid=0.

~~~
mjg59
It also detects any bump in capabilities. Getting superuser level abilities
without modifying either your uid or your capabilities would require
modification of kernel code.

~~~
xkxx
> would require modification of kernel code

And it's impossible for an exploit to do because?..

EDIT: "?.." in this case means "please continue my sentence".

~~~
mjg59
Kernel code is mapped read only

~~~
xkxx
Thanks for taking time to answer my questions.

> Kernel code is mapped read only

What part of the system makes sure that read-only kernel code stays read-only?
Is it hardware or the hypervisor?

Another question. Is there any chance that you missed something and there's a
way to get privilleged access without actually modifying your uid, bumping
your capabilities or modifying kernel code?

~~~
mjg59
The hardware - it's an attribute on the page table. You could certainly
monitor that from the hypervisor as well, although there'd be some performance
overhead. You can avoid all of these protections if you can inject new code
into an existing privileged process or SUID executable, which is how DirtyCOW
worked. The blog points out that this class of vulnerability isn't covered.

~~~
xkxx
> The hardware - it's an attribute on the page table.

And the only way to make the page writable again is to clear the page first.
Am I correct?

> You can avoid all of these protections if you can inject new code into an
> existing privileged process or SUID executable, which is how DirtyCOW
> worked.

It can happen if either one of these is true: 1) the existing privileged
process or SUID executable is vulnerable; 2) the kernel is vulnerable.
DirtyCOW is interesting because it's the case #2. The case #1 is not that
interesting, because targeting vulnerable privileged executables is a known
attack vector.

------
geertj
Sounds very familiar to VMware's VMsafe [1], which apparently they EOL'd [2].

[1] [http://www.gabesvirtualworld.com/vmsafe-what-is-it-
exactly/](http://www.gabesvirtualworld.com/vmsafe-what-is-it-exactly/)

[2] [https://kb.vmware.com/kb/2058911](https://kb.vmware.com/kb/2058911)

------
buckhx
Who is using rkt in production? I'm working on a system from scratch and
planning on using k8s and just assumed docker would be the container engine of
choice, but the more I hear/ read about rkt the more it seems like the better
choice.

~~~
rwvhp
BlaBlaCar have been using rkt since some really early version. Simon Lallemand
gave a talk in Berlin at the CoreOS meetup about it. I don't recall it being
recorded, though there is a blog entry covering some of the details[1].
Interesting how they built a lot of tooling for the container building and
management before other ways appeared, and the approach they took to it.

[1] [http://blablatech.com/blog/why-and-how-blablacar-went-
full-c...](http://blablatech.com/blog/why-and-how-blablacar-went-full-
containers)

~~~
robszumski
Video of a talk from BlaBlaCar in Paris on rkt:
[https://www.youtube.com/watch?v=dW7U2PZ16ek](https://www.youtube.com/watch?v=dW7U2PZ16ek)

And a blog post on their site: [http://blablatech.com/blog/why-and-how-
blablacar-went-full-c...](http://blablatech.com/blog/why-and-how-blablacar-
went-full-containers)

------
geofft
How close are we to a world without SUID binaries? It seems like you could
just replace sudo with RPC to some privileged daemon (which could be as simple
as the existing sshd and "ssh root@localhost", but another daemon would be
fine), and most of the other use cases (mount, pt-chown, dbus-daemon-launch-
helper, etc.) have been slowly disappearing.

That would avoid the weird special case for SUID binaries, which looks like
the sort of thing that someone is going to discover an attack on in a year or
two. (And it's a special case in lots of systems, not just this one.)

~~~
catern
>most of the other use cases (mount, pt-chown, dbus-daemon-launch-helper,
etc.)

Those listed examples could also be could be achieved with IPC to a privileged
daemon. But unfortunately not all setuid executables can be replaced that way.

The major design flaw with setuid executables is that they run in an execution
environment (view of the root filesystem, Linux namespace, etc.) provided by
the caller. A good explanation of this is [http://maxsi.org/blog/setuid-bit-
considered-harmful/](http://maxsi.org/blog/setuid-bit-considered-harmful/)

Unfortunately this is why not all setuid executables can be replaced with IPC.
setuid binaries like "nsenter" allow you to run processes with a selective
modification to one part of your execution environment, while still inheriting
everything else unmodified. You can't achieve this with IPC to a privileged
daemon because that daemon cannot inherit your current execution environment
when starting new processes.

Nevertheless, getting rid of setuid is very important. Kind of baffling to me
that it hasn't been done yet. The closest we've come is replacing setuid with
filesystem capabilities, but those have the same fundamental design problem.
I've been working on genuinely getting rid of setuid some myself. Happy to
collaborate with anyone else who wants to work on this.

~~~
derefr
> You can't achieve this with IPC to a privileged daemon because that daemon
> cannot inherit your current execution environment when starting new
> processes.

What if it could? An "execution environment" could be turned into some sort of
kernel token object that could be passed to another process over IPC. A
privileged daemon could receive one, fork(2) once, drop some privileges, and
then fork(2) again, this time with the execution-environment parameter—and end
up with a copy of itself running _under_ the IPC-sender, with environment
inherited from the sender but privileges inherited from the receiver (and,
presumably, an address space that contains all the mappings of both parents,
with the IP pointing into the receiver's code section, like a normal fork(2)
would.)

~~~
catern
Definitely, that is one possible design. Although that would be fairly tricky
to implement in the kernel.

One way you could do this is to heavily abuse the CRIU project.
[https://criu.org/Main_Page](https://criu.org/Main_Page) Just do the
following:

1\. fork

2\. send a request to a privileged daemon

3\. privileged daemon checkpoints your child

4\. privileged daemon modifies the saved state of your child in whatever way
you requested

5\. privileged daemon resurrects your child (which is possible because it is
privileged)

I am not really serious about this method but it is certainly a thing you
could do. :)

------
benmmurphy
i think this is useful in that against unsophisticated attackers and people
who don't know it is there the protection will trigger. however, it sounds
like it only is going to be triggering when a process forks. so if you have
some exploit that raises your creds then just don't fork and you don't trigger
the detection. you can do everything evil in the system you want to do without
forking. and if you want the convenience of a shell you can just chmod suid a
binary and exec that. anyway, i think this generally idea is interesting
because you can do a lot with it and if you keep what you are doing a secret
(because you are running a public cloud platform for example) then you can
have quite effective protection against attackers that don't have insider
information. but if you are running a public cloud probably one of your
biggest threats are insiders. so meh.

~~~
mjg59
It validates on every syscall that has a permission check. If you modify your
process status without forking and then make a syscall, it'll still trigger.

(disclaimer: author of this feature)

~~~
benmmurphy
Sorry. I completely misread the original post. That seems a lot more sound :)

------
lima
Glad to see that they acknowledge GRSecurity.

~~~
technofiend
GRSecurity has some interesting ideas and if systemd is any indication some
Linux vendors will happily work with iconoclasts. However it seems like much
Theo De Raadt can rub people the wrong way resulting in OpenBSD, GR Security's
alleged refusal to submit patches in a manner Linus will accept has resulted
in their rejection from the kernel mainline. Considering for some applications
Linux needs all the hardening it can get, this is truly unfortunate.

~~~
SEJeff
You've got it a bit wrong. Brad Spengler (aka spender) and much of the
grsecurity team's goal is making Linux more secure by working on grsecurity.
They're not interested in totally rewriting large portions of Linux technical
debt to make Linux more secure, but upstream demands they do it. Is spender
often a bit of a prick, yes, but is he virtually 100% right on security
things, also yes. It wasn't really a rejection from kernel mainline so much as
they don't have any interest in submitting it. You can't have rejected what
you never submitted in the first place.

Fast forward to now and you have kernel heavyweights like Kees Cook (former
Canonical kernel/security lead now chromeos security badass extraordinaire)
working on bringing these features into mainline via the aforementioned KSP
projects. In the end, Linux is getting more secure, and it is largely in part
to the hard work of the entire GRSec team, even if they're a bit abrasive to
work with.

~~~
technofiend
I was specifically referring to this post by Linus himself
[http://article.gmane.org/gmane.linux.kernel/774824](http://article.gmane.org/gmane.linux.kernel/774824)
in which he states

>The apparent inability (and perhaps more importantly - total unwillingless)
from the PaX team to be able to see what makes sense in a long-term general
kernel and what does not, and split things up and try to push the sensible
things up (and know which things are too ugly or too specialized to make
sense), caused many PaX features to never be merged.

Although there are other posts since debating the same point such as this one
[https://lkml.org/lkml/2005/1/25/141](https://lkml.org/lkml/2005/1/25/141)

