
Rootless containers feature merged into runC - marcosnils
https://github.com/opencontainers/runc/pull/774
======
cyphar
One thing that really excites me about getting this into runC is that now we
can work on making other parts of container orchestration and management run
as an unprivileged user.

Huge props to the Cloud Foundry team who already have taken rootless
containers and have some experimental support for them[1]. It'd be awesome if
we could do something similar to Kubernetes so that you could start clusters
as an unprivileged user (in my mind the networking is the hardest part and I
think the only way right now is to implement pseudo-bridge interfaces in
userspace). But I'm pretty excited about the possibilities. :P

[1]: [https://github.com/cloudfoundry/garden-runc-
release/blob/dev...](https://github.com/cloudfoundry/garden-runc-
release/blob/develop/docs/rootless-containers.md)

~~~
jacques_chester
I'm sure if you ask nicely, Pivotal, IBM, SAP or Google would host a merge
party for you. All are present in Sydney.

Disclosure: I work for Pivotal.

~~~
cyphar
While it'd be really cool to go to said offices for a merge party, I feel like
that might be a bit too self-indulgent. Though meeting some of the Pivotal
team in Sydney might be fun. :P

~~~
contingencies
Hey, another Sydneysider here (currently based in China), did some early
implementations with LXC, eg. [https://github.com/globalcitizen/lxc-
gentoo](https://github.com/globalcitizen/lxc-gentoo) (in US) and talked with
the IBM guys at that time.

Nitpicks: About ~9min in your talk on a user namespace slide you talk about
device creation restrictions implying it is linked to user namespaces but I am
pretty sure that is the device cgroup's job and it's possible to bugger that
up if you're not careful. Similarly at 19:50 or so I believe the statements
about things not working only apply to some container systems, eg. IIRC
_mknod_ can be allowed and controlled on a per- _major:minor_ basis via the
device cgroup.

Another security area I looked at was docker grsec incompatibilities, eg.
[https://github.com/docker/docker/issues/20303](https://github.com/docker/docker/issues/20303)
though didn't get closure despite suggesting what IMHO seems a decent fix.
Meh.

Re: ~32:50 + crystal balling: My personal impression is that the whole
container based infrastructure / development trend will continue to snowball
in to larger devops solutions that automatically secure systems in many ways
(via cgroup configurations, kernel security toolkit policies such as syscall
and device whitelisting, readonly path lockdowns, network traffic and
bandwidth restrictions, etc.) over the next few years. Existing tools like
_atime_ , _fsnotify_ and network traffic dumps should make this pretty easy.
IMHO it should be a natural progression for obtaining low hanging security
enhancements currently unused due to configuration complexities. It will be
basically enabled through decent workflow (ie. CI/CD + testing), and may
result in a "safe CFLAGS"-like set of aggressiveness presets for new services.
Thought this out and implemented to some extent while at Kraken (~2011-2015).
(Now working on a more hardware/mech eng. related business and am not longer
spending time in the area. Down for a drink next trip though, email in
profile!)

~~~
cyphar
> About ~9min in your talk on a user namespace slide you talk about device
> creation restrictions implying it is linked to user namespaces [...]

No, mknod() is also gated by a capable(CAP_MKNOD) check if the node is a
character or block device (see vfs_mknod). Which you can't have if you've
created an unprivileged namespace. Devices cgroup is an additional restriction
but you hit capable(CAP_MKNOD) long before that check.

> Similarly at 19:50 or so I believe the statements about things not working
> only apply to some container system

The entire talk is about rootless containers, so I'm specifically talking
about the case where you have only mapped a single user and don't have the
ability to modify cgroups. I'm well aware how to grant access to devices if
you have root -- the whole point of rootless containers is that you don't have
root. ;)

> Re: ~32:50 + crystal balling: [...]

Yeah, I wrote that slide about 10 minutes before my talk. My point was that
I'm really hoping one day we see containers (or specifically the sandboxing
capabilities of containers) be integrated into normal applications and
extended to the point where every user uses this stuff. And as you said, the
main benefit is that you can then apply a bunch of useful security profiles
that knock out the low-hanging fruit.

~~~
contingencies
_No, mknod() is also gated by a capable(CAP_MKNOD) check if the node is a
character or block device (see vfs_mknod). Which you can 't have if you've
created an unprivileged namespace. Devices cgroup is an additional restriction
but you hit capable(CAP_MKNOD) long before that check._

 _Namespaces_ (of which there are many, so 'unpriveleged' must be qualified)
!= _cgroup controllers_ (of which there are similarly many) != _capabilities_
(of which there are also many).

IIRC these are _separate and no use of one implies the other_. Therefore I do
not understand your comment and believe it may be in error, as it seemed both
in your talk and in this reply that you are implying a link is now implied at
the kernel level. I strongly suspect that your impression comes from container
userland runtime system specific code (in which my "for some of these comments
it depends which container runtime you are using" comment stands), and I
strongly doubt it is kernel code, but am always more than willing to be re-
educated.

~~~
cyphar
> Namespaces (of which there are many, so 'unpriveleged' must be qualified) !=
> cgroup controllers (of which there are similarly many) != capabilities (of
> which there are also may).

You're correct that namespaces != cgroups. But capabilities are actually part
of namespaces (well, specifically the user namespace but a lot of permission
checks in the kernel depend on other namespaces and having capabilities in
pinned namespaces). So while you could say that capabilities != namespaces
it's not really accurate since they are quite inter-related. Not to mention
that every process is in a set of cgroups and namespaces and has a set of
capability sets.

As for the specific example of mknod, here's a link to the relevant kernel
source[1]. You're not correct in your assumption.

cgroups and namespaces are separate when you are actually administrating them,
and they are wholly separate subsystems. However, at the end of the day any
security policy will have to be applied to whatever syscalls they apply to.
capable(CAP_MKNOD) is a namespace check, and devcgroup_inode_mknod is a cgroup
check.

> I strongly suspect that your impression comes from container userland
> runtime system specific code

I read the kernel code for namespaces and cgroups all the time when debugging
issues. I'm well aware how containers work from both the userspace abstraction
side as well as the kernel side (and have written kernel code for containers
too).

> and I strongly doubt it is kernel code, but am always more than willing to
> be re-educated.

See the link[1] for the example of mknod. My point was that both cgroups and
namespaces gate the requirements for creating new devices. And that makes
logical sense because any user can create a new user namespace with full
capabilities, but the device cgroup rules aren't changed by doing that (so it
would allow for unprivileged users to do all sorts of bad things as a result).

[1]:
[https://github.com/torvalds/linux/blob/fe82203b63e598c34d96e...](https://github.com/torvalds/linux/blob/fe82203b63e598c34d96e846dea49679a726fc7a/fs/namei.c#L3660)

~~~
contingencies
Yes, _obviously_ mknod is capability checked in the kernel, as there is a
capability with that very name. Where else would it occur? I never suggested
otherwise.

My point was that in your talk and comments you make some suggestions which I
feel are misleading or inaccurate. In some cases this is probably because you
are discussing your own implementation. In other cases this is because I
believe your use of terminology is incorrect or misleading, eg. capabilities
are not 'part of' namespaces, any more than block devices are 'part of' the
filesystem. Similarly, the phrase 'unprivileged namespace' comes across as
exceptionally obtuse, given that most security checks occur in other kernel
subsystems, and a holistic conception is implied by the word 'unpriveleged'.
Naming is hard, but kernel terminology should already be consistent.

I can't see us getting anywhere further with this discussion but an honest
thanks for taking the time to respond.

~~~
cyphar
> Yes, obviously mknod is capability checked in the kernel, as there is a
> capability with that very name.

Yes, and it's checked against the root user namespace, not your current one.
Which means that it's not as simple as "do I have this capability". It's "do I
have this capability _in this user namespace_ ". Which is what I said
originally, so I'm not sure what you're debating here.

> My point was that in your talk and comments you make some suggestions which
> I feel are misleading or inaccurate. In some cases this is probably because
> you are discussing your own implementation.

Can you give some examples? If I'm wrong about something, I'd love to see an
example of it so I can correct it in the future. In particular, I'd like to
know what parts of my discussion are related to runC's implementation of
containers.

> capabilities are not 'part of' namespaces, any more than block devices are
> 'part of' the filesystem

Capabilities are effectively 4 bitmasks that are scoped to user namespaces.
Yes, they aren't technically part of "struct user_namespace_t" but the way
they are used make them quite intricately related. Much more intricately
related than block device inodes in a filesystem.

> given that most security checks occur in other kernel subsystems, and a
> holistic conception is implied by the word 'unpriveleged'.

But those security checks generally will use either ns_capable (so they're all
hooked up to the user namespace scoping logic), basic UID/GID checks (which
are also hooked up to user namespace mapping logic) or some other security
framework that is effectively an additional ACL (and isn't the purview of
namespaces anyway).

The main complaint I would understand from the use of "unprivileged namespace"
is that the kernel doesn't actually mark namespaces as being privileged or
unprivileged, it just so happens that unprivileged namespaces act slightly
differently because of how they are set up. But AFAICS that's not the argument
you're making.

------
ptspts
What is the status of TCP/IP networking in rootless runC? How can incoming and
outgoing connections be restricted?

~~~
cyphar
You can't both use iptables and have access to host network interfaces (as an
unprivileged user), if that's what you mean. You can either create a network
namespace and manage network interfaces (but not have a bridge to the host) or
just deal with what access you're given. In the future my plan is to write a
CNI plugin that implements pseudo-bridges in userspace so that you could get
both (though I'm not sure if it'll work at the moment).

But if you're an administrator you can restrict it like any other process.

------
0x006A
related talk cyphar gave at linux.conf.au 2017 on this topic:
[https://www.youtube.com/watch?v=r6EcUyamu94](https://www.youtube.com/watch?v=r6EcUyamu94)

------
minimaxir
A semi off-topic note:

As shown in the linked thread, this Hacker News submission was promoted via an
image of the submitted link in /newest. This is a modern form of vote
manipulation (although in this case perhaps unintentionally) which seems to be
the rage nowadays, especially on a certain other social-voting website. (The
content of the submission is good/important regardless, but others should keep
in mind that this form of voting manipulation isn't clever)

~~~
advisedwang
I'm not sure I understand how this is vote manipulation? Surely they are just
replying to the "somebody should post this to HN" comment by showing that they
did? Why does an image get illegitimate votes?

~~~
minimaxir
Posting an image makes it obvious what to upvote. (as opposed to replying "I
posted on Hacker News")

But even then that would technically be vote manipulation because it's drawing
attention to an immature submission. (and there is little genuine reason to do
so. The only counterexample I can think of is using HN as a comments section)

~~~
eriknstr
There were 15 participants in the GitHub issue. Hardly enough people to
manipulate anything, so obviously that was not the intention. Even though some
300 people are watching the repo as a whole I doubt that many of them would
see that particular comment. As the parent to your comment said, a screenshot
is a reasonable way to inform the others in the thread that the link has been
posted.

I am against vote-manipulation of course but I don't think that this is an
instance of it.

~~~
MichaelBurge
This thread has 71 points in 3 hours. Having 15 people upvote it all at once
could get it onto the front page, where it then keep getting upvoted
naturally.

------
barbazfoo
This is huge, major props to cyphar and the runC folks. Congrats! The result
of 11mo of dev work. Dope AF.

~~~
cyphar
<3 The one thing I've really learned from all of this is just how many
different things can break if you take away root privileges from a process.
I'm kinda excited to see how much of cri-o and kubernetes we can make run as
an unprivileged user (especially given how Cloud Foundry has already taken
advantage of this).

~~~
SEJeff
And when we finally bridge the containers seriously with criu for container
migrations, wow.

~~~
jacques_chester
I've been nagging anyone who will listen that this ought to be done. (Easy for
me: I don't have to do it. I will of course take credit to the fullest extent
the law allows.)

------
falcolas
If a rootless container process runs as the root user and can't be switched,
is it considered to be "root" as far as the kernel is concerned? As in, does
it have access to root-only kernel features (like the root keychain)?

~~~
cyphar
No, the kernel knows what it's "real" UID and GID are. It even knows what
unmapped UIDs and GIDs are. I haven't tried to access the root keychain inside
a rootless container, but if it does work I would consider that to be a
vulnerability in the kernel.

------
thinxer
Is rootless containers safe now? It is not turned on by default in ArchLinux
because of security concerns[1].

[1]:
[https://bugs.archlinux.org/task/36969](https://bugs.archlinux.org/task/36969)

------
arianvanp
what is the diffference between runc and rkt? they both implement OCI right?

~~~
cyphar
I'd compare runC to LXC as opposed to rkt. It just takes a path and
configuration and starts a container, without managing images or networking
topologies. rkt is on the same level as cri-o (or containerd) since it also
implements image handling as well as other related things.

Also, rkt (by default at least) uses systemd-nspawn to create containers. In
runC we have our own implementation called libcontainer.

