I'm trying to get a better sense of how this approach differs technically from rootless Docker / usernetes. I understand that it's not there yet, for many reasons, and I see your FAQ about it, but it's clearly working towards the same goal, right?
I think what's going on is that you depend on shiftfs from Ubuntu, and SECCOMP_RET_USER_NOTIF (or something?), unprivileged user namespaces, cgroup namespaces, etc. from the upstream kernel, but the major missing parts in the upstream kernel are procfs and sysfs virtualization and making shifts feel se amless, and so you've written a syscall trapper and a FUSE filesystem that run on the host and emulate the things you need. Is that approximately right?
If so, I'd be really curious whether you see a path to get onto upstream runc at some future point. It seems like you'd need shiftfs to be upstreamed, but if an unprivileged procfs2 + sysfs2 shows up upstream, I think you can use that? And you'd probably fit in at approximately the place something like vpnkit fits in for managing shiftfs?
I have a use case for this sort of thing at work, and we've been exploring rootless Docker and unprivileged containers a bit. I"m trying to get a sense of why to prefer Sysbox EE instead of waiting for (or, ideally, contributing to) upstream support for namespaced procfs/sysfs, for shiftfs, and for properly teaching Kubernetes about user namespaces. I suppose the answer is that your solution works right now, and upstream support might take years?
I guess that puts you in a position much like OpenVZ and even LXC itself, which both had significant out-of-tree code in years past and seemed to be decently successful businesses as stuff slowly got upstreamed.
It seems like the major benefits of Sysbox EE are paid support and not using the same uid_map for each container?
I don't think rootless approach is fully aligned with what we're doing right now. True, we both rely on user-namespaces, and we both emphasize the security angle, but our goal is to expand the number of applications/functionality that can run in containers, which is something rootless approach may struggle with for some time.
In regards to our dependencies, we can operate with or without shiftfs. In both cases user-namespaces are always utilized. The rest of your approximation is correct: we need most of what you mentioned in your second paragraph, which btw, is already there (thanks to Canonical/LXD folks) starting in Ubuntu 5.0+ and 5.5+ for other distros. As you know, shiftfs is only present in Ubuntu at the moment, but as i said, we can live without it.
Which leads me to your question: why would you wait if the functionality you're after is already there? If having dockerd running as an unprivileged user is not a real must-have for you, then Sysbox provides a fairly secure solution while giving you all the functionality.
Sorry, i'm not familiarized with vpnkit yet, will take a look.
Correct, those are some of the benefits Sysbox-EE offers at the moment. That, plus efficiency & scalability features and hardened testing.
Thanks a lot for your detailed feedback @geofft. Please ping us on slack anytime.
I'm mostly meaning vpnkit in the sense of a it's thing that plugs into rootless Docker to provide networking - it seems like you could also be a plugin to upstream rootless Docker to provide sysbox-fs and your shiftfs management, at least in the long term.
Will try to remember to join the Slack next week, this is definitely a cool project :)
I'm trying to get a better sense of how this approach differs technically from rootless Docker / usernetes. I understand that it's not there yet, for many reasons, and I see your FAQ about it, but it's clearly working towards the same goal, right?
I think what's going on is that you depend on shiftfs from Ubuntu, and SECCOMP_RET_USER_NOTIF (or something?), unprivileged user namespaces, cgroup namespaces, etc. from the upstream kernel, but the major missing parts in the upstream kernel are procfs and sysfs virtualization and making shifts feel se amless, and so you've written a syscall trapper and a FUSE filesystem that run on the host and emulate the things you need. Is that approximately right?
If so, I'd be really curious whether you see a path to get onto upstream runc at some future point. It seems like you'd need shiftfs to be upstreamed, but if an unprivileged procfs2 + sysfs2 shows up upstream, I think you can use that? And you'd probably fit in at approximately the place something like vpnkit fits in for managing shiftfs?
I have a use case for this sort of thing at work, and we've been exploring rootless Docker and unprivileged containers a bit. I"m trying to get a sense of why to prefer Sysbox EE instead of waiting for (or, ideally, contributing to) upstream support for namespaced procfs/sysfs, for shiftfs, and for properly teaching Kubernetes about user namespaces. I suppose the answer is that your solution works right now, and upstream support might take years?
I guess that puts you in a position much like OpenVZ and even LXC itself, which both had significant out-of-tree code in years past and seemed to be decently successful businesses as stuff slowly got upstreamed.
It seems like the major benefits of Sysbox EE are paid support and not using the same uid_map for each container?