I worked on something similar for Windows for nearly 10 years, called Thinstall (acquired by vmware and renamed to Thinapp). I love the idea of something like that for linux.
We put a lot of effort into being able to run an application directly from a compressed single-file binary, you have a small executable with a big payload attached to the end of it. The payload essentially contains a mountable filesystem, however because windows doesn't support doing anything like that from an unprivileged account we had to emulate many things Windows normally does, including execution of binary images.
We did some tricks where executable data was stored on disk in a format it can be directly memory mapped and run without reading anything - this allowed us to launch most applications over a network without extracting anything to disk locally and achieve millisecond launch times. Pages would get pulled into local memory by the OS as needed while the application was running.
The end result was that you can create a single EXE file could contain things like python + script, or more complex like photoshop & Word. Just go to a new computer from a guest account, launch the EXE and you instantly start using that app.
I haven't been involved with it for the last 6 years, but I believe vmware still offers it.
Wow, hello! I worked in the UK on a short-lived product called SoftInstall.net which did something similar for .NET applications. We had our own file system and registry emulation layers, all inspired by Microsoft Research’s “Detours” paper. I remember being impressed with Thinstall at the time.
You gotta emulate CreateProcessEx? How did you approach that?
I've seen people do it by running something like calc.exe suspended then remapping all the memory. The trouble is, this approach certainly wouldn't work for situations involving 64 bit Windows and 32 bit binaries, and well, it's ugly and confusing ("why is Photoshop running under calc.exe?") Surely, the solution had to be nicer than that?
Or, you know, you could just uncompress the application in a newly-created subfolder of C:\WINDOWS\TEMP\ and launch the binary that is to be run afterwards (which of course is conveniently named SETUP.EXE). The way it's always been done.
The problems with this approach are:
- It can take a long time to extract the package, especially if you are looking at gigabyte applications. We are able to launch huge applications instantly because we avoid the extraction step.
- Windows applications are typically hard-coded to read files from global locations like c:\windows\system32, side-by-side DLL locations, HKEY_LOCAL_MACHINE registry, font folders, services, etc. If you try to copy most applications to a single folder, they won't run. Some apps can be designed to run from a single folder, but this is the exception rather than the rule.
- Ideally you don't want the application writing to other parts of the filesystem or registry because it might screw up other apps. We solved this by doing copy on write access to other parts of the PC and redirecting the writes to a local folder.
Thinapp solved this by presenting a synthetic view of the filesystem and registry to the app. This view merged the contents of the real system with the contents of the package, so the app thought everything was were it expected when it runs. In addition, we provided a copy-on-write sandbox of this view, so any writes would be redirected to a local folder without the app knowing about it. If you delete the sandbox, the app goes back to it's "first run" state.
I see that it extracts files into `/tmp/...` inside the container first before running it. It'd be better if it could read the files directly from the image, say, like a self-executable squashfs image.
Wouldn't in-place decompression for a "self-executable" squashfs require one to disable NX for that binary? That seems like it would be a terrible idea from a security perspective. (Even for pure data, in-place decompression seems like it would heavily interfere with e.g. demand paging of such data into RAM, using the binary itself as backing store. Disk space is so cheap these days - I don't think things like squashfs should be used nowadays unless truly necessary.)
NX does not need to be disabled[1] for the binary thanks to W^X[0]. As long as pages are never mapped as writeable and executable at the same time, you don't lose any protection.
[0] https://en.wikipedia.org/wiki/W%5EX
[1] AFAIK, on Linux NX isn't something that is enabled or disabled for individual programs anyways, it's up to the userland whether or not it takes advantage of the mmap(2) mapping flags
In general, you lose some protection, as a two-stage exploit might write some data in W state then flip to (or wait for) X state for execution. That being said, a self-decompressing binary would only do this at startup, before it consumes untrusted input, so given a way to drop map-executable permissions that wouldn't be a problem.
And yet we always find new ways to waste it. 16 GB was once an unimaginable amount of space, and yet one of my coworkers now takes it for granted that 16 GB isn't enough storage for a smartphone.
Agreed. We had a petabyte storage for scratch data on our cluster. We were stuck around 95-100% capacity used for the past month and some change until we recently added another half petabyte or so. We were constantly deleting and automating warnings to people that info would be removed and their access restricted if things weren’t removed. We ended up suspending the cluster for a day while we tried to figure out how we were exponentially losing data (one user decided to start uploading terabytes of data to scratch, which is against policy).
No one seemed to understand that if data filled up completely, the entire cluster would go offline because _we wouldn’t have the space for anything_.
Cynically, I think its not that they don't understand its that they don't care and that it is why you are probably there to care for them and solve it. I have similar engineers and its a struggle. Didn't Douglas Adams invent the concept of the SEP Field to explain this (somebody else's problem)?
Yes, it's part of system management to prevent users interfering with one another -- and at least detect problems that are difficult to prevent, like activity crippling a parallel filesystem, and sorting it out.
It doesn't have to be in-place, or even compressed; it just has to be automatic and OS-integrated. With something like squashfs (with or without compression) the kernel handles all the paging, mapping, and decompression (if any). Of course there are the classical benefits like resource usage being more efficient, but more importantly, less userspace code that needs to worry/care about filesystems and file layout etc.
> It doesn't have to be in-place, or even compressed; it just has to be automatic and OS-integrated. With something like squashfs (with or without compression) the kernel handles all the paging, mapping, and decompression (if any).
Sure, but I don't think you can do that within a self-contained binary, the kernel doesn't really support that. It would have to just be a loopback image, and lose the whole "self-contained binary" angle that makes this different from all sorts of other container tech.
> Sure, but I don't think you can do that within a self-contained binary, the kernel doesn't really support that.
An executable binary can set up a mount namespace, mount /proc/self/exe or argv[0] (with some offset) as a loopback device within the namespace, and then execute a process within the mount.
Physical pages can be mapped more than once, it is the mapping that describes protection bits. Implementing an in-place FUSE filesystem for example would have no bearing on whether other mappings of the same file lacked execute bits, and in the case of the FUSE interface, the kernel does not even know the original source of the data
I think this could also be achieved with a loopback device setup to cover some of the file, again no relevance to NX etc, but that would require setuid or similar to configure the device
> Well judging by the original GitHub issue about unprivileged runc containers, the largest group of commenters is from the scientific community who are restricted to not run certain programs as root.
Which is why we use Singularity[0]. Apologies if I’m missing something, but to me this problem they’re trying to solve, has already been solved. As stated in the README:
> Singularity is an open source container platform designed to be simple, fast, and secure. Singularity is optimized for EPC and HPC workloads, allowing untrusted users to run untrusted containers in a trusted way.
Singularity ‘requires’ root or elevated privileges to create and modify containers, but running them is done in the user’s namespace with that user’s permission set. And privileges can’t be elevated during the containers runtime as there are kernel level catches that prevent sudo and su from working inside the container.
In terms of the ‘static’ part, Singularity uses SquashFS to produce a single binary image, and uses the SIF (Singularity Image File) format to define applications, their dependencies, and runscripts. On top of being able to run anything inside the container with the `exec` subcommand.
You can also create sandbox development images that are just a directory on disk for testing things out and installing/building tools manually, then convert it into a SIF container after for cluster deployment. Or be nicer and port your process into a recipe/definition file.
Most scientific/HPC related centers have started utilizing Singularity for their container workflow. Tools like
Docker are essentially banned because of their perma-root privileges[1].
[0] https://github.com/sylabs/singularity
[1] I’m an admin who reimplemented Singularity on our university’s cluster. With a heavy need on filesystem integration (which Singularity solves with great configuration options), containers which run with root privileges would be a massive security vulnerability.
I started the rootless containers project and have had long email threads with the Singularity folks.
> Which is why we use Singularity[0]. Apologies if I’m missing something, but to me this problem they’re trying to solve, has already been solved. Singularity ‘requires’ root or elevated privileges to create and modify containers,
Well, Singularity came out at the same time as runc's rootless containers support was developed -- so at the time it wasn't solved at all. It required (and still requires for several operations and features) root privileges in order to work (such as suid helpers in addition to explicit root requirements).
It's not acceptable for a usecase where you cannot run anything as root (not even installation scripts). That was the use-case, and rootless runc (and now thanks to Akihiro and Guiseppe, rootless Docker and Kubernetes) in theory can now work without the need for any suid helpers -- though these days there is an increasing usage of newuidmap and newgidmap (which isn't mandated for the rootless runc implementation).
At the time, LXC's unprivileged containers was the closest thing and it had a few (mostly optional) suid binaries -- I wanted absolutely none.
> but running them is done in the user’s namespace with that user’s permission set. And privileges can’t be elevated during the containers runtime as there are kernel level catches that prevent sudo and su from working inside the container.
This last part really isn't revolutionary at all. All it really takes is a syscall and three files to write to. The hard part is getting everything else to work without root privileges.
Are you a dev on binctr? Is so, the following statements are based off an assumption of yes.
> Well, Singularity came out at the same time as runc's rootless containers support was developed -- so at the time it wasn't solved at all. It required (and still requires for several operations and features) root privileges in order to work (such as suid helpers in addition to explicit root requirements).
Alright, but I wasn’t asking about a “back then” scenario, I was mostly trying to figure out what this offers _today_. From a Singularity end-user’s perspective, there’s only one function that requires elevated permissions, and that’s building.
> It's not acceptable for a usecase where you cannot run anything as root (not even installation scripts). ... At the time, LXC's unprivileged containers was the closest thing and it had a few (mostly optional) suid binaries -- I wanted absolutely none.
Now I understand what this offers and it’s overarching goal is. In my specific situation (which is the angle I’m looking at this from), that’s not really a big deal due to the way we handle global and local software installation. I understand this is not a universal situation.
> This last part really isn't revolutionary at all.
I don’t think I made it sound like it was revolutionary, I was just sharing what a commonly used tool in HPC does for a security measure.
Based off of the binctr README:
> Create fully static, including rootfs embedded, binaries that pop you directly into a container. _Can be run by an unprivileged user._
My main confusion stems from that. In comparison to Singularity, it doesn’t seem to offer anything new or noteworthy that would make me take a second look at it. The first phrase just makes me think I’m saving myself a few keystrokes as I wouldn’t have to type `singularity shell <container>` to shell into my container of choice. It just looks like yet another container solution. I still find these things cool (and their above my head in dev terms), but it feels like programming languages, a new one popping up every so often.
I’m primarily looking to see what advantages (or differences in approach) this offers over Singularity, Shifter, or Charliecloud, etc for my context. At least in terms of security and efficiency.
No, but I'm a maintainer of runc and implemented all of the core features that binctr uses (or rather, that Jessie hacked together for a proof of concept). I've also implemented rootless support in quite a few other tools to the point where now an unprivileged user can download, extract, and run a rootless container. In addition, I'm working with some other folks who joined later to get Kubernetes (and Docker) to be completely rootless.
> From a Singularity end-user’s perspective, there’s only one function that requires elevated permissions, and that’s building.
(Most forms of) execution still requires setuid binaries, which means that you have to have root permissions in order to install Singularity. You can use rootless containers as a completely unprivilged user (meaning that if you have unprivileged shell access to a random box you can use containers). Singularity cannot do this.
> I was just sharing what a commonly used tool in HPC does for a security measure.
My point was that basically any container runtime can do what you described.
> I’m primarily looking to see what advantages (or differences in approach) this offers over Singularity, Shifter, or Charliecloud, etc for my context. At least in terms of security and efficiency.
Many of those tools require setuid binaries that are developed as part of their container runtime -- which means that all of the security is contingent in no vulnerabilities in their setuid binaries (something that is hard to do even for the developers of programs like sudo).
Rootless containers are a project that requires absolutely no privileged codepaths for any container operation (though these days you can optionally use standard setuid binaries -- by default they are not used). This is something that none of the projects you listed (as far as I'm aware) can do.
Charliecloud specifically doesn't require setuid, but many HPC systems won't have user namespaces enabled. (It removed the setuid component which allowed it to work on RHEL6, for instance.) You can also easily build a root for it under proot, for instance -- and proot itself is probably OK for running computationally-intensive work. HPC systems already have a privilege escalation mechanism available from the resource manager, but I don't know there's a problem using it for this sort of thing. Shifter controls the images, so you're not at risk from malformed filesystems, at least.
Definitely setuid in Singularity should worry people. It has a non-stellar security record, and has been less than transparent about issues. The last time I looked at the code for instance it still had many calls unchecked for error returns (including all mallocs), and wrote uninitialized memory to image files. It was a mistake to get it into Fedora, and HPC people should be a bit more circumspect. That said, resource managers probably provide most attack surface in HPC systems, in some case completely trivially.
By the way, it's often possible just to run programs from an unpacked filesystem of another distribution just by setting PATH and LD_LIBRARY_PATH.
Not sure when you last looked at Singularity's code base, but it was completely re-written this year into Go, so some of the issues you saw may have been solved. Not sure, just letting you know if you haven't been following its development.
> It was a mistake to get it into Fedora, and HPC people should be a bit more circumspect.
I never installed it from EPEL, as building from source was relatively easy in 2.x (trivially easy in 3.x) and allowed buildtime configurations specific to a cluster. If you pay for support you get custom repos, but that's up to the cluster maintainers to decide.
I know it's been re-written, but I don't know why that would restore my faith in it and Sylabs (for multiple reasons). It emphasizes the point about inclusion in Fedora (and other distributions?) for which there's some expectation of security and stability, even if you don't care about that.
Singularity started up in late 2015 started seeing widespread adoption in HPC centers by late 2016. A product for the industry by the industry.
In no amount of container research I’ve done have I ever seen binctr mentioned in discussion or as a link from a search engine. Today is the first time I’ve come across it. Age doesn’t mean much for modern search results and isn’t the only factor for adoption. And it certainly doesn’t explain its featuresets or security.
> Just saying binctr is not some new project just hitting the scene.
Gotcha. You’re original statement left a lot of room for interpretation, I wasn’t sure what you were suggesting.
I didn’t mean to imply binctr was a new project, I meant it relative to me as I’ve never come across it before in comparison to the well known established solutions that exist today. The more “household” project names. I probably should have made that more clear.
I'd add that a ratio of ~2000 stars to ~50 forks suggests that the blog author is very popular person and that almost no-one is using binctr for serious work. Nonetheless, I starred it as a code reference for interacting with the docker codebase.
Is there anyway to do what it does without them? Genuine question, there’s still a lot about Linux system/user space that I’m not up to snuff on yet (part of a team here, and I don’t handle users, mostly tools). I’m trying to compare/contrast the goals of binctr to existing solutions that appear to do the same or similar thing(s) already, so I’m open to learning something new today!
The only new thing I’m seeing from this solution is the ability to create containers as an unprivileged user, which for most scenarios I tend to view as a non-issue. Or am I misunderstanding that?
> I’m trying to compare/contrast the goals of binctr to existing solutions that appear to do the same or similar thing(s) already, so I’m open to learning something new today!
binctr is a proof of concept that was written almost 3 years ago. These days, rootless containers are being integrated into many tools. Right now, under [1] we have some tooling which allows you to run Docker, cri-o and Kubernetes in rootless mode. Cloud Foundry's default configuration now uses rootless containers.
So really, it's not a choice between binctr and something else (binctr was a proof-of-concept) -- in fact please don't run binctr in production. But runc and quite a lot of other tooling has rootless support now.
Singularity doesn't do rootless containers, it still requires suid binaries in several cases. But I'm biased, having had some pretty bad arguments with the lead developers in the past (mainly for misrepresenting runc).
Thanks! This helps me understand a bit better. So away from binctr, which should just be considered a reference, a rootless system isn’t just about being able to run i.e. execute a container, but also go into and modify, like using the containers package manager and fiddling about. Without needing root access to do so (due to the way these tools like yum are built to work). But also be able to use systems like Docker that typically require sudo/root for nearly everything to not need it.
For us, one of Singularity’s primary advantages is the host filesystem integration capabilities.
Does a rootless container system have any potential drawbacks when dealing with host filesystem integrations and user permissions? Or should this be considered up to the implementer’s discretion and a separate issue?
This may be thinking too much (but probably due to lack of deeper understanding) but an event I see playing out is a user on the cluster installing a rootless container system, building their own containers, runtime mounting host directories they shouldn't be that typically require root access and then modifying them. Is this a possible situation?
I often end up getting caught up in tooling and missing the bigger picture/goal, so apologies for the long back and forth :)
> a rootless system isn’t just about being able to run i.e. execute a container
This is part of it, but I want to repeat again that Singularity (and most projects that you've mentioned) only appear to not require privileges -- they use setuid binaries. That isn't the same as not requiring privileges (in my opinion -- because requiring setuid binaries means you need root to install something on the host).
Rootless containers can be used without any root access at any point -- even during container runtime installation.
> but also go into and modify, like using the containers package manager and fiddling about. Without needing root access to do so (due to the way these tools like yum are built to work). But also be able to use systems like Docker that typically require sudo/root for nearly everything to not need it.
You have "root" inside the container, but this is a user namespace so it's actually an unprivileged user on the host (and you don't need to have privileges to create a user namespace). Package managers get a bit complicated, but we have techniques to get around their lovely issues (or you can use newuidmap/newgidmap that map a range of UIDs/GIDs and then it all works without issue -- these are standard setiud programs you get with distributions and most distributions will give users mappings by default).
Docker is a bit complicated (to run as an unprivileged user or inside a container), but with some patches (which are mostly upstream from memory) you can run Docker inside a user namespace with a bit of finagling thanks to the rootless work done by myself and Akihiro.
> one of Singularity’s primary advantages is the host filesystem integration capabilities.
One of my problems with Singularity is that they keep creating new words to define old concepts -- can you explain what you mean by "host filesystem integration"? Are you talking about bind-mounts? If yes, then yes you can use them.
> This may be thinking too much (but probably due to lack of deeper understanding) but an event I see playing out is a user on the cluster installing a rootless container system, building their own containers, runtime mounting host directories they shouldn't be that typically require root access and then modifying them. Is this a possible situation?
Yes, this is precisely the long-term usecase we want to have. Right now you can run Kubernetes unprivileged (in a mode we call "usernetes"[1]) and the idea is that you will be able to login as any unprivileged user on a cluster machine and set up Kubernetes (or just run containers directly if you don't like Kubernetes). There are some complications with networking, but we're working on it.
> Are you talking about bind-mounts? If yes, then yes you can use them.
Yes, I’m referring to the bind mount capabilities. By default they are enabled, but Singularity’s config file allow us to disable user defined mounts and set global mounts that will always be used.
> Yes, this is precisely the long-term usecase we want to have.
I’m confused. The situation I described was a negative where a user run container bind mounts directories they shouldn’t have access to and modifying it. Like mount host /etc to container /mnt then start editing the host systems configuration files. I may be misinterpreting your statement but I don’t see how that event should be a long term goal...
> Yes, I’m referring to the bind mount capabilities.
Okay. Every existing container runtime supports bind-mounts. Having a global configuration for them is mostly a UX thing (you could make a wrapper in half an hour that does that).
> The situation I described was a negative where a user run container bind mounts directories they shouldn’t have access to and modifying it.
Oh I misunderstood your question. No, that's not possible. "Root" inside a user namespace cannot access files that the underlying user would not be able to access -- this is guaranteed by the kernel (which is why I misunderstood your question -- it's a kernel security policy which affects every process).
I thought you were asking whether it's possible to run a cluster as an unprivileged user. An unprivileged user can bind-mount whatever they want, but actual file access is blocked when you would expect.
As someone who doesn't know a lot of the nitty details of the kernel or containerizing, I appreciate you for taking the time explaining some of this.
So just to make it completely clear in my head (sorry!), when utilizing a user namespace rootless (non-setuid) container system:
* The user has complete read/write/execute to everything in the 'base' container. Anything bind-mounted in assumes the given privileges from the host system.[0] The user's context is essentially mapped to root inside the container, but not a 'true' root in the context of the host system.
* The user can create/modify/run a container without needing root privileges to communicate with a daemon or any other tooling.
* A user can install a rootless system to their owned directories on a host system and operate its services/tools without needing elevated privilege access.
[0] In the situation where a user maps a root owned directory to its corresponding directory in the container (/usr:/usr) the container root privilege access is no longer applicable. Other permissions (rwxrwxrwx user group) also apply inside the container from bind-mounts.
If the above is all true, then I think I've got it. All in all, I do think the idea of a rootless system is pretty cool. Obviously you need to be running a 3.8 or later kernel (unless backported for namespaces) and enabled in the kernel, but its cool tech. Thank you for taking the time to help me understand it better. This probably would have gone a lot faster if done in verbal conversation, so I appreciate you sticking it through :)
If you don't mind me asking, what came up in your conversations with the Sylabs team? Was it a case of "we'd like to wait and see as this is new[1], not thoroughly vetted tech", not developed in-house, or they just didn't want the tool to work that way? Those are just speculations I can come up with off the top of my head. If you don't want to talk about it, that's fine, too.
[1] Relatively speaking at the time. It's clearly not brand-spanking new.
> The user has complete read/write/execute to everything in the 'base' container. Anything bind-mounted in assumes the given privileges from the host system.[0] The user's context is essentially mapped to root inside the container, but not a 'true' root in the context of the host system.
Sure, but it's actually much simpler than this. User namespaces allow you to change the UID mapping of users -- and an unprivileged user can map themselves to "root" (however, on the host, the user is still the same -- this similar in concept to a PID namespace). However, in a user namespace, filesystem access (plus quite a lot of other things) are restricted based on your real UID/GID.
A container image (in order to be used by a user namespace) needs to be extracted such that the owner is the user that you want to execute as. If you were using "root-ful" user namespaces then you could just chown an existing image, but for rootless you need to extract it. My OCI image tool umoci[1] has supported this extraction (it's a bit more complicated than you might think) for about 2 years now.
The point is that a container image is generally just a directory (maybe it's a complicated mount hierarchy, but at the end of the day there is a directory that you can "chroot" into -- note that we actually use pivot_root but that's a different topic).
> The user can create/modify/run a container without needing root privileges to communicate with a daemon or any other tooling.
Yes. The idea is that if you are dropped on a box (with a new enough kernel) with just an unprivileged user, and none of the tools you need are installed, you can still run containers. So all aspects of the container runtime must not require privilege -- and runc has supported this mode since 2016 (I implemented it).
> If you don't mind me asking, what came up in your conversations with the Sylabs team?
The main disagreement is that they misrepresent runc on their project page[2] -- which is hosted on a .gov website no less. After that there were some further disagreements about whether rootless was something that they actually supported (they don't -- and in fact I was fairly sure I saw several vulnerabilities in their setuid helpers, but I didn't bother looking further into it). Their documentation now says that they've supported rootless containers since 2016 but I don't agree with that statement (it's also so restricted)[3].
I got (somewhat unfairly) angry with them over this whole exchange, so I don't really like discussing Singularity that much anymore. In my first talk about rootless containers, someone actually asks about Singularity[4].
> The only new thing I’m seeing from this solution is the ability to create containers as an unprivileged user, which for most scenarios I tend to view as a non-issue. Or am I misunderstanding that?
I believe you're mislabeling it as a non-issue, it seems to be the core problem they're trying to solve:
From the blog[0]:
> Yes, we all know that containers run unprivileged processes; but creating and running the containers themselves requires root privileges at some point.
> What is the awesome sauce we all gain from this?
> Well judging by the original GitHub issue about unprivileged runc containers, the largest group of commenters is from the scientific community who are restricted to not run certain programs as root.
I did read the blog post, but it all seemed about running containers as unprivileged users. Being able to create is what I’m saying is a non-issue, because it only saves a single convience step of not having to upload a container built elsewhere to a cluster. And tools like Singularity have online tools (Sylabs Library, Singularity Hub) which offer remote builders for users that don’t have privileges anywhere (which is hard to find) and don’t want to build containers locally. You can then pull and run the produced container from the Singularity executable without needing root access.
An alternative is a cluster could set up a single machine that gives users sudo access only on the Singularity executable and prevent the ability to bind mount, therefore creating a relatively secure build environment on-premise. However, security matters more than conviencence for HPC clusters, so we’re more inclined to tell users “do it yourself” than give them any loophole in the system.
Singularity already solves the unprivileged running of containers. Unless binctr is doing something that provides a different kind of unprivileged access?
I’m not trying to be thick headed, but I’m just not connecting the dots on what’s so special about this.
The Docker work is a direct derivative of the rootless containers work I started more than 2 years ago (and others have been working on before and since), which is what this blog post refers to.
Singularity didn't exist at the time in any meaningful way, and suid binaries (even a small number) are completely unacceptable for the usecases I had.
If it takes a container to get static linking for dynamic languages, that's fine by me.
I have one Python project where I'm just using Docker to produce a single, self-contained lump so I don't have to fuck around when installing. This would be a step forward for me.
And then I have a Python library and command-line tool [1] where an awful lot of the issues people report are library version nonsense. I tried to help the first few, but at this point I've decided it's just not my job to make up for the Python community's packaging issues and/or issues related to other open-source projects and OSes. I'd love to be able to produce the equivalent of a statically linked binary for those people.
It doesn't, really. Since there are umpteen ways to do things in GNU/Linux, developers often end up with less-than-ideal implementations for what they are trying to do using a single user account.
The single-user approach creates clashes between the network devs who want to build empires of containers they "own", and the stack-level bare metal purists who want the system to be as clean and secure as possible by isolating things where they should be isolated (to a single user instance for that purpose alone). This is not a new problem, nor a very well thought-out solution.
Containers are always a less-than-ideal implementation for people running Linux natively. The ideal way to sandbox in Linux is create a user account, download and test whatever code, see what breaks or infringes with its unique notion of "privileges", and delete the user when done.
The problem with using separated user accounts for isolation is that various applications assume that they either run as root (like various package managers) or need to start child processes with different user id. Sometimes this can be worked around, but overall the amount of efforts is very non-trivial.
So people thought that instead of fixing the apps it was easier to fix the kernel. But this resulted in a big complexity with namespaces, capabilities, cgroups etc.
I have to admit that I'm quite confused about this comment. Are you saying simply running a command under its own uid is enough to provide the same isolation that containers do, and that the latter were only created because people are not running Linux natively?
I think he is saying there are a lot of 80% solutions like user based isolation that could have been made more secure, but instead people invented a new solution that has its own problems, and that the fractured landscape of solutions we see now is due to the freedom of open source.
POSIX doesn't speak to the problem at all (except for chroot). It's just not part of the standard.
With linux, it's because the tools are exposed piecewise. Creating a process namespace isn't the same thing as creating a virtual network device. You can use cgroups to stop a container all at once, sure, but you can also use them for other useful purposes and the code is the same. Some of these APIs are simple, some less so. But they're tools.
I know, I know, now you're asking "well, why didn't someone just put all the tools together in one box that would do it with one sane UI?!"
Rkt is pretty much dead as a standalone technology. It was influential in a number of ways though (for example, it forced the CRI to come into existence in Kubernetes).
I know. And the linked project is an effort in that direction too. The point was that a knee jerk "sandboxing in linux is hard" argument is just silly on its face. Because Docker. Just go read any of the bajillion tutorials and testimonials about how to get your process to run in a sandbox.
Fundamentally they share a kernel with the host. This has its own implications.
Beyond that it is very difficult to get things right, and there's definitely some information leakage (e.g. see the output of `mount` in a docker container... not a docker specific problem).
The main issue is, having a VM layer is always going to be "more secure" (because it's another security boundary) than not having a VM layer. Of course VM's have their own issues.
So if you have a need to run multi-tenant/untrusted workloads it's kind of a CYA situation where if you don't use this extra security boundary and there is a problem sometime down the line then you'll have to answer questions like "why didn't you do ..."
The reality is you can do a lot to lock things down with seccomp+selinux/apparmor.
Secure for what? I mean, sandboxing is more or less inherently a security technique. An app running in a Docker (or whatever) container is less exposed than one running on a big shared host. Is it as firm an isolation as you get from a proper virtual machine layer? No. But then, as we've discovered over the past year the hardware itself can have bugs that poke holes in that protection.
No free lunch. But yeah, Linux's suite of "container" tools are basically as secure as any other OS's "containers", no matter what spin you're reading elsewhere.
Interestingly, there was a post in HN recently about the next build of Windows introducing sandboxing. It really feels like there has been a steady stream of new security features coming out of Microsoft lately, which is great to see.
Sandboxing in general is thought of as a policy thing that should be implemented in userspace. Kernel provides basic tools (namespaces, seccomp, whatever...), and userspace decides how to combine them to make a sandbox, whatever the developer wants that to be.
Linux was never designed for it, that's why. It's not a first class kernel feature like in other operating systems. Talk to Linus (who doesn't care about security, so don't lead with that as the killer feature)
Linus cares about security, after all he includes things like seccomp and selinux which are features only intended for security, and he supports the hardening project etc.
He has spoken out against the modern security industry and security theatre, both of which are distinct from security.
So basically a full circle on running single binary executable processes, a multitude of them, as was done on a UNIX system back in the '90's, only with more complexity (this effort is an attempt to reduce that complexity, which in and of itself is telling).
And still nowhere near the flexibility, security, the power and simplicity of illumos zones and SMF.
Freezing python scripts+runtime to generate static executables has long been possible. Of course, all such methods are specific to python and have python-specific caveats. One could say static-containerisation is just a general form of the practice of freezing runtimes with scripts.
For Python and other interpreters one does not need containers. LD_PRELOAD and similar hacks is sufficient to redirect reading of script files into reading from a file system embedded into the executable.
LD_PRELOAD in combination with LD_LIBRARY_PATH is meant to be used by a programmer debugging and testing his or her own shared object library, not for using software in production.
By "similar hacks" I meant something like embedding a custom loader into a fully static binary that runs the interpreter and loads its libraries from the embedded filesystem while patching libc calls.
Why would you bother with such difficult and fragile hacks when you could instead simply have the kernel itself remap those file-related calls with a mount namespace or chroot?
Why would you invlove the kernel at all, since you have access to Python's code itself, and can just make your own version of python, that will source files from wherever you want. And your package will be protable to Windows, too.
Exactly. Collectively we solved the same problem again by introducing many more layers of abstraction and complexity. The levels of abstraction between what appears to be happening and what is really happening grows. When your app is down in prod, simple and easy to understand has benefits.
I'm not too excited about embedding the data (rootfs) within the image. Have a .preinit_array section that runs a seccomp-bpf filter, unshare, etc with filter, unshare, etc data in their own sections (this allows zeroing a section to get original behaviour)
We put a lot of effort into being able to run an application directly from a compressed single-file binary, you have a small executable with a big payload attached to the end of it. The payload essentially contains a mountable filesystem, however because windows doesn't support doing anything like that from an unprivileged account we had to emulate many things Windows normally does, including execution of binary images.
We did some tricks where executable data was stored on disk in a format it can be directly memory mapped and run without reading anything - this allowed us to launch most applications over a network without extracting anything to disk locally and achieve millisecond launch times. Pages would get pulled into local memory by the OS as needed while the application was running.
The end result was that you can create a single EXE file could contain things like python + script, or more complex like photoshop & Word. Just go to a new computer from a guest account, launch the EXE and you instantly start using that app.
I haven't been involved with it for the last 6 years, but I believe vmware still offers it.