Hacker News new | past | comments | ask | show | jobs | submit login
Sandboxing and workload isolation (fly.io)
158 points by sudhirj 7 days ago | hide | past | favorite | 46 comments

I just cannot shake the feeling that everything Containers have promised us, I was promised back in 1991 with pre-emptive multitasking and protected memory.

I also can't help but think we're going to end up with a microkernel with some sort of nested cgroups for user processes. Which is itself going to look a little bit like Erlang...

Containers are the API hack we put in because we couldn't change POSIX. The isolation and packaging were available in 90's era capability-based OSes (and 70's mainframe OSes), but backwards compatibility is more important than anything else, so we build matryoshkas of OSes inside OSes.

If I'm not mistaken, the current stack for a containerized multi-host environment could be roughly described as:

UNIX-like Linux kernel -> POSIX environment -> Kubernetes API -> Docker/CRI-O runtime -> POSIX environment -> Applications

From the developer perspective only the last POSIX layer is relevant, all the layers below are efforts to strong-arm old UNIX paradigms into 21st century compute/storage/network architecture. Could it be possible to replace this huge layered stack with something lighter that is better designed for the current compute/storage/networking paradigm in the cloud/data center and supports the POSIX compatibility layer for applications? For example:

Harvey OS -> orchestration API -> APEX -> Applications

Plan 9 -> orchestration API -> APE -> Applications

Clients could still rely on the desktop Linux distribution, Android or some other OSes.

Docker promised containers like we knew them from the past, and improved building and packaging procedures.

In the past, we never had an easy way to package various applications (including all needed resources and executables). Each language ecosystem invented their own ways of packaging, distributing and installing the application, and this always relied on pre-existing environment.

With Docker you don't care what versions of language runtime, or dependencies, each application runs, you can run them on the same machine. There are no incompatible dependencies between multiple Docker containers, everything almost always works.

Same goes with building, image layers made building of images a lot easier, and doing incremental changes is fast. Of course there are downsides, image size can really grow if you are not careful, and you need to regularly rebuild images from the ground up, to ensure everything is patched.

Is that a surprise? Containers can mostly be summed up by file system isolation on top of process isolation. I think this is why we're seeing some work done on capability based OSes like Google's Fuchsia.

Sorta, but I think you're missing the value proposition forest for the feature trees.

I'm going to speak to the docker-style containers because that's what I'm most familiar with.

Filesystem, process, and even network isolation facilitate the illusion of an application having a fresh install of an OS to a degree. This includes details like /dev/, a package manager, things sitting in /etc/.

Added to that there's decent tooling for sharing, composing/layering, shipping, baking images. They're declarative (enough-ish) too.

This is nice because that's a pretty decent compatibility layer across a ton of software like databases and webservers that have always run expecting a full host, services written with bare metal in mind, and new things too.

The kernel isolation features are powerful enablers, but don't undervalue how important it was to have things sorted out for you, like ldloader quirks with different glibc versions against the same kernel. Real issue I had.

Source: Too much futzing around with chroots.

> I also can't help but think we're going to end up with a microkernel with some sort of nested cgroups for user processes. Which is itself going to look a little bit like Erlang...

Also sounds like https://redox-os.org, a microkernel that has namespaces, written in Rust and actively under development

When I was at VMware I was pretty excited about this; it made a lot of sense (and still does):


Azul systems introduces a JVM that uses the VT instructions directly to vastly accelerate GC.

I think some of their custom hardware was also a 'bare metal' VM arrangement for years, but I never actually met anyone who was a customer.

Which is why vendors selling compliant JavaSE for embedded deployment with real time support still exist, despite Android.

In the 90s there was no compute-as-service and the difficulty in sanely running arbitrary yet potentially deliberately adversarial code was less well-understood. The latter is maybe still not all that well-understood.

Was "tomcat" and EJB the 90s version thereof?

Not really, it would fall under the 'language runtime' category in the article. EJB was not really designed with the goal of isolating entirely arbitrary, intentionally adversarial code. That's a problem that's gone from 'near-future interesting' to 'current, important and practical' since the 90s - both on the client (browsers, phones) and the server.

Sure it was, that is what I have done several times with WebSphere.

You ran intentionally adversarial code with WebSphere?

I run intentionally WebSphere security capabilities to contain EAR and WARs into their own little universe, alongside JEE security capabilites and managers.

Everything that Docker sells was already possible with JEE application containers, just Java only.

To be fair, there were a lot of CVEs because part of the sandbox in Java was capabilities-based, and the Achilles' heel of capabilities is that if you can see it, you can use it, and feature factory work has a bad habit of accidentally making things visible by exposing some parent object in an attempt to be helpful.

I did like the permissions inheritance model quite a bit, but the hybrid ACL/capabilities system was always a bit schizophrenic.

Still way less than C based technology, and besides most of them were related to Java Applets, a completely different stack than application containers like WebSphere.

No, it wasn't; you couldn't keep code running in a WAR off the network. For a number of years, practically every sitewide pentest I did was gameovered by SSRF off Java application servers. I never even bothered to figure out how tight the isolation between co-resident applications was on those things.

Sure you can, by packing the network sensitive code into another Java library and using process internal bean calls across domains.

Now if those guys didn't understood the WebSphere capabilities that is another matter.

The idea behind sandboxed execution environments is to make them resilient to insecure application code. You can't address that problem by stipulating that your application is secure!

Neither did I said that, rather that WebSphere does provide the tooling to create such sandboxes, provided one goes through the effort to create the application architecture in a way that those features at put it use.

It doesn't do magic if the users don't bother to use the tools put at their disposal.

This isn't sandboxing at click of button.

See previous comment.

For me we are kind of there already with hypervisors and managed language runtimes.

The irony of having monoliths like Linux running type 1 hypervisors, on Intel CPUs running ME.

Love what the folks at fly.io are doing with edge deployed containers.

Interesting comparison with Cloudflare's isolate approach which they expanded on earlier this week: https://blog.cloudflare.com/mitigating-spectre-and-other-sec...

Seems like there is an edge compute option for any requirement these days!

What's the cutoff for something to be considered "Edge"? It looks like Fly has only about 20 POPs.

The ability to have an auto scaling auto distributive compute mesh, I’d say. Once a company builds that they can add POPs as fast as they can spend (or raise) money.

By that definition, AWS and GCP are both "edges" then. Doesn't sound like a useful definition...

We've deployed nsjail for our file conversion pipeline (i.e. ImageMagick) and it's been great -- very nice configuration language and strong isolation properties, with a manageable performance hit. Definitely easy to write a configuration that would not securely sandbox you, though, which seems like a strong point towards Docker or other more high-level solutions.

I might look at nsjail. I run chrome headless inside of cgroups (for cpu and memory restrictions) and using a runtime generated no-login user (for privilege restrictions), and with a custom group id (for IP tables bandwidth and routing restrictions). Plus I monitor the pool of chromes with a shell script and if any exceed a resource threshold, I used cpulimit then pkill to take care of it.

I could just use docker (and I have docker images of this app that some people like to run) but I think this way is more lightweight, gives me more explicit control, and leans into the OS level security features, like privileges and cgroups with a couple of Unix commands.

I thought I was the only one running workloads in headless browsers! We should start a club for just us NaN.

OK, come join my club! https://github.com/dosyago send me an email :)

I've worked on unikernels for several years, and I've recently had some insight into low-overhead emulation. Depending on your use-case it's possible that a tiny emulator will be less overhead than everything else for the construction/destruction part. So, imagine that you have a programmable web server and each request needs to be served quickly. Then, reducing the overheading of forking the master machine is everything. The cost of doing a few emulated instructions into systemcalls (which largely model what you already have to check normally), you are going to have lots of room before you hit a point where something heavier like a JIT-enabled emulator or rv8 which has millisecond-overhead.

I am working on such a thing now and it has a 6-microsecond overhead in a highly concurrent scenario (60k req/s). So, I think there are cases for everything.

So light-weight virtual machines on a security-hardened code base.

Have we come full circle?

Probably since we're steadily reinventing all the techniques that people used to run stuff on mainframes.

> But you’re not running OpenBSD, so, moving on.

But.. I am running OpenBSD ;(

There's also unikernels which several work under firecracker.

"The first, and really the big problem for the whole virtualization approach, is that you need bare metal servers to efficiently do lightweight virtualization;" <- this isn't really true anymore as unikernels can be deployed to {aws, gcloud, azure} as machine images and so can be as light-weight as your program needs.

We've been playing a lot with OSv. But unikernels aren't strictly sandboxing or isolation.

If you're deploying a unikernel to ec2, you're just using their underlying isolation. You can't safely deploy multiple unikernels on the same ec2 instances without figuring out your own, nested isolation.

Sure - the hypervisor is the isolation vehicle - the same as firecracker.

The only? reason I can see for wanting to deploy N unikernels to one AWS instance is if you want full control on the orchestration of things like multi-tenancy, however, it is very much possible and it is possible on GCloud as well. It should be noted that you are in the same boat as firecracker here. They both use virtualization.

The usual reason to want to do N things with 1 piece of hardware is cost savings.

Anyone else having a hard time even reading this article because of the lack of contrast from text to background?

Did you try Reader View[1]?

Actually it's rather ugly for this specific page for some reason, with weird boxes around every paragraph.

But at least the contrast isn't an issue.

[1] https://support.mozilla.org/en-US/kb/firefox-reader-view-clu...

No, but I am not sure why you got downvoted for asking.

It's against the guidelines to do so.

uhm, nope

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact