Hacker News new | past | comments | ask | show | jobs | submit login

In terms of security/isolation, processes, users, containers and virtualization are all essentially the same thing. I wish the people working on these things would step back and notice the forest for the trees.

Whatever the ultimate "isolation unit" ends up being, it needs to be recursive. That means being able to run processes within your process (essentially as libraries), create users within your user account (for true first-class multi-tenancy), or VMs within your VM (without compounding overhead).

It turns out that this author also wrote "Docker: Not Even a Linker"[1] which was also deeply insightful about unconscious/accidental architecture decisions. I'm impressed by his insight and disturbed that most people don't seem to understand it.

[1] https://news.ycombinator.com/item?id=9809912

Let me take that one step further:

>processes, users, containers and virtualization are all essentially the same thing.

...and so are modules/objects/whatever your language of choice calls them. Abstraction boundaries, to be precise. Abstraction, security, and type-safety, are all very closely related.

These language-specific mechanisms for isolation are recursive - trivially so. And language runtimes and compilers make security cheap - so cheap that it's ubiquitous.

Processes, users, containers and virtualization all rely on an operating system for security, which in turn relies on hardware features. Specifically, virtual memory and privileged instructions. And those hardware features are slow, and more importantly: they're not recursive!

But hardware-based isolation does have one key advantage over language-based isolation: It works for arbitrary languages, and indeed, arbitrary code.

I completely agree that recursive isolation is necessary. We need to figure out rich enough hardware primitives and get them implemented; or we need to migrate everything to a single language runtime, like the JVM.

Great point. The JVM tried for this position and failed IMHO (I think it abstracted too much). Now the browser is slowly honing in on it, and it might succeed (mostly due to sheer inertia). As opposed to the JVM, I like to call the ultimate goal the "C Virtual Machine" (just process isolation++).

I think moving isolation out of hardware is really important (both to make it recursive and portable). NaCl is an interesting step in that direction. If you could use something like it to protect kernelspace (instead of ring 0), syscalls could be much, much faster.

There's another problem with language-based isolation: it makes your language/compiler/runtime security-critical. Conversely, NaCl has a tiny, formally proven verifier that works regardless of how the code was actually generated, which seems like a much saner approach.

I'll also say that I don't think it's reasonable to expect every object/module/whatever within a complex program to be fully isolated (in mainstream languages at least). There's no need for it, and it will have too much overhead (in a world where objects in many languages already have too much overhead). Better to start relatively coarse-grained (today the state of the art is basically QubesOS), and gradually improve.

"NaCl is an interesting step in that direction. If you could use something like it to protect kernelspace (instead of ring 0), syscalls could be much, much faster."

It's actually partly inspired by how old security kernels work mixed with SFI. The first, secure kernels used a combination of rings, segments, tiny stuff in kernel space, limited manipulation of pointers, and a ton of verification. Here's original ones:


A Burroughs guy who worked with Schell et al on GEMSOS and other projects was the Intel guy who added the hardware isolation mechanisms. They were originally uninterested in that. Imagine the world if we were stuck on legacy code doing the tricks no isolation allows. Glad it didn't happen. :)

Eventually, that crowd went with separation kernels to run VM's and such that market was demanding. They run security-critical components directly on the tiny kernel.


The SFI people continued doing their thing. The brighter ones realized it wasn't working. They started trying to make compiler or hardware assisted safety checking cost less with clever designs. One, like NaCl and older kernels, used segments to augment SFI. Others started looking at data flow more. So, here's some good work from that crowd:




So, have fun with those. :)


...this is extremely relevant to what you're saying. A talk worth watching.

I don't think I would call JEE and Spring a failure.

They mark the turning point I stopped worrying about UNIX deployments.

An application server has all the features I care about from a container, including fine grain control over which apis are accessible to the hosted applications.

It doesn't matter if the application server is running on the OS, an hypervisor, container or even bare metal.

Back in 2011 we were already using AWS Beanstalk for production deployments.

Also OS/400 is like that, user space is bytecode based. For writing kernel space native code, or privileged binaries you need the appropriately called Metal C compiler.

It's been done:



The latter runs FreeBSD on a FPGA. There's others that use crypto to do similar things with sealed RAM. What we need isn't tech to be invented so much as solutions to be funded and/or bought. Neither most suppliers or demand side want to make sacrifices necessary to get the ball rolling. Most prototypes are done at a loss by CompSci people. There's a few niche companies doing it commercially. High-assurance security w/ hardware additions is popular in smartcard market for example. Rockwell-Collins does it in defense with AAMP7G processor but you bet really low volume in orders. Price goes up as a result.

"It's been done" and "What we need isn't tech to be invented so much as solutions to be funded and/or bought." are somewhat dismissive overstatements.

If we had a way to implement a capability-secure runtime on Linux on Intel CPUs, in a way that improved performance rather than making it worse, and was straightforwardly usable with existing programs, then we could make a ton of progress and easily be commercially successful.

But that is technologically difficult. Maybe we can still figure it out, though. Or maybe we just need to figure out the right trade-offs to get something viable out the door and widely used.

"in a way that improved performance rather than making it worse"

This part is not realistic if it's apples to apples. Performance will always drop because the high performance came specifically by doing unsafe things that enable attackers. Adding checks or restrictions slows it down. The question is whether it can be done without slowing things down too much. I'm hopeful about this given many people are comfortably using their PC's and the net on 8-year old PC's w/ Core Duo 2's that still run pretty well. That was 65nm tech. Matching it or at least a smartphone is what I'd aim at for first run.

>Performance will always drop because the high performance came specifically by doing unsafe things that enable attackers.

No, this is not true. Strong static typing and other compile-time information can (at least theoretically) allow improving both security and performance.

For example, look at single address space operating systems. Switching between security domains is cheaper when you don't have to switch address spaces. That's an improvement to security that allows increased efficiency.

We were talking about C code. That doesn't have strong, static typing with built-in safety. C developers also rejected things like Cyclone and Clay. So, my comment assumes a C-based platform (esp OS).

Intel had an wonderful processor for that, iAPX 432, and they botched it.

Every time I read about it I wonder how things would have turned out, if they managed to do a proper job with the CPU.

Don't forget i960 which had some of 432's fundamental protections with decent performance. It was used later in high-end embedded market.


This. We know how to build secure systems, but VCs won't fund the effort unless there's profit to be made and existing software will run on it. That second criterion especially means you're going to have to dumb down either your security or your performance to the point that there's no real value added.

I've come to the conclusion that the only realistic approach to security is to begin a slow, methodical replacement of every line of C code in the Linux kernel with Rust. No VC is going to pay for that, so some other funding mechanism will be required.

Not only Rust, but every native compiled memory safe language is a good alternative to replace user space applications, specially if they aren't dependent on the usual C memory pointer optimization tricks for their use case.

Hence why you will see my schizophrenic posts regarding Go, although I dislike some of the design decisions, every userspace application written in Go is one application less written in C. And if bare metal approaches like GERT[0] take off even better.

[0] - https://github.com/ycoroneos/G.E.R.T

I think modern hardware is "recursive" in the relevant sense.

In an ethereal sense sense every Turing complete machine is recursive, because by definition you can implement another Turing complete VM on it. CPUs with kernel/user separation took it further by allowing vanilla instructions to run on bare metal within a virtual world defined by the kernel. Modern CPU virtualisation features extend this to the control instructions that an OS would use.

What else can you ask for?

I actually don't know much about nested virtualization. Can it be done to arbitrary depths with modern hardware?

One issue with being "recursive" is efficiency. Ideally, running 500 levels deep would be just as fast as running at the top level. In language-based systems this can be achieved with inlining and optimization, but it's difficult in hardware, where there is much less semantic information available about the running code.

In terms of missing the forest for the trees, I agree, but perhaps we are looking at different forests.

If port 22 is not privileged, then what is to prevent my daemon from listening on that port and collecting all the credentials of other users trying to log into the machine? Nothing. This is why users don't get to bind to privileged ports -- it's why privileged ports exist. The workaround is that every user get their own ssh daemon under their control and for every user to request that their ssh daemon handle their own login by specifying their own virtual network address: alice@alice.com and bob@bob.com -- instead of the current solution of using shared hosts with system services: alice@sharedhost and bob@sharedhost

What you cannot do is have system services (a shared host) with user control over daemons that fulfill those services. It has to be system control over shared services and user control over user services.

But every user having their own ssh daemon and their own hostname/IP is certainly looking a lot like the virtualization/containerization solution, no? The opposite of the virtualization option is not "get rid of privileged ports", but "have privileged ports" -- e.g. have resources controlled by the system and not any particular user.

The real complaint here is that using custom port numbers is unwieldy and we need more robust mappings from custom domains to shared domains with custom ports. For example, make it easier for users to set up their own virtual hostnames to map to a shared host with a custom port. Getting rid of privileged ports doesn't solve this problem at all.

For the same reason, users can't bind to port 80, because then my webserver could steal credentials to your site, as both our sites use the same common webserver. So either none of us controls the webserver or we each have our webserver, and with our webserver we'll need our own copies of other system libs, which again puts us back on the containerization path.

Again, the choice is of using system libs versus duplicating and then isolating user libs.

It seems to me that this is the fundamental trade off, and focusing on privileged ports as a problem, when they are one side of a fundamental trade off, is not really insightful at all.

Those are all accidental limitations from the architecture as it exists, not as it could exist. There is no fundamental reason why the IP address is a combination of network and host address. There is no fundamental reason why a host is presumed to have only one IP address. There is no fundamental reason why alice@alice.com and bob@bob.com can’t be the same daemon listening on different IP addresses, but anybody connecting to the alice.com interface gets a different certificate and no access to the bob.com resources.

I think the systemd socket activation with declarative configuration files, and the Serverless cloud computing fad, are hints of how it is possible to control exactly what program is running, and even have some custom code, without having to duplicate and maintain all the binaries. Too bad they’re doing it on Linux, so they still have all those accidental limitations.

Hosts are not presumed to have only one IP address. This is a mistake that people make. (They often made it with djbdns, hence https://cr.yp.to/djbdns/ifconfig.html .) But it has never actually been the case. Indeed, one can find discussions of hosts that have multiple IP addresses in RFC 1122 and discussion that IP addresses do not have a 1:1 correspondence with network interfaces in RFC 791.

Details. In practice, few applications do interesting things when binding to multiple IP addresses. It’s like a special case of single IP address.

Perhaps I should phrase it, there is no fundamental reason why IP addresses are associated with hosts rather than users, or even services. There is no fundamental reason why you need to be root to listen to privileged ports, which includes many of the most useful ports.

There's implications for development whenever we have a sandbox wall, too. There are serious differences in the development experience of a binary language with no runtime model, a binary language with one, a language wrapped in a high level, fully sandboxed VM, and a language compiled to another language. Some of them allow productivity, others restrict entire application categories.

Once you have a runtime, dynamic linking suffers. Once you have a VM, you lose vast swathes of control over I/O and resources. And once you target a different language you end up with lowest-common-denominator semantics and debugging capabilities.

In some respects, the JVM-style sandboxed language runtime is an "original mistake" because it's an OS-like environment that doesn't have access to OS-like features, leading to a lot of friction at the edges and internal bloating as more and more features are requested. If we had similar access to memory, devices etc. everywhere the friction wouldn't be experienced as such, even if there were protections enforced that hurt performance in certain cases. You'd design to a protocol, and either the device and OS would support it or it wouldn't. That's how the Internet itself managed to scale.

But as it is, the stuff we have to work with in practical systems continues to assert that certain line of coder machismo: unsafe stuff will always be unsafe and You Should Know What You Are Doing, and anyone who wants safety is a Newbie Who Should Trust A Runtime.

Apparently C developers still don't know what they are doing, after 50 years.

"12. Trust the programmer, as a goal, is outdated in respect to the security and safety programming communities. While it should not be totally disregarded as a facet of the spirit of C, the C11 version of the C Standard should take into account that programmers need the ability to check their work."


"In terms of security/isolation, processes, users, containers and virtualization are all essentially the same thing."

This is not true. They each are defined by different security boundaries, have vastly differing properties of isolation and communication, contain different data, and are contained by different components.

From a security perspective they are not the same, though from a functional perspective they each solve similar problems.

I feel like you're speaking past each other. I think you're describing what they happen to be empirically, whereas the parent is describing what they are fundamentally (isolation mechanisms). I'm pretty sure the parent realizes you can't "user" for "container" in the same sentence, and that each one has different use cases and implications... but what is being claimed is a little more abstract than that.

That may be, though I do not think that is the case as the parent made a very narrow qualified statement.

The parent made a specific statement about the security properties of various isolation mechanisms, equating them all from a security perspective.

Can you give a specific example of something that e.g. processes absolutely must support which users absolutely cannot?

Consider that in Linux, processes and threads are implemented via the same abstraction (tasks). This abstraction actually leaks in some unfortunate cases, but it's generally considered "good enough."

The abstraction may be good enough functionally. My comment was a security not functional statement.

In the case you mention, your choice of abstraction may affect your threat model, depending on if there is shared state and what data may require isolation.

I'm assuming that the underlying isolation mechanism is formally proven (or at least as good as possible). With a single set of reasonable features, it should be able to provide isolation between processes, users, containers and VMs. What am I missing?

For general purpose operating systems formal verification of security mechanisms should not always be assumed.

I was not talking about ideal security but that certain pre-existing mechanisms do not have equivalent security postures, as the parent had mentioned. The point isn't that with enough work isolation can be achieved but that that work has not in fact been done and the various mechanisms are distinct and their security values should not be conflated.

Fuchsia is looking like a neat contender for tackling this problem. I've read through most of their design docs, and if I've understood them correctly, it should allow for fully recursive processes. The sandboxing [0] and namespaces [1] design docs are a good starting point.

As an example of this in action, the test runner [2] creates a new application environment for each test.

[0] https://fuchsia.googlesource.com/docs/+/HEAD/sandboxing.md

[1] https://fuchsia.googlesource.com/docs/+/HEAD/namespaces.md

[2] https://fuchsia.googlesource.com/test_runner/

> Whatever the ultimate "isolation unit" ends up being, it needs to be recursive.

Thank you. I've always thought about this and figured I must be crazy since nobody else seems to care about it.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact