> I know for sure that both MirageOS and Erlang-on-Xen (unikernels) can be easil...

banana_giraffe · on Oct 5, 2019

AWS has different instance types. Some are running on Nitro, which is very close to bare metal, and they have so called "metal" instance types, which involves no virtualization between you and the hardware, as the name implies.

And it's quite possible to create new images for these instances for your needs.

You're right about the complexity and lock in. Using such services are often a trade off in dev complexity vs speed. If you're following a well defined process that doesn't change often that's going to peg the CPU of a 72-core machine for days, it might be worth it to eek out every speed improvement you can. If, on the other hand, you're constantly iterating and updating your code, and responding to handful of user requests that barely causes a blip on a single core VM, CPU speed is probably not your primary concern.

People are doing it, I'm one of them. It's a niche though, and most assuredly not for ever case.

derefr · on Oct 5, 2019

I think you're misunderstanding: I'm talking about instances that are operating in kernel mode in a VM (which is ring 0, just with the MMU and IOMMU pre-configured by the dom0 to not let the domU have complete control over memory or peripherals.) For most IaaS providers, VM instances are all they'll let you run anyway†. Normally, people are running userland processes on these VMs. That's two context switches for every system call: domU user process → syscall to domU kernel → hypercall to dom0 kernel. And it's two address space mappings you have to go through, making a mess of the cache-coherence of the real host-hardware TLB.

Writing your code to be run as the kernel of the VM, on the other hand, reduces this to one context switch and one page translation, as your application is just making hypercalls directly and directly using "physical memory" (≣ dom0 virtual memory.)

Think of it this way: from the hypervisor's perspective, its VMs are a lot like processes. A hypervisor offers its VMs all the same benefits of stability and isolation that an OS offers processes. In fact, the only reason they aren't just regular OS processes (containers, essentially), is that IaaS compute has been set up with the expectation that users will want to run complete boot-images of existing OSes as their "process", and so a process ABI (the hypercall ABI) is exposed that makes this work.

But, if you are already getting the stability+isolation benefits just from how the IaaS compute provider's hypervisor is managing your VM-as-workload—then why would you add any more layers? You've already got the right abstraction! A kernel written against a hypercall interface, is effectively equivalent to a userland process of the hypervisor, just one written against a strange syscall ABI (the hypercall ABI.)

(And, of course, it's not like you can choose to run directly as a host OS userland process instead. IaaS compute providers don't bother to provide such a service, for several reasons‡.)

> Third, who knows if these images will work on other cloud providers.

Hypercall ABIs are part of the "target architecture" of a compiler. You don't have to take one into account in your source code; compilers handle this for you. You just tell clang or ocamlcc or rustc or whatever else that you're targeting "the Xen hypercall ABI", or "the ESXi ABI", and it spits out a binary that'll run on that type of hypervisor.

(Admittedly, it's a bit obtuse to figure out which hypervisor a given cloud provider is using for a given instance-type; they don't tend to put this in their marketing materials. But it's pretty common knowledge floating around the internet, and there are only four-or-so major hypervisors everyone uses anyway.)

> Fifth, if this was a good / easy idea, many people would be doing it, but they aren’t.

I'm from a vertical where this is common (HFT.) I'm just here trying to educate you.

---

† there are in fact "bare-metal clouds", which do let you deploy code directly on ring 0 of the host CPU, with the same "rent by the second" model of regular IaaS compute. (They accomplish this by relying on the server's BMC—ring -1!—to provide IaaS lifecycle functions like wiping/deploying images to boot disks.) It's on these providers where a Linux-kernel-based (or other FOSS-kernel-based) unikernel approach would shine, actually, as you would need specialized drivers for this hardware that Linux has and the "unikernel frameworks" don't. See http://rumpkernel.org/ for a solution targeting exactly this use-case, using NetBSD's kernel.

‡ Okay, this is a white lie. Up until recently none of the big IaaS providers wanted to provide such a service, because they didn't trust container-based virtualization technology to provide enough isolation. Google built gVisor to increase that isolation, though, and so you can run "ring-3 process on shared direct-metal host" workloads on their App Engine, Cloud Functions, and Cloud Run services. But even then, gVisor—despite avoiding ring-0 context switches—still has a lot of overhead from the user's perspective, almost equivalent to that of a ring-0 application in a VM. The only benefits come from lowered per-workload book-keeping overhead on the host side, meaning Google can overprovision more workloads per host, meaning that "vCPU hours" are cheaper on these services.

speedplane · on Oct 6, 2019

> I'm talking about instances that are operating in kernel mode in a VM.... Normally, people are running userland processes on these VMs. That's two context switches for every system call... Writing your code to be run as the kernel of the VM, on the other hand, reduces this to one context switch and one page translation

Thanks for the clarification, this does indeed make sense. If your app is already sandboxed by the VM, introducing a second kernel/userland sandbox within the existing sandbox doesn't make as much sense.

That said, I think there are better ways to fix this issue than putting all of your code into a VM's kernel space. For instance, imagine there was a way for a hypervisor to lock down and "trust" the code running in a VM's kernel space, and could thus put the VM's kernel space into the same address space as the hypervisor. This could also potentially reduce the two memory translations down to one.

Another solution is to rely more on special hypervisor hardware that could conceivably do the two memory translations (VM user -> VM kernel -> hypervisor) as fast as a single translation.

The main reason that these alternative approaches may be desirable, is that asking developers to move their programs from userland to the kernel is a big ask. There's a lot of configuration that needs to be done, and few general software developers have experience working within unprotected kernel space. Simple bugs that would normally just crash a single process could bring down the entire VM, and could potentially affect other VMs on a network (for example, imagine a bug that accidentally overwrote a network driver's memory).

I'm sure there are performance gains to be had here, but they may be insignificant. Projects like these are cool, but raise big red flags of potential over and early optimization.