I admittedly haven't studied the whole unikernel space yet, but intuitively they do seem unfit for production unless we spend a decade rebuilding tooling (debuggers, process diagnostics tools, etc.). And even then, other downsides apply, as laid out in the Joyent article.
Happy to change my mind over time if it proves to be the other way around, but for now I'm very skeptical.
So work needs to be done to make unikernels profileable & debuggable. I wouldn't claim that this was impossible.
System calls might be additional overhead when there is a hypervisor, but hypervisors are unnecessary when we have containers. You stand to eliminate much more overhead from eliminating the hypervisor than you stand from eliminating the syscalls. Some of that overhead is internal fragmentation from memory partitioning, duplication of driver code, potential double caching, etcetera.
The industry is in the early stages of a transition from hardware virtualization to containers because containers are a better abstraction than hardware virtualization. Joyent offers Illumos zones, Swiscomm offers docker containers with flocker (full disclosure: my employer is the author of flocker), Microsoft has deployed drawbridge on their Azure cloud, etcetera. We will only see more of this in the future.
Once the transition is complete, I see no advantage to unikernels. You could use them in UNIX binary mode, but that makes them little more than a standard process on a traditional system. That is a very different role than the one that their creators intended for them.
Can an application unikernel on a hypervisor (which is really a lightweight OS that nowadays supports many pass-through features) beat performance of an application on a regular OS? (I didn't mention containers, since they should ultimately be irrelevant to hot path performance). So can it? With a lot of work, I bet they can.
So who has has the better price/performance? That's going to depend on how much engineering work it is to adopt, fix, and use unikernels, when they are competing with an established ecosystem around Linux and containers. And that may be where unikernels actually loses on price/performance, where price includes total cost of ownership. We'll see!
> You stand to eliminate much more overhead from eliminating the hypervisor than you stand from eliminating the syscalls.
So my hypervisor application talks directly to devices, thanks to pass-through. What's that about syscalls again?
No double caching: most of our applications have an in-memory working set, and no disk state. Some do (eg, Cassandra databases).
> from a lack of a global page replacement algorithm
Oh, so it's more efficient to be running where paging (aka swapping) is allowed? Sure, for memory footprint, but for runtime performance you're banking on it reaching a state where paging is minimal. The amount of memory saved is depending on the working set, maybe a lot, maybe a little. One downside is you're paying a small CPU tax to manage this (maintaining kswapd lists, and scanning them).
I think this would sometimes be a benefit, and sometimes not. And if not, is there anything stopping a Unikernel -- which must manage its own memory anyway -- from implementing its own pager?
The inefficiency isn't technical resources, but human resources: having Unikernel engineers reinvent what modern kernels already do.
> You also have internal fragmentation from memory partitioning, which prevents you from running as many applications and/or reduces memory available for cache.
Again, usually no file system cache in use. And most apps are started with a fixed heap size that consumes all of memory. There's no left-over/wasted memory that could be used by other apps.
If you want to page out cold memory to make room, uh, sure, but see previous comment. I bet that sometimes works, and sometimes doesn't.
I see no disadvantage for unikernel on hypervisor setups versus applications on a container host setups in applications where there is no disk state. However, I see no advantage either. The techniques used to talk to hardware directly work in userland too. netmap is a fantastic example of this.
I had expected unikernels on hypervisors to have a disadvantage against a container on a traditional kernel, but after reading your remarks, I think that the two ought to perform identically (at least where there is no file system IO), with neither being theoretically better. However, the world is adopting containers in traditional kernels and unless a unikernel on a hypervisor can be better, I do not see much value in devoting resources to unikernels too.
> Oh, so it's more efficient to be running where paging (aka swapping) is allowed? Sure, for memory footprint, but for runtime performance you're banking on it reaching a state where paging is minimal. The amount of memory saved is depending on the working set, maybe a lot, maybe a little. One downside is you're paying a small CPU tax to manage this (maintaining kswapd lists, and scanning them).
I was referencing cache efficiency when I talked about page replacement algorithms rather than paging to disk. Imagine a global ARC algorithm in a traditional system versus each unikernel having its own. The global hit rate would be better with a global algorithm than it would be with a local algorithm in each unikernel.
Even if your application does its own cache, the principle of a global algorithm being best ought to apply to filesystem metadata.
> Again, usually no file system cache in use. And most apps are started with a fixed heap size that consumes all of memory. There's no left-over/wasted memory that could be used by other apps.
This is not the sort of application that I had in mind. I am still skeptical that unikernels are better, but I agree that they are not worse here. In this case, it seems to me that they are (theoretically) just a different way of doing things and are not better or worse.
Unikernels allow more experimentation. The interface that the hypervizor provides is generally lower level (especially with hardware passthrough) than the traditional operating system's interface.
Unikernel 'programs' would normally use a library as an abstraction layer to bridge the gap. These libraries are easier to swap and change and experiment with than traditional OSs. (At least that was the whole justification for exokernels in the 90s, the approach our current hypervisors grew out of.)
By the way, I am a fan of rumpkernels, which also offer the ability to do experimentation. Rumprun is apparently a unikernel design, while rump kernels are building blocks. Rump kernels need not be used in unikernels. They can be used in whatever you want them to be, with unikernels being one place that they can go.
LPARs/LDOMs are a much more secure abstraction for "sharing resources among potentially hostile customers". Those physically partition at the hardware. LPARs are used on the IBM mainframes and are "EAL5 Certified". LDOMs are the SPARC equivalent, but I do not know their EAL. Both traditional kernels and various hypervisors are EAL 4 (some are called EAL4+), which is not as secure.
In a unikernel setup abstractions can live much more comfortably in libraries.
Protection profile less CC evaluations are worthless in the eyes of most governments and CC schemes, but kudos to IBM product management and marketing for creating competitive FUD.
As of a year ago LDOM's (Oracle VM for SPARC) hasn't had a CC evaluation and I'm not seeing anything currently in evaluation. Solaris Zones have been evaluated under the Solaris OSPP EAL4 + extensions evaluation.
The biggest reason that virtualization technologies haven't had a CC evaluation with a protection profile is that no US NIAP approved protection profile existed and the draft ones that were circulated were crap.
Assurance levels (EAL) are deprecated for newest NIAP protection profiles as the higher assurance levels (EAL4) were cost and time prohibited for vendors to complete before the product was outdated. Many people wrongly think common criteria is a security evaluation (free of bugs) - it's not - it's a security architecture evaluation (is the documented behavior working correctly).
There is a schism in CC - everything is changing - anything we know today is wrong and will change.
TL;DR: Common Criteria is a joke and doesn't actual mean what you think it does.
A Lisp Machine on Xen would be one model.
Side-thought: Can Android be dockerized?
Why? Aren't Android apps already sufficiently sandboxed?
Just as an aside: Docker doesn't seem very security-focused, I would not [yet] count on its containers being properly sandboxed. :>
Wasn't Dalvik deprecated & replaced by ART (Andriod Runtime)? ART compiles apps AoT - IIRC; upon installation pre-Marshmallow, and while charging/idle Marshmallow going forward
It's almost enough to make me stop cursing Android developers and their children's children. Unfortunately version updates for non-Google devices are rare and everybody is still stuck supporting the majority of devices that are pre-Lollipop. Also it didn't make the APIs any better >:(
Rather than just being a competing virtualization solution, "Unikernels" are really about eschewing the existing OS paradigm altogether. For example, the Mirage folks seem to have asked themselves about how they could create a "safe" OS and landed on the solution that they could achieve that by trusting the OCaml compiler and runtime for "safety" and so wrote a brand new OS from scratch in OCaml. That is a very different thing than a reaction to the "tire fire" that you are describing!
Similarly, for rump kernels Antti Kantee (with the help of others I presume) took several years to re-architect the NetBSD kernel to minimize the inter-dependency of different components of the kernel through the creation of a "hypercall" interface and a carefully thought out separation of concerns. One of the end results of this architecture is that you can run NetBSD drivers outside of the NetBSD kernel "just" by implementing the rumpkernel hypercall interface. Want to write your own OS (in a "safe" language language like OCaml for instance) but don't want to write a tcp stack or a filesystem implementation or USB driver from scratch? Rumpkernels could be an solution to that problem. Again, that is a very different problem space than the "tire fire".
It is how the safe OS from Burroughs, DEC, Xerox Parc, ETHZ and many others used to work.
Those OSes were written in strong typed systems programming languages, the whole stack.
Part of their security was based on the language type system.
This is just the next step. I've got an app which needs communication channels and possibly persistent storage - isolate everything else. This is what unikernels provide. If it gets rid of some of the redundant system parts is just a cherry on top.
OpenWRT/LEDE will happily work on a system with 4MB of storage:
QNX had a graphical environment, a web browser, a web server, a text editor, image viewer, various games, a package manager, etcetera on a 1.44MB floppy:
Less is definitely more, but you do not need a Unikernel to achieve such sizes and you lose observably by going with a Unikernel. If something goes wrong with your application such as it becoming non-responsive, you need to attach gdb or get a core dump like a kernel developer would to understand what happened. Your production systems that are likely EC2 instances that lack such functionality, which means debugging is much harder with a unikernels than it would have been with a monolithic, hybrid or micro kernel. Furthermore, disk space is cheap, which is why few opt for OpenWRT/LEDE over more full featured Linux distributions in datacenters.
If you want the experience of a single address space and little more code than your application, you could run FreeDOS, which also fits on a floppy and has a code base that is mature. There are guides for doing this online. Here is one for doing a web server:
The world moved away from such designs because the observability and stability were awful. We might have "safe" languages now that improve stability of the application, but those could just run as a process in an environment where proper debugging can be done when something goes wrong. The few percentage points of performance that you get from eliminating the mechanisms that enable you to understand what went wrong do not out justify discarding them.
Also, you lose the advantage of a shared memory pool with unikernels, which are generally intended to run in VMs. Partitioning memory in VMs causes internal fragmentation, which artificially lowers the density of applications per machine. It also can lower block IO efficiency from double caching between the host and guest. Hardware virtualization is a useful technology, but it is an inefficiency that we need to eliminate with containers, rather than one that we should to embrace with unikernels.
I think there is a lot of design space here that is unexplored, so I'm not so sure it is as clear cut as you say. You might like this talk given earlier this year at Compose Conference, entitled "Composing Network Operating Systems" (I was a speaker at Compose and I <3'd this talk a lot.)
It is not just about performance in all cases. Mirage is the particular case in question here - but with OCaml functors, it becomes possible to compose components of kernel in truly modular ways. I was continuously surprised by this talk.
Something that needs to write to a block device only needs an abstract functor describing the interface to the device and some primitives to read or write to it. There are many implementations of this interface.
This seems quite obvious but it allows powerful ideas. For example, in the talk, you can see examples similar to this. But what if you want to test your kernel? You can simply substitute in a new implementation that has failure modes. You can write a block device that randomly ignores every 100th write; one that has unexpectedly high latencies, one that outright hangs on all I/O requests... Doing this kind of fault injection today is possible, but it's conceptually a lot nicer if it's just a "Mock" at the "block device" level that you can easily control and extend. You can do all kinds of other things; like have your system timer freak out, skew in random ways, run in reverse.
You mention observability, but when your systems are truly modular, this is nothing more than an obvious follow up. An example in the talk is interposing "Irmin", which is a distributed, Git-esque storagre system, into the network subsystem of your kernel driver. Any time interface properties of the device change, you write entries into the append-only Irmin log which are distributed. Irmin also has a git interface for read-only analysis.
The short story is that means in the talk, there is a live example where you can query a git repository to get a read-only changelog of all the networking state in your application. In the particular example, I believe it was interposed into the ARP implementation; every ARP packet and ARP response was logged into Irmin, and every system change propagated as a result was logged too. This gives you really amazing levels of persistent analysis and introspection with very low developer cost. It's true you could do something similar in a system today; but this is truly modular, works for any application built to use a particular Functorised-API, etc. It's a programming interface! And in theory there's also nothing stopping conventional tools like `ocamldebug` from working either.
Mirage also abstracts over the true underlying runtime. So that same device API can be switched with one that just talks to a POSIX-compliant filesystem, you get an ELF executable, etc. This all works on normal systems too; Unikernels are merely a different deployment target (for the most part).
This is not to say that Unikernels are the future or we should abandon our stable systems we have now (I definitely won't be doing so anytime in the future). But I found myself very surprised at what was quite easily possible, and I wouldn't so quickly write it all off as a fad. Maybe for Huge Enterprise, yeah... Operations experience separate from development is very useful, and a lot easier to find. But there's definitely some really cool uses for these things, especially in helping rethink and improve on some previous ideas.
> Also, you lose the advantage of a shared memory pool with universals, which run in VMs. Partitioning memory in VMs causes internal fragmentation, which lowers densities. It also can cause double caching between the host and guest, which lowers block IO efficiency.
This is a good point that's often overlooked. But I don't look to Unikernels for outright performance, either; to me, they are more interesting for researching newer operating system designs with a much better ROI than previous methods. I'm glad to see that happening, personally. And I might even take a performance loss if it meant winning some other guarantees in return.
I guess my point is that the unikernel is always going to be the equivalent of a userland process. The question is whether your bare-metal kernel is going to be a traditional one or a hypervisor. They have definite performance advantages over a traditional kernel when your bare metal kernel is a hypervisor, but I believe that is the wrong abstraction when I consider overhead.
> OpenWRT/LEDE will happily work on a system with 4MB of storage:
> QNX had a graphical environment, a web browser, a web server, a text editor, image viewer, various games, a package manager, etcetera on a 1.44MB floppy
This is all true, but you skipped the sentence following the one you quoted: "Depending on your needs, you can go down into the kilobyte range - and that's not just the app - that's everything".
UNIX is a great OS to share a machine amongst many users and many programs. It's not that great when your app is made of thousands of asynchronous programs. Last decade tools are to be reinvented regardless of microkernels.
With a monolithic kernel in the way you have to make some black-box concessions.
Or to put it more charitably, since cloud compute services are based around booting VM images based on this model, we'll just go with it instead of trying to use an abstraction that is actually designed for this.
Correct me if I'm wrong, but it seems to me that the first thing any unikernel is going to do when it boots is switch the (virtualized) CPU out of x86 Real Mode (which all x86 machines boot into for legacy reasons, but virtually no one has needed since circa 1995) into protected mode.
Is it just me or does this seem a little bit crazy?
1. The job of an OS is to ensure that multiple programs can run on a single box without interfering with each other.
2. The job of a hypervisor is to ensure that multiple OSes can run on a physical box without interfering with each other.
3. In many cloud deployments, a single VM instance only runs a single user-defined program, which is programmed to a higher-level runtime than the OS (eg. Node.js, JVM, Rails/Django, SQL).
4. Why do we need #1 then?
IMHO, the real interesting stuff happens when you start re-implementing the APIs that we actually program to, without the OS. For example, what if:
1. You could take any command-line ELF executable and build an AMI out of it. This AMI would have an HTTP interface that only accepted connections from certain security groups. It would take in the command-line args via query params, and let you construct a virtual filesystem containing only the files you operate on via request body. Imagine say a compile server that runs Clang on user-defined code and serves the executable back, to be run on its own VM. And the crucial part is - there is no persistent storage on the box, nor any code that would be worth attacking. If there's a bug in the executable and an attacker pwns the box, the worst he can do is corrupt the request. There is no shell. There is no filesystem. There is no TCP stack to make outgoing connections with.
2. You could re-implement Node.js for stateless webservers. Again, you'd have no filesystem; once the initial program starts, it's guaranteed to never touch disk, since it has no disk access. Node does its own scheduling, and this way Node's scheduler doesn't need to fight the OS scheduler. You could store preformatted HTTP packets or response fragments in read-only memory pages and send them out directly via RDMA.
3. You could do a database or search engine that bypasses the filesystem entirely, instead writing directly to raw disk blocks. It can choose these disk blocks based on locality, since it knows the particular index structure and access pattern for the data, and doesn't have to fight the OS's attempts to hide the disk blocks under a file abstraction.
The point of unikernels is to take away stuff - it's not about which mode the CPU boots into, it's about removing all the code that is on a typical cloud computing image but has nothing to do with the job the instance is actually doing. All of this - shell, filesystem, DNS resolvers, etc. - is attack surface for a potential hacker, and it's often overhead when processing.
In the parent post, does AMI mean Amazon Machine Image, or some Application M____ Interface?
Right now, much of the research on unikernels focuses on implementing a POSIX API. In other words, it replaces libc so that instead of eg. write() making a syscall into a kernel, write() inlines the code that the kernel would've run and talks directly to the hardware.
IMHO, the real wins for unikernels come when they start implementing higher-level interfaces, eg. Node or Rails or Django or HTTP or SQL or the JVM. Many programs are already written to these frameworks, with no knowledge of (or in some cases, access to) the underlying POSIX APIs, and the frameworks themselves often re-implement a large portion of the OS to create better domain-specific abstractions. Node or Python's asyncio, for example, implement their own schedulers that each run inside a single OS thread. Databases work in terms of pages, built on top of a filesystem; they effectively try to recreate the abstraction of a block device on top of a stream on top of a real block device. Websites often have large quantities of text that are sent back with every request (think of page layout in a templating engine, or JS bundles for a SPA). This data is usually copied and concatenated multiple times within a framework, while a bare-metal-aware web framework would store it in a buffer somewhere and write it out directly to the network card.
And yes, I meant Amazon Machine Image. Doesn't have to be Amazon, but I'm focused on the pragmatics of how you might deploy a real unikernel to solve problems, and wanted to make the point that you're going to be loading it into Xen or some other cloud hypervisor at the end.
A unikernel can cut out the middle man here.
But containers are basically the same thing but with better debug support and a more familiar OS environment. Problem is containers need to be deployed on metal to be effective, not VMs. Unfortunately not many providers do this yet.
So yeah it is all kinda crazy.
Samsung just acquired Joyent, which provides multi-tenant container hosting on bare metal via Illumos and LX-branded zones. So to me, the acquisition further validates that approach.
That is my understanding of part of the premise of unikernels. Another is security from having less code, although nothing stops you from having less code with Linux. LEDE/OpenWRT are Linux distributions that are often smaller than the sizes that are advertised for unikernels.
I consider containers using syscalls on a kernel that operates on bare metal to be a better abstraction.
> Correct me if I'm wrong, but it seems to me that the first thing any unikernel is going to do when it boots is switch the (virtualized) CPU out of x86 Real Mode (which all x86 machines boot into for legacy reasons, but virtually no one has needed since circa 1995) into protected mode.
That is only on x86/amd64 systems. It is different on other architectures.
> Is it just me or does this seem a little bit crazy?
The more I learn about unikernels, the more skeptical I become of them.
The nice thing about ring-0 is that on modern hardware with SR-IOV, a VM can be associated with devices and the multiplexing that had to be done within a kernel or hypervisor can now be done in hardware.
Still, there are worse container/host interfaces. It could be the full suite of POSIX system calls.
Why would a Java application server provided by the host be better than the full suite of POSIX system calls?
I'm not following why you think turning on protected mode is crazy. Can you elaborate?
I was happy to see UniK, because I've long seen unikernels as an "architecture-buster" for Cloud Foundry. Yet in practice the shift to Diego made it much less painful than expected.
Disclaimer: I work for Pivotal, the majority contributor of engineer to Cloud Foundry. EMC is a major shareholder in Pivotal.
I'm very happy about more Go ports, I've done the arm64 and the Solaris ports, and now I am finishing the sparc64 port, but ports need to live upstream.
And i think this is a very big problem, for them its some magic pizza box, and complaint when things don't perform the way they expect it to be.
Good read though thanks!
Can I use Google Cloud or AWS?
You could - although you won’t write much more than a toy app - not until things are changed.
DeferPanic offers managed services for both public and private cloud environments and it's platform targets KVM, Xen, bare metal, and ESX.
Perhaps that falls under the "unfit"statement about these Cloud provider but that seem pretty nebulous for a such a technical discussion.
Google Compute Engine runs a lot more than just Docker images. It allows you to run arbitrary x86 VMs, just like EC2. It is not based on Xen, however (it is a combination of KVM and a non-QEMU VMM about which I wish I could say a whole lot more, but I don't think we're prepared to do that just now).
However what they hand you is a docker container I believe so provided there's docker target for whatever rump kernel it should theoretically just work. No?
It sounds like you work on GCE?
There's also GKE which is managed Kubernetes complete with Docker containers.
(And yes, I work on the virtual machine monitor backing GCE)
It's just that it's a bit fiddly to make it happen. My guess is that it's the fiddliness that Ian is suggesting is impractical.
2) We are implementing support to support user supplied images which will let you run mostly anything in the very near future. We plan to be completely agnostic.
One reason MirageOS uses OCaml, for instance, is for its memory safety properties. A truly staggering amount of vulnerabilities (e.g. Heartbleed) are due to abusing unintended ways of accessing memory in programs which face the public Internet. Since we've proven over and over again at this point that we can't reliably write safe C code, there's a reason folks are interested in eliminating as much of it as possible, all the way down to the hypervisor level. Since so many devices will be Internet-connected soon, having a way to write apps without even a possibility of "Oops" bugs like this is even more critical.
The one-language library-centric ideology of e.g. MirageOS especially is really orthogonal to questions of provisioning data centers. It is truly a huge step in the right direction, and before the unikernel-container convergence, could be applied to the host OS of a container rig.
Maybe I'm missing something?
(garbage collection would seem to be an issue with a bare-metal language or?)