(The other thing to follow up on if you're interested in the background on this stuff is kvmtool).
In both cases, a big part of the answer seems to be eliminating BIOS overhead; getting rid of oproms appears to have been the single biggest win for Intel. But the Red Hat article also finds lots of overhead in QEMU itself, and both pieces talk about kernel config issues (for instance, scrubbing the kernel you boot of subsystems that have expensive initcalls).
By comparison: Firecracker is purpose-built in Rust for this one task, provides no BIOS, and offers only network, block, keyboard, and serial device support --- with tiny drivers (the serial support is less than 300 lines of code).
Regarding your first link I had questions about the following passages from it:
>"People like to say “Docker isn’t a security boundary”, but that’s not so true anymore, though it once was."
Could you explain why that wasn't true and why it is now?
>"Systems security people spent almost a decade dunking on Docker because of all the gaps in this simplified container model. But nobody really runs containers like this anymore."
Could you elaborate? People don't run container like what? It isn't clear from the preceding paragraph.
Not OP but most public cloud need stronger isolation than what Docker provides. GKE uses gVisor, AWS uses Firecracker now. When they started, lambda code was scheduled on a different EC2 instance per customer. VM isolation is just so much stronger. Firecracker hopes to make launching VMs as quick as launching containers while retaining the great isolation benefits of a VM.
Edit: I guess container-optimized Chromium OS and gVisor are complementary (that is, you're welcome to run your app in gVisor sandboxed containers on a guest VM running Google's container-optimized OS). Ref: https://go-review.googlesource.com/c/playground/+/195983
Thanks. I remember mrkurt say fly.io evaluated gVisor but settled on Firecracker eventually, though, I must ask: can gVisor on its own achieve same level of protection as micro-VMs? Even if not, can the protection offered by application kernels like gVisor and Kata be considered enough for multi-tenant workloads (like the one fly.io runs) inside the same guest (VM)?
Also, if I may, in what cases would one prefer to use Firecracker managed micro-VM running gVisor sandboxed containers? I'd imagine gVisor slows things down, so it might not be for everyone, but I'm curious as to what value-add might make gVisor worth that.
That's a tough question to answer! I spent some time on it in the blog post, and a lot more time that I didn't write about hacking on little bits of gVisor while researching that post. My take is, no, I don't think gVisor is as secure as Firecracker; it's way more ambitious and has a larger attack surface. But I buy that it is substantially better than OS-sandboxed container runtimes, and I would probably trust it for multitenant workloads.
(Kata, as I understand it, is mostly about microvm containers; I like Firecracker almost as much for its implementation, which I think is pretty gorgeous, as for the microvm design choice, but I like 'em both fine I guess. Also, there's a Firecracker-based Kata that people use, right?)
With containers, both the kernel and the hypervisor are shared. With vms, only the hypervisor is shared.
It's a matter of having a smaller attack surface. There are plenty of container images that run with root access by default, which is almost full access to the kernel. This means that if the application running in the container is compromised, you need to rely on the kernel enforcing the sandbox between containers. This is a relatively new threat (root not being fully trusted), so beyond there simply being more attack surface, there's likely to be more bugs/vulns out there to be discovered. With effort and care you can safely run this but reducing attack surface is a good idea for defense in depth.
If one only allows container to run as a non-root user (no user namespace either) with all privileges dropped with strong mount isolation and some form of syscall filtering, then the attack surface is similar to that of hypervisors if not smaller while the performance is significantly better.
But yes, quite a few services assumes they have root privileges and do not work as is in such containers, like recent OpenSSH. For those cases VM isolation makes for much smaller attack surface.
The security models really aren't comparable. Again, see the blog post, which offers two examples of attacks that break the model you're proposing. I don't want to get into too much detail in this thread because I really just wanted to add some data to questions Julia Evans specifically asked in her post.
The blog post incorrectly states that Go is a memory-safe language. It is not in its standard configuration. And it does not mention that hypervisor bugs allowing to escape from a VM are typically due to wrong logic, not memory safety. Using memory-safe language does not protect from those. And that in turn is the reflection of complexity of hypervisor interfaces in modern CPU. And with VM one gets much greater exposure to hardware bugs as VM has access to more instructions.
Do you know libkrun? github.com/containers/libkrun
"libkrun is a dynamic library that allows programs to easily acquire the ability to run processes in a partially isolated environment using KVM Virtualization."
Julia if you're reading, I'm a big fan. One request: could you add the date/time to the post, preferably near the title?
I do see a `<time>` tag in the HTML, but it doesn't render any human readable text. The datetime is also part of the URL which is "good enough," but it always takes me a minute to remember that and Ctrl+F won't find it.
I'm a little ADD about knowing when things were published, so far from an average person. If you like it the way it is, then don't worry about me. I just wanted to throw it out there.
On reading this, I went to see if the time tag renders in reader mode, assuming that’s part of why it’s there, but at least on my mobile Safari it surprisingly does not.
I’ll second the request. And also second that I’m a big fan of Julia Evans.
The blog post doesn't mention it but Firecracker was originally based on crosvm [0] built at Google by the ChromeOS team for a WSL-like Linux sandbox on top of ChromeOS running debian-buster containers viz. penguin (afaik) in a gentoo-based VM viz. termina [1]. crosvm inturn is part of a much bigger crostini project [2], which I find to be super fascinating, as it supports UI workloads (over Wayland and X).
If you're using ChromeOS to run any Linux app, you're using crostini which launches those apps in crosvm managed sandboxes (Container inside a VM) in seconds. In not so distant future, it looks like Android will sandbox platform workloads (running outside the Android framework?) managed by crosvm, to considerably improve security [3].
I looked for but couldn't find ChromeOS GCP instances. I mean, ChromeOS might be a great platform to run multi-tenant server workloads at this point.
> ... launches those apps in crosvm sandboxes in seconds.
Technically true, but the initial VM launch is not all that quick. E.g., on a Pixelbook i7 launching a terminal session without the VM started, it takes about 20 seconds for the VM itself to initiate, and another 30 seconds for things like volume mounting and starting the container running Debian, for a total of 50 seconds.
Subsequent launches once the VM is started are much quicker, just a few seconds.
> I looked for but couldn't find ChromeOS GCP instances
When I first heard about KVM on Android, I thought Android was going to run every app in its own micro-VM managed by crosvm... But that'd be too resource intensive for mobile devices, I think?
You can use QEmu snapshots to start a VM in under a second (but more than 125ms).
I used this in 2009 for an IRC bot that safely evaluated arbitrary shell commands for demonstration purposes, and a fork thereof is still chugging along to this day.
I wrote a similar post some weeks ago, going to similar deepths to script launching a VM with Firecracker.
My main goal then was to provide the necessary automation to help using cloud images for the VMs, so you can easily leverage a wide array of existing images. Most of the credit is due to cloud-init, which helps automate instance configuration after boot.
I just got this working in windows preview build too. If you enable KVM and rebuild the WSL2 kernel, then you can follow the Linux firecracker demo on github step by step and it works. I was able to launch 400 concurrent firecracker VMs on my laptop in 60 sec.
Yep. At Fly.io, we run customer containers on our own hardware around the world --- the normal workflow just pushes Docker containers to our registry --- by converting them into root filesystems and running them as Firecracker instances.
May I ask why you use firecracker, especially when you already have Docker images in your registry? Do you need root and/or a kernel in your containers for your application?
Because of multitenancy. It isn't safe to run jobs from different customers alongside each other in namespaced OS containers; instead, we give every customer instance its own VM and its own OS. This is the same model that Lambda and Fargate use (of course, that's what Amazon implemented Firecracker for).
I linked to a blog post we wrote about the rationales here, upthread.
Typically, a docker filesystem doesn't include a proper unit (docker will inject "tiny init") - so you would probably have to add a kernel, and init, somehow.
I think I'd prefer to just create a vm, rather than re-use the pre-built docker.
It's not a big secret or anything but it's changed recently and Jerome would do a better job of describing it than I would; apart from the filesystem stuff, we have an init we wrote in Rust that does a bunch of the lifting.
Even though the startup times are fast, the IO performance is poor compared to native. This is primarily because of IO emulation. We've been working on a new hypervisor that can directly run isolated containers (no VMs). Email me if you are interested in learning more.
I am not so sure about that. Amazon uses Firecracker because of its performance and how lightweight it is. Qemu VMs used to be heavier and I think Qemu devs started a project to have a lightweight version like Firecracker recently.
Linux containers are containers, not VMs. They are more like docker (although, lxd/lxc typically are used more like jails/VMs - a "full" user land, rather than just an application binary, like with a docker container wrapping a service implemented in go).
Technically, docker/lxc uses kernel namespaces to isolate a process tree - firecracker starts up a virtual machine.
When a VM context switch happens, the CPU uses extensions like Intel VMX to isolate the virtual machine code from the host code. Usually the hypervisor also forces a cache flush to
mitigate CPU vulnerabilities as well.
I probably agree about container runtimes, but containers as a specification for a preconfigured unit of compute have lots of value outside the runtimes. We don't use namespaced containers at all at Fly.io, and run everything inside of jailed Firecrackers. But we get enormous value out of container tooling, and the most common and simplest way of deploying something to Fly is simply to have our tooling push a Docker image over to Fly's repo.
I think containers have been largely a force for good. They're not as much a force for good as some might claim, but that's a different argument than that they were a waste of time.
Re: if there’s something similar for macOS, it’s built-in. The new Virtualization framework in macOS 11 (Big Sur) is very very quick. User space with support for Virtio spec.
Firecracker would be better from the security isolation point of view. It can use everything that Docker has to offer in the security domain and some more.
https://fly.io/blog/sandboxing-and-workload-isolation/
Two useful links from my Pinboard research for that post:
A person at Red Hat optimizing QEMU boot time:
http://oirase.annexia.org/tmp/paper.pdf
An Intel deck talking about qemu-lite:
http://events17.linuxfoundation.org/sites/events/files/slide...
(The other thing to follow up on if you're interested in the background on this stuff is kvmtool).
In both cases, a big part of the answer seems to be eliminating BIOS overhead; getting rid of oproms appears to have been the single biggest win for Intel. But the Red Hat article also finds lots of overhead in QEMU itself, and both pieces talk about kernel config issues (for instance, scrubbing the kernel you boot of subsystems that have expensive initcalls).
By comparison: Firecracker is purpose-built in Rust for this one task, provides no BIOS, and offers only network, block, keyboard, and serial device support --- with tiny drivers (the serial support is less than 300 lines of code).