Two useful links from my Pinboard research for that post:
A person at Red Hat optimizing QEMU boot time:
An Intel deck talking about qemu-lite:
(The other thing to follow up on if you're interested in the background on this stuff is kvmtool).
In both cases, a big part of the answer seems to be eliminating BIOS overhead; getting rid of oproms appears to have been the single biggest win for Intel. But the Red Hat article also finds lots of overhead in QEMU itself, and both pieces talk about kernel config issues (for instance, scrubbing the kernel you boot of subsystems that have expensive initcalls).
By comparison: Firecracker is purpose-built in Rust for this one task, provides no BIOS, and offers only network, block, keyboard, and serial device support --- with tiny drivers (the serial support is less than 300 lines of code).
>"People like to say “Docker isn’t a security boundary”, but that’s not so true anymore, though it once was."
Could you explain why that wasn't true and why it is now?
>"Systems security people spent almost a decade dunking on Docker because of all the gaps in this simplified container model. But nobody really runs containers like this anymore."
Could you elaborate? People don't run container like what? It isn't clear from the preceding paragraph.
Another commentator on this thread points out that GKE uses trimmed-down Chromium OS? https://news.ycombinator.com/item?id=25884851
Edit: I guess container-optimized Chromium OS and gVisor are complementary (that is, you're welcome to run your app in gVisor sandboxed containers on a guest VM running Google's container-optimized OS). Ref: https://go-review.googlesource.com/c/playground/+/195983
Also, if I may, in what cases would one prefer to use Firecracker managed micro-VM running gVisor sandboxed containers? I'd imagine gVisor slows things down, so it might not be for everyone, but I'm curious as to what value-add might make gVisor worth that.
(Kata, as I understand it, is mostly about microvm containers; I like Firecracker almost as much for its implementation, which I think is pretty gorgeous, as for the microvm design choice, but I like 'em both fine I guess. Also, there's a Firecracker-based Kata that people use, right?)
It's a matter of having a smaller attack surface. There are plenty of container images that run with root access by default, which is almost full access to the kernel. This means that if the application running in the container is compromised, you need to rely on the kernel enforcing the sandbox between containers. This is a relatively new threat (root not being fully trusted), so beyond there simply being more attack surface, there's likely to be more bugs/vulns out there to be discovered. With effort and care you can safely run this but reducing attack surface is a good idea for defense in depth.
But yes, quite a few services assumes they have root privileges and do not work as is in such containers, like recent OpenSSH. For those cases VM isolation makes for much smaller attack surface.
"libkrun is a dynamic library that allows programs to easily acquire the ability to run processes in a partially isolated environment using KVM Virtualization."
Julia if you're reading, I'm a big fan. One request: could you add the date/time to the post, preferably near the title?
I do see a `<time>` tag in the HTML, but it doesn't render any human readable text. The datetime is also part of the URL which is "good enough," but it always takes me a minute to remember that and Ctrl+F won't find it.
I'm a little ADD about knowing when things were published, so far from an average person. If you like it the way it is, then don't worry about me. I just wanted to throw it out there.
I’ll second the request. And also second that I’m a big fan of Julia Evans.
Same same, everything grounded and then dug deep.
If you're using ChromeOS to run any Linux app, you're using crostini which launches those apps in crosvm managed sandboxes (Container inside a VM) in seconds. In not so distant future, it looks like Android will sandbox platform workloads (running outside the Android framework?) managed by crosvm, to considerably improve security .
I looked for but couldn't find ChromeOS GCP instances. I mean, ChromeOS might be a great platform to run multi-tenant server workloads at this point.
 One can run custom Linux containers (other than the debian-based penguin) but not VMs (other than termina) at this point: https://chromeos.dev/en/linux/linux-on-chromeos-faq
Technically true, but the initial VM launch is not all that quick. E.g., on a Pixelbook i7 launching a terminal session without the VM started, it takes about 20 seconds for the VM itself to initiate, and another 30 seconds for things like volume mounting and starting the container running Debian, for a total of 50 seconds.
Subsequent launches once the VM is started are much quicker, just a few seconds.
> I looked for but couldn't find ChromeOS GCP instances
You're looking for Container-Optimize OS: https://cloud.google.com/container-optimized-os/ , which is Chromium OS based.
It's the default OS for e.g. GKE nodes, so Google probably runs pretty large numbers of them.
When I first heard about KVM on Android, I thought Android was going to run every app in its own micro-VM managed by crosvm... But that'd be too resource intensive for mobile devices, I think?
What they are instead doing with this project led by Will Deacon is more towards isolating non-Android workloads that OEMs run (like Radios): https://news.ycombinator.com/item?id=10905643, https://news.ycombinator.com/item?id=24109856, https://news.ycombinator.com/item?id=14859602?
However, I really like all the security efforts they put into it (which is also a reason why the NDK is so constrained).
Thanks for the links.
I used this in 2009 for an IRC bot that safely evaluated arbitrary shell commands for demonstration purposes, and a fork thereof is still chugging along to this day.
For a generic build like you'll find in a Linux distro, QEMU startup is slow mostly because it links to dozens of shared libraries (103 on Fedora 33).
My main goal then was to provide the necessary automation to help using cloud images for the VMs, so you can easily leverage a wide array of existing images. Most of the credit is due to cloud-init, which helps automate instance configuration after boot.
"Automation to run VMs based on vanilla Cloud Images on Firecracker": https://ongres.com/blog/automation-to-run-vms-based-on-vanil...
So this makes it comparable to containers when speed in question.
Anybody using this instead typical VM's in production (and not being Amazon) ?
Typically, a docker filesystem doesn't include a proper unit (docker will inject "tiny init") - so you would probably have to add a kernel, and init, somehow.
I think I'd prefer to just create a vm, rather than re-use the pre-built docker.
It creates a block device using `qemu-img`, adds an empty filesystem using `mkfs.ext4` and then simply mounts it and copies in the files.
The previous posts cover this topic in more detail:
Day 43: Building VM images - https://jvns.ca/blog/2021/01/21/day-43--building-vm-images/
Day 44: Building my VMs with Docker - https://jvns.ca/blog/2021/01/22/day-44--got-some-vms-to-star...
We create a hard link to the resulting device for the root drive inside firecracker.
I linked to a blog post we wrote about the rationales here, upthread.
Technically, docker/lxc uses kernel namespaces to isolate a process tree - firecracker starts up a virtual machine.
Even though the startup times are fast, the IO performance is poor compared to native. This is primarily because of IO emulation. We've been working on a new hypervisor that can directly run isolated containers (no VMs). Email me if you are interested in learning more.
If we can get VMs to launch even remotely as fast as containers, that possibly means the end of containers due to VMs offering far superior isolation.
I think containers have been largely a force for good. They're not as much a force for good as some might claim, but that's a different argument than that they were a waste of time.