It's always great to see different ways folks are using Firecracker. Snapshot-and-restore is a particularly cool capability, especially if you solve the data movement problems (like these folks have).
One challenge to clone-and-restore that they don't talk about here is making sure that clones don't behave too similarly (like returning the same cryptographic random numbers). We wrote a paper about that a while back (https://arxiv.org/abs/2102.12892), and the Linux kernel community has been doing some great work in that area recently too.
Does firecracker not support virtio rng? I won't comment on other uniqueness issues, but I would naively expect that you can fix random number generation by outsourcing it to the host. Or does Linux not pull from the provided rng on every use, resulting in a gap right after restore where your per-VM rng isn't unique? I suppose you could fix that by making the VM kernel aware that it was just restored? And now I see why it's not trivial:P
You could hotplug different hardware, but having a unique MAC address isn't very important if you're on a virtual network where you only talk to the host to get your traffic routed. A unique MAC is only important if you put the cloned VMs on the same network segment.
What do MAC addresses have to do with rng? Are you thinking of the old style UUIDs that used the machines's MAC address and system time? What a terrible idea that was.
I assume not reusing a MAC address falls into the same bucket of "make sure the VMs are not too similar" rather than anything specific to random number generation.
Hit that exact bug with a customer at work in libvirt. Two machines booted at approximately the same time generated a VM with the same Mac. Due to very poor choices of random seed using the boot time and PID and xoring which made that even less random.
It was since fixed though I never updated the bug.
src/util/virrandom.c:virRandomOnceInit seeds the random number generator using this formula:
unsigned int seed = time(NULL) ^ getpid();
This seems to be a popular method after a quick google but it's easy to see how this can be problematic. The time is only in seconds, and during boot of a relatively identical system these numbers are both likely to be relatively similar across multiple systems which is quite likely in cloud-like environments. Secondly, by using bitwise OR only a small difference is created and if the 1st or 2nd MSB of the pid or time are 0 then it would be easy to have colliding values.
Though problematic from basic logic, I also tested this with a small test program trying 67,921 unique combinations of time() and pid() which produced only 5,693 random seeds using PID range 6799-6810 and time() range 1502484340 to 1502489999.
Thanks for sharing the paper, that is super interesting. This is indeed a challenge. The fact that we run development workloads instead of production workloads makes this easier for our use case. We do some rehydration, and when forking across team/organization boundaries we don't clone if there are secrets set. But for production workloads this would not be ready yet.
Once uniqueness has been solved though, VM cloning would become a real solution for serverless hosting (and many of other cases), exciting prospect!
Could you share why you're seeing such a slowdown? I've mainly experienced a slowdown when loading pages for the first time, since new pages need to be page faulted into memory from disk first, but after those pages have been loaded into memory I don't experience any slowdown compared to fresh starts. That said, that's only based on how I've experienced it with the type of workloads that we run on the VMs.
It's the page fault+copy latency, together with some secondary effects from the page tables being updated (seems to briefly halt all cores). The actual copying of a page of RAM is almost free compared to the time spent in all the kernel code for a page fault.
If your RAM is file backed, you end up spending lots of time in the filesystem code too - I used anonymous mappings which really helped there, and called clone() on the VM process to keep them shared.
I suspect if you use huge pages you might see lots of the impact vanish, but obviously that has other downsides.
Right, that makes sense. Once the memory page is in memory it should be fast though. We use shared mapping right now, and practically the pages stay in memory during the lifetime of a VM once they've been loaded, but we need to do more testing when there's more memory pressure.
I've been looking at huge pages recently, I'm going to do some more testing with transparent huge pages today and see if it changes performance. Unfortunately we cannot use reserved huge pages because that doesn't work with shared mmap on say an XFS FS.
Another idea is to make clones use the same memory base layer of their parent, then the pages are already prefaulted and it would deduplicate overall memory usage. Many things to discover still..
Not to diminish this work, but I think it's worth noting that it's increasingly possibly to launch new VMs extremely quickly too. FreeBSD/Firecracker can reach userland in 33 ms, and the OSv unikernel boots in under 10 ms.
I think increasingly we'll see Firecracker used with EC2-like setups of "create a disk image with everything preinstalled and then boot it" rather than using snapshots of running (suspended) VMs.
I’m kind of curious if AWS is ever going to launch a firecracker as a service thing independent from lambda. It would be wonderful for CI or other tasks where you want to rapidly spin up a box and you don’t know how long it needs to be up. EC2 and Fargate take enormous amounts of time to provision compared to firecracker.
From testing a couple years ago (things are likely different now), image pull/setup made a pretty noticeable difference. A 1GB container was about 20 seconds slower than a 500MB one--I assume I/O since Fargate instance size didn't make a difference
On the other hand, ECS still seems slow compared to k8s where things are nearly instance unless you're measuring so ECS control plane speed might be part of the issue, too
This is still a thing, Fargate pull times are super slow: https://github.com/aws/containers-roadmap/issues/696. We run all of our workloads on fargate, and it's really annoying when you're trying to iterate on something and you have to sit there waiting on "Provisioning..." for 1-2 minutes every time you launch a task. I don't think the control plane is that slow, as EC2 based ECS launches tasks really fast if the images are already cached on the machine.
People have mentioned image loading but one other shockingly slow thing is allocating ENIs (this also affects Lambda, VPC endpoints, etc.). I've had a few times where I've looked at the logs and it's basically been like 5 minutes to launch something where 4 of those were waiting for the ENI.
I'd also like to see a Firecracker powered EC2 (with some constraints, of course), but ~6s provision time of current EC2 is already pretty awesome and TBH I don't care about 6s for CI things much.
We use Azure DevOps at work for our CI/CD, and although they provide an ephemeral runner setup (where you can run the agent with a --once flag, and it will exit after a single job runs so you know to destroy the container/VM), jobs will fail if there are no runners in the pool when the build starts. If we could get VM starts down to milliseconds or a second at most in AWS, we could scale our CI runners down to zero and use a webhook (for PR/commit) from ADO to trigger a VM launch on AWS, and by time the pipeline actually started, there would be an agent ready to take the job.
A very specific use case, I know, but if I could have the CI runners run as needed, we could get instances that are way bigger so our builds run faster, and pay around the same amount since they don't have to sit around when they aren't being used.
Well that's going to be a very exensive CI, when virt-lightning spawns a VM in less than 10 seconds with virtio, and you can have plenty on a dedicated server, which you probably have for CI because CI runs faster on dedicated hardware.
I would love to see this as well. I currently can launch a Linux VM in milliseconds, but EC2 takes ~6s before the first user-provided instruction gets to run.
Worth noting that loading a small hello world c unikernel can load in a ridiculous small amount of time but some multiple-gigabyte JVM unikernel might take 100s of ms.
If you need super fast boot times firecracker is definitely worth looking at but should be taken with caveats of what precisely you are going to run there.
I think you may be ignoring the aspect of cloning the codebase and handling writes transparently and then being able to quickly clone/snapshot that VM.
I'm very eager to see more developments in the fresh start times!
The main reason why snapshotting became interesting for us, is because we're running development servers defined by our users. A development server could take a long time to start, sometimes minutes.
So even if we can start the VM fast, the most important speedup for us is on the user code that we cannot control.
Say the user code initiates a download, what happens if we clone during the run of the operation? Will the clone be able to finish the download?
The opposite case - say the user code binds to an IP:port to run a service. Will the clone try to step over the parent, binding to a port that is already taken?
The TCP connection gets "paused", it doesn't get broken but packets don't arrive. The packets that don't arrive are seen as packet loss, and so they get resent. If the connection stays frozen too long it will lead to disconnection (at least of the websocket connection to the VM).
For IP uniqueness, we give every VM the same IP, but we put every VM in its own network namespace. Then we have iptable rules to rewrite the src/dest IP on every packet that enters the network namespace.
You know I'm not sure... TCP is stream oriented and supposed to handle lost packets so I'd think the TCP layer itself would handle the pause. If the sender doesn't get an ACK for a packet then it'll resend that packet later (TCP has sequence numbers so the stream can be reconstructed from out-of-order delivery and resends).
I revisited my proof-of-concept test scripts when I wrote the previous comment. I'll try in the next week to add some additional tests in there to determine stream reliability and packet delay/loss.
UDP of course doesn't have the same benefits.
I'm using ECMP + Anycast in a project I've been developing for the last couple of years (K18S or Keep It Simples Stupids) to effectively replace Kubernetes functionality with standard protocols and tooling that is in almost all distros.
We started out with the challenge of replacing the major parts of CNIs and that is where the ECMP + Anycast work arose from.
Native IPv6 with only VLANs and direct routing (no messing about with IPv4, NAT or overlay networks), ECMP + Anycast gives load-balanced routing to pods with automatic detection of lost hosts. Pods exposed to public get public IPv6 address in addition to a ULA (Unique Local Address, formerly called site-local). ULAs used for private routing.
Systemd-networkd is configured automatically by systemd-nspawn so there doesn't need to be a massive, foreign, orchestration control system.
Systemd-nspawn/systemd-machined to manage container lifecycles with OCI compliant images, or leverage nspawn's support for overlayfs to build machine images from several different file-system images. (rather like Docker's layers but always separate, not combined) but can be used in a pick-and-mix fashion to assemble a container that has several related but separately packaged components.
Configs for /etc/ of each container mapped in from external storage using the same overlayfs method. In most cases everything is read-only but some hosts/pods can be allowed to write into the /etc/ overlay and those changes can be optionally committed to the external storage.
Adopting IPV6 and dropping IPv4 was the best thing we ever did in terms of keeping things simple and straightforward and relying on the existing network protocols and layers, instead of re-inventing it all (badly).
At the time we started Kubernetes didn't even have IPv6 support and even once it did many CNIs couldn't handle it properly.
I know nothing about VMs or filesystems but I absolutely enjoyed this article. The language was very clear and easy to follow. Would be following the blog from now on.
I have a question about the copy-on-write example involving VM A and VM B. It says t VM B will directly use all the data from VM A and for any change, it copies the block, writes into it and reads from it after this.
But what if, say, block 2 is changed by VM A and was never written to by VM B? Wouldn't VM B read the changed block 2? Clearly, it doesn't happen cause a fork is a copy, but an explanation of how this is tackled is appreciated!
Yes, and this is also the biggest challenge. Right now we use XFS to enable CoW, but that quickly leads to filesystem fragmentation. I'm still looking at a way that we can quickly let both VM A and VM B use the same base snapshot, and write new changes to either anonymous memory or a file.
So let's say VM A is already running, and it's cloned to VM B. When that happens, we freeze the memory of VM A, and link it to VM B. For both VM A and VM B, any new write will be done to a new layer.
So the logic is to check if VM A has a new fork. If yes, then start CoW to a new layer of blocks, and leave the current layer to be linked with VM B. If no, just don't use CoW.
The "fork" feature pauses the current VM for cloning - does it mean that your environment can have unpredictable pauses because you never know when someone will press "Fork" and how often?
Yes it's a good point, this is one of the reasons that we wanted the fork time to be low. If we keep it low enough, the connection won't break and other users won't notice it. That said, for some things (like terminal access) it's impossible to hide it.
Practically, 99% of the forks will be done from the `main`/`master` branch of the repo, which is read-only for everyone on the team. So the mini-pause isn't breaking in those cases.
This sounds a lot like SnowFlock[1], a U of T project to fork Xen VMs from ~12 years ago. Are they related? Or is this an independent re-discovery of the same principles?
Came here to say this. I was a grad student at U of T working with some of the original SnowFlock authors back in 2009. (Although I had no hand in the development of SnowFlock itself, I did contribute to some of the follow up work.) Skimming through the article, it looks like the high level idea is the same.
The major difference seems to be that SnowFlock would start a proprietary server which is responsible for sending memory pages over the network on demand whenever the clone reads them. Some follow up work also added several different prefetching strategies to improve the performance of the cloned VMs while they were still fetching remote memory.
SnowFlock was really targeted at compute-heavy applications. The idea was that you could mostly set up your application in the single VM, clone it, and then after cloning, it could be fairly easy to configure the clones to continue working on the problem in parallel.
My Masters thesis made use of SnowFlock to clone relational databases on demand.
There's something seriously wrong with the page on Firefox. The whole browser locks up for a few seconds when you scroll up/down. Couldn't possibly read the article.
Works fine for me, firefox on ubuntu (not installed with snap). I use ublock origin and it seems to have blocked a few external url's on that page so probably it's some bad js slowing you down.
I think this skipped out on the downside of using Firecracker: the host underneath needs to be either Baremetal(AWS) or VM with nested virt support(GCP). This creates additional complexity to manage such a setup in production. Moreover, since both nested-virt VM / baremetal both comes at a very high spec by default, the economic would only make sense if you are at a scale that could saturate(or over-saturated like the article hinted at) these resources.
Very interesting blog post nevertheless. Looking forward to read more!
When QEMU saves a snapshot, it tries to be "smart" about memory, only saving the memory in use[1]. This trades off CPU at snapshot time for I/O at transfer time. How compatible is Firecracker's virtual memory subsystem with doing something like that?
In our case we changed Firecracker to use a shared mmap instead of an private mmap, so in our case the dirtied pages were synced back automatically to the backing memory file. The main reason for this was to reduce IO on snapshot time. I'm also looking at other ways we can do this, because using a shared mmap fragments the underlying xfs fs pretty fast. Maybe we can batch writes more instead of writing single pages.
It could be shared memory all the way through, but the memory of the original VM should become read-only once a cloned VM starts reading from it. So then both VMs (the original VM and the new VM) should put their writes in a new CoW layer.
Using XFS with CoW has been the easiest way to enable this, but if there's a way that we can do this purely in-memory, that would be even faster.
That said, for hibernation we would still have to persist to disk, but timing is less important there.
This is something I will look into! I'm thinking it could reduce start time because we have to copy the mem snap from disk to tmpfs, essentially loading it into memory, but I'm going to try this!
This is really cool. I've also been working with Firecracker, but for isolated CI runners with Docker and KinD/K3s support. Starting with GitHub Actions [1] I've also had interest in making OpenFaaS use pause/resume from Gatsby.js who wanted to reduce their hosting costs. The main challenges were around the networking - if you use CNI and the Go SDK [2] then restores simply don't work. Not sure if you're working with netlink and IMAP directly to get around it?
My question is how are you guaranteeing uniqueness, or do you only clone snapshots for a single tenant? [3]
CodeSandbox is one of the most impressive engineering teams I'm aware of
For most of us who are consumers only of these more fundamental infrastructure projects, there's something deeply satisfying about seeing people push these boundaries (very appropriate for HN too). Fly is another similar team/blog
Not to take credit away, but interestingly enough, both of those heavily rely on Firecracker VM, which certainly solves a huge “fundamental” infrastructure problem.
Tangential to the topic: I look forward to the day that fast snapshotting and snapshot restoring becomes a thing for all VPS providers like hetzner, digital ocean, and vultr (and all the others).
Especially if a machine is snapshotted, restored, snapshotted again, restored and the cycle continues. Even if what’s stored doesn’t get much larger the subsequent snapshot+restore processes take a little longer each time. Each provider has different timelines with vultr saying it can take up to 60 minutes for a snapshot to restore.
My use case is similar but different to code sandbox. I use a beefy remote machine for development and to keep costs low I fire it up and tear it down on demand and pay only for the hours the machine was up. It works fine for me but I just wish snapshotting+restoring was faster on these services. That would make it perfect.
Because you have to pay for stopped instances. That’s why I snapshot and restore. It’s the same with all VPS providers (at least the reliable ones I know)
Yeesht. I had never tried this out with AWS so I had just completely made assumptions about their stopped instance pricing. Thanks for the correction. At the same time, the pricing is rough. Instances with similar configs cost anywhere between 3-7x the price. My current bill of between 2 to 3 USD a month would go up to about 15-25 USD. In absolute terms that's not big but over the months that is going to add up.
great to hear more details about snapshot/restore in the wild, plenty about firecracker but seemingly much less about this exciting feature in real usecases.
looking forward to the unwritten details / future posts too, particularly:
- How to handle network and IP duplicates on cloned VMs
and
- Turning a Dockerfile into a rootfs for the MicroVM (quickly)
Thank you! Right now our main use case is cloning development environments so we can provide a fresh running dev env for every branch and PR. However there are many other interesting applications, like speeding up CI jobs with VMs that start from a snapshot.
I'll make sure we write about the other topics as well. For the network, we run the VM in its own network namespace on the host, and we give every VM the same IP. We then use an iptable rule to rewrite every incoming and outgoing packet to the IP that the host has assigned for the VM.
Yeah the CI case is really interesting, it’s generally reproducible and declarative so a good fit that way, and time waiting for things to start is a big deal in CI.
Another use case I was thinking of was stateful compilers like scala where warming up the compiler is expensive, often a CI task too.
Regarding turning Dockerfiles into a MicroVM: https://gruchalski.com/posts/2021-03-23-introducing-firebuil..., on GitHub: https://github.com/combust-labs/firebuild. This could get you started. Plenty of moving parts in that problem. Many root OS’s, many inits, … Difficult to pull this off by one person without any particular reason so I kinda suspended the project but who knows, seems like people want it so might be a good idea to reboot it.
The bit about turning a dockerfile into a rootfs. A docker image is just a tarball of tarballs. We do something like this:
- you can dump the image using `docker save <name>`.
- you can then get a list of the tarballs in this image by extracting this tarball and reading the file `manifest.json`; `Config` -> `Layers` will give you a list of tarballs (see undocker for how to do this: https://github.com/larsks/undocker)
- Untar these in a directory and use linux tools to convert this dir to a rootfs.
also interested in the upper limit of a micro vm, like how big can it get? 64gb memory? not really micro any more and maybe a traditional VM would be a better fit.
I have to imagine that Fargate on Windows doesn't use Firecracker though, right? Firecracker needs kernel level changes to work properly, and the open source version doesn't let you run anything but Linux.
I have no idea how it works under the hood. Knowing what I know about Firecracker from watching the publicly available videos, I was shocked and thought it would never happen.
On the other hand, CodeBuild has supported Windows containers for years and at least CodeBuild for Linux is based on Fargate, so the service team figured something out. (I had to figure out how to word that. I can’t say “they figured it out” since I work for the same company. But I couldn’t say “we” since I’m so far removed from any service team in the consulting department that it would be disingenuous)
The biggest VM we've been running for dev environments have 12GB RAM, 8vCPUs and 30GB disk. I've also done some tests with 16GB RAM and that worked well too. Have yet to find an upper limit.
Another (unrelated) test we've done is on overprovisioning memory. We were able to run 200 VMs (all running Vite dev server where a file was changed every second) with 2GB RAM per VM, on a node with 128GB RAM. Because we were mapping the memory files on disk directly to the VM, the VM would automatically "swap" the memory back to the memory file when it had memory pressure. The bottleneck here was CPU.
The "micro" in microvms is less about size and more about resources. A typical virtual machine under Xen or KVM (para)virtualizes a lot of hardware and emulates a lot of devices, so that the operating system sees it as a normal machine.
The microVM emulates the minimal possible set of devices needed to run, such as disks and network devices, and in the specific case of firecracker, through the use of the virtio model. So it can theoretically use huge amounts of memory of a large vCPU count and still be a microvm.
We cloned VMware vSphere VMs in under 5 seconds 4 years ago. Relatively easy with proper storage integration and things like refclone copies on the storage.
Problem is the VM takes twice that time to boot so it's not as impressive ;-)
(yes, it's a different idea to the OP but still pretty neat)
This is a great post! Love that CEO goes into such technical details.
Looking forward to read about networking. That I think is technically also interesting and has been a challenge for us for a bit. Coming to the VMs and lower level topics like kernel, or Linux networking has been really fun for me. Weirdly, things feel much simpler the lower you go for some reason. Probably less abstraction?
A bit of self less promo. We are using Firecracker to create interactive onboarding for devs. We did one for Prisma
We start a Firecracker clone when you visit the website. Everything you do happens in your Firecracker VM. You have access to the terminal and can play around with Prisma.
Not so new age... but in IRC, the evalbot in #bash, did use something similar with QEMU:
> Initially, Qemu is booted and its state is saved. On each evaluated command, this state is loaded (giving a usable shell in less than one second), a command is fed on stdin and the output read on stdout.
As the side-note in the article mentions, the core idea of "copy-on-write" has been around for ages. In context of virtualization, check out QEMU's "qcow2" format and its notion of "backing files" and "overlays".
A quick example to take offline, instantaneous "disk snapshots" (QEMU can do this for live VMs too): Let's assume you already have a disk image of a clean Linux distro, let's call it _base.raw_. Then you can create instantaneous "snapshot"[1] this way:
$> qemu-img create -f qcow2 -b ./base.raw -F raw overlay1.qcow2
[The "-F raw" is specifying the file format of the backing file; this is a good practice to explicitly mention this when creating overlay files;]
Once you do this, and boot the VM with overlay1.qcow2, all the new guest writes will go to overlay1.qcow2. And whenever the guest need to refer to some old data it is copied over from the backing file, base.raw into the overlay1.qcow2 file. This lets you take a a backup of the base image, or make more "snapshots" (overlays) based on it.
To take an instantaneous disk snapshot while the guest is running, refer to the docs here[2].
[1] The term "snapshot" here a bit of a misnomer, it is actually called an "overlay" — because the overlay file "refers" to its backing file, which becomes read-only once you create the overlay.
This is very cool! This technology will open the door to preview environments in CI/CD.
If you also have a versioned filesystem you can efficiently create lots of snapshots that store VM images differentially you can introduce a branchable/versioned environments for the whole backend and tie it to the repo commit hash.
Exactly! We're looking at the second use case, to have a VM running tied to the git history. Right now we do this on a branch level, so you could say "connect to branch X on microservice Y" from another microservice, and you can test APIs quickly. There's a lot of new things this enables!
Yes! We're watching cloud-hypervisor as well, it might also be more suitable for our use cases.
Yes, we did look at live migrations since there's a lot written about it and it's the closest to cloning a running VM. Lots of development in that space!
Dumb question. How is keeping a lambda warm different than just running a VM? Does the warm lambda instance respond quickly, and when it starts getting saturated, then additional Lambdas come online via a cold start process?
But what devoutsalsa said is true, you will still get a cold start when a request comes in while the already "warm" lambda is processing another request. This is one of the gotchas that not everyone seem to understand about these cloud functions. Sure, you'll scale quite horizontally, but just one executor at the time. If you have a long cold start, this can significantly increase latency. Some providers let you pay for concurrency, so that you can scale out more quickly.
What runtime are you using? A custom runtime can get cold start times in milliseconds, as long as you're not loading a large language runtime or a container. Try a custom runtime that has only a single statically linked binary in it and nothing else.
Very cool, snapshotting has come a long way in recent years.
I don't see anything about graphics in the article - could this approach also be used to clone a VM running a desktop window manager like Gnome or KDE? Or would that rely on GPU memory which is not included in the dumps?
I don't believe Firecracker currently supports GPU (latest what I saw about it is here: https://github.com/firecracker-microvm/firecracker/issues/11...). But I wouldn't be surprised if there's another MicroVM manager that would support GPU + snapshotting.
I'd like to see some of these techniques put into a directly usable project. This stuff seems a lot like where the VM world is heading, yet doing these things is still a lot harder than it'd need to be.
TBH having worked on live- and self-migration on top of Xen, as well as on MicroVMs at Bromium, the startup that coined the term, this feels like history repeating, just with Rust instead of C. I was once told by one of the Bromium founders that they tried to sell the MicroVM idea to AWS, but the AWS guys just took everything they learned in the meeting as inspiration and built Firecracker instead.
TLDR; At fundamental level, the idea is that most VMs don't change much to their large images so you can lazily only work with diffs. The copy-on-write allows blocks of your new file points to blocks of original and maintain diff of blocks that get changed.
One challenge to clone-and-restore that they don't talk about here is making sure that clones don't behave too similarly (like returning the same cryptographic random numbers). We wrote a paper about that a while back (https://arxiv.org/abs/2102.12892), and the Linux kernel community has been doing some great work in that area recently too.