We clone a running VM in 2 seconds

mjb · on Sept 1, 2022

It's always great to see different ways folks are using Firecracker. Snapshot-and-restore is a particularly cool capability, especially if you solve the data movement problems (like these folks have).

One challenge to clone-and-restore that they don't talk about here is making sure that clones don't behave too similarly (like returning the same cryptographic random numbers). We wrote a paper about that a while back (https://arxiv.org/abs/2102.12892), and the Linux kernel community has been doing some great work in that area recently too.

yjftsjthsd-h · on Sept 2, 2022

Does firecracker not support virtio rng? I won't comment on other uniqueness issues, but I would naively expect that you can fix random number generation by outsourcing it to the host. Or does Linux not pull from the provided rng on every use, resulting in a gap right after restore where your per-VM rng isn't unique? I suppose you could fix that by making the VM kernel aware that it was just restored? And now I see why it's not trivial:P

cperciva · on Sept 2, 2022

Nope. Official advice is "use RDRAND".

kevincox · on Sept 2, 2022

Which does avoid this problem as long as you are using it directly (not just as a seed).

boulos · on Sept 2, 2022

I'm surprised "MAC" never appears in the paper. Does Linux happily pickup a new MAC "on wake-up"?

JoshTriplett · on Sept 2, 2022

You could hotplug different hardware, but having a unique MAC address isn't very important if you're on a virtual network where you only talk to the host to get your traffic routed. A unique MAC is only important if you put the cloned VMs on the same network segment.

NavinF · on Sept 2, 2022

What do MAC addresses have to do with rng? Are you thinking of the old style UUIDs that used the machines's MAC address and system time? What a terrible idea that was.

flumpcakes · on Sept 2, 2022

I assume not reusing a MAC address falls into the same bucket of "make sure the VMs are not too similar" rather than anything specific to random number generation.

lathiat · on Sept 2, 2022

Hit that exact bug with a customer at work in libvirt. Two machines booted at approximately the same time generated a VM with the same Mac. Due to very poor choices of random seed using the boot time and PID and xoring which made that even less random.

Details here: https://bugs.launchpad.net/bugs/1710341

It was since fixed though I never updated the bug.

src/util/virrandom.c:virRandomOnceInit seeds the random number generator using this formula: unsigned int seed = time(NULL) ^ getpid();

This seems to be a popular method after a quick google but it's easy to see how this can be problematic. The time is only in seconds, and during boot of a relatively identical system these numbers are both likely to be relatively similar across multiple systems which is quite likely in cloud-like environments. Secondly, by using bitwise OR only a small difference is created and if the 1st or 2nd MSB of the pid or time are 0 then it would be easy to have colliding values.

Though problematic from basic logic, I also tested this with a small test program trying 67,921 unique combinations of time() and pid() which produced only 5,693 random seeds using PID range 6799-6810 and time() range 1502484340 to 1502489999.

CompuIves · on Sept 1, 2022

Thanks for sharing the paper, that is super interesting. This is indeed a challenge. The fact that we run development workloads instead of production workloads makes this easier for our use case. We do some rehydration, and when forking across team/organization boundaries we don't clone if there are secrets set. But for production workloads this would not be ready yet.

Once uniqueness has been solved though, VM cloning would become a real solution for serverless hosting (and many of other cases), exciting prospect!

londons_explore · on Sept 2, 2022

Beware that such techniques slow down memory access for the newly running VM quite substantially.

For example, a simple memset() call across a few gigabytes of RAM inside the VM might slow down a factor of 1000x after a VM clone like this.

Both 'parent' and 'child' VM's see the slowdowns.

Some types of garbage collector also see really substantial slowdowns.

It was a deal-breaker for my project where I did similar sorts of things.

CompuIves · on Sept 2, 2022

Could you share why you're seeing such a slowdown? I've mainly experienced a slowdown when loading pages for the first time, since new pages need to be page faulted into memory from disk first, but after those pages have been loaded into memory I don't experience any slowdown compared to fresh starts. That said, that's only based on how I've experienced it with the type of workloads that we run on the VMs.

londons_explore · on Sept 2, 2022

It's the page fault+copy latency, together with some secondary effects from the page tables being updated (seems to briefly halt all cores). The actual copying of a page of RAM is almost free compared to the time spent in all the kernel code for a page fault.

If your RAM is file backed, you end up spending lots of time in the filesystem code too - I used anonymous mappings which really helped there, and called clone() on the VM process to keep them shared.

I suspect if you use huge pages you might see lots of the impact vanish, but obviously that has other downsides.

CompuIves · on Sept 2, 2022

Right, that makes sense. Once the memory page is in memory it should be fast though. We use shared mapping right now, and practically the pages stay in memory during the lifetime of a VM once they've been loaded, but we need to do more testing when there's more memory pressure.

I've been looking at huge pages recently, I'm going to do some more testing with transparent huge pages today and see if it changes performance. Unfortunately we cannot use reserved huge pages because that doesn't work with shared mmap on say an XFS FS.

Another idea is to make clones use the same memory base layer of their parent, then the pages are already prefaulted and it would deduplicate overall memory usage. Many things to discover still..

intelVISA · on Sept 4, 2022

Copy on write on a VMM is solid for a naive impl. but you need heavier stuff for prod I fear.

cperciva · on Sept 2, 2022

Not to diminish this work, but I think it's worth noting that it's increasingly possibly to launch new VMs extremely quickly too. FreeBSD/Firecracker can reach userland in 33 ms, and the OSv unikernel boots in under 10 ms.

I think increasingly we'll see Firecracker used with EC2-like setups of "create a disk image with everything preinstalled and then boot it" rather than using snapshots of running (suspended) VMs.

easton · on Sept 2, 2022

I’m kind of curious if AWS is ever going to launch a firecracker as a service thing independent from lambda. It would be wonderful for CI or other tasks where you want to rapidly spin up a box and you don’t know how long it needs to be up. EC2 and Fargate take enormous amounts of time to provision compared to firecracker.

Dunedan · on Sept 2, 2022

AWS Fargate uses Firecracker as well.

capableweb · on Sept 2, 2022

Strange, Fargate is anything but fast.

sexy_panda · on Sept 2, 2022

From my experience the allocation of resources and other tasks preparing the run of a container are consuming quite a lot of time.

Pulling the image and building the container is actually just a matter of a few seconds.

I have no data about it though.

nijave · on Sept 2, 2022

From testing a couple years ago (things are likely different now), image pull/setup made a pretty noticeable difference. A 1GB container was about 20 seconds slower than a 500MB one--I assume I/O since Fargate instance size didn't make a difference

On the other hand, ECS still seems slow compared to k8s where things are nearly instance unless you're measuring so ECS control plane speed might be part of the issue, too

easton · on Sept 2, 2022

This is still a thing, Fargate pull times are super slow: https://github.com/aws/containers-roadmap/issues/696. We run all of our workloads on fargate, and it's really annoying when you're trying to iterate on something and you have to sit there waiting on "Provisioning..." for 1-2 minutes every time you launch a task. I don't think the control plane is that slow, as EC2 based ECS launches tasks really fast if the images are already cached on the machine.

acdha · on Sept 2, 2022

People have mentioned image loading but one other shockingly slow thing is allocating ENIs (this also affects Lambda, VPC endpoints, etc.). I've had a few times where I've looked at the logs and it's basically been like 5 minutes to launch something where 4 of those were waiting for the ENI.

rfoo · on Sept 2, 2022

I'd also like to see a Firecracker powered EC2 (with some constraints, of course), but ~6s provision time of current EC2 is already pretty awesome and TBH I don't care about 6s for CI things much.

easton · on Sept 2, 2022

We use Azure DevOps at work for our CI/CD, and although they provide an ephemeral runner setup (where you can run the agent with a --once flag, and it will exit after a single job runs so you know to destroy the container/VM), jobs will fail if there are no runners in the pool when the build starts. If we could get VM starts down to milliseconds or a second at most in AWS, we could scale our CI runners down to zero and use a webhook (for PR/commit) from ADO to trigger a VM launch on AWS, and by time the pipeline actually started, there would be an agent ready to take the job.

A very specific use case, I know, but if I could have the CI runners run as needed, we could get instances that are way bigger so our builds run faster, and pay around the same amount since they don't have to sit around when they aren't being used.

1337shadow · on Sept 2, 2022

Well that's going to be a very exensive CI, when virt-lightning spawns a VM in less than 10 seconds with virtio, and you can have plenty on a dedicated server, which you probably have for CI because CI runs faster on dedicated hardware.

JoshTriplett · on Sept 2, 2022

I would love to see this as well. I currently can launch a Linux VM in milliseconds, but EC2 takes ~6s before the first user-provided instruction gets to run.

staticassertion · on Sept 2, 2022

How fast do you want? My bet is that you can get EC2 to boot up very quickly, ie: ~1 minute or less with a bit of effort.

eyberg · on Sept 2, 2022

Worth noting that loading a small hello world c unikernel can load in a ridiculous small amount of time but some multiple-gigabyte JVM unikernel might take 100s of ms.

If you need super fast boot times firecracker is definitely worth looking at but should be taken with caveats of what precisely you are going to run there.

vlovich123 · on Sept 2, 2022

I think you may be ignoring the aspect of cloning the codebase and handling writes transparently and then being able to quickly clone/snapshot that VM.

cperciva · on Sept 2, 2022

Cloning the codebase is what I'm getting at with preparing a disk image.

CompuIves · on Sept 2, 2022

I'm very eager to see more developments in the fresh start times!

The main reason why snapshotting became interesting for us, is because we're running development servers defined by our users. A development server could take a long time to start, sometimes minutes.

So even if we can start the VM fast, the most important speedup for us is on the user code that we cannot control.

visarga · on Sept 2, 2022

Say the user code initiates a download, what happens if we clone during the run of the operation? Will the clone be able to finish the download?

The opposite case - say the user code binds to an IP:port to run a service. Will the clone try to step over the parent, binding to a port that is already taken?

CompuIves · on Sept 2, 2022

The TCP connection gets "paused", it doesn't get broken but packets don't arrive. The packets that don't arrive are seen as packet loss, and so they get resent. If the connection stays frozen too long it will lead to disconnection (at least of the websocket connection to the VM).

For IP uniqueness, we give every VM the same IP, but we put every VM in its own network namespace. Then we have iptable rules to rewrite the src/dest IP on every packet that enters the network namespace.

iam-TJ · on Sept 2, 2022

Have you considered, or tested, using ECMP (Equal Cost Multiple Path routing) and anycast for that?

I did some extensive IPv4 and IPv6 ECMP anycast testing a couple years ago where we'd randomly bring up and kill hosts and containers.

The network layer provided the fault tolerance and could be tweaked to react very quickly to missing hosts.

CompuIves · on Sept 2, 2022

That is very interesting, would it also be able to handle paused VMs where it buffers the packets up to certain threshold?

iam-TJ · on Sept 4, 2022

You know I'm not sure... TCP is stream oriented and supposed to handle lost packets so I'd think the TCP layer itself would handle the pause. If the sender doesn't get an ACK for a packet then it'll resend that packet later (TCP has sequence numbers so the stream can be reconstructed from out-of-order delivery and resends).

I revisited my proof-of-concept test scripts when I wrote the previous comment. I'll try in the next week to add some additional tests in there to determine stream reliability and packet delay/loss.

UDP of course doesn't have the same benefits.

I'm using ECMP + Anycast in a project I've been developing for the last couple of years (K18S or Keep It Simples Stupids) to effectively replace Kubernetes functionality with standard protocols and tooling that is in almost all distros.

We started out with the challenge of replacing the major parts of CNIs and that is where the ECMP + Anycast work arose from.

Native IPv6 with only VLANs and direct routing (no messing about with IPv4, NAT or overlay networks), ECMP + Anycast gives load-balanced routing to pods with automatic detection of lost hosts. Pods exposed to public get public IPv6 address in addition to a ULA (Unique Local Address, formerly called site-local). ULAs used for private routing.

Systemd-networkd is configured automatically by systemd-nspawn so there doesn't need to be a massive, foreign, orchestration control system.

Systemd-nspawn/systemd-machined to manage container lifecycles with OCI compliant images, or leverage nspawn's support for overlayfs to build machine images from several different file-system images. (rather like Docker's layers but always separate, not combined) but can be used in a pick-and-mix fashion to assemble a container that has several related but separately packaged components.

Configs for /etc/ of each container mapped in from external storage using the same overlayfs method. In most cases everything is read-only but some hosts/pods can be allowed to write into the /etc/ overlay and those changes can be optionally committed to the external storage.

Adopting IPV6 and dropping IPv4 was the best thing we ever did in terms of keeping things simple and straightforward and relying on the existing network protocols and layers, instead of re-inventing it all (badly).

At the time we started Kubernetes didn't even have IPv6 support and even once it did many CNIs couldn't handle it properly.

kretaceous · on Sept 2, 2022

I know nothing about VMs or filesystems but I absolutely enjoyed this article. The language was very clear and easy to follow. Would be following the blog from now on.

I have a question about the copy-on-write example involving VM A and VM B. It says t VM B will directly use all the data from VM A and for any change, it copies the block, writes into it and reads from it after this.

But what if, say, block 2 is changed by VM A and was never written to by VM B? Wouldn't VM B read the changed block 2? Clearly, it doesn't happen cause a fork is a copy, but an explanation of how this is tackled is appreciated!

birdman3131 · on Sept 2, 2022

Vm A also does copy on write. So VM B is still seeing the unmodified block.

CompuIves · on Sept 2, 2022

Yes, and this is also the biggest challenge. Right now we use XFS to enable CoW, but that quickly leads to filesystem fragmentation. I'm still looking at a way that we can quickly let both VM A and VM B use the same base snapshot, and write new changes to either anonymous memory or a file.

kretaceous · on Sept 2, 2022

Hi, not sure I understand.

What if VM A is a new VM? What happens to the block after copy-on-write? Just destroyed?

CompuIves · on Sept 2, 2022

So let's say VM A is already running, and it's cloned to VM B. When that happens, we freeze the memory of VM A, and link it to VM B. For both VM A and VM B, any new write will be done to a new layer.

kretaceous · on Sept 2, 2022

Thanks for replying.

So the logic is to check if VM A has a new fork. If yes, then start CoW to a new layer of blocks, and leave the current layer to be linked with VM B. If no, just don't use CoW.

I hope I got it right!

CompuIves · on Sept 2, 2022

Yep, that's exactly it. So the moment the fork happens, we create both a new layer for VM A and VM B, so they can both use the same base layer.

kretaceous · on Sept 2, 2022

Got it, thanks a lot!

CompuIves · on Sept 2, 2022

Thanks a lot, I appreciate that!

kgeist · on Sept 2, 2022

The "fork" feature pauses the current VM for cloning - does it mean that your environment can have unpredictable pauses because you never know when someone will press "Fork" and how often?

CompuIves · on Sept 2, 2022

Yes it's a good point, this is one of the reasons that we wanted the fork time to be low. If we keep it low enough, the connection won't break and other users won't notice it. That said, for some things (like terminal access) it's impossible to hide it.

Practically, 99% of the forks will be done from the `main`/`master` branch of the repo, which is read-only for everyone on the team. So the mini-pause isn't breaking in those cases.

ShroudedNight · on Sept 2, 2022

This sounds a lot like SnowFlock[1], a U of T project to fork Xen VMs from ~12 years ago. Are they related? Or is this an independent re-discovery of the same principles?

[1] http://www.cs.toronto.edu/~brudno/public/pdf/lagar2009snowfl...

michaelmior · on Sept 2, 2022

Came here to say this. I was a grad student at U of T working with some of the original SnowFlock authors back in 2009. (Although I had no hand in the development of SnowFlock itself, I did contribute to some of the follow up work.) Skimming through the article, it looks like the high level idea is the same.

The major difference seems to be that SnowFlock would start a proprietary server which is responsible for sending memory pages over the network on demand whenever the clone reads them. Some follow up work also added several different prefetching strategies to improve the performance of the cloned VMs while they were still fetching remote memory.

SnowFlock was really targeted at compute-heavy applications. The idea was that you could mostly set up your application in the single VM, clone it, and then after cloning, it could be fairly easy to configure the clones to continue working on the problem in parallel.

My Masters thesis made use of SnowFlock to clone relational databases on demand.

https://www.researchgate.net/publication/221351958_FlurryDB_...

rwmj · on Sept 2, 2022

Qemu can also do this (and fast boot these days), there's nothing new about this.

aetherspawn · on Sept 2, 2022

There's something seriously wrong with the page on Firefox. The whole browser locks up for a few seconds when you scroll up/down. Couldn't possibly read the article.

mrintegrity · on Sept 2, 2022

Works fine for me, firefox on ubuntu (not installed with snap). I use ublock origin and it seems to have blocked a few external url's on that page so probably it's some bad js slowing you down.

brabel · on Sept 2, 2022

Fine for me as well, on MacOS.

pistoriusp · on Sept 2, 2022

Has the same issue on my Pixel using Chrome. Felt like j was scrolling through slime.

sluongng · on Sept 2, 2022

I think this skipped out on the downside of using Firecracker: the host underneath needs to be either Baremetal(AWS) or VM with nested virt support(GCP). This creates additional complexity to manage such a setup in production. Moreover, since both nested-virt VM / baremetal both comes at a very high spec by default, the economic would only make sense if you are at a scale that could saturate(or over-saturated like the article hinted at) these resources.

Very interesting blog post nevertheless. Looking forward to read more!

CompuIves · on Sept 2, 2022

You're completely right. Right now we're hosting the VMs on bare-metal at Hetzner, and we're looking at OVH or Contabo for hosting in the US.

conradev · on Sept 2, 2022

When QEMU saves a snapshot, it tries to be "smart" about memory, only saving the memory in use[1]. This trades off CPU at snapshot time for I/O at transfer time. How compatible is Firecracker's virtual memory subsystem with doing something like that?

[1] https://github.com/qemu/qemu/blob/7dd9d7e0bd29abf590d1ac235c...

CompuIves · on Sept 2, 2022

Firecracker keeps a bitmap of which pages have been dirtied (it's a flag you can turn on), so you can make incremental snapshots of only the changed pages (more here: https://github.com/firecracker-microvm/firecracker/blob/main...).

In our case we changed Firecracker to use a shared mmap instead of an private mmap, so in our case the dirtied pages were synced back automatically to the backing memory file. The main reason for this was to reduce IO on snapshot time. I'm also looking at other ways we can do this, because using a shared mmap fragments the underlying xfs fs pretty fast. Maybe we can batch writes more instead of writing single pages.

liuliu · on Sept 2, 2022

Can these just be shared memory all the way through or in your case, the persistence to disk at fork time is important?

CompuIves · on Sept 2, 2022

It could be shared memory all the way through, but the memory of the original VM should become read-only once a cloned VM starts reading from it. So then both VMs (the original VM and the new VM) should put their writes in a new CoW layer.

Using XFS with CoW has been the easiest way to enable this, but if there's a way that we can do this purely in-memory, that would be even faster.

That said, for hibernation we would still have to persist to disk, but timing is less important there.

ahefner · on Sept 2, 2022

Shared mmap on top of tmpfs, maybe?

CompuIves · on Sept 2, 2022

This is something I will look into! I'm thinking it could reduce start time because we have to copy the mem snap from disk to tmpfs, essentially loading it into memory, but I'm going to try this!

conradev · on Sept 2, 2022

To answer my own question, Firecracker supports the `track_dirty_pages` and `enable_diff_snapshots` flags, which allow for incremental snapshotting:

https://github.com/firecracker-microvm/firecracker/blob/main...

alexellisuk · on Sept 2, 2022

This is really cool. I've also been working with Firecracker, but for isolated CI runners with Docker and KinD/K3s support. Starting with GitHub Actions [1] I've also had interest in making OpenFaaS use pause/resume from Gatsby.js who wanted to reduce their hosting costs. The main challenges were around the networking - if you use CNI and the Go SDK [2] then restores simply don't work. Not sure if you're working with netlink and IMAP directly to get around it?

My question is how are you guaranteeing uniqueness, or do you only clone snapshots for a single tenant? [3]

[1] https://github.com/self-actuated/actuated [2] https://github.com/firecracker-microvm/firecracker-go-sdk [3] https://github.com/firecracker-microvm/firecracker/blob/main...

bluelightning2k · on Sept 2, 2022

CodeSandbox is one of the most impressive engineering teams I'm aware of

For most of us who are consumers only of these more fundamental infrastructure projects, there's something deeply satisfying about seeing people push these boundaries (very appropriate for HN too). Fly is another similar team/blog

parhamn · on Sept 2, 2022

Not to take credit away, but interestingly enough, both of those heavily rely on Firecracker VM, which certainly solves a huge “fundamental” infrastructure problem.

nstart · on Sept 2, 2022

Tangential to the topic: I look forward to the day that fast snapshotting and snapshot restoring becomes a thing for all VPS providers like hetzner, digital ocean, and vultr (and all the others).

Especially if a machine is snapshotted, restored, snapshotted again, restored and the cycle continues. Even if what’s stored doesn’t get much larger the subsequent snapshot+restore processes take a little longer each time. Each provider has different timelines with vultr saying it can take up to 60 minutes for a snapshot to restore.

My use case is similar but different to code sandbox. I use a beefy remote machine for development and to keep costs low I fire it up and tear it down on demand and pay only for the hours the machine was up. It works fine for me but I just wish snapshotting+restoring was faster on these services. That would make it perfect.

cperciva · on Sept 2, 2022

Why not use EC2? I find it works great as a development environment like this; starting a stopped instance takes < 5 seconds.

nstart · on Sept 4, 2022

Because you have to pay for stopped instances. That’s why I snapshot and restore. It’s the same with all VPS providers (at least the reliable ones I know)

cperciva · on Sept 4, 2022

You're only paying for their disks. Admittedly it's more expensive than snapshot-and-restore, but it's much cheaper than leaving the instance running.

nstart · on Sept 5, 2022

Yeesht. I had never tried this out with AWS so I had just completely made assumptions about their stopped instance pricing. Thanks for the correction. At the same time, the pricing is rough. Instances with similar configs cost anywhere between 3-7x the price. My current bill of between 2 to 3 USD a month would go up to about 15-25 USD. In absolute terms that's not big but over the months that is going to add up.

Still. Food for thought for me. Thanks again :)

mkl95 · on Sept 2, 2022

Let's say you wanted to test something quickly. You could

1) Clone your VM in 1.5s as described in the article

2) Clone your database in a few seconds with Database Lab Engine [1]

3) Something else?

[1] https://postgres.ai/products/how-it-works

CompuIves · on Sept 2, 2022

That would work! You could even run the Postgres inside the VM, and the data inside the DB will be cloned as well between clones.

nhoughto · on Sept 1, 2022

great to hear more details about snapshot/restore in the wild, plenty about firecracker but seemingly much less about this exciting feature in real usecases.

looking forward to the unwritten details / future posts too, particularly:

- How to handle network and IP duplicates on cloned VMs

and

- Turning a Dockerfile into a rootfs for the MicroVM (quickly)

CompuIves · on Sept 2, 2022

Thank you! Right now our main use case is cloning development environments so we can provide a fresh running dev env for every branch and PR. However there are many other interesting applications, like speeding up CI jobs with VMs that start from a snapshot.

I'll make sure we write about the other topics as well. For the network, we run the VM in its own network namespace on the host, and we give every VM the same IP. We then use an iptable rule to rewrite every incoming and outgoing packet to the IP that the host has assigned for the VM.

nhoughto · on Sept 2, 2022

Yeah the CI case is really interesting, it’s generally reproducible and declarative so a good fit that way, and time waiting for things to start is a big deal in CI.

Another use case I was thinking of was stateful compilers like scala where warming up the compiler is expensive, often a CI task too.

rad_gruchalski · on Sept 2, 2022

Regarding turning Dockerfiles into a MicroVM: https://gruchalski.com/posts/2021-03-23-introducing-firebuil..., on GitHub: https://github.com/combust-labs/firebuild. This could get you started. Plenty of moving parts in that problem. Many root OS’s, many inits, … Difficult to pull this off by one person without any particular reason so I kinda suspended the project but who knows, seems like people want it so might be a good idea to reboot it.

Disclaimer: I’m the author.

silasb · on Sept 2, 2022

I think fly.io does something like this as well?

shriphani · on Sept 2, 2022

The bit about turning a dockerfile into a rootfs. A docker image is just a tarball of tarballs. We do something like this:

- you can dump the image using `docker save <name>`. - you can then get a list of the tarballs in this image by extracting this tarball and reading the file `manifest.json`; `Config` -> `Layers` will give you a list of tarballs (see undocker for how to do this: https://github.com/larsks/undocker) - Untar these in a directory and use linux tools to convert this dir to a rootfs.

nhoughto · on Sept 2, 2022

also interested in the upper limit of a micro vm, like how big can it get? 64gb memory? not really micro any more and maybe a traditional VM would be a better fit.

scarface74 · on Sept 2, 2022

AWS’s serverless Docker solution - Fargate - based on Firecracker supports up to 30GB of RAM and 4 vCPUs.

Unrelated TIL: AWS Fargate has supported Windows since last October. I work at AWS and “specialize” in serverless and I didn’t know that.

easton · on Sept 2, 2022

I have to imagine that Fargate on Windows doesn't use Firecracker though, right? Firecracker needs kernel level changes to work properly, and the open source version doesn't let you run anything but Linux.

scarface74 · on Sept 2, 2022

I have no idea how it works under the hood. Knowing what I know about Firecracker from watching the publicly available videos, I was shocked and thought it would never happen.

On the other hand, CodeBuild has supported Windows containers for years and at least CodeBuild for Linux is based on Fargate, so the service team figured something out. (I had to figure out how to word that. I can’t say “they figured it out” since I work for the same company. But I couldn’t say “we” since I’m so far removed from any service team in the consulting department that it would be disingenuous)

CompuIves · on Sept 2, 2022

The biggest VM we've been running for dev environments have 12GB RAM, 8vCPUs and 30GB disk. I've also done some tests with 16GB RAM and that worked well too. Have yet to find an upper limit.

Another (unrelated) test we've done is on overprovisioning memory. We were able to run 200 VMs (all running Vite dev server where a file was changed every second) with 2GB RAM per VM, on a node with 128GB RAM. Because we were mapping the memory files on disk directly to the VM, the VM would automatically "swap" the memory back to the memory file when it had memory pressure. The bottleneck here was CPU.

evandrofisico · on Sept 2, 2022

The "micro" in microvms is less about size and more about resources. A typical virtual machine under Xen or KVM (para)virtualizes a lot of hardware and emulates a lot of devices, so that the operating system sees it as a normal machine.

The microVM emulates the minimal possible set of devices needed to run, such as disks and network devices, and in the specific case of firecracker, through the use of the virtio model. So it can theoretically use huge amounts of memory of a large vCPU count and still be a microvm.

dark-star · on Sept 2, 2022

We cloned VMware vSphere VMs in under 5 seconds 4 years ago. Relatively easy with proper storage integration and things like refclone copies on the storage.

Problem is the VM takes twice that time to boot so it's not as impressive ;-)

(yes, it's a different idea to the OP but still pretty neat)

AlphaSite · on Sept 2, 2022

Is this seperate from vmfork? https://blogs.vmware.com/euc/2016/02/horizon-7-view-instant-...

mlejva · on Sept 2, 2022

This is a great post! Love that CEO goes into such technical details.

Looking forward to read about networking. That I think is technically also interesting and has been a challenge for us for a bit. Coming to the VMs and lower level topics like kernel, or Linux networking has been really fun for me. Weirdly, things feel much simpler the lower you go for some reason. Probably less abstraction?

A bit of self less promo. We are using Firecracker to create interactive onboarding for devs. We did one for Prisma

https://prisma.usedevbook.com

We start a Firecracker clone when you visit the website. Everything you do happens in your Firecracker VM. You have access to the terminal and can play around with Prisma.

txutxu · on Sept 2, 2022

Not so new age... but in IRC, the evalbot in #bash, did use something similar with QEMU:

> Initially, Qemu is booted and its state is saved. On each evaluated command, this state is loaded (giving a usable shell in less than one second), a command is fed on stdin and the output read on stdout.

https://www.vidarholen.net/contents/evalbot/

kashyapc · on Sept 2, 2022

As the side-note in the article mentions, the core idea of "copy-on-write" has been around for ages. In context of virtualization, check out QEMU's "qcow2" format and its notion of "backing files" and "overlays".

A quick example to take offline, instantaneous "disk snapshots" (QEMU can do this for live VMs too): Let's assume you already have a disk image of a clean Linux distro, let's call it _base.raw_. Then you can create instantaneous "snapshot"[1] this way:

  $> qemu-img create -f qcow2 -b ./base.raw -F raw overlay1.qcow2

[The "-F raw" is specifying the file format of the backing file; this is a good practice to explicitly mention this when creating overlay files;]

Once you do this, and boot the VM with overlay1.qcow2, all the new guest writes will go to overlay1.qcow2. And whenever the guest need to refer to some old data it is copied over from the backing file, base.raw into the overlay1.qcow2 file. This lets you take a a backup of the base image, or make more "snapshots" (overlays) based on it.

To take an instantaneous disk snapshot while the guest is running, refer to the docs here[2].

[1] The term "snapshot" here a bit of a misnomer, it is actually called an "overlay" — because the overlay file "refers" to its backing file, which becomes read-only once you create the overlay.

[2] https://libvirt.org/kbase/live_full_disk_backup.html

nikita · on Sept 2, 2022

This is very cool! This technology will open the door to preview environments in CI/CD.

If you also have a versioned filesystem you can efficiently create lots of snapshots that store VM images differentially you can introduce a branchable/versioned environments for the whole backend and tie it to the repo commit hash.

CompuIves · on Sept 2, 2022

Exactly! We're looking at the second use case, to have a VM running tied to the git history. Right now we do this on a branch level, so you could say "connect to branch X on microservice Y" from another microservice, and you can test APIs quickly. There's a lot of new things this enables!

nikita · on Sept 2, 2022

Did you guys think about live migrations? https://github.com/cloud-hypervisor/cloud-hypervisor seems to support it and it shares a good amount of code with firecracker.

CompuIves · on Sept 2, 2022

Yes! We're watching cloud-hypervisor as well, it might also be more suitable for our use cases.

Yes, we did look at live migrations since there's a lot written about it and it's the closest to cloning a running VM. Lots of development in that space!

tatoalo · on Sept 2, 2022

Slightly OT, but I have a question on AWS Lambda(s) since they run on Firecracker MicripoVM(s): have somebody found a way to reduce cold-starts times?

Online I was only able to find that the way to go seems to be to produce an heartbeat on a rolling schedule, wanted to look into this.

santiagobasulto · on Sept 2, 2022

The solution for cold starts is usually keeping them warm :)

In Lambdas, you can schedule a cloudwatch event similar to the heartbeat you've mentioned.

devoutsalsa · on Sept 2, 2022

Dumb question. How is keeping a lambda warm different than just running a VM? Does the warm lambda instance respond quickly, and when it starts getting saturated, then additional Lambdas come online via a cold start process?

santiagobasulto · on Sept 2, 2022

I use Zappa, it just schedules a frequent execution of the lambda: https://github.com/zappa/Zappa#keeping-the-server-warm

vegardx · on Sept 2, 2022

But what devoutsalsa said is true, you will still get a cold start when a request comes in while the already "warm" lambda is processing another request. This is one of the gotchas that not everyone seem to understand about these cloud functions. Sure, you'll scale quite horizontally, but just one executor at the time. If you have a long cold start, this can significantly increase latency. Some providers let you pay for concurrency, so that you can scale out more quickly.

JoshTriplett · on Sept 2, 2022

What runtime are you using? A custom runtime can get cold start times in milliseconds, as long as you're not loading a large language runtime or a container. Try a custom runtime that has only a single statically linked binary in it and nothing else.

tatoalo · on Sept 2, 2022

I am using a Python runtime

notsapiensatall · on Sept 1, 2022

Very cool, snapshotting has come a long way in recent years.

I don't see anything about graphics in the article - could this approach also be used to clone a VM running a desktop window manager like Gnome or KDE? Or would that rely on GPU memory which is not included in the dumps?

CompuIves · on Sept 2, 2022

I don't believe Firecracker currently supports GPU (latest what I saw about it is here: https://github.com/firecracker-microvm/firecracker/issues/11...). But I wouldn't be surprised if there's another MicroVM manager that would support GPU + snapshotting.

solarkraft · on Sept 2, 2022

I'd like to see some of these techniques put into a directly usable project. This stuff seems a lot like where the VM world is heading, yet doing these things is still a lot harder than it'd need to be.

kosolam · on Sept 1, 2022

Yep. Fc is awesome

CompuIves · on Sept 1, 2022

Firecracker is one of the most exciting technologies I've seen in a while. It felt like a whole world opened up after learning about its capabilities.

jacobgorm · on Sept 2, 2022

TBH having worked on live- and self-migration on top of Xen, as well as on MicroVMs at Bromium, the startup that coined the term, this feels like history repeating, just with Rust instead of C. I was once told by one of the Bromium founders that they tried to sell the MicroVM idea to AWS, but the AWS guys just took everything they learned in the meeting as inspiration and built Firecracker instead.

sytelus · on Sept 2, 2022

TLDR; At fundamental level, the idea is that most VMs don't change much to their large images so you can lazily only work with diffs. The copy-on-write allows blocks of your new file points to blocks of original and maintain diff of blocks that get changed.

Overall great article!