Krunvm is extremely easy to use and is packed with some interesting ideas.
One of the biggest advantages of this VMM is that programs have access to the network inside the VMM without the admin having to setup complex virtual bridges and so forth in the host in advance or use something like slirp. This is accomplished via TSI (Transparent Socket Impersonation).
Basically sockets in the guest are bridged to AF_VSOCK via the use of a patched linux kernel (when you build libkrunfw.so) when communicating outside the VM. See https://www.youtube.com/watch?v=EGV03THGrrw for more info on TSI.
My only concern is that TSI is currently not a feature available in Linux. When do the authors plan to upstream this into Linux proper? My understanding is that this was planned in 2021 but it is now 2022...
Maybe instead of patching the kernel the "init" process of the VM could set up a seccomp-notify sandbox to handle the socket syscalls in userspace to back them the tcp/udp sockets by a vsock (I think that read/write or sendmsg/recvmsg would work without userspace handling because they get the vsock fd).
As far as I understand, any protocol that runs over IP (TCP, UDP, ICMP, ...) should be supported by slirp (i.e. slirp4netns) as it sets up a TAP device.
Mmmmmm, rust and golang, working together. Shout out to buildah team and the krunvm folks. This is very cool. The icing on the cake was the aarm64 support for apple. Kudos to you all. I can now run experimental linux images as VM's and see if they'll blow up.
Since it's a VM, it's ideal for workloads with a set amount of resource use and that need strong isolation guarantees. Regular containers are better to share a pool of resources whose usage varies widely, and when you don't need strong isolation guarantees. Depending on how I/O is handled, container I/O can be very slow, whereas a dedicated disk snapshot without CoW/overlays would be much faster. Since this also uses TSI for networking, you will need a patched Linux kernel to use networking in the guest at all, and raw sockets don't work at all.
Container file I/O is very slow. It unpacks the OCI image layers onto the regular host filesystem, then adds overlay filesystems, does copy-on-write, and references files between each layer. For example, doing 10 containerized nodejs app builds simultaneously will swamp the host with iowait. A common hack to is to put the OCI file tree / overlays on a dedicated disk with much higher iops than the boot disk.
> It unpacks the OCI image layers onto the regular host filesystem, then adds overlay filesystems, does copy-on-write, and references files between each layer.
That's just Docker though, right? Does LXC or systemd-nspawn do that?
Thank you, I'll have to look into this. I was thinking from a file namespacing perspective there shouldn't be overhead, but it makes sense that adding the overlay filesystems and mounts would impact performance.
That depends on the microvm. Device support in Firecracker, like GPUs, doesn't exist, which also makes Firecracker suitable for multitenant workloads. Something like QEMU has far more device support but is also significantly easier to escape out of.
Very cool looking project. I love their approach to networking which patches the Linux kernel to intercept operations on sockets and defers that to the host.
I’ve been working in a similar area recently and networking is an unfortunate stumbling block.
If it is OCI compatible, it technically means you could use kubernetes or another container orchestrator to orchestrate these microvms. I wonder krunvm already works with kubernetes.
If I'm understanding correctly, it's OCI compatible in the other direction - it consumes OCI compatible images, but it doesn't expose an OCI compatible layer on top for orchestration.
kube-virt[1] is a thing, though, that provides k8s orchestration for VMs. I don't see why you couldn't use krunvm microvms with that
One of the biggest advantages of this VMM is that programs have access to the network inside the VMM without the admin having to setup complex virtual bridges and so forth in the host in advance or use something like slirp. This is accomplished via TSI (Transparent Socket Impersonation).
Basically sockets in the guest are bridged to AF_VSOCK via the use of a patched linux kernel (when you build libkrunfw.so) when communicating outside the VM. See https://www.youtube.com/watch?v=EGV03THGrrw for more info on TSI.
My only concern is that TSI is currently not a feature available in Linux. When do the authors plan to upstream this into Linux proper? My understanding is that this was planned in 2021 but it is now 2022...