Fast CI with MicroVMs

throwawaaarrgh · on Nov 18, 2022

> I spoke to the GitHub Actions engineering team, who told me that using an ephemeral VM and an immutable OS image would solve the concerns.

that doesn't solve them all. the main problem is secrets. if a job has access to an api token that can be used to modify your code or access a cloud service, a PR can abuse that to modify things it shouldn't. a second problem is even if you don't have secrets exposed, a PR can run a crypto miner, wasting your money. finally, a self-hosted runner is a step into your private network and can be used for attacks, which firecracker can help mitigate but never eliminate.

the best solution to these problems is 1) don't allow repos to trigger your CI unless the user is trusted or the change has been reviewed, 2) always use least privilege and zero-trust for all access (yes even for dev services), 3) add basic constraints by default on all jobs running to prevent misuse, and then finally 4) provide strong isolation in addition to ephemeral environments.

alexellisuk · on Nov 18, 2022

You still have those same problems with hosted runners, don't you?

We're trying to re-create the hosted experience, but with self-hosted, faster, infrastructure, without needing to account for metered billing.

Twirrim · on Nov 19, 2022

https://marcyoung.us/post/zuckerpunch/

Facebook had this exact problem recently with the pytorch repo. Their self hosted CI runners would run on all PRs, and it could leak all sorts of stuff.

no_wizard · on Nov 18, 2022

Firecracker is pretty great, good to see it can be used in a CI environment like this, definitely peaking my interest.

I know its the backbone of what runs fly.io[0] as well

[0]: https://fly.io/docs/reference/architecture/#microvms

waz0wski · on Nov 18, 2022

Firecracker has been running AWS Lambda & Fargate for a few years now

https://aws.amazon.com/blogs/aws/firecracker-lightweight-vir...

There's also similar microVM project with a bit more container-focused support called Kata

https://katacontainers.io/

digianarchist · on Nov 18, 2022

Not a hypervisor expert by any means but what's stopping projects backporting the super fast startup time of Firecracker into regular VM hypervisors?

I'm assuming that Firecracker is somewhat constrained in some way.

antonvs · on Nov 19, 2022

It’s written specifically to host the Linux kernel, and doesn’t use a bios or a boot loader. If you backported that into another hypervisor, it would probably have to be something like “are we loading a compatible Linux? If so switch to Firecracker mode”. But of course you can do that yourself, with a small shell script that either starts the traditional VM or Firecracker.

Or they could do what QEMU has done and put out a separate product/mode: https://github.com/qemu/qemu/blob/a082fab9d25/docs/system/i3...

skissane · on Nov 19, 2022

That QEMU doc says:

> The recommended way to trigger a guest-initiated shut down is by generating a triple-fault, which will cause the VM to initiate a reboot

Doesn’t that mean it can’t distinguish an intentional triple fault to trigger reboot from an accidental triple fault caused by a guest kernel bug which corrupts the IDT? I think it would be better if there was some kind of call the guest could make to the hypervisor to reboot-one is less likely to invoke that service by accident than to triple fault by accident.

solarkraft · on Nov 19, 2022

I'm used to QEMU VMs being slow and annoying to work with due to them being full VMs, so I was quite surprised to see that this is really just as fast as Firecracker!

emmelaich · on Nov 18, 2022

*piquing

intelVISA · on Nov 19, 2022

Does Fly.io have an in-house engineering team? Seems like it's mostly AWS lite.

ignoramous · on Nov 18, 2022

Sounds similar to webapp.io (layerci) that has been discussed quite a few times here: https://news.ycombinator.com/item?id=31062301

> Friction starts when the 7GB of RAM and 2 cores allocated causes issues for us

Well, I just create a 20GB swap. There's ample disk space but swap is slow for sure.

> MicroVM

Coincidentally, QEMU now sports a firecracker-inspired microvm: https://github.com/qemu/qemu/blob/a082fab9d25/docs/system/i3... / https://mergeboard.com/blog/2-qemu-microvm-docker/

alexellisuk · on Nov 18, 2022

Hi, I'd not heard of webapp.io before so thanks for mentioning it. Actuated is not a preview branch product, that's an interesting area but not the problem we're trying to solve.

actuated is not trying to be a CI system or a replacement for one like webapp.

It's a direct integration with GitHub Actions, and as we get interest from pilot customer for GitLab etc, we'll consider adding support for those platforms too.

Unopinionated, without lock-in. We want to create the hosted experience, with safety and speed built in.

colinchartier · on Nov 18, 2022

Hey, yeah this looks somewhat similar to what we're building at https://webapp.io (nee LayerCI, YC S20)

We migrated to a fork of firecracker, but we're a fully hosted product that doesn't directly interact with GHA at all (similar to how CircleCI works), so there's some positioning difference between us and OP at the very least.

Always happy to see innovation in the space :)

dblooman · on Nov 19, 2022

What do you use to orchestrate firecracker?

synergy20 · on Nov 18, 2022

did not know qemu has its own firecraker machine now, thanks for the info! going to test how fast it boots.

avita1 · on Nov 18, 2022

Something I've increasingly wondered is if the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run.

Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.

If I had infinity time, I'd build a CI system that found a runner that maintained some state (gasp!) about the build and went to a test runner that had most of its local build cache downloaded, source code cloned, and toolchain bootstrapped.

capableweb · on Nov 18, 2022

You'd love a service like that, until you have some weird stuff working in CI but not in local (or vice-versa), that's why things are built from scratch all the time, to prevent any such issues from happening.

Npm was (still is?) famously bad at installing dependencies, where sometimes the fix is to remove node_modules and simply reinstalling. Back when npm was more brittle (yes, possible) it was nearly impossible to maintain caches of node_modules directories, as they ended up being different than if you reinstalled with no existing node_modules directory.

ar_lan · on Nov 18, 2022

I think Nix could be leveraged to resolve this. If the dependencies aren't perfectly matched it downloads the _different_ dependencies, but can use any locally downloaded instances already.

So infra concerns are identical. Remove any state your application itself uses (clean slate, like a local DB), but your VM can functionally be persistent (perhaps you shut it off when not in use to reduce spend)?

RealityVoid · on Nov 18, 2022

You wouldn't catch it, it's true.

But I'd depends if you're willing to trade accuracy for speed. I suggest the correct reaction to this is... "How much speed?"

I presume the answer to be "a lot".

rad_gruchalski · on Nov 18, 2022

My immediate reaction is “correctness each and every time”.

saurik · on Nov 18, 2022

I mean, given that my full build takes hours but my incremental build takes seconds--and given that my build system itself tends to only mess up the incremental build a few times a year (and mostly in ways I can predict), I'd totally be OK with "correctness once a day" or "correctness on demand" in exchange for having the CI feel like something that I can use constantly. It isn't like I am locally developing or testing with "correctness each and every time", no matter how cool that sounds: I'd get nothing done!

d4mi3n · on Nov 18, 2022

This really depends a lot on context and there's no right or wrong answer here.

If you're working on something safety critical you'll want correctness every time. For most things short of that it's a trade-off between risk, time, and money—each of which can be fungible depending on context.

rad_gruchalski · on Nov 18, 2022

Do you really need to build the whole thing to test?

deathanatos · on Nov 18, 2022

In my experience, yes.

A small change in a dependency, essentially, bubbles or chains to all dependent steps. I.e., a change in the fizzbuzz source but inherently run the fizzbuzz tests. This cascades into your integration tests — we must run the integration tests that include fizzbuzz … but those now need all the other components involved; so, that sort of bubbles or chains to all reverse dependencies (i.e., we need to build the bazqux service, since it is in the integration test with fizzbuzz…) and now I'm building a large portion of my dependency graph.

And in practice, to keep the logic in CI reasonably simple … the answer is "build it all".

(If I had better content-aware builds, I could cache them: I could say, ah, bazqux's source hashes to $X, and we already have a build for that hash, excellent. In practice, this is really hard. If all of bazqux was limited to some subtree, but inevitably one file decides to include some source from outside the spiritual root of bazqux, and now bazqux's hash is "the entire tree", which by definition we've never built.)

(There's bazel, but it has its own issues.)

maccard · on Nov 18, 2022

I work in games, our repository is ~100GB (20m download) and a clean compile is 2 hours on a 16 core machine with 32GB ram (c6i.4xlarge for any Aws friends). Actually building a runnable version of the game takes two clean compiles (one editor and one client) plus an asset processing task that takes about another 2 hours clean.

Our toolchain install takes about 30 minutes (although that includes making a snapshot of the EBS volume to make an AMI out of).

That's ~7 hours for a clean build.

We have a somewhat better system than this - our base ami contains the entire toolchain, and we do an initial clone on the ami to get the bulk of the download done too. We store all the intermediates on a separate drive and we just mount it, build incrementally and unmount again. Sometimes we end up with duplicated work but overall it works pretty well. Our full builds are down from 7 hours (in theory) to about 30 minutes, including artifact deployments.

Too · on Nov 19, 2022

This is how CI systems have always behaved traditionally. Just install a Jenkins agent on any computer/VM and it will maintain persistent workspace on disk for each job to reuse in incremental builds. There are countless other tools that work in the same way. This also solves the problem of isolating builds if your ci only checks out the code and then launches a constrained docker container executing the build. This can easily be extended to use persistent network disks and scaled up workers, but is usually not worth the cost.

It's baffling to see this new trend of yaml actions running in pristine workers, redownloading the whole npm-universe from scratch on every change, birthing hundreds of startups trying to "solve" CI by presenting solutions to non-problems and then wrapping things in even more layers of lock-in and micro-VMs and detaching yourself from the integration.

While Jenkins might not be the best tool in the world, the industry needs a wake-up shower on how to simplify and keep in touch with reality, not hidden behind layers of SaaS-abstractions.

jacobwg · on Nov 18, 2022

Agreed, this is more or less the inspiration behind Depot (https://depot.dev). Today it builds Docker images with this philosophy, but we'll be expanding to other more general inputs as well. Builds get routed to runner instances pre-configured to build as fast as possible, with local SSD cache and pre-installed toolchains, but without needing to set up any of that orchestration yourself.

colinchartier · on Nov 18, 2022

This was the idea behind https://webapp.io (YC S20):

- Run a linear series of steps

- Watch which files are read (at the OS level) during each step, and snapshot the entire RAM/disk state of the MicroVM

- When you next push, just skip ahead to the latest snapshot

In practice this makes a generalized version of "cache keys" where you can snapshot the VM as it builds, and then restore the most appropriate snapshot for any given change.

lytedev · on Nov 18, 2022

I have zero experience with bazel, but I believe it offers the possibility of mechanisms similar to this? Or a mechanism that makes this "somewhat safe"?

hobofan · on Nov 18, 2022

Yes it does, but one should be warned that adopting Bazel isn't the lightest decision to make. But yeah, the CI experience is one of its best attributes.

We are using Bazel with Github self-hosted runners, and have consistent low build times with a growing codebase and test suite, as Bazel will only re-build and re-test what needs to be changed.

The CI experience compared to e.g. doing naive caching of some directories with Github managed runners is amazing, and it's probably the most reliable build/test setup I've had. The most common failure we have of the build system itself (which is still rare with ~once a week) is network issues with one of the package managers, rather than quirks introduced by one of the engineers (and there would be a straightforward path towards preventing those failures, we just haven't bothered to set that up yet).

skissane · on Nov 19, 2022

> Each job will always have to run a clone, always pay the cost of either bootstrapping a toolchain or download a giant container with the toolchain, and always have to download a big remote cache.

Couldn’t this be addressed if every node had a local caching proxy server container/VM, and all the other containers/VMs on the node used it for Git checkouts, image/package downloads, etc?

quesera · on Nov 19, 2022

> the model of CI where a totally pristine container (or VM) gets spun on each change for each test set imposes an floor on how fast CI can run

I believe this is the motivation behind https://brisktest.com/

mattbillenstein · on Nov 18, 2022

I'm using buildkite - which lets me run the workers myself. These are long-lived Ubuntu systems setup with the same code we use on dev and production running all the same software dependencies. Tests are fast and it works pretty nice.

raffraffraff · on Nov 18, 2022

I'm not using it right now, but at a previous company we used Gitlab CI on the free tier with self-hosted runners. Kicked ass.

alexellisuk · on Nov 18, 2022

Self-hosted runners are brilliant, but have a poor security model for running containers or building them within a job. Whilst we're focusing on GitHub Actions at the moment, the same problems exist for GitLab CI, Drone, Bitbucket and Azure DevOps. We explain why in the FAQ (link in the post).

goodoldneon · on Nov 18, 2022

> poor security model for running containers or building them within a job

You mean Docker-in-Docker? If so, we used Kaniko to build images without Docker-in-Docker

alexellisuk · on Nov 18, 2022

There is a misconception that Kaniko means non-root, but in order to build a container it has to work with layers which requires root.

Using Kaniko also doesn't solve for:

How do you run containers within that build in order to test them? How do you run KinD/K3s within that build to validate the containers e2e?

goodoldneon · on Nov 18, 2022

The benefit of Kaniko (relative to Docker-in-Docker) is that you don't need to run in privileged mode.

We test our containers in our Dev environment after deploying

alexellisuk · on Nov 18, 2022

That is a benefit over DIND and socket sharing, however it doesn't allow for running containers or K8s itself within a job. Any tooling that depends on running "docker" (the CLI) will also break or need adapting.

This also comes to mind: "root in the container is root on the host" - https://suraj.io/post/root-in-container-root-on-host/

mattbillenstein · on Nov 19, 2022

This reminds me of the erlang map-reduce "did you just tell me to fuck myself" meme

Shish2k · on Nov 18, 2022

> Each job will always have to run a clone

You can create a base filesystem image with the code and tools checked out, then create a VM which uses that in a copy-on-write way

maccard · on Nov 19, 2022

AWS Autoscaling groups with a custom AMI does this by default, fwiw.

bkq · on Nov 18, 2022

Good article. Firecracker is something that has definitely piqued my interest when it comes to quickly spinning up a throwaway environment to use for either development or CI. I run a CI platform [1], which currently uses QEMU for the build environments (Docker is also supported but currently disabled on the hosted offering), startup times are ok, but having a boot time of 1-2s is definitely highly appealing. I will have to investigate Firecracker further to see if I could incorporate this into what I'm doing.

Julia Evans has also written about Firecracker in the past too [2][3].

[1] - https://about.djinn-ci.com

[2] - https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-in-l...

[3] - https://news.ycombinator.com/item?id=25883253

alexellisuk · on Nov 18, 2022

Thanks for commenting, and your product looks cool btw.

Yeah a lot of people have talked about Firecracker in the past, that's why I focus on the pain and the problem being solved. The tech is cool, but it's not the only thing that matters.

People need to know that there are better alternatives to sharing a docker socket or using DIND with K8s runners.

lxe · on Nov 18, 2022

Firecracker is nice but still very limited to what it can do.

My gripe with all CI systems is that an an industry standard we've universally sacrificed performance for hermeticity and re-entrancy, even when it doesn't really gives us a practical advantage. Downloading and re-running containers and vms, endlessly checking out code, installing deps over and over is just a waste of time, even with caching, COW, and other optimizations.

jxf · on Nov 18, 2022

> My gripe with all CI systems is that an an industry standard we've universally sacrificed performance for hermeticity and re-entrancy, even when it doesn't really gives us a practical advantage.

The perceived practical advantage is the incremental confidence that the thing you built won't blow up in production.

> even with caching, COW, and other optimizations

Many CI systems do employ caching. For example, Circle.

IshKebab · on Nov 18, 2022

Hermeticity is precisely what allows you to avoid endlessly downloading and building the same dependencies. Without hermeticity you can't rely on caching.

I feel like 90% of the computer industry is ignoring the lessons of Bazel and is probably going to wake up in 10 years and go "ooooooh, that's how we should have been doing it".

throwaway894345 · on Nov 18, 2022

I think everyone agrees that the Bazel/Nix approach is correct, the problem is that Bazel/Nix/etc are insanely hard to use. For example, I spent a good chunk of last weekend trying to get Bazel to build a multiarch Go image, and I couldn't figure it out. Someone needs to figure out how to polish Bazel/Nix so they're viable for organizations that can't invest in a team to operate and provide guidance on Bazel/Nix/etc.

pnut · on Nov 19, 2022

Check out https://www.pantsbuild.org/

throwaway894345 · on Nov 19, 2022

I’ve used Pants professionally and that was possibly the worst of the three in my experience. Support across build tools varies by language, but I didn’t get the impression that Pants was head-and-shoulders above other tools for any language ecosystem.

IshKebab · on Nov 19, 2022

Is that Pants 1 or Pants 2? I'm curious about your opinion of the others too.

throwaway894345 · on Nov 20, 2022

Pants 2 IIRC. It has been several years now, so I forget the particulars.

jiayo · on Nov 18, 2022

Can you elaborate on some of the lessons of Bazel? I've only just heard of it recently, and while I'm intrigued, my impression is this is similar to Facebook writing their own source control: different problems at massive scale. Can a SMB (~50 engineers) benefit from adopting Bazel?

hobofan · on Nov 18, 2022

> Can a SMB (~50 engineers) benefit from adopting Bazel?

We are ~8 engineers, and yes, definitely. However there should be good buy-in across the team (as it can be quite invasive), and depending on your choice of languages/tooling the difficulty of adoption may greatly vary.

I was the one introducing Bazel to the company and across the ~80 weeks at the company I spent maybe ~4 weeks on setting up and maintaining Bazel.

I don't know about your current setup and challenges you have with your CI system. However, compared to the generic type of build system I've seen at companies of that size, I would estimate that with 50 engineers having a single build systems/developer tooling engineer focused on setting up and maintaining Bazel should easily have a positive ROI (through increased development velocity and less time wasted on hunting CI bugs alone).

lxe · on Nov 19, 2022

If you're doing golang in a large monorepo, in a company of 1000+ engineers, then maybe. If you're a a mobile dev in a similar sized company, then also maybe. If you have devops resources and SRE's and dedicated personnel that understand bazel, then maybe.

Personally I wouldn't touch it with a 10 foot pole. It's an opinionated task runner, with terrible docs, that if you don't configure correctly, will just hurt your dev process.

IshKebab · on Nov 19, 2022

Yes absolutely (depending on what you do exactly).

The core idea behind Bazel is to make build steps truly hermetic, so you know exactly what inputs they are using. This means you can rely on caching, incremental builds, distributed builds and so on.

I'm sure if you've had any experience of Make or similar systems you've encountered "a clean build fixed it". That root cause of that is because somewhere there's a mistake in your build system where you forgot to declare a dependency on something, and it just happens to work most of the time, but maybe one time the dependency changed and Make didn't know it had to rebuild some stuff so the build breaks.

That's basically why almost everyone's CI system builds everything from scratch every time. Nobody trusts incremental builds.

Bazel goes to great lengths to make it so that you have to declare dependencies, otherwise you simply can't access them. That includes:

* Cleaning environment variables for build steps

* Running build steps in sandboxes

* Storing intermediate artefacts in random directories

* Including tools themselves (e.g. compilers) as part of the dependency tree

Honestly it's not a perfectly hermetic environment, e.g. your build steps can still read the current time, RNGs, etc. so you can still have indeterminacy in your build, but it goes a lot further than anything else.

So ultimately the upside is that you can do things like have CI only build and test things that possibly could have been affected by a change. Fix a typo in a README? It won't have to build or test anything.

My current company spend 300 compute hours and 2-6 wall hours on every CI run, even for fixing doc typos. Bazel can prevent that.

There are downsides though - Bazel was the first system to do this so it has rough edges. And all that sandboxing means there is extra effort to make debugging and IDE integration work. Also because it is super conservative about rebuilding things it can do it sometimes even when it doesn't need to. So I probably wouldn't use it on really small projects, like ones where CI time is under 10-20 minutes anyway.

There are a load of newer build systems that use the same idea: Pants, Buck, Pants 2, Please.build, etc. But obviously they don't have the momentum of Bazel.

lxe · on Nov 19, 2022

I guess my issue is not hermeticity but ephemeral constrainers and stateless approach to things. Maybe a "one-way" hermeticity... a semi-permeability if you will.

no_wizard · on Nov 18, 2022

for the frontend space, NX gives you bazel caching / features. It just doesn't cache dependencies, but in my experience with GitHub Actions running `pnpm install` or `yarn install` is not the slowest operation, its running the tools after.

throwaway894345 · on Nov 18, 2022

Honestly, I've never missed the shared mutable environment approach one bit. It might have been marginally faster, but I'd trade a whole bunch of performance for consistency (and the optimizations mean there's not much of a performance difference). Moreover, most of the time spent in CI is not container/VM overhead, but rather crappy Docker images, slow toolchains, slow tests, etc.

alexellisuk · on Nov 18, 2022

When you say it's limited in what it can do, what are you comparing it to? And what do you wish it could do?

Fly has a lot of ideas here, and we've also been able to optimize how things work in terms of downloads and as for boot-up speed, it's less than 1-2s before a runner is connected.

lijogdfljk · on Nov 18, 2022

I'm a bit surprised i don't see NixOS-like tooling in container orchestration for this reason.

throwaway894345 · on Nov 18, 2022

There aren't any NixOS-like tooling that isn't incredibly burdensome. I think Nix and NixOS have the right vision, but there's way too much friction for most orgs to use. Containers are imperfect, but they're way easier to work with.

lijogdfljk · on Nov 18, 2022

Oh yea, i use it - i get it lol. But as someone who uses NixOS, for all it's flaws the community also is quite passionate and pushes out quite a bit of features, ideas, etc. There's little experiments in all aspects of the ecosystem.

I'm just kinda surprised some Docker-esque thing hasn't stuck. Something that works with Docker, but transforms it to all the advantages of NixOS.

CI pipelines are just so rough and repetitive in plain Docker, which is what we use.

throwaway894345 · on Nov 18, 2022

> for all it's flaws the community also is quite passionate and pushes out quite a bit of features, ideas, etc.

This hasn't been my experience. There have been significant issues with Nix since its inception and very little progress has been made. Here are a few off the top of my head:

* The nix expression language is dynamically typed and there are virtually no imports that would point you in the right direction, so it's incredibly difficult to figure out what kind of data a package requires (you typically have to find the callsite and recurse backwards to figure out what kind of data is provided or follow the data down the callstack [recurse forwards] just to discern the 'type' of data).

* The nix expression language is really hard to learn. It's really unfamiliar to most developers, which is a big deal because everyone in an organization that uses Nix has to interface with the expression language (it's not neatly encapsulated such that some small core team can worry about it). This is an enormous cost with no tangible upside.

* Package defs in nixpkgs are horribly documented.

* Nixpkgs is terribly organized (I think there is finally some energy around reorganizing, but I haven't discerned any meaningful progress yet).

I can fully believe that the community is responsive to improvements in individual packages, but there seems to be very little energy/enthusiasm around big systemic improvements.

> I'm just kinda surprised some Docker-esque thing hasn't stuck. Something that works with Docker, but transforms it to all the advantages of NixOS.

Using something like Nix to build Docker images is conceptually great. Nix is great at building artifacts efficiently and Docker is a great runtime. The problem is that there's no low-friction Nix-like experience to date.

ar_lan · on Nov 18, 2022

It sounds like your issues with Nix stem from its steep adoption curve, rather than any technical concern. This _is_ a concern for a team that needs to manage it - I agree.

I'm quite diehard in terms of personal Nix/NixOS use, but I hesitate to recommend to colleagues as a solution because the learning curve would likely reduce productivity for quite some time.

That said - I do think that deterministic, declarative package/dependency management is the proper future, especially when it comes to runtime environments.

throwaway894345 · on Nov 18, 2022

> It sounds like your issues with Nix stem from its steep adoption curve, rather than any technical concern

Not only is it difficult to learn (although that's a huge problem), but it's also difficult to use. For instance, even once you've "learned Nix", inferring data types is an ongoing problem because there is no static type system. These obstacles are prohibitive for most organizations (because of the high-touch nature of build tooling).

> This _is_ a concern for a team that needs to manage it

The problem is that there isn't "one team that needs to manage it"; every team needs to touch the build definitions or else you're bottlenecking your development on one central team of Nix experts which is also an unacceptable tradeoff. If build tools weren't inherently high-touch, then the learning curve would be a much smaller problem.

ar_lan · on Nov 18, 2022

Sorry, I wasn't clear - I wasn't implying there should be a central team to manage it. One of the beauties of Nix is providing declarative dev environments in repositories, which means to fully embrace it each individual team should own it for themselves.

At best a central team would be useful for managing an artifactory/cache + maybe company-wide nixpkgs, but in general singular teams need to decide for themselves if Nix is helpful + then manage it themselves.

throwaway894345 · on Nov 18, 2022

Agreed. It's just that when every team has to own their stuff, usability issues become a bigger problem and afaict the Nix team is not making much progress on usability (to the extent that it seems like they don't care about the organizational use case--as is their prerogative).

fideloper · on Nov 18, 2022

This project looks really neat!

Firecracker is very cool, I wish/hope tooling around it matures enough to be super easy. I'd love to see the technical details on how this is run. It looks like it's closed source?

The need for baremetal for Firecracker is a bit of a shame, but it's still wicked cool. (You can run it on a DO droplet but nested virtualization feels a bit icky?)

I run a CI app myself, and have looked at firecracker. Right now I'm working on moving some compute to Fly.io and it's Machines API, which is well suited for on-demand compute.

alexellisuk · on Nov 18, 2022

Hey thanks for the interest, this is probably the best resource I have on Firecracker, hope you enjoy it:

https://www.youtube.com/watch?v=CYCsa5e2vqg

For info on actuated, check out the FAQ or the docs: https://docs.actuated.dev

We're running a pilot and looking for customers who want to make CI faster for public or self-hosted runners, want to avoid side-effects and security compromise of DIND / sharing a Docker socket or need to build on ARM64 for speed.

Feel free to reach out

ridiculous_fish · on Nov 18, 2022

The article does not say what a MicroVM is. From what I can gather, it's using KVM to virtualize specifically a Linux kernel. In this way, Firecracker is somewhat intermediate between Docker (which shares the host kernel) and Vagrant (which is not limited to running Linux). Is that accurate?

Is it possible to use a MicroVM to virtualize a non-Linux OS?

alexellisuk · on Nov 18, 2022

Thanks for the feedback.

That video covers this is great detail. Click on the the video under 1) and have a watch, it should answer all your questions.

I didn't want to repeat the content there

ridiculous_fish · on Nov 18, 2022

Will do, thanks!

f0e4c2f7 · on Nov 18, 2022

This seems pretty interesting to me. I haven't messed with firecracker yet but it seems like a possible alternative to docker in the future.

alexellisuk · on Nov 18, 2022

It is, but is also a very low-level tool, and there is very little support around it. We've been building this platform since the summer and there are many nuances and edge cases to cater for.

But if you just want to try out Firecracker, I've got a free lab listed in the blog post.

I hear Podman desktop is also getting some traction, if you have particular issues with Docker Desktop.

kernelbugs · on Nov 18, 2022

Would have loved to see more of the technical details involved in spinning up Firecracker VMs on demand for Github Actions.

alexellisuk · on Nov 18, 2022

Hey thanks for the feedback. We may do some more around this. What kinds of things do you want to know?

To get hands on, you can run my Firecracker lab that I shared in the blog post, then add a runner can be done with "arkade system install actions-runner"

We also explain how it works here: https://docs.actuated.dev/faq/

thehabbos007 · on Nov 18, 2022

Not the poster you were replying to, but I've looked at your firecracker init lab (cool stuff!) and just wondering how that fits together with a control plane. Would be cool to see how the orchestration happens in terms of messaging between host/guest and how I/O is provisioned on the host dynamically.

Sytten · on Nov 18, 2022

Wondering if it would be possible to run macos. The hosted runner of Github Actions for macos are really really horrible, our builds take easily 2x to 3x more time than hosted Windows and Linux machines.

rad_gruchalski · on Nov 18, 2022

Congratulations on the launch.

The interesting part of this is that the client supplies the most difficult resource to get for this setup. As in, a machine on which Firecracker can run.

alexellisuk · on Nov 18, 2022

Users provide a number of hosts and run a simple agent. We maintain the OS image, Kernel configuration and control plane service, with support for ARM64 too.

rad_gruchalski · on Nov 18, 2022

Great stuff, undeniably. There’s not much going on in the open source space around multi-host schedules for Firecracker. So that’s a mountain of work.

With regards to the host, I made remark because of Firecracker requirements regarding virtualisation. Running Firecracker is no brainer when an org maintains a fleet of their own hardware.

brightball · on Nov 18, 2022

I’m curious to see how k8s isn’t a good fit for this? I’m not a k8s advocate for production code but at the CI level it seems ideal.

alexellisuk · on Nov 18, 2022

Great questions, we answer those here in the FAQ: https://docs.actuated.dev/faq/

imachine1980_ · on Nov 18, 2022

Realy cool what is the license? , there any way I can contribute code/test/documentation to this project ?

a-dub · on Nov 18, 2022

this is cool. throwing firecracker at CI is something i've been thinking about since i first read about firecracker.

i was thinking more along the lines of, can you checkpoint a bunch of common initialization and startup and then massively parallelize?

alexellisuk · on Nov 18, 2022

You can checkpoint and restore, but only once for security reasons, so it doesn't help much.

https://github.com/firecracker-microvm/firecracker/blob/main...

The VMs launch super quick in < 1s they are actually running a job.

a-dub · on Nov 21, 2022

hm. well assuming they have no network and don't otherwise encrypt bits that an attacker could get ahold of, it's probably fine.

the bigger issue would be something like spawning a bunch of servers that share the same rng state that can then be manipulated by an attacker (and therefore encrypt different data with the same key+nonce and such).

deltaci · on Nov 18, 2022

congratulations on the launch. it looks pretty much like a self-hosted version of https://buildjet.com/for-github-actions

alexellisuk · on Nov 18, 2022

Thanks for commenting.

It seems like buildjet is competing directly with GitHub on price (GitHub has bigger runners available now, pay per minute), and GitHub will always win because they own Azure, so I'm not sure what their USP is and worry they will get commoditised and then lose their market share.

Actuated is hybrid, not self-hosted. We run actuated as a managed service and scheduler, you provide your own compute and run our agent, then it's a very hands-off experience. This comes with support from our team, and extensive documentation.

Agents can even be cheap-ish VMs using nested virtualisation, you can learn a bit more here: https://docs.actuated.dev/add-agent/