Making EC2 boot time faster

jedberg · 2024-05-23T18:39:59 1716489599

Boot time is the number one factor in your success with auto-scaling. The smaller your boot time, the smaller your prediction window needs to be. Ex. If your boot time is five minutes, you need to predict what your traffic will be in five minutes, but if you can boot in 20 seconds, you only need to predict 20 seconds ahead. By definition your predictions will be more accurate the smaller the window is.

But! Autoscaling serves two purposes. One is to address load spikes. The other is to reduce costs with scaling down. What this solution does is trade off some of the cost savings by prewarming the EBS volumes and then paying for them.

This feels like a reasonable tradeoff if you can justify the cost with better auto-scaling.

And if you're not autoscaling, it's still worth the cost if the trade off is having your engineers wait around for instance boots.

fhuici · 2024-05-24T06:41:51 1716532911

Fully agree, doing reactive autoscaling when the actual boot time is slow is an inherently hard problem. We've done years of research into building specialized VMs (unikernels) and fast controllers to be able to provide infra that allows VMs/containers to cold start, and thus autoscale/scale to zero in milliseconds (eg, a simple Node app cold starts in ~50 ms). If interested, you can try it out at kraft.cloud, or check out info about the tech in our blogs (https://unikraft.io/blog/) or the corresponding LF OSS project (www.unikraft.org).

deivid · 2024-05-24T09:57:56 1716544676

Unikraft is really cool, but Linux is not necessarily the blocker. You can boot to PID1 in firecracker in ~6ms, see my experiments: https://blog.davidv.dev/minimizing-linux-boot-times.html

pradn · 2024-05-24T15:47:00 1716565620

I assume the majority of people are autoscaling merely for diurnal and weekly traffic swings, where the signal window could be as high as 30 min or 1 hour. Do folks really see sub-minutely autoscaling?

Are customers that are so cost sensitive served well by public clouds and attendant infrastructure?

jedberg · 2024-05-24T16:08:50 1716566930

Netflix scales on short intervals for reliability. Scale up quickly to handle a spike in traffic, scale down slowly to save money.

They also do proactive scaling to address the issues you brought up, since they can predict with fairly high accuracy the normal viewing patterns.

For example scaling up on Saturday morning ahead of all the kids waking up.

sfilmeyer · 2024-05-23T19:32:33 1716492753

>By definition your predictions will be more accurate the smaller the window is.

Small nit, and this doesn't detract from your points. I don't think this is universally true by definition, even if it is almost always true. You could come up with some rare conditions where your traffic at t+5 minutes is actually easier to predict than at t+20 seconds. Of course, even in that case you're better off (ceteris paribus) being able to spin things up in 20 seconds.

jedberg · 2024-05-23T19:36:40 1716493000

I can come up with a lot of examples where it is easier to predict further out[0], but that also means I can predict them 20 seconds out. :)

[0] For example I can tell you exactly when spikes will happen to Netflix's servers on Saturday morning (because the kids all get up at the same time). And I can tell you there will be spikes on the hour during prime time as people shift from linear TV to streaming (or at least they did a lot more 10 years ago!). I can also tell you when spikes to Alexa will be because I already know what times peoples alarms are set for.

foota · 2024-05-24T02:25:19 1716517519

Hm... I think there are things with one sided uncertainty, that make the ops example possible.

For example, things rarely start before they're scheduled, so if you know something is supposed to happen at 5 and last an hour, it's reasonable to assume it will be happening at 5:10, so if it's 4:59 right now, you're more certain that it will happen in 11 minutes than in one minute.

Of course, as time progresses your uncertainty curve flattens out (at 5:09 you're pretty certain it'll happen any moment now), but that's only in the future.

Even then, it's quite reasonable to say that there's no time leading up to the event where you're more confident it will start imminently than you're confident it will be occurring further out.

bushbaba · 2024-05-24T08:49:53 1716540593

Autoscaling also makes performance insights easier. It keeps the resources per processed request relatively consistent over time. Whereas resizing not automatically can leads to a lot of operational complexity in understanding how your service will react under different loads.

This dang age everyone should use autoscaling. relying on ODCR (capacity reservations) for guaranteeing resources exist.

cricketlover · 2024-05-24T04:55:06 1716526506

won't there be more noise while predicting just 20s in advance? The longer the duration, the less effects we will see of temporary events like network blips etc. no? sorry I'm new to software engineering and just trying to learn.

wongarsu · 2024-05-24T09:45:12 1716543912

However with a smaller prediction interval you can dampen your autoscaling more. If you predict 20s into the future, react, and 20s later you see how that changed the situation you can afford to spin very few instances up and down each 20s. If you have to predict 5m into the future you might have to take much stronger actions because any effect is delayed by the 5m startup interval.

cyberpunk · 2024-05-24T05:08:43 1716527323

There’s no one answer for it, you need to learn your traffic / resource usage patterns and tune the scaling to match your situation.

No shortcuts really, although a lot of web applications behave “kinda” similar.

Start conservatively and tweak from there.

diroussel · 2024-05-24T22:02:25 1716588145

Faster feedback is the shortcut. And it always works. Faster boot time is lower latency to serve the queue. Faster feedback is stabilising.

necovek · 2024-05-23T17:43:28 1716486208

From a technical perspective, Amazon has actually optimized this but turned that into "serverless functions": their ultra-optimized image paired with Firecracker achieve ultra-fast boot-up of virtual Linux machines. IIRC from when Firecracker was being introduced, they are booting up in sub-second times.

I wonder if Amazon would ever decide to offer booting the same image with the same hypervisor in EC2 as they do for lambdas?

20thr · 2024-05-23T20:04:07 1716494647

100% -- EC2's general purpose nature is not in my opinion the best fit for ephemeral use-cases. You'll be constantly fighting the infrastructure as the set of trade-offs and design goals are widely different.

This is why CodeSandbox, Namespace, and even fly.io built special-purpose architectures to guarantee extremely start-up time.

In the case of Namespace it's ~2sec on cold boots with a set of user-supplied containers, with storage allocations.

(Disclaimer, I'm with Namespace -- https://namespace.so)

arianvanp · 2024-05-23T17:51:33 1716486693

And AWS now has a product to spin up Lambdas for GitHub Actions CI runners

https://docs.aws.amazon.com/codebuild/latest/userguide/actio...

fhuici · 2024-05-24T06:36:03 1716532563

[Disclaimer: I'm with KraftCloud] For what it's worth, Firecracker/the VMM is only one part of the boot process. Among others, there's also the controller and the VM/OS itself that typically slow things down. In other words, it's not enough to just switch in Firecracker and expect cold starts to immediately drop to sub-second levels.

On kraft.cloud we've fundamentally redesigned the cloud stack to be able to cold start containers/VMs in milliseconds (eg, about 20 millis for nginx, about 50 millis for a basic Node app), and also scale them to zero and autoscale them in milliseconds. If interested there's more info about the tech in our blog posts https://unikraft.io/blog/ .

cr125rider · 2024-05-23T18:20:10 1716488410

Fargate is an alternative that runs on Firecracker as well. It's hidden behind ECS and EKS, however.

mochomocha · 2024-05-24T03:15:59 1716520559

According to [1] Fargate is actually not using Firecracker, but probably something closer to a single container running in a single-tenant ec2 VM. If true, this makes VM boot-time optimizations and warm pooling even more important for such product.

[1]: https://justingarrison.com/blog/2024-02-08-fargate-is-not-fi...

plopz · 2024-05-24T04:53:05 1716526385

Fargate is too slow without the container cache you can get with ec2

scarface_74 · 2024-05-24T02:05:15 1716516315

And CodeBuild…

solarengineer · 2024-05-24T14:29:07 1716560947

Not an AWS example, but on my Illumos Zones on an i5 at Hetzner, I get from zero to ssh in under 50 ms. I am certain of the numbers since I have used DTrace to measure. It is unfortunate that Ilumos is not popular enough for a multitude of reasons.

necovek · 2024-05-24T17:35:14 1716572114

Wow, that's amazing!

I wonder if I should try out Illumos as I am rebuilding my home server, but I am afraid that due to lack of time, it'd take me ages to replicate my libvirt-based setup with around 30 services.

How well are Linux containers and VMs supported in Illumos? How about nested KVM? That could help with the transition as I am heavily leaning into GNU tools and KVM.

FWIW, when I first laid my eyes on a Sun workstation back in the 90s and seen the Control key where it rightly belongs (in the place of Caps Lock in broken layouts :), I said "duh" and have never moved back from remapped Caps-as-Ctrl.

solarengineer · 2024-05-25T03:56:20 1716609380

You could use SmartOS to achieve the same.

Linux containers no longer run natively and instead need to run within VMs.

oblio · 2024-05-28T11:12:27 1716894747

> It is unfortunate that Ilumos is not popular enough for a multitude of reasons.

It's complicated but you can blame Sun for a lot of that, the primary thing was their explicit choice of license to make it incompatible (or hard-to-make-compatible with Linux).

solarengineer · 2024-05-24T14:46:51 1716562011

I should add - this is a full Zone with a number of services. One of my half-finished projects was to trace the call graph of the various services started and disable those not needed.

amluto · 2024-05-23T15:16:45 1716477405

I don’t use EC2 enough to have played with this, but a big part here is the population of the AMI into the per-instance EBS volume.

ISTM one could do much better with an immutable/atomic setup: set up an immutable read-only EBS volume, and have each instance share that volume and have a per-instance volume that starts out blank.

Actually pulling this off looks like it would be limited by the rules of EBS Multi-Attach. One could have fun experimenting with an extremely minimal boot AMI that streams a squashfs or similar file from S3 and unpacks it.

edit: contemplating a bit, unless you are willing to babysit your deployment and operate under serious constraints, EBS multi-attach looks like the wrong solution. I think the right approach would be build a very very small AMI that sets up a rootfs using s3fs or a similar technology and optionally puts an overlayfs on top. Alternatively, it could set up a block device backed by an S3 file and optionally use it as a base layer of a device-mapper stack. There’s plenty of room to optimize this.

Szpadel · 2024-05-23T17:36:16 1716485776

we used s3fs in production. please don't use it, it's unreliable, unpredictable failure modes, can bring whole instance down. if you really need something like that use rclone mount

mdaniel · 2024-05-23T15:32:57 1716478377

I believe they addressed this in their post because one cannot (currently?) `aws ec2 run-instances --volume-id vol-cafebabe`, rather one can only tell AWS what volume parameters to use when they create the root device. Your theory may still be sound about using some kind of super bare bones AMI but there will be no such outcome of "hey, friend, use this existing EBS as your root volume, don't create a new one"

stingraycharles · 2024-05-23T17:20:45 1716484845

Isn’t EBS multi-attach only available for the (very expensive) io1 / io2 volume types?

amluto · 2024-05-23T17:42:52 1716486172

Hmm, it does look like it, although one could carefully use large IO.

But the bigger issue might be durability. Most EBS types have rather low quoted durability, and, for a shared volume like this, that’s a problem. Using S3 instead would be better all around except for the smallish engineering effort and deployment effort needed.

Getting a tool like mkosi to generate a boot-from-S3 setup should be straightforward. Converting most any bootable container should also be doable, even automatically. Converting an AMI would involve more heuristics and be more fragile, but it ought to work reliably with most modern Linux distros.

attentive · 2024-05-23T19:47:18 1716493638

That's reinventing ebs/ami/snapshots. They are already doing it i.e. data goes lazily from s3 to ebs/ec2.

amluto · 2024-05-23T23:13:16 1716505996

It’s not, though. The way to boot a transient OS (just like a transient instance of a container on a machine/instance with a container runtime) is to give userspace read-only access to the image. It can be outright read-only or it can be an actual efficient overlay mechanism (qcow, overlayfs, device-mapper snapshot, etc). EBS, as the article notes, can’t actually do a read-only mount of a snapshot at all, and it’s very inefficient at instantiating a volume from a snapshot.

antihero · 2024-05-24T07:48:44 1716536924

Could you make a snapshot of the booted instance then and boot other instances from that?

amluto · 2024-05-24T14:54:27 1716562467

That seems like it would have exactly the same problem. The problem is that EBS volumes load very inefficiently from snapshots. (They’re also unnecessarily expensive: you pay for (number of instances times size) despite the fact that what you actually want is for each instance to read exactly the same data.)

Fundamentally, EBS volumes are designed to work more or less like actual disks, whereas modern scaled workloads want something closer to net-booted diskless machines.

fduran · 2024-05-23T22:26:07 1716503167

So I've created ~300k ec2 instances with SadServers and my experience was that starting an ec2 VM from stopped took ~30 seconds and creating one from AMI took ~50 seconds.

Recently I decided to actually look at boot times since I store in the db when the servers are requested and when they become ready and it turns out for me it's really bi-modal; some take about 15-20s and many take about 80s, see graph https://x.com/sadservers_com/status/1782081065672118367

Pretty baffled by this (same region, same pretty much everything), any idea why?. Definitively going to try this trick in the article.

paranoidrobot · 2024-05-24T00:47:41 1716511661

My guess is probably related to AWS Spot capacity.

The second and third spikes at 80 and 140 seconds lines up nicely with this kind of behavior.

The second spike would be optimised workloads that can respond to spot interruption in under 60 seconds.

The third spike would be Spot workloads that are being force-terminated.

The reason it's falling on those bounds is because of whatever is trying to schedule your workload only re-checks for free capacity once a minute.

I used to be able to spin up spot instances and basically never get interruptions. They'd stay on for weeks/months.

In my experience, it used to be fairly safe to have Spot instances for most workloads. You'd almost never get Spot interruptions. Now, some regions and instance types are difficult to run Spot instances at all.

fduran · 2024-05-24T14:49:43 1716562183

Thanks, pot capacity being scheduled differently would explain the behavior.

Almost all my ec2 instances are spot, and actually I can compare the distribution with the on-demand ones.

My spot instances are very short lived (15-30 mins max) and AFAIK I've never seen a spot instance force-terminated (this would be hard to find I think).

paranoidrobot · 2024-05-25T20:09:54 1716667794

When I say "force-terminated" I mean when you don't voluntarily shut down in response to a SpotInterruption event.

When the event is sent, they give you two minutes to shut down.

If you either don't subscribe to the events, or don't shut down fast enough, they kill the instance.

fletchowns · 2024-05-23T22:47:00 1716504420

Perhaps in one case you are getting a slice of a machine that is already running, versus AWS powering up a machine that was offline and getting a slice of that one?

fduran · 2024-05-23T23:21:22 1716506482

Yes, some internal (AWS operation) explanation like the one you suggest makes sense.

crohr · 2024-05-23T18:09:17 1716487757

> while we can boot the Actions runner within 5 seconds of a job starting, it can take GitHub 10+ seconds to actually deliver that job to the runner

This. I went the same route with regards to boot time optimisations for [1] (cleaning up the AMI, cloud-init, etc.), and can boot a VM from cold in 15s (I can't rely on prewarming pools of machines -- even stopped -- since RunsOn doesn't share machines with multiple clients and this would not make sense economically).

But the time taken by the official runner binary to load and then get assigned a job by GitHub always takes around 8s, which is more than half of the VM boot time :( At some point it would be great if GitHub could give us a leaner runner binary with less legacy stuff, and tailored for ephemeral runners (that, or reverse-engineer the protocol).

[1] https://runs-on.com

develatio · 2024-05-23T15:32:30 1716478350

Maybe AWS should actually take a look into this. I know comparing AWS to other (smaller) cloud providers is not totally fair given the size of AWS, but for example creating / booting an instance in Hetzner takes a few seconds.

torginus · 2024-05-23T17:25:36 1716485136

It also takes a few seconds on AWS. The guy is comparing setting up a whole new machine from an image, with network and all, to turning on a stopped EC2 instance.

The latter takes a few seconds, the former is presumably longer. This is the great relevation of this blog post.

dylan604 · 2024-05-23T18:01:41 1716487301

wait, restarting a stopped machine is faster than launching an AMI from scracth is a great revelation?

That's like saying waking your MacbookPro is faster than booting from powered off state. Of course it is, and that's precisely why the option exists.

jpambrun · 2024-05-23T19:51:50 1716493910

I think this is unexpected. I expected that once created, my boot volume would have the same performance on the first boot than on the second. It's really not obvious that the volume is really empty and lazily loaded from S3. The proposed work around is also a bit silly: read all blocks one by one even tho maybe 1% of the block have something in them on a new VM. This is actually a revelation.

mdeeks · 2024-05-23T19:31:28 1716492688

If you aren't familiar with how EBS works and how volumes are warmed, then yes, this is an interesting blog post. Not everyone is an expert. They become experts by reading things like this and learning.

If you didn't know about this EBS behavior it would be logical to assume that booting from scratch is roughly equivalent to starting/stopping/starting again.

pluies · 2024-05-24T10:43:49 1716547429

It's definitely news to me!

Intuitively, I would have expected AWS to send back the EBS volume backing a powered-off machine into the Whatever Long-Term Storage Is Behind EBS, and therefore start-up time to be ~identical to a fresh start as the steps would be the same: retrieve data from long-term storage into a readable EBS volume, start VM etc.

It's very interesting that it is not the case, and that keeping the EBS volume around after a first boot makes a second boot so much faster.

dylan604 · 2024-05-24T13:37:01 1716557821

If that were the case, then there would be absolutely no benefit from having a stop vs terminate. Unless you've written data to the EBS volume, but in these cases where the boot time is critical most of them are some sort of read-only volume anyways. The fact that an EC2 can be stopped or terminated should immediately suggest speed might be a difference in reaching the running state. In the EBS docs, it clearly states that if you keep an EBS volume around either attached to a stopped EC2 or detached and left in your pool of volumes, you will be charged for that volume.

everfrustrated · 2024-05-23T16:23:43 1716481423

Hetzner does not offer a network block storage comparable to EBS that can be used as a root (bootable) file system. AWS local-attached ephemeral disk are also immediately available but cannot be seeded with data (same as Hetzner they are wiped clean ahead of boot).

andersa · 2024-05-23T16:33:36 1716482016

This is an advantage. EBS is terrible! Literally orders of magnitude slower than modern SSDs.

stingraycharles · 2024-05-23T17:29:49 1716485389

Depends on your definition of slow. Throughput-wise, I think it’s fairly decent — we typically set up 4 EBS volumes in raid0 and get 4GB/sec for a really decent price.

Nextgrid · 2024-05-23T18:16:22 1716488182

Sequential throughput can be fine. Random access is always going to be orders of magnitude slower than a direct-attach disk.

Remember why we switched from spinning hard drives to SSDs? Well EBS is like going back to a spinning drive.

flybarrel · 2024-05-24T13:22:52 1716556972

Depends on what volume types you are using. For random access - with io2 you can get 256K IOPS per volume, and if you do RAID0, on the largest instance you can get 400K IOPS. Directly-attached vs over-the-network storage can be fairly different, but going back to a spinning drive is not a fair statement.

Nextgrid · 2024-05-24T19:14:45 1716578085

You are talking about concurrent IOPS which is fair enough, you can indeed scale that. But every individual one of those IOPS would still have far higher latency than direct-attach storage.

This is a problem when IO operations have to be sequential and can't be parallelized such as when they depend on each other (database has to first read the index to know where the actual data is, then do another IO to actually read said data).

Having lots of IOPS could allow you to make multiple of these queries in parallel (assuming locking/etc doesn't get in the way), but it still means every individual query would be slower than on direct-attach storage.

Aeolun · 2024-05-24T01:30:32 1716514232

I think AWS is aware of this. Which is why they have instances with attached SSD’s

Nextgrid · 2024-05-24T19:17:28 1716578248

However, you can't boot off them, and they don't build anything to facilitate their usage as a boot drive, even though it's perfect for use-cases like these where the machine is ephemeral and its local drive doesn't actually matter.

They should have a feature where you can provide it an S3 URL or EBS snapshot when launching an instance, and the control plane would write it to the local ephemeral direct-attach volume on launch.

Currently if you wanted to do that you'd have to roll your own with a minimal EBS as your root volume containing iPXE or some other minimal OS that is capable of formatting the direct-attach NVME and populating it before passing control to it (via kexec/etc?).

But I guess since EBS incurs additional cost, there's no incentive for them to make it easier to move away. I bet that in aggregate they're making quite good money off mostly unnecessary EBS volumes that don't actually contain any useful data.

tekla · 2024-05-23T16:41:27 1716482487

EBS is great for workloads that dont require SSDs, which most don't.

If it does, you can do provisioned which will get you alot more or go NVME.

Nextgrid · 2024-05-23T18:17:46 1716488266

Even provisioned won't get you the access times of a direct-attached SSD. Speed of light and all that - EBS is using the network under the hood, it's not a direct connection to the host.

tekla · 2024-05-23T18:31:25 1716489085

Yes I know, and? Thats why I mentioned NVME

flaminHotSpeedo · 2024-05-24T07:15:19 1716534919

At least if you're using the Ec2 optimized amis, Ec2 instances frequently boot fast enough that they'll be executing your username initialization before you get the run instances response.

Though there's a long tail, so sometimes there can be a gap on the order of a couple second between the sync response and when the bootloader transfers over to the kernel.

tekla · 2024-05-23T16:05:54 1716480354

They have and I know this because I've hammered them on this because we demand thousands of instances to autoscale very aggressively in 1-3 minutes. Very few people give a shit about initialization times. They care more about instance ready times which is constrained by the OS that is running.

attentive · 2024-05-23T19:48:49 1716493729

It depends on instance type and OS and can be real short on ec2.

matt-p · 2024-05-23T15:41:29 1716478889

What's size got to do with boot time? Serious question.

develatio · 2024-05-23T16:03:26 1716480206

By "the size" I meant to say "the size of the infrastructure", meaning that AWS has to manage orders of magnitude more instances than Hetzner. This might as well contribute to "things" being slower.

londons_explore · 2024-05-23T16:21:10 1716481270

Arguably it can also make things faster. A small provider might need to migrate other instances around to make space for your new instance, whereas a big provider almost certainly can satisfy your request from existing free capacity, and it should therefore be a matter of milliseconds to identify the physical machine your new VM will run on.

matt-p · 2024-05-23T23:52:04 1716508324

I understand that, apologies for not making that clear.

I guess I'm just struggling to see how having more VMs means it takes longer to provision one? A database query to find the space and get an allocation should be milliseconds in either case.

RationPhantoms · 2024-05-23T15:45:28 1716479128

More employed eyes on an issue or ability to compensate the best-in-class engineers to take a look.

matt-p · 2024-05-23T23:47:47 1716508067

Exactly, so a large provider should theoretically have faster boot times if anything, no? More chance of having a customer who cares deeply, more wasted CPU cycles (and therefore incentive to optimise) as it's multiplied by 10 million rather than 10 thousand instance starts. More likely to have larger quantities of engineers available, more monitoring/data/telematics.

ldoughty · 2024-05-24T10:57:32 1716548252

The article mentions hydration (a process to reduce the penalty of first access of a data block). This is done by essentially reading the entire EBS volume with the tool like fio or dd. The time it takes to complete this process is relative to the amount of dirty blocks. Therefore, it will take twice as long to hydrate 20 GB of data compared to 10 GB.

playingalong · 2024-05-23T16:08:12 1716480492

Likely they mean that following Conway's law in AWS there are more abstraction layers involved.

matt-p · 2024-05-23T23:55:08 1716508508

Most likely true, but things are going deeply wrong if fresh Debian VM boot times are a function of the organisational structure of your hosting provider.

CaptainOfCoit · 2024-05-23T15:52:46 1716479566

Smaller companies are faster and more nimble than larger corporations.

matt-p · 2024-05-23T23:50:06 1716508206

Yes, but why specifically does that mean boot times of a VM are faster? One has to be 'nimble' to improve that?

mnutt · 2024-05-23T19:50:22 1716493822

They talk about the limitations of the EC2 autoscaler and mention calling LaunchInstances themselves, but are there any autoscaler service projects for EC2 ASGs out there? The AWS-provided one is slow (as they mention), annoyingly opaque, and has all kinds of limitations like not being able to use Warm Pools with multiple instance types etc.

mdaniel · 2024-05-24T00:03:00 1716508980

I am a little confused by your mention of "EC2 autoscaler" and then "EC2 ASG" autoscaler, but if I'm hearing you correctly and you'd want "self managed ASGs," then you may have some success adapting Keda <https://github.com/kedacore/keda#readme> (or your-favorite-event-driven-gizmo) to monitor the metrics that interest you and driving ec2.LaunchInstances on the other side, since as very best I can tell that's what ASGs are doing just using their serverless-event-something-or-other versus your serverless-event-something-or-other. I would suspect you could even continue to use the existing ec2.LaunchTemplate as the "stamp out copies of these" system, since there doesn't appear to be anything especially ASG-y about them, just that is the only(?) consumer thus far

After having typed all that out, I recalled that Open Stack exists and thus they may get you ever further toward your goal since they are trying to be on-prem AWS: https://docs.openstack.org/auto-scaling-sig/latest/theory-of...

flaminHotSpeedo · 2024-05-24T07:17:27 1716535047

Yeah that's basically what asg does, you can see the createFleet requests in cloudtrail

everfrustrated · 2024-05-23T16:34:50 1716482090

It's too bad that EBS doesn't natively support Copy-On-Write.

Snapshots are persisted into S3 (transparently to the user) but it means each new EBS volume spawned doesn't start at full IOPS allocation.

I presume this is due to EBS volumes being specific-AZ so to be able to launch an AMI-seeded EBS volume in any AZ it needs to go via S3 (multi-AZ)

Twirrim · 2024-05-23T16:45:19 1716482719

EBS volumes are "expensive" compared to S3, due to the limitations of what you can do with live block volumes + replicas, vs S3. It takes more disk space to have an image be a provisioned volume ready to be used for copy-on-write, vs having it as something backed up in S3. So the incentives aren't there vs just trying to make the volume creation process as smooth and fast as possible.

I'd guess it's likely that EBS is using a tiered caching system, where they'll keep live volumes around for Copy-on-write cloning for the more popular images/snapshots, with slightly less popular images maybe stored in an EBS cache of some form, before it goes all the way back to S3. You're just not likely to end up getting a live volume level of caching until you hit a certain threshold of launches.

Nextgrid · 2024-05-23T18:24:13 1716488653

I don't get why they're using EBS here to begin with. EBS trades off cost and performance for durability. It's slow because it's a network-attached volume that's most likely also replicated under the hood. You use this for data that you need high durability for.

It looks like their use-case fetches all the data it needs from the network (in the form of the GH Actions runner getting the job from GitHub, and then pulling down Docker containers, etc).

What they need is a minimal Linux install (Arch Linux would be good for this) in a squashfs/etc and the only thing in EBS should be an HTTP-aware boot loader like IPXE or a kernel+initrd capable of pulling down the squashfs from S3 and run it from memory. Local "scratchspace" storage for the build jobs can be provided by the ephemeral NVME drives which are also direct-attach and much faster than EBS.

jedberg · 2024-05-23T18:35:33 1716489333

By using EBS they don't have to wait for disk to fill from network on second+ boot.

Nextgrid · 2024-05-23T18:38:21 1716489501

Ah so they are keeping the machines around? Do they need to do that - does the GH runner actually persist anything worth keeping in between runs?

jedberg · 2024-05-23T18:42:37 1716489757

They keep the instances in a "stopped" state, which means keeping the EBS volume around (and paying for it) but not paying for the instance (which could be another machine when turn it back on, which is why you can't load it into scratch space and then stop it).

What's on the EBS is their docker image, so they don't have to load it back up again.

Nextgrid · 2024-05-23T18:53:18 1716490398

Makes sense. I still think it would be cheaper to just reload it from S3 (straight into memory, not using EBS at all) on every boot. The entire OS shouldn't be more than a gigabyte which is quite fast to download as a bulk transfer straight into RAM.

jedberg · 2024-05-23T18:56:24 1716490584

Yes it would be cheaper, but the whole point of this article is trading off cost for faster boot times. They address your points in the article, how it's faster to boot off a warm EBS instead of loading from scratch.

solatic · 2024-05-24T10:04:49 1716545089

Makes me wonder why Depot isn't moving to on-prem hardware. When you're reselling compute with a better API, you give up a substantial proportion of your profits to the hyperscaler while offering worse performance (due to being held hostage to the hyperscaler's design decisions, like lazy loading root EBS from S3).

Surely an optimized approach here looks something like booting customer CI workloads directly from the hypervisor, using an ISO/squashfs/etc. stored directly on the hypervisor, where the only networked disks are the ones with the customers' BuildKit caches?

maccard · 2024-05-23T15:20:47 1716477647

I don't use GHA as some of our code is stored in Perforce, but we've faced the same challenges with EC2 instance startup times on our self managed runners on a different provider.

We would happily pay someone like depot for "here's the AMI I want to run & autoscale, can you please do it faster than AWS?"

We hit this problem with containers too - we'd _love_ to just run all our CI on something like fargate and have it automatically scale and respond to our demand, but the response times and rate limting are just _so slow_ that it means instead we just end up starting/stopping instances with a lambda which feels so 2014.

CaptainOfCoit · 2024-05-23T15:54:00 1716479640

> We would happily pay someone like depot for "here's the AMI I want to run & autoscale, can you please do it faster than AWS?"

Change that to "here's the ISO/IMG I want to run & autoscale, can you please do it faster than AWS?" and you'll have tons of options. Most platforms using Firecracker would most likely be faster, maybe try to use that as a search vector.

maccard · 2024-05-23T19:52:24 1716493944

Can you maybe share some examples? We're fine to use other image formats, but a lot of the value of AWS is that the services interact, IAM works nicely together, etc.

Fly.io comes up often [0] on HN, but there's an overwhelming amount of "it's a nice idea, but it just doesn't work" feedback on it.

[0] https://news.ycombinator.com/item?id=39363499

abatilo · 2024-05-24T13:04:34 1716555874

Depot also does remote docker builds using a remote build kit agent. It was actually their original product. If you could feasibly put everything into a Dockerfile, including running your tests, then you could use that product and get the benefits.

maccard · 2024-05-24T15:37:49 1716565069

I actually didn't know this. We've had some teething issues _building_ in docker, but we actually run our services in containers. I'm sure a few hours of banging my head against a wall would be worth it here.

> including running your tests, "thankfully", we use maven which means that our tests are part of the build lifecycle. It's a bit annoying because our CI provider has some neat parallelism stuff that we could lean on if we could separate out the test phase from the build phase. We use docker-compose inside our builders for dev dependencies (we run our tests against a real database running in docker) but I think they should be our only major issues here.

But, Thanks for the heads up.

Szpadel · 2024-05-23T17:45:39 1716486339

I'm not fully investigated fargate limitations but I think it would be possible to use any k8s native CI on eks + fargate, maybe even use kubevirt for VM creation? from my exploration of fargate with eks, aws provisioned capacity in around 1s region

maccard · 2024-05-23T19:50:42 1716493842

> AWS offers something very similar to this approach called warm pools for EC2 Auto Scaling. This allows you to define a certain number of EC2 instances inside an autoscaling group that are booted once, perform initialization, then shut down, and the autoscaling group will pull from this pool of compute first when scaling up.

> While this sounds like it would serve our needs, autoscaling groups are very slow to react to incoming requests to scale up. From experimentation, it appears that autoscaling groups may have a slow poll loop that checks if new instances are needed, so the delay between requesting a scale up and the instance starting can exceed 60 seconds. For us, this negates the benefit of the warm pool.

I pulled this from the article, but it's the same problem. Technically yes, eks + fargate works. In practice the response times from "thing added to queue" to "node is responding" is minutes with that setup.

Szpadel · 2024-05-25T08:36:23 1716626183

This isn't my experience with eks + fargate

My theory is that they keep nodes booted up and ready and when kube-scheduler cannot assign node then Aws will just add this ready instance to your vpc and ask it to join your cluster.

From user perspective it looked like you always have available capacity on your cluster

everfrustrated · 2024-05-23T16:29:00 1716481740

Out of curiosity what CI system are you using with Perforce?

maccard · 2024-05-23T19:48:41 1716493721

We use buildkite with a customised verison of https://github.com/improbable-eng/perforce-buildkite-plugin/

Our game code is in P4, but our backend services are on GH. Having a single CI system means we get easy interop e.g. game updates can trigger backend pipelines and vice versa.

In the past I've used TeamCity, Jenkins, and ElectricCommander(!)

bingemaker · 2024-05-23T16:48:44 1716482924

Curious, how do you measure the time taken for those 4 steps listed in "What takes so long?" section?

uavoperator · 2024-05-23T23:22:05 1716506525

This is really only tangentially related to the article, but

>If AWS responds that there is no current capacity for m7a instances, the instance is updated to a backup type (like m7i) and started again

Any ideas why m7i would be chosen as the backup type rather than the other way around? m7a seems to be more expensive than m7i, so maybe there's some performance advantage or something else I'm missing that makes AMD CPU containing instances preferable to Intel ones?

tenplusfive · 2024-05-24T07:43:20 1716536600

At least with other instance types (m5,m6,t3) it was the case that the AMD processors were cheaper. As it turns out, this does not seem to be a general rule.

It seems like performance wise the AMD processors are (in certain workloads) quite a bit faster than their Intel equivalent: https://www.phoronix.com/review/aws-m7a-ec2-benchmarks/2 (in later pages it seems to be a little bit more mixed)

crohr · 2024-05-24T07:28:00 1716535680

m7i CPU is in the same ballpark figure than m7a (https://runs-on.com/benchmarks/aws-ec2-instances/). When you look at the interruption percentage for m7a I think m7i (not m7i-flex if you don't want burstable instances) is probably the better choice. But I suppose it depends on availability in their specific zones.

waiwai933 · 2024-05-23T17:14:39 1716484479

I believe this is similar to EC2 Fast Launch which is available for Windows AMIs, but I don't know exactly how that works under the hood.

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/win-a...

mcbain · 2024-05-24T07:02:12 1716534132

It does launch an instance and take a snapshot but what's happening is the sysprep and OOBE stuff that can take 10 mins or so (you can find it in the console and startup logs). That's a lot more overheard than just hydrating an EBS volume.

https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/win-a...

cmckn · 2024-05-23T16:44:41 1716482681

You can enable fast restore on the EBS snapshot that backs your AMI: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-sn...

It’s not cheap, but it speeds things up.

stingraycharles · 2024-05-23T17:27:39 1716485259

$540/month per EBS volume per AZ. And it’s still fairly limited, at a maximum of 8 credits, it wouldn’t nearly cover the use case described in the article (launching 50 instances quickly).

suryao · 2024-05-23T18:23:02 1716488582

This is very cool optimization.

I make a similar product offering fast Github actions runners[1] and we've been down this rabbit hole of boot time optimization.

Eventually, we realized that the best solution is to actually build scale. There are two factors in your favor then: 1) Spikes are less pronounced and the workloads are a lot more predictable. 2) The predictability means that you have a decent estimate of the workload to expect at any given time, within reason for maintaining an efficient warm pool.

This enables us to simplify the stack and not have high-maintenance optimizations while delivering great user experience.

We have some pretty heavy use customers that enable us to do this.

[1] https://www.warpbuild.com

matt-p · 2024-05-23T23:59:40 1716508780

This is almost always the answer, adding an instance should be a fairly rare event.

immibis · 2024-05-23T15:22:34 1716477754

There's something to say about building a tower of abstractions and then trying to tear it back down. We used to just run a compiler on a machine. Startup time: 0.001 seconds. Then we'd run a Docker container on a machine. Startup time: 0.01 sections. Fine, if you need that abstraction. Now apparently we're booting full VMs to run compilers - startup time: 5 seconds. But that's not enough, because we're also allocating a bunch of resources in a distributed network - startup time: 40 seconds.

Do we actually need all this stuff, or does it suffice to get one really powerful server (price less than $40k) and run Docker on it?

iudqnolq · 2024-05-23T21:24:22 1716499462

That doesn't solve the same problem.

GitHub actions in the standard setup needs to run untrusted code and so you essentially need a VM.

You can lock it down at the cost of sacrificing features and usability, but that's a tradeoff.

fhuici · 2024-05-24T14:21:49 1716560509

We don't need all of those layers and abstractions of course. But if we do things right we also don't need to go the bare metal server route -- cloud platforms, if done right, can provide both strong, hardware-level (read: vm) isolation plus fast starts.

On kraft.cloud (shameless plug) we build extremely specialized VMs (aka unikernels) where most of the code in them is the application code, and pair this with a fast, custom controller and other perf tweaks. We use Dockerfiles to build from, but when deploying we eliminate all of those layers you mention. Cold boot times are in milliseconds (e.g., nginx 20ms, a basic node app ~50ms), as are scale to zero and autoscale.

mike_hearn · 2024-05-23T17:37:45 1716485865

A really powerful server should not cost you anywhere near $40k unless you're renting bare metal in AWS or something like that.

Getting rid of the overhead is possible but hard, unless you're willing to sacrifice things people really want.

1. Docker. Adds a few hundred msec of startup time to containers, configuration complexity, daemons, disk caches to manage, repositories .... a lot of stuff. In rigorously controlled corp environments it's not needed. You can just have a base OS distro that's managed centrally and tell people to target it. If they're building on e.g. the JVM then Docker isn't adding much. I don't use it on my own companies CI cluster for example, it's just raw TeamCity agents on raw machines.

2. VMs. Clouds need them because they don't trust the Linux kernel to isolate customers from each other, and they want to buy the biggest machines possible and then subdivide them. That's how their business model works. You can solve this a few ways. One is something like Firecracker where they make a super bare bones VM. Another would be to make a super-hardened version of Linux, so hardened people trust it to provide inter-tenant isolation. Another way would be a clean room kernel designed for security from day one (e.g. written in Rust, Java or C#?)

3. Drives on a distributed network. Honestly not sure why this is needed. For CI runners entirely ephemeral VMs running off read only root drive images should be fine. They could swap to local NVMe storage. I think the big clouds don't always like to offer this because they have a lot of machines with no local storage whatsoever, as that increases the density and allows storage aggregation/binpacking, which lowers their costs.

Basically a big driver of overheads is that people want to be in the big clouds because it avoids the need to do long term planning or commit capital spend to CI, but the cloud is so popular that providers want to pack everyone in as tightly as possible which requires strong isolation and the need to avoid arbitrary boundaries caused by physical hardware shapes.

immibis · 2024-05-24T02:48:14 1716518894

$40k to buy the server, not to rent per month.

If you know who's using your build server, you probably don't need isolation stronger than Docker, because they can to to jail for hacking it.

necovek · 2024-05-23T17:46:06 1716486366

How do you get Docker container startup time of 0.01s with any real-life workload (yes, I know they are just processes, so you could build a simple "hello world" thing, but I'd be surprised if even that runs this fast)?

Do you have an example image and network config that would demonstrate that?

(I'd love to understand the performance limits of Docker containers, but never played with them deeply enough since they are usually in >1s space which is too slow for me to care)

fhuici · 2024-05-24T14:26:54 1716560814

On kraft.cloud we use Dockeffiles to build into extremely specialized VMs for deployment. With this in place, we can have say an nginx server cold started and ready to serve at a public URL in about 20 millis (not quite the 10ms you mention, but in the right ballpark, and we're constantly shaving that down). Heavier apps can take longer of course, but not too much (e.g., node/next < 100ms). Autoscale and scale to zero also operate in those timescales.

Underneath, we use specialized VMs (unikernels), a custom controller and load balancer, as well as a number of perf tweaks to achieve this. But it's (now) certainly possible.

necovek · 2024-05-24T17:25:53 1716571553

Thanks, that is very interesting.

Still, that mostly confirms my experience: to achieve this level of performance, you need to do optimizations on a lower level, and this is not really achievable with docker out of the box (plain Linux host with usual Docker runtime).

cjk2 · 2024-05-23T15:28:43 1716478123

I'm mostly just running the (Go) compiler on my laptop which is considerably faster than on docker and considerably cheaper than the server...

I mean an ass end M3 macbook has the same compile time as an i9-14900k. God knows what an equivalent Xeon/Epyc costs...

immibis · 2024-05-23T15:36:45 1716478605

Maybe your container isn't set up right - Docker contains run directly on the host, just partitioned off from accessing stuff outside of themselves with the equivalent of chroot. Or it could be a Mac-specific thing. Docker only works that way on Linux, and has to emulate Linux on other platforms.

yjftsjthsd-h · 2024-05-23T16:05:46 1716480346

Right, they said they're on a macbook so unless they're going out of their way to run Linux bare-metal it has to use a VM. And AIUI there are extra footguns in that situation, especially that mapping volumes from the host is slower because instead of just telling the kernel to make the directory visible you have to actually share from the host to the VM.

See also: https://reece.tech/posts/osx-docker-performance/

See also: https://docs.docker.com/desktop/settings/mac/

> Shared folders are designed to allow application code to be edited on the host while being executed in containers. For non-code items such as cache directories or databases, the performance will be much better if they are stored in the Linux VM, using a data volume (named volume) or data container.

cjk2 · 2024-05-23T16:20:43 1716481243

Why would I use docker? You don't have to use it. I'm just generating static binaries.

Does anyone understand how to do stuff without containers these days?

rfoo · 2024-05-23T18:46:11 1716489971

Because you just said:

> which is considerably faster than on docker

And we are curious why it is like so because we not only understand how to do stuff without containers, we also understand how containers work and your claim sounds off.

cjk2 · 2024-05-23T19:46:29 1716493589

I don't understand what you are saying.

I'm saying it is slower on docker due to container startup, pulling images, overheads, working out what incantations to run, filesystem access, network weirdness, things talking to other things, configuration required, pull limits, API tokens, all sorts.

Versus "go run"

rfoo · 2024-05-24T14:41:49 1716561709

Wow! You are right! Running go build on host instead of container is 1.16 times faster! A whopping 435ms difference! Amazing!

    /tmp/gitea $ hyperfine -p 'go clean -cache' 'make backend' 'docker run --rm -v $PWD:/build -v $HOME/.go:/go -w /build golang:1.22.3 make backend'
    Benchmark #1: make backend
      Time (mean ± σ):      2.766 s ±  0.021 s    [User: 8.429 s, System: 1.590 s]
      Range (min … max):    2.732 s …  2.800 s    10 runs

    Benchmark #2: docker run --rm -v $PWD:/build -v $HOME/.go:/go -w /build golang:1.22.3 make backend
      Time (mean ± σ):      3.201 s ±  0.034 s    [User: 9.9 ms, System: 7.7 ms]
      Range (min … max):    3.135 s …  3.235 s    10 runs

    Summary
      'make backend' ran
        1.16 ± 0.01 times faster than 'docker run --rm -v $PWD:/build -v $HOME/.go:/go -w /build golang:1.22.3 make backend'

(For incremental build, tested with `hyperfine --warmup 1 'make backend' 'docker run --rm -v $PWD:/build -v $HOME/.go:/go -v $HOME/.cache/go-build:/root/.cache/go-build -w /build golang:1.22.3 make backend'` it's 812.9 ms vs 1.146s.)

immibis · 2024-05-24T02:50:58 1716519058

But usually it's not "considerably". Obviously setting up the container environment takes time but it should be well under a second per build.

skydhash · 2024-05-23T17:44:03 1716486243

I’m using VMs these day because of conflicts and inconsistencies between tooling. But the VM is dedicated to one project and I set it up just like a real machine (GUI, browser, and stuff). No file sharing. It’s been a blast.

benwaffle · 2024-05-23T19:16:34 1716491794

reminds me of https://world.hey.com/dhh/we-re-moving-continuous-integratio...

cjk2 · 2024-05-23T19:41:49 1716493309

Yep.

And you usually get lumbered with some shitty thing like github actions which consumes one mortal full time to keep it working, goes down twice a month (yesterday wasn't it this week?), takes bloody forever to build anything and is impossible to debug.

Edit: and MORE YAML HELL!

paulddraper · 2024-05-23T19:09:46 1716491386

> From a billing perspective, AWS does not charge for the EC2 instance itself when stopped, as there's no physical hardware being reserved; a stopped instance is just the configuration that will be used when the instance is started next. Note that you do pay for the root EBS volume though, as it's still consuming storage.

Shutdown standbys absolutely the way to do it.

Does AWS offer anything for this, because it's very tedious to set this up.

tekla · 2024-05-23T19:12:19 1716491539

Warm pools

paulddraper · 2024-05-23T19:38:39 1716493119

yep, that's it, thank you kind person

orf · 2024-05-24T01:10:21 1716513021

It seems that you want to make your root volume as small as possible, and use it to only attach a pre-warmed pool of EBS volumes at launch time that contain the actual config/data you need?

You can launch a stripped down distribution with what, a 200mb disk? Then attach the “useful” EBS volume, and “do stuff” with that - launch a container, or whatever.

nathants · 2024-05-24T07:48:17 1716536897

in the us-west-2-lax-1a local zone, i just booted 100 r5.xlarge spot instances as fortnite like game servers[1]. 1 to be a central server, 99 to be fake players. the server broadcasts x100 write amplified data from every player to every player. the 101st serve is my local pc.

the server broadcasts at 200 MB/s[2]. the whole setup costs me $3-4 usd/hour and by far the slowest part of boot is my game compiling on the central server, whether i store ccache data in s3 or not. i've booted this every day for the last 6 months, to test the game.

if your system can't handle 30s vm boots, your system should improve.

1. https://r2.nathants.workers.dev/ec2_snitch.png

2. https://r2.nathants.workers.dev/ec2_boot.mp4

albert_e · 2024-05-24T05:18:11 1716527891

AWS will (should) make this an optional feature.

Often the technology is the easier part.

The difficult part is how to name the feature intuitively, adding to an ocean of jargon and documentation, and making the configuration knobs intuitive both in UI and CLI/SDK.

Amazon Simple Compute Service :) ?

kylegalbraith · 2024-05-24T07:07:29 1716534449

Other founder of Depot here. AWS is pretty close to this idea with their Warm Pools [0]. But for our use case, they're just too slow to react to changes. We observed 60s+ to notice a change and actually start the machine. That doesn't work when we need to launch the machine as quickly as possible in reaction to a pending GHA job.

That said, I think this is a problem they could likely solve with that functionality, and we'd love to use it.

[0] https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-au...

ldoughty · 2024-05-24T10:46:53 1716547613

The article talks about fio to warm the drive... That's basically fast snapshot restore[0]. This reduces the "first access" penalty for "dirty" blocks. This is probably the slowest part of the entire article (it's about 10 seconds per dirty GB to fio the disk).

[0]: https://docs.aws.amazon.com/ebs/latest/userguide/ebs-fast-sn...