So I've created ~300k ec2 instances with SadServers and my experience was that s...

paranoidrobot · 2024-05-24T00:47:41 1716511661

My guess is probably related to AWS Spot capacity.

The second and third spikes at 80 and 140 seconds lines up nicely with this kind of behavior.

The second spike would be optimised workloads that can respond to spot interruption in under 60 seconds.

The third spike would be Spot workloads that are being force-terminated.

The reason it's falling on those bounds is because of whatever is trying to schedule your workload only re-checks for free capacity once a minute.

I used to be able to spin up spot instances and basically never get interruptions. They'd stay on for weeks/months.

In my experience, it used to be fairly safe to have Spot instances for most workloads. You'd almost never get Spot interruptions. Now, some regions and instance types are difficult to run Spot instances at all.

fduran · 2024-05-24T14:49:43 1716562183

Thanks, pot capacity being scheduled differently would explain the behavior.

Almost all my ec2 instances are spot, and actually I can compare the distribution with the on-demand ones.

My spot instances are very short lived (15-30 mins max) and AFAIK I've never seen a spot instance force-terminated (this would be hard to find I think).

paranoidrobot · 2024-05-25T20:09:54 1716667794

When I say "force-terminated" I mean when you don't voluntarily shut down in response to a SpotInterruption event.

When the event is sent, they give you two minutes to shut down.

If you either don't subscribe to the events, or don't shut down fast enough, they kill the instance.

fletchowns · 2024-05-23T22:47:00 1716504420

Perhaps in one case you are getting a slice of a machine that is already running, versus AWS powering up a machine that was offline and getting a slice of that one?

fduran · 2024-05-23T23:21:22 1716506482

Yes, some internal (AWS operation) explanation like the one you suggest makes sense.