
My take at making AWS EC2 cheaper by automating SPOT instances with AutoScaling - alien_
https://mcristi.wordpress.com/2016/04/21/my-approach-at-making-aws-ec2-affordable-automatic-replacement-of-autoscaling-nodes-with-equivalent-spot-instances/
======
vgt
I'm going to plug Google Cloud's Preemptible VMs as a simpler alternative to
Spot Instances:

\- Preemptible VMs are sold at a fixed 70% off discount, removing pricing
volatility entirely

\- Google Cloud's Compute Engine has far fewer VM types, thus making it much
easier to construct the resources you want (exception being GPUs, you don't
need a "network optimized" instance to get fast network, nor you need a
"storage optimized" instance to get fast/large storage - these things are
modular on GCE).

(disc: work on Google Cloud)

~~~
gcr
I'm still waiting for _any_ serious competitor to come along with GPU
instances. Google Cloud doesn't have them, Azure _still_ hasn't made them
available despite announcing them over a year ago, and other cloud providers
make you _call a salesperson_ to get a quote on a service that's much more
expensive than what you could run yourself with a trip down to Best Buy.

Amazon is still _eating everyone 's lunch_ on the GPU front.

~~~
vgt
Good point on IaaS front.

One major use case for GPUs is ML. Google Cloud externalizes its ML through
serverless APIs. At that point what matters is your ability to derive value
from ML, rather than having exposure to building blocks.

~~~
lmeyerov
That's an anti-competitive position for an IaaS to take. "We already compete
with your kind of software here, so we won't sell you the hardware"?

I work on some analytics software that includes GPU ML algorithms & non-ML GPU
algorithms, neither of which Google makes. Our peers do a lot of visual
computing in AWS. Really weird thing to hear from a Google rep.

~~~
boulos
Please don't take it that way, Leo! How about:

"Yes, we know! Sorry! If you just need some ML things and can use the Cloud ML
or Cloud Vision services we've got something to tide you over.

If not, we're always excited to get direct 'I want to do X' feedback that we
can translate into 'Customers are demanding Y, and willing to pay Z'".

Please don't take the comment to mean we're trying to box people out of this
space. We're not. We do think (most? many?) people don't want to actually roll
their own, but we love everyone.

Disclosure: I work on Compute Engine.

~~~
vgt
what he said :)

------
kiwidrew
Why would you bother going to all this trouble when you could just use
Amazon's official Spot Fleet service/API?

[http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-
flee...](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-fleet.html)

~~~
dmourati
He started before the feature was released and acknowledges they came up with
an in-product solution:

"and it seems they now have a full fledged solution for the problem, based on
pretty much a reimplementation of AutoScaling, using machine learning and with
a beautiful UI and they are really successful with it. Funnily enough, they
even contacted me to sell that solution to my company and we are seriously
evaluating it"

~~~
kiwidrew
Ah, okay, I missed that part!

~~~
alien_
Yes, but I actually did start a few weeks before the spot fleet was launched.

The problem with the spot fleet 1) it's kind of awkward to use 2) it has
statically defined capacity so you can't scale it 3) it has a static bid
price, so if at some point your are outbid on all the group's bids, you end up
with no capacity 4) among other things it lacks integration with the ELB so
you can't really use it for so many use cases.

My solution is simpler, better integrated with the rest of AWS and more
resilient and once I iron out the bugs and get it production-ready, it should
be a better choice.

------
ecesena
One issue that we had in the past was the following - hope this is resolved
now.

Say you have an autoscaling group spanned across zones A and B, and say you
have 1 machine in zone A and 1 in machine B.

Now, the price in zone A goes up, and your machine in zone A dies.

The issue (bug?) was that the autoscaling group was trying to re-instantiate a
new VM in zone A. Of course, since the price was high, the new VM was
basically immediately dying. And so on.

Edit: issue apart, it's a great way that can save money, especially if you
have a group of VMs whose computation can be interrupted/restarted relatively
cheaply.

~~~
alien_
With my approach the AutoScaling group would always replace failed instances
with the on-demand ones identical to those initially defined on the group's
launch configuration.

All I do is I later attempt to replace them with whatever I can buy from the
spot market.

On the other hand, the spot bidding implemented out of the box in AutoScaling
will fail if you are outbid in all Availability Zones at the same time, since
it doesn't fall back to on-demand instances. I've seen people often use a
second on-demand AutoScaling group that would scale out when you get outbid on
the spot one, but then you have a problem defining scaling policies so that
they can scale nicely, and/or shifting the capacity between them. Someone had
a nice talk at re:invent about how they do all that.

------
flaviotruzzi
Looks like
[https://github.com/chaordic/tiopatinhas](https://github.com/chaordic/tiopatinhas)

~~~
alien_
I've looked in a bit more detail and it seems to attach the nodes to the ELB
used by the AutoScaling group, while I am attaching them to the group itself,
which indirectly adds them to the load balancer.

I'm curious how does that tool handle the scaling of the group, since the
nodes are actually outside the group and can't contribute to group-wide
metrics like average CPU usage, often used for scaling out.

------
voltagex_
I've got
[https://github.com/voltagex/junkcode/tree/master/Python/spot...](https://github.com/voltagex/junkcode/tree/master/Python/spotprices)
and
[https://github.com/voltagex/junkcode/tree/master/CSharp/spot...](https://github.com/voltagex/junkcode/tree/master/CSharp/spotprices)
but I was never really happy enough with either to take them past my junkcode
folder.

If I could make the Python one work a bit better, I'd probably use it with
Ansible - [http://snappishproductions.com/2014/03/24/Spot-Instances-
Wit...](http://snappishproductions.com/2014/03/24/Spot-Instances-With-
Ansible.html)

~~~
x5n1
Funny Amazon acquired a startup for millions that did just this sort of thing.

~~~
voltagex_
Who knows, maybe my junk code is gold!

------
boulos
Why do only update / rebalance every 30 minutes? Is that because of the per-
hour billing (so replacing an on-demand one only part way through its life is
a mistake). If so, it still seems like you'd want to inspect every 5 minutes
or so (keep on-demand until it reaches > 60-frequency).

Disclosure: I work on Compute Engine (and launched our preemptible VMs
product).

~~~
alien_
It's because so far I am only considering replacing the nodes slowly, one at a
time, mostly in order not to hit any soft limits that may be defined on the
account(like total number of instances), but also in order to give the user
the chance to stop it in case things go south for whatever reason during the
evaluation, since as I warned, this thing is likely full of bugs at this
point.

