
Creating Your Own EC2 Spot Market - Chris911
http://techblog.netflix.com/2015/09/creating-your-own-ec2-spot-market.html
======
cperciva
Does this save much money compared to using Amazon's spot capacity? In the
early days I remember the spot price closely tracked the "per operating hour"
RI prices.

It would be a very nice feature if EC2 could incorporate each user's pool of
reserved instances into their view of the spot market, but maybe that would
create too much complexity on Amazon's side.

~~~
samstave
These are diff models.

The spot pool is amazons unused capacity vs their allotment of RIs.

RIs always get priority capacity over spot, and spot instances always have
highest bidder priority.

If there are a ton of spot capacity one hour, then Netflix scales up in that
zone to launch jobs on their RIs - the RIs get launched and spot instances get
whacked with the message "capacity over subscribed".

The RI is a guarantee that when you choose to launch that instance, capacity
exists. Period. (Unless your instance limit for that type/zone/region is hit)

So, no Amazon could never do that.

However, AFAIK, Netflix can do what it wants with the RIs...

Think of Heroku: millions of sub-leases on their RIs to user apps.... But
Amazon cannot meddle with the availability of the RIs that Heroku has
purchased

~~~
cperciva
I know RIs are different from spot instances. But it should be possible to
have a spot instance API front-end which pulls from unused RIs first -- just
like on-demand instances do, in fact.

~~~
samstave
But that is how it already works...

The diff is that AWS spot instance pull from capacity that is unused,
irrespective of RIs that ___you_ __don 't own... But when ___i_ __come to
launch my __ _RI_ __and capacity is full, your spot instance gets whacked.

What netflix is doing is an internal meta-spot-market on ___THEIR_ __idle RIs.
This compute capacity will never be available to anyone but them unless they
create a service to sublet the capacity via the same tools they do to their
internal teams... From amazons persective that capacity is already paid for
regardless of consumption...

Forgive me if I am articulating this poorly...

In the end, I just find the numbers disclosed in here to be awesome and I
suspect they are just but a fraction of total capacity Netflix requires.

Given that they disclose 12k idle trough RIs is very telling in efficiencies -
but since I don't have the whole picture of their footprint, or instance
demographic, I can't tell how much of this completes the picture of their
landscape.

I'll assume 12k represents perhaps 80% of the machines they use... They have
control-plan, orchestration, DR, private peering nodes (can't recall the name)
but like backblaze pods, and any number of varied stage test debut envs...

So from a pure "stream or encode my shit" ___maybe_ __80% works, hopefully it
's better - but could be worse...

Netflix's original mandate to push all DC ops and what not out of the building
is not only super awesome, but is one of the linchpins in AWS business model.
There should be whole career paths in college designed around this model...

~~~
cperciva
You're not understanding me. If I have a Reserved Instance already paid for
and tell Amazon that I want a Spot Instance, I'll be charged the Spot Instance
price for it. What I'm saying is that launching a spot instance should use my
idle RI capacity _and bill me as if I had launched those instances as on-
demand_.

~~~
samstave
I understand you, but that is an incorrect assumption.

Aws has resources. They are priced at X.

If you buy on demand, you pay X

If you buy an RI, you pay X-[Ri discount]

If you launch an instance, and you have an RI that applies, you get the
discount.

Spot instances are a diff game. Spot and RI are mutually exclusive when
costing.

Spot is only affected by capacity when it comes to RIs.

If you have RIs, you can't swap a running instance between spot or RI or OD
pricing...

Aws calls this "lifecycle" \- spot or normal. RI discount "coupons" only apply
to an instance in the same zone on normal lifecycle that can receive the
available RI pricing discount.

They do not overlap...

~~~
cperciva
I _know_ they don't overlap! I'm saying they _should_ overlap.

~~~
samstave
Had this discussion with aws multiple times. Never going to happen

------
samstave
This is amazingly good info;

First - I've been seeking insight to how Netflix handles their footprint and
what instance types they use. I dealt with these problems on a MUCH smaller
scale (1K machines) but the challenges are the same...

Here is what I can derive from this info:

So given just 1500 idle Personalization R3.4xlarge instance for a given month
-- if they did not use them, they were losing $265,140 per month on these
instances (assuming a 36-month RI on these instances...) If they had these as
no-upfront RIs -- then they are wasting $475,200 per month on these instances.

Obviously Netflix gets significant discounts on their RIs -- as far as I am
aware the common customer will only get a discount of up to 5% off RIs after
your initial $500K RI buy... im sure netflix has even better...

Anyway - as we dont know what the other instance types are that they use, if
we were only to assume that the other 12K instances were the same and their
utilization was only 50% of the day -- they would be wasting as little as
$2MM/mo to as much as $3.8MM per month...

Finally - its great to see them mention specifically that the granularity of
AWS is only hourly for instances - but their internal granularity is to the
minute. The hourly granularity of AWS has been a thorn for some time. I
certainly hope that Netflix will open source their spot manager at some point
(Recall that AWS employees went out and started ClusterK with their spot
manager, the Balancer, and then AWS came along and bought that little company
up)....

There is ___significant_ __value in being able to manage the spot market (your
meta-version, like netflix, and AWS)....

What I would REALLY find interesting is if the Netflix Spot Manager were able
to split and combine RIs and move them from zone to zone.

*Disclaimer: the napkin math above obviously does not take into account that the RIs and instances are in all regions - so the pricing will be much different and by region... but it does illustrate, even by those 1500 RIs how much money can be wasted on under-utilized instances.

Netflix could start a service offering batch jobs to other companies to
utilize capacity if they wanted. 12K instances unused for even an hour is
going to be say, $6,000/hour that could potentially be wasted...

~~~
eloff
Interestingly enough Google cloud bills by the minute after the first ten
minutes.

~~~
deftnerd
For a moment, I thought you meant that the first 10 minutes were free. I had
all sorts of fun ideas brewing about minimal OS's and trying to take advantage
of free resources.

I looked into it and they do billing by the minute with a minimum of 10
minutes billed.

Darn!

