
The Cost Savings Of Netflix's Internal Spot Market - yarapavan
http://highscalability.com/blog/2017/12/4/the-eternal-cost-savings-of-netflixs-internal-spot-market.html
======
kcorbitt
> Video encoding is 70% of Netflix’s computing needs, running on 300,000 CPUs
> in over 1000 different autoscaling groups.

This feels crazy to me. They'll have to encode all their content into 15-20
different formats for various devices/bitrates, but surely they cache the
output for each of those so it'll only have to be done once for each
show/movie. Compare that to serving content, which happens thousands or
millions of times for each stream. This isn't Youtube, there's not much of a
"long tail" of content that's only watched a few times but has to be encoded
anyway.

~~~
mgummelt
> Stranger Things season 2 is shot in 8K and has nine episodes. The source
> files are terabytes and terabytes of data. Each season required 190,000 CPU
> hours to encode.

High resolution video is larger than you might expect.

~~~
abtinf
Is there such a thing as a video, image, or audio format that is inefficient
for viewing, but is radically more efficient for transcoding? Such a technique
would be useful for a number of applications.

~~~
Retric
Sure, Raw video is one example of that, it takes insane bandwidth and storage,
but can load it from SSD fast enough that's not an issue. Really, codecs need
to be closely related before encoding X helps make encoding Y.

~~~
tomjen3
To compare: my camera uses 25 megabytes of storage for each single raw frame.
I have no idea how much raw sound takes up, but you would have to move around
a lot of SSDs to use that.

~~~
Retric
It can get worse 100+ Megapixels at 60 fps:
[http://www.forzasilicon.com/2014/04/forza-silicon-
introduces...](http://www.forzasilicon.com/2014/04/forza-silicon-introduces-
advanced-cmos-modular-video-camera-platform/)

Though, I have no idea what network connections they would use. I mean even in
black and white this would saturate a 100Gbps link.

------
gourou
The article is about improving resource utilization, usually very low for
cloud users (between 6% and 12% according to McKinsey). It's too bad the
article fails to mention resource utilization numbers at Netflix.

Cloud facilities like Google and Twitter have been co-locating workloads using
containers for a while now (using Borg or Mesos) but they're still below 40%.

[https://www.sallan.org/pdf-
docs/McKinsey_Data_Center_Efficie...](https://www.sallan.org/pdf-
docs/McKinsey_Data_Center_Efficiency.pdf)

[http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/googletrace-
so...](http://www.pdl.cmu.edu/PDL-FTP/CloudComputing/googletrace-socc2012.pdf)

~~~
Aissen
It's probably because these cloud companies consider this type of data to be
trade secrets. I'm actually surprised they talk so freely about all this.
Maybe it's because they're years ahead from the competition, and do not seem
to stop innovating.

~~~
dvanduzer
I recall seeing a technical presentation by one of these larger organizations
(probably Netflix), where they explain that they have to do these tech talks
in order to spread the knowledge enough so that they can find people to hire
to work on their stuff. Staffing was _explicitly_ the reason they were
speaking so freely about details other shops typically consider proprietary.

~~~
pc86
What sorts of things does one need to learn to get a software engineering job
at a place like Netflix? Where would you even go to learn that type of thing
(other than Netflix)?

As someone who spends most of their day writing C# web apps for clients,
something like a job at Netflix seems like an impossibility at this point.

~~~
eropple
Netflix is, mostly, just a Java shop. They follow good practices in a lot of
ways (circuit-breaker patterns, abstracted access to data stores, etc.) but
it's still a Java shop.

Their infrastructural/cloud ops teams are broader in the proper nouns they
wrangle, but that's a different kettle of fish and is totally learnable
outside of Netflix: understand AWS (when to build versus buy), understand how
systems talk to one another (networking, HTTP for online queries versus queues
for offline jobs), and how systems scale and repair after failure (horizontal
versus vertical scaling, quorums for decisionmaking, when to use CP versus AP
datastores, that kind of thing). And Netflix open-sources enough of their
internal tooling that, once you have a baseline in the infrastructural field,
you can probably do a good job of figuring out what the missing bits are.

I'm sure, having not worked there but worked on fairly large systems, that
Netflix has its own wacky problems, but they're the things that baseline
understanding of this stuff enables you to learn. It's all just digital
plumbing.

------
peterwwillis
If you weren't running in the cloud, you would just use baremetal machines
with prioritized job queues. You can keep a low-priority job queue doing video
encoding 24/7, and keep any other jobs or applications at a higher-priority.
The result is continuous resource utilization at near-peak with little
contention.

This reduces complexity and overhead significantly. With a cloud model, each
level of abstraction requires new methods for networking, orchestration,
configuration, execution, debugging, etc. On a single baremetal machine, all
of the resources are essentially the same things running on one layer. But you
can still get fancy with cgroups, namespaces and quotas if you need enforced
isolation or resource limits.

~~~
mark242
Plus the capex costs of constantly having to buy new hardware and stick it in
a datacenter for when your workload went up.

Plus the costs of maintaining that hardware both from an opex and capex point
of view.

Plus the capex and opex costs of creating and maintaining the high speed
network required to ship those files back and forth within the datacenter
space that you're renting.

Plus the massive hit in how nimble your team can be. (Want to support a new
codec? Either buy and install new hardware, or set the priority on these re-
encodes lower and tell your boss to wait six months)

When you have an extremely variable workload such as Netflix et al, even
though AWS/GCP/Azure are more expensive on a unit cost, your savings from
operational expenditure will more than make up for that difference. Not having
to buy and maintain a bunch of "just using baremetal machines" is a massive,
massive cost savings.

Lambda / Google Cloud Functions / Azure Functions are a natural extension of
this-- your write/test/deploy cycle can be much, much faster and you don't
have to even worry about maintaining any infrastructure. The brainpower
savings of this more than make up for the extra cost for CPU cycles.

~~~
peterwwillis
Well first of all, you can _always_ pay someone else for baremetal server
hosting and support, just like with Amazon. So you don't need to buy and
maintain, and you can still save costs - you're renting, just like with
Amazon.

Second, even if Netflix's "workload" is variable, they're still using reserved
instances. There's no practical difference from those to baremetal machines.

Finally, I am _highly_ skeptical of anyone telling me that five more layers of
abstraction make it easier or quicker to write, test and deploy code. Code is
code. Servers are servers. Microservices are microservices. Different people
are (or should be) maintaining all of these, and if they do their jobs right
it should not incur a performance penalty on anyone else.

I will also state the obvious: if you don't hire the right people to manage
the right pieces, you will waste huge amounts of time and money trying to
figure out what other people have known for decades. This is universal, and
has nothing to do with what technology you use.

~~~
voltagex_
> 300,000 CPUs

You better have a bloody big datacentre.

------
thisisit
I have tried reading this twice but can't find how are they leveraging the
internal spot markets? Are they re-selling unused capacity to a third party?

~~~
nickcw
> I have tried reading this twice but can't find how are they leveraging the
> internal spot markets? Are they re-selling unused capacity to a third party?

They have reserved a lot of instances in EC2.

[https://aws.amazon.com/ec2/pricing/reserved-
instances/](https://aws.amazon.com/ec2/pricing/reserved-instances/)

You can get a steep discount for reserving instances. A reserved instance is
typically reserved for a number of years.

When these aren't in use, then they can put them into their video encoding
pool and they are essentially free.

I guess they reserve lots of instances to make sure they have the capacity at
peak times and to access some serious discounts.

~~~
hinkley
The one universal truth in corporate America is that if your boss’s boss spent
a million dollars on somehing, they’re going to make you ‘use’ it if it’s the
last thing they do.

Because they worry it will be if they fail. See also multi year contracts for
software that’s terrible.

~~~
pc86
Netflix needs _n_ instances to operate at peak capacity. The article is about
them utilizing those pre-paid, _necessary_ compute resources when the actual
streaming they do requires a small fraction of _n_. I'm not sure how that has
any correlation to multi year software contracts.

~~~
fragmede
Reserved instances are a 1 or 3 year commitment for the discount, compared to
more expensive on-demand instances with no minimum contract.

------
naturalgradient
So they built...a scheduler?

~~~
mnx
Well, a multi-machine scheduler.

~~~
robnagler
In the HPC world, there are many of these: SLURM, Torque, PBS.

~~~
toomuchtodo
Don't forget Globus and Condor!

~~~
pnutjam
How could you overlook (son of) Grid Engine

------
llama052
I can't help but think that a hybrid cloud/on premise solution would be
cheaper in the long run for them. The millions of dollars that they pour into
AWS, it seems like they could colocate/contract out some heavy on-premise
hardware for standard load and then offload peak load to spot instances.

~~~
remus
It's hard to say without seeing the numbers as they're almost certainly
getting a very steep discount from Aws (in return for extolling the virtues of
Aws, being pushed in Aws promo material etc.) In addition I would assume
there's some sort of exclusivity clause in there.

------
robnagler
Something is wrong with the math:

* Video encoding is 70% of Netflix's computing needs

* Stated cost savings on encoding is 92%

This equals a 64% savings on "Netflix's computing needs", which must be a good
chunk of their budget.

~~~
jedberg
What's wrong with the math? They saved a significant amount of money doing
this, which is why it was worth doing.

~~~
robnagler
I would think it would be a significant shareholder event to save 64% of their
total computing budget. Their 2016 10K

[https://ir.netflix.com/secfiling.cfm?filingID=1628280-17-496...](https://ir.netflix.com/secfiling.cfm?filingID=1628280-17-496&CIK=1065280)

says their 3rd party cloud computing costs increased by $23M in 2015 and
increased a further $27M in 2016. More importantly, their total technology and
development cost increased from $650M in 2015 to $850M in 2016. That includes
engineers so it's a bit tricky to figure out what their costs really are, but
they didn't go down appreciably in 2016 so this effect had to be in 2017.
Looking at their first three 10Q's shows their Technology and Development
budget keeps going up. I don't see how a streaming service could save 64% of
its total computing needs, and it not show up on any of its SEC filings.

In 2016 they completed a seven year move of all their computing infrastructure
to AWS. When they sized the system, they bought reserved instances (probably
at a significant discount). From the math in this article, they bought 64%
more than they needed if they just discovered they could save 64% of their EC2
budget. That seems like a big math error.

~~~
jedberg
I was a former insider so I can't get into details, but if you look you'll see
that almost the entirety of their cost is content acquisition. The next
biggest cost is salary.

Servers barely even register as far as costs go, and also, it's a 92% savings
over what the cost would have been without the system, not an absolute 92%.
The farm keeps growing all the time as movies become higher resolution and
more encodings are supported.

~~~
robnagler
Technology and Development (T&D) is directly related to computing, not content
acquisition or sales. Computing costs are significant in this category, since
they point them out explicitly in the notes. 64% savings would show up, since
we know that 3rd party cloud costs increased by $50M over the last two years.
For the number of instances they must be using, the costs must be running well
over $100M.

~~~
jedberg
The number you don't know is how much the encoding workload grew in the time
it took them to develop the system.

Let's use your numbers. Say that two years ago computing costs were $50M, and
encoding was $46M of that. Now say that their costs are currently $100M, but
the encoding workload grew 6X. Under the old system, that would have cost
$276M, but under the new system it is on $22M. That would be a 92% savings,
and would totally be in line given that in the last few years they have
drastically increased their machine learning output, which would have
overtaken encoding work.

------
mac01021
> There’s lots of encoding work to do. Netflix supports 2200 devices and lots
> of different codecs, so video is needed in many different file formats.

Why? Don't they write/control their own client apps and video players? Why
isn't a single codec with a choice of 3 or 4 different resolutions sufficient?

~~~
erinaceousjones
On most platforms that aren't powerful games consoles or personal computers,
the video decoding is done by proprietary hardware; these hardware decoders
usually only support a few families of codecs and are difficult or impossible
to reprogram.

Even if most things support, say, H.264 out of the box, each video decoding
SoC has different quirks which means either A) the same encoding looks
different on different hardware or B) the encoding fails to decode on
different hardware, so even the same codec can have a bunch of parameters
which can/need to be adjusted for different devices.

One smart TV might have a Broadcom SoC with a GPU which supports h.264
decoding like the Raspberry Pis do, an Android tablet might have a Mali GPU
which can decode VP9 and HEVC, some weird older set top box which has netflix
built in might only support MPEG...

------
ploggingdev
Are there any posts that analyse how the economics of offering $10/month
streaming plans work out? Since they are hosted on AWS, data transfer is
expensive, and coupled with the licensing of the content they serve, how to
they manage to offer streaming at such a low price point?

~~~
anderiv
Netflix doesn’t stream from AWS. Rather, they’ve built their own video CDN[1]
with PoPs all over the globe. All of their user-facing streaming happens from
their CDN. Their AWS infrastructure is used for billing, transcoding,
playlists, etc. Also, they perform some seeding of the CDN from within AWS,
but the seed boxes themselves can seed each other as well.

All in all, they clearly _do_ push huge amounts of data out of AWS, but that
data account for a very small percentage of the total data that gets pushed
out to clients via their CDN.

[1] [https://openconnect.netflix.com/en/](https://openconnect.netflix.com/en/)

~~~
anonacct37
I talked with the CenturyLink tech who hooked up my fiber and he mentioned
they had just wired up some 40 gigabit appliances. This was in Utah. They are
all over.

------
exabrial
Every time I read articles about Netflix running on EC2 I really really really
want to know who's winning more. Is Netflix really saving money by using ec2
vs ownership? Or is Amazon raking cash in from Netflix and Netflix is
basically subsidising the build out costs for ec2?

~~~
moxious
It could be both. Just because they're paying Amazon an arm and a leg doesn't
mean they have the ability or the time to do a better job for cheaper.

------
tbirrell
Can someone explain what a spot market is in this context?

~~~
samstave
Spot, in AWS market, as you likely know, is the ability to bid for regional
data center underutilized compute capacity... compute that is not otherwise
reserved or dedicated to a particular AWS client account. The risk is that if
I outbid you, then I can slurp your instances out from under you, based on if
they are needed from the inventory to complete my bid request.

In context of Netflix, they purchased tons of reserved instances in AWS -
these reserved instances will never enter into AWS spot pool, so are dedicated
compute power to Netflix.

However, Netflix may not use that reserved instance to its full capacity and
so it has downtime that Netflix was paying for.

To make efficient use of the tons of money paid to reserve that AWS instance
exclusively for Netflix needs, but seeing that it was only needed to stream to
clients say, 30% of the day, they built an internal compute bidding market at
Netflix, so that other jobs could say “hey netflix, I need to run this
encoding job and I need a bunch of boxes, out of all our reserved instances,
gimme some that are not currently utilized for my job”

So they layered their own internal compute request service to maximize the
util of their tons of reserved instances, instead of going to additional AWS
instances to complete these other jobs...

------
ericfrederich
If they have this internal spot instance market they should schedule crypto
mining at a super low priority.

