Hacker News new | comments | show | ask | jobs | submit login
The Cost Savings Of Netflix's Internal Spot Market (highscalability.com)
157 points by yarapavan 9 months ago | hide | past | web | favorite | 84 comments

> Video encoding is 70% of Netflix’s computing needs, running on 300,000 CPUs in over 1000 different autoscaling groups.

This feels crazy to me. They'll have to encode all their content into 15-20 different formats for various devices/bitrates, but surely they cache the output for each of those so it'll only have to be done once for each show/movie. Compare that to serving content, which happens thousands or millions of times for each stream. This isn't Youtube, there's not much of a "long tail" of content that's only watched a few times but has to be encoded anyway.

> Stranger Things season 2 is shot in 8K and has nine episodes. The source files are terabytes and terabytes of data. Each season required 190,000 CPU hours to encode.

High resolution video is larger than you might expect.

Is there such a thing as a video, image, or audio format that is inefficient for viewing, but is radically more efficient for transcoding? Such a technique would be useful for a number of applications.

Sure, Raw video is one example of that, it takes insane bandwidth and storage, but can load it from SSD fast enough that's not an issue. Really, codecs need to be closely related before encoding X helps make encoding Y.

To compare: my camera uses 25 megabytes of storage for each single raw frame. I have no idea how much raw sound takes up, but you would have to move around a lot of SSDs to use that.

It can get worse 100+ Megapixels at 60 fps: http://www.forzasilicon.com/2014/04/forza-silicon-introduces...

Though, I have no idea what network connections they would use. I mean even in black and white this would saturate a 100Gbps link.

Interesting idea. Could there be a way of making a format requires some sort of extra work to access? So sort of like a bitcoin setup where the local device is a miner (but actually doing useful work) and it gets access to the video as part of that computation. Effectively the content creator would trade computation for content.

I'm sure it's totally impossible in theory and in practice.

In many ways - older video codecs. Newer video codecs gain more compression by doing more and more complicated encoding and decoding. Trading CPU for Size.

I'm assuming because most of the content serving happens in Open Connect [0]. And while there may be a ton of server work involved in billing, api, auth, analytics, etc, it probably pales in comparison to the raw transcoding if you take actually serving the encoded files out to clients. From the article, "9,570 different video, audio, and text files" are generated for one show, from 8k resolution source files.

Seems reasonable to me.

[0] https://openconnect.netflix.com/en/

The article is about improving resource utilization, usually very low for cloud users (between 6% and 12% according to McKinsey). It's too bad the article fails to mention resource utilization numbers at Netflix.

Cloud facilities like Google and Twitter have been co-locating workloads using containers for a while now (using Borg or Mesos) but they're still below 40%.



It's probably because these cloud companies consider this type of data to be trade secrets. I'm actually surprised they talk so freely about all this. Maybe it's because they're years ahead from the competition, and do not seem to stop innovating.

I recall seeing a technical presentation by one of these larger organizations (probably Netflix), where they explain that they have to do these tech talks in order to spread the knowledge enough so that they can find people to hire to work on their stuff. Staffing was explicitly the reason they were speaking so freely about details other shops typically consider proprietary.

What sorts of things does one need to learn to get a software engineering job at a place like Netflix? Where would you even go to learn that type of thing (other than Netflix)?

As someone who spends most of their day writing C# web apps for clients, something like a job at Netflix seems like an impossibility at this point.

Netflix is, mostly, just a Java shop. They follow good practices in a lot of ways (circuit-breaker patterns, abstracted access to data stores, etc.) but it's still a Java shop.

Their infrastructural/cloud ops teams are broader in the proper nouns they wrangle, but that's a different kettle of fish and is totally learnable outside of Netflix: understand AWS (when to build versus buy), understand how systems talk to one another (networking, HTTP for online queries versus queues for offline jobs), and how systems scale and repair after failure (horizontal versus vertical scaling, quorums for decisionmaking, when to use CP versus AP datastores, that kind of thing). And Netflix open-sources enough of their internal tooling that, once you have a baseline in the infrastructural field, you can probably do a good job of figuring out what the missing bits are.

I'm sure, having not worked there but worked on fairly large systems, that Netflix has its own wacky problems, but they're the things that baseline understanding of this stuff enables you to learn. It's all just digital plumbing.

We have many different teams that work in different domains and use different technologies. You can take a look at https://jobs.netflix.com to see if there's anything you would love doing.

My team, for example, builds the user interface for TV devices. It is a single page app that we craft using JavaScript and React. The same code runs on hundreds of different devices, including game consoles, smart TVs, cable boxes, and other HDMI-connected devices.

The TV UI team looks for seasoned software engineers who build UIs and we come from a variety of backgrounds: web, desktop, mobile, embedded devices.

So their plan for getting people who better understand their (likely best) practices for admin is to share their knowledge of how to administer a system like theirs?

That's actually pretty interesting, and loads of respect to them for it.

Publish date is 2008 on that McKinsey doc, 2012 on the Google one. Decent chance it's a different world now.

Article was published on dec 4th 2017

If you weren't running in the cloud, you would just use baremetal machines with prioritized job queues. You can keep a low-priority job queue doing video encoding 24/7, and keep any other jobs or applications at a higher-priority. The result is continuous resource utilization at near-peak with little contention.

This reduces complexity and overhead significantly. With a cloud model, each level of abstraction requires new methods for networking, orchestration, configuration, execution, debugging, etc. On a single baremetal machine, all of the resources are essentially the same things running on one layer. But you can still get fancy with cgroups, namespaces and quotas if you need enforced isolation or resource limits.

Plus the capex costs of constantly having to buy new hardware and stick it in a datacenter for when your workload went up.

Plus the costs of maintaining that hardware both from an opex and capex point of view.

Plus the capex and opex costs of creating and maintaining the high speed network required to ship those files back and forth within the datacenter space that you're renting.

Plus the massive hit in how nimble your team can be. (Want to support a new codec? Either buy and install new hardware, or set the priority on these re-encodes lower and tell your boss to wait six months)

When you have an extremely variable workload such as Netflix et al, even though AWS/GCP/Azure are more expensive on a unit cost, your savings from operational expenditure will more than make up for that difference. Not having to buy and maintain a bunch of "just using baremetal machines" is a massive, massive cost savings.

Lambda / Google Cloud Functions / Azure Functions are a natural extension of this-- your write/test/deploy cycle can be much, much faster and you don't have to even worry about maintaining any infrastructure. The brainpower savings of this more than make up for the extra cost for CPU cycles.

Well first of all, you can always pay someone else for baremetal server hosting and support, just like with Amazon. So you don't need to buy and maintain, and you can still save costs - you're renting, just like with Amazon.

Second, even if Netflix's "workload" is variable, they're still using reserved instances. There's no practical difference from those to baremetal machines.

Finally, I am highly skeptical of anyone telling me that five more layers of abstraction make it easier or quicker to write, test and deploy code. Code is code. Servers are servers. Microservices are microservices. Different people are (or should be) maintaining all of these, and if they do their jobs right it should not incur a performance penalty on anyone else.

I will also state the obvious: if you don't hire the right people to manage the right pieces, you will waste huge amounts of time and money trying to figure out what other people have known for decades. This is universal, and has nothing to do with what technology you use.

> 300,000 CPUs

You better have a bloody big datacentre.

What happens if you need to encode a lot of videos during peak streaming?

The beauty of the cloud is that when that happens, they can burst up and use on-demand instances to increase the overall pool. With bare metal you either have to keep a bunch of spares or not be able to handle that situation.

They would turn down the amount of encoding they do during peak streaming. They use their existing servers instead of paying for more servers to encode their Netflix Originals, but encoding isn't time sensitive so can just run on a "spot instance" when their reserved instances are under-utilised.

They'll need 4K, 1080p, 720p etc encoded versions ready for the release of the series, but they might have copies of some of the episodes ready for encoding a few months before the release, so can slowly encode them on their spare resources.

I worked there so I'm well aware of how it works. My point was sometimes they need to do a massive encode during peak, and having the cloud allows them to do that when necessary.

I agree 100%. The cloud is basically the only way to do massive parallel processing in a very short time window without the up front costs. But if this event is uncommon, you can do your massive encode on the cloud, and shut off the cloud resources after peak and go back to bare metal.

I think bare metal is simpler, and that simpler often means better, but there are limits to bare metal. Those limits are where the cloud can come in. You can (and many people do) use both.

I have tried reading this twice but can't find how are they leveraging the internal spot markets? Are they re-selling unused capacity to a third party?

> I have tried reading this twice but can't find how are they leveraging the internal spot markets? Are they re-selling unused capacity to a third party?

They have reserved a lot of instances in EC2.


You can get a steep discount for reserving instances. A reserved instance is typically reserved for a number of years.

When these aren't in use, then they can put them into their video encoding pool and they are essentially free.

I guess they reserve lots of instances to make sure they have the capacity at peak times and to access some serious discounts.

Yes, but bursting -/-> spot market.

The one universal truth in corporate America is that if your boss’s boss spent a million dollars on somehing, they’re going to make you ‘use’ it if it’s the last thing they do.

Because they worry it will be if they fail. See also multi year contracts for software that’s terrible.

Netflix needs n instances to operate at peak capacity. The article is about them utilizing those pre-paid, necessary compute resources when the actual streaming they do requires a small fraction of n. I'm not sure how that has any correlation to multi year software contracts.

Reserved instances are a 1 or 3 year commitment for the discount, compared to more expensive on-demand instances with no minimum contract.

It's not real clear in the article at all, but it appears that they are simply re-allocating unused computing power (on non-video encoding reserved instances) to video encoding tasks on demand. I think that the author is just calling it an "internal spot market" for sake of reusing terminology familiar with their audience (or to try to attract clicks).

Or, y'know, they implemented an internal spot market, where encoding jobs can say "I'm willing to pay $X to encode right now". Other jobs bid $Y, and the scheduler can prioritize appropriately.

If true, that would be pretty interesting - priority is a common feature to implement on top of a naive scheduler, but 'nice' values are harder to connect to dollar amounts.

Unfortunately, the article doesn't go into any detail if that is, indeed, what they implemented.

If you'd like to learn about an open-source scheduler that does mimic some of the properties of an internal spot market, check out Cook: https://github.com/twosigma/Cook

It's unlikely that Netflix implemented anything this complex for video encoding, since Cook is only valuable when you're scheduling jobs from parties with conflicting incentives, rather than a singular purpose like "encode all Netflix video".

Looking closer at the mission to "encode all Netflix video", a less idealized version involves people at Netflix and outside it, who have signed contracts with deadlines and monetary incentives and everyone wants to go first, but how many are willing to put their money where their mouth is, so to speak.

It's expensive to operate thousands of instances in AWS, so I presume there's also a relentless push internally to drive down costs.

Thanks for the link to Cook, that's fascinating. The fact that it plugs into Mesos seems quite useful!

Holy Shit, two sigma, the hedge fund, developed that.(Cook)

>Cook is a powerful batch scheduler, specifically designed to provide a great user experience when there are more jobs to run than your cluster has capacity for.

I designed the audio transcoding infrastructure and scaling plan for an audio-heavy startup a number of years ago, but unfortunately the mobile app was delayed and they ran out of money before public launch.

Using other servers as lower priority job servers is a great way to increase utilization during off-peak times and save on server costs. You have to think about what happens if either a batch job or the server's main task runs away with all the system resources, so it can get tricky. A spot market might actually be an interesting solution to some of those problems, though for most of us, just using disk quotas and nice levels will be enough.

> Video encoding is 70% of Netflix’s computing needs, running on 300,000 CPUs in over 1000 different autoscaling groups.


> But if you look, encoding has a bursty workload pattern. That means the farm would be underutilized a lot of the time.

> ...they built their own internal spot market to process the chunked encoding jobs. The Engineering Tools team built an API exposing real time unused reservations at the minute level.

They're running multiple encoding tasks at once (an understatement) and when one of these is using less CPU, they shift another encoding task to that same machine.

Agreed the article is unclear. My understanding is they’re using unused capacity on servers that are already performing other tasks. So they’re not really creating a spot market for unutilized servers, so much as for unutilized capacity on the severs they’re already using.

For example if one of their servers is using 50% CPU, they can fill the remaining 50% with video encoding tasks.

But obviously I don’t know any better than you. That’s just my takeaway from the vaguely worded article.

The article says they have a lot of reserved instances which at some point is not fully used.

So if you have 50 cpu hours reserved capacity per hour and during parts of the day you're using only 10h, they can allocate the remaining 40h which they reserved and paid upfront for encoding.

They probably rent servers / instances for a fixed price per year and use those to do video conversion when they're not busy, instead of renting additional or on-demand instances when there's encoding to be done. So just reusing existing systems instead of spinning up new ones.

So they built...a scheduler?

Well, a multi-machine scheduler.

In the HPC world, there are many of these: SLURM, Torque, PBS.

Don't forget Globus and Condor!

How could you overlook (son of) Grid Engine

Well yes, but the elegance here is how they layered their scheduler to be on top of scheduling the resources they paid for at AWS with a meta-scheduler, which when they originally did this was rather novel in the consumer market.

No, the underlying instances were reserved. It reduces to the same problem as scheduling on bare metal.

So, are you saying this is simplistic.?

What is your solution.?

It's not simplistic. It's simple. And a fine solution.

Ah yes, good correction of the statement.

I think we both agree, just that I was more clumsy with my language.


Scheduling implies this is simply done on a clock, while this is more about real-time maximization of CPU resources with feedback based on measured load.

> Scheduling implies this is simply done on a clock

That's a very outdated definition of scheduling in regards to cloud computing. The schedulers of Mesos, Omega, Borg, etc. all do allow for CPU-load based scheduling, and have schedulers for other scenarios.

I can't help but think that a hybrid cloud/on premise solution would be cheaper in the long run for them. The millions of dollars that they pour into AWS, it seems like they could colocate/contract out some heavy on-premise hardware for standard load and then offload peak load to spot instances.

It's hard to say without seeing the numbers as they're almost certainly getting a very steep discount from Aws (in return for extolling the virtues of Aws, being pushed in Aws promo material etc.) In addition I would assume there's some sort of exclusivity clause in there.

Something is wrong with the math:

* Video encoding is 70% of Netflix's computing needs

* Stated cost savings on encoding is 92%

This equals a 64% savings on "Netflix's computing needs", which must be a good chunk of their budget.

What's wrong with the math? They saved a significant amount of money doing this, which is why it was worth doing.

I would think it would be a significant shareholder event to save 64% of their total computing budget. Their 2016 10K


says their 3rd party cloud computing costs increased by $23M in 2015 and increased a further $27M in 2016. More importantly, their total technology and development cost increased from $650M in 2015 to $850M in 2016. That includes engineers so it's a bit tricky to figure out what their costs really are, but they didn't go down appreciably in 2016 so this effect had to be in 2017. Looking at their first three 10Q's shows their Technology and Development budget keeps going up. I don't see how a streaming service could save 64% of its total computing needs, and it not show up on any of its SEC filings.

In 2016 they completed a seven year move of all their computing infrastructure to AWS. When they sized the system, they bought reserved instances (probably at a significant discount). From the math in this article, they bought 64% more than they needed if they just discovered they could save 64% of their EC2 budget. That seems like a big math error.

I was a former insider so I can't get into details, but if you look you'll see that almost the entirety of their cost is content acquisition. The next biggest cost is salary.

Servers barely even register as far as costs go, and also, it's a 92% savings over what the cost would have been without the system, not an absolute 92%. The farm keeps growing all the time as movies become higher resolution and more encodings are supported.

Technology and Development (T&D) is directly related to computing, not content acquisition or sales. Computing costs are significant in this category, since they point them out explicitly in the notes. 64% savings would show up, since we know that 3rd party cloud costs increased by $50M over the last two years. For the number of instances they must be using, the costs must be running well over $100M.

The number you don't know is how much the encoding workload grew in the time it took them to develop the system.

Let's use your numbers. Say that two years ago computing costs were $50M, and encoding was $46M of that. Now say that their costs are currently $100M, but the encoding workload grew 6X. Under the old system, that would have cost $276M, but under the new system it is on $22M. That would be a 92% savings, and would totally be in line given that in the last few years they have drastically increased their machine learning output, which would have overtaken encoding work.

1. Keep in mind "potential savings" is different from "actual savings".

If the EC2 hosts have already been reserved then Netflix "cloud costs" would be relatively stable. What you might see is a reduced rate of increase y/y.

2. Keep in mind Amdhal's Law (or rather a slight variant). The absolute reduction has to be weighted against the percentage of resource usage that has been reduced.

(All numbers are made up) If previously Netflix was paying 3MM for encoding, and now they are paying 0MM; Compared to an annual 30MM on streaming with a ~25% y/y growth, you wouldn't notice the missing 3MM unless it was pointed out.

Their capacity keeps growing so the savings would be over the same amount of computing, but overall the total amount of computing they do keeps going up as shows move from being shot in 4K to 8K, and they acquire more and more shows to broadcast.

It’s 64% they didn’t need most of the time, but would need during peak usage. The “spot market” uses this buffer capacity only when it isn’t serving customers, preventing slowdowns and outages.

In addition to what everyone else has said, I think the wording is quite important. My reading is that encoding takes up 70% of their CPU hours. A 92% saving on that is obviously very good, but there's a lot of other ways you can spend money on aws.

> There’s lots of encoding work to do. Netflix supports 2200 devices and lots of different codecs, so video is needed in many different file formats.

Why? Don't they write/control their own client apps and video players? Why isn't a single codec with a choice of 3 or 4 different resolutions sufficient?

On most platforms that aren't powerful games consoles or personal computers, the video decoding is done by proprietary hardware; these hardware decoders usually only support a few families of codecs and are difficult or impossible to reprogram.

Even if most things support, say, H.264 out of the box, each video decoding SoC has different quirks which means either A) the same encoding looks different on different hardware or B) the encoding fails to decode on different hardware, so even the same codec can have a bunch of parameters which can/need to be adjusted for different devices.

One smart TV might have a Broadcom SoC with a GPU which supports h.264 decoding like the Raspberry Pis do, an Android tablet might have a Mali GPU which can decode VP9 and HEVC, some weird older set top box which has netflix built in might only support MPEG...

Three things stand out to me. One and two, some devices have hardware support for certain formats, and Netflix optimized for playtime (battery) and bandwidth so you have to handle all those.

Scaling a video on the client looks like crap and is slow. So you have to deal with all the fiddly screen resolutions, of which the distinct count used to be truly awful but now is much better. However you have to add tablets in which cancels out some of the gains in sanity.

On the bandwidth thing, I wouldn’t put it past Netflix to reencode popular movies with a more time intensive compression settings to achieve better aggregate bandwidth usage but not hold up new releases.

Because software video decoding of a codec that doesn't have native hardware decoding is slow (software 4K decoding taxes even many modern machines) and most importantly is incredibly resource intensive and therefore battery draining.

You've no realistic option than to support whatever formats that the hardware has native decoding support, and you'd prefer H265 over 264 (bandwidth costs money) if it's available, and there are many different HDR formats.

That may depend on what's supported on the devices. I worked on smart TVs years ago and there was a large variability in what things could play.

About resolutions, different resolutions can serve different bandwidths. Not everyone can stream crystal-clear HD at all times.

Their clients are in used in different Smart TVs, and each brand/model can have different codecs chips.

Are there any posts that analyse how the economics of offering $10/month streaming plans work out? Since they are hosted on AWS, data transfer is expensive, and coupled with the licensing of the content they serve, how to they manage to offer streaming at such a low price point?

Netflix doesn’t stream from AWS. Rather, they’ve built their own video CDN[1] with PoPs all over the globe. All of their user-facing streaming happens from their CDN. Their AWS infrastructure is used for billing, transcoding, playlists, etc. Also, they perform some seeding of the CDN from within AWS, but the seed boxes themselves can seed each other as well.

All in all, they clearly do push huge amounts of data out of AWS, but that data account for a very small percentage of the total data that gets pushed out to clients via their CDN.

[1] https://openconnect.netflix.com/en/

I talked with the CenturyLink tech who hooked up my fiber and he mentioned they had just wired up some 40 gigabit appliances. This was in Utah. They are all over.

Netflix likely doesn't transfer most of their data through AWS, but instead Netflix Open Connect: https://openconnect.netflix.com/

This was super interesting. Thanks for sharing!

> Open Connect is the name of the global network that is responsible for delivering Netflix TV shows and movies to our members world­wide. This type of network is typically referred to as a “Content Delivery Network” or “CDN” because its job is to deliver internet­based content efficiently by bringing the content that people watch close to where they’re watching it. The Open Connect network shares some characteristics with other CDNs, but also has some important differences.

Netflix began the Open Connect initiative in 2011, as a response to the ever­increasing scale of Netflix streaming. We started the program for two reasons:

1) As Netflix grew to be a significant portion of overall traffic on consumer Internet Service Provider (ISP) networks, it became important to be able to work with those ISPs in a direct and collaborative way.

2) Creating a content delivery solution customized for Netflix allowed us to design a proactive, directed caching solution that is much more efficient than the standard demand driven CDN solution, reducing the overall demand on upstream network capacity by several orders of magnitude.


One doesn't host bulk media in AWS. Netflix runs everything except that on AWS. I've contracted for two streamers that also used AWS for most things, both of them used CDNs rather than AWS for serving the big film files.

Every time I read articles about Netflix running on EC2 I really really really want to know who's winning more. Is Netflix really saving money by using ec2 vs ownership? Or is Amazon raking cash in from Netflix and Netflix is basically subsidising the build out costs for ec2?

It could be both. Just because they're paying Amazon an arm and a leg doesn't mean they have the ability or the time to do a better job for cheaper.

Can someone explain what a spot market is in this context?

Spot, in AWS market, as you likely know, is the ability to bid for regional data center underutilized compute capacity... compute that is not otherwise reserved or dedicated to a particular AWS client account. The risk is that if I outbid you, then I can slurp your instances out from under you, based on if they are needed from the inventory to complete my bid request.

In context of Netflix, they purchased tons of reserved instances in AWS - these reserved instances will never enter into AWS spot pool, so are dedicated compute power to Netflix.

However, Netflix may not use that reserved instance to its full capacity and so it has downtime that Netflix was paying for.

To make efficient use of the tons of money paid to reserve that AWS instance exclusively for Netflix needs, but seeing that it was only needed to stream to clients say, 30% of the day, they built an internal compute bidding market at Netflix, so that other jobs could say “hey netflix, I need to run this encoding job and I need a bunch of boxes, out of all our reserved instances, gimme some that are not currently utilized for my job”

So they layered their own internal compute request service to maximize the util of their tons of reserved instances, instead of going to additional AWS instances to complete these other jobs...

If they have this internal spot instance market they should schedule crypto mining at a super low priority.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact