I have tried reading this twice but can't find how are they leveraging the inter...

nickcw · on Dec 5, 2017

> I have tried reading this twice but can't find how are they leveraging the internal spot markets? Are they re-selling unused capacity to a third party?

They have reserved a lot of instances in EC2.

https://aws.amazon.com/ec2/pricing/reserved-instances/

You can get a steep discount for reserving instances. A reserved instance is typically reserved for a number of years.

When these aren't in use, then they can put them into their video encoding pool and they are essentially free.

I guess they reserve lots of instances to make sure they have the capacity at peak times and to access some serious discounts.

mgummelt · on Dec 5, 2017

Yes, but bursting -/-> spot market.

hinkley · on Dec 5, 2017

The one universal truth in corporate America is that if your boss’s boss spent a million dollars on somehing, they’re going to make you ‘use’ it if it’s the last thing they do.

Because they worry it will be if they fail. See also multi year contracts for software that’s terrible.

pc86 · on Dec 5, 2017

Netflix needs n instances to operate at peak capacity. The article is about them utilizing those pre-paid, necessary compute resources when the actual streaming they do requires a small fraction of n. I'm not sure how that has any correlation to multi year software contracts.

fragmede · on Dec 5, 2017

Reserved instances are a 1 or 3 year commitment for the discount, compared to more expensive on-demand instances with no minimum contract.

jasongill · on Dec 5, 2017

It's not real clear in the article at all, but it appears that they are simply re-allocating unused computing power (on non-video encoding reserved instances) to video encoding tasks on demand. I think that the author is just calling it an "internal spot market" for sake of reusing terminology familiar with their audience (or to try to attract clicks).

fragmede · on Dec 5, 2017

Or, y'know, they implemented an internal spot market, where encoding jobs can say "I'm willing to pay $X to encode right now". Other jobs bid $Y, and the scheduler can prioritize appropriately.

If true, that would be pretty interesting - priority is a common feature to implement on top of a naive scheduler, but 'nice' values are harder to connect to dollar amounts.

Unfortunately, the article doesn't go into any detail if that is, indeed, what they implemented.

mgummelt · on Dec 5, 2017

If you'd like to learn about an open-source scheduler that does mimic some of the properties of an internal spot market, check out Cook: https://github.com/twosigma/Cook

It's unlikely that Netflix implemented anything this complex for video encoding, since Cook is only valuable when you're scheduling jobs from parties with conflicting incentives, rather than a singular purpose like "encode all Netflix video".

fragmede · on Dec 6, 2017

Looking closer at the mission to "encode all Netflix video", a less idealized version involves people at Netflix and outside it, who have signed contracts with deadlines and monetary incentives and everyone wants to go first, but how many are willing to put their money where their mouth is, so to speak.

It's expensive to operate thousands of instances in AWS, so I presume there's also a relentless push internally to drive down costs.

Thanks for the link to Cook, that's fascinating. The fact that it plugs into Mesos seems quite useful!

bob_theslob646 · on Dec 6, 2017

Holy Shit, two sigma, the hedge fund, developed that.(Cook)

>Cook is a powerful batch scheduler, specifically designed to provide a great user experience when there are more jobs to run than your cluster has capacity for.

nitrogen · on Dec 6, 2017

I designed the audio transcoding infrastructure and scaling plan for an audio-heavy startup a number of years ago, but unfortunately the mobile app was delayed and they ran out of money before public launch.

Using other servers as lower priority job servers is a great way to increase utilization during off-peak times and save on server costs. You have to think about what happens if either a batch job or the server's main task runs away with all the system resources, so it can get tricky. A spot market might actually be an interesting solution to some of those problems, though for most of us, just using disk quotas and nice levels will be enough.

LeifCarrotson · on Dec 5, 2017

> Video encoding is 70% of Netflix’s computing needs, running on 300,000 CPUs in over 1000 different autoscaling groups.

...

> But if you look, encoding has a bursty workload pattern. That means the farm would be underutilized a lot of the time.

> ...they built their own internal spot market to process the chunked encoding jobs. The Engineering Tools team built an API exposing real time unused reservations at the minute level.

They're running multiple encoding tasks at once (an understatement) and when one of these is using less CPU, they shift another encoding task to that same machine.

chatmasta · on Dec 5, 2017

Agreed the article is unclear. My understanding is they’re using unused capacity on servers that are already performing other tasks. So they’re not really creating a spot market for unutilized servers, so much as for unutilized capacity on the severs they’re already using.

For example if one of their servers is using 50% CPU, they can fill the remaining 50% with video encoding tasks.

But obviously I don’t know any better than you. That’s just my takeaway from the vaguely worded article.

LusoTycoon · on Dec 5, 2017

The article says they have a lot of reserved instances which at some point is not fully used.

So if you have 50 cpu hours reserved capacity per hour and during parts of the day you're using only 10h, they can allocate the remaining 40h which they reserved and paid upfront for encoding.

Cthulhu_ · on Dec 5, 2017

They probably rent servers / instances for a fixed price per year and use those to do video conversion when they're not busy, instead of renting additional or on-demand instances when there's encoding to be done. So just reusing existing systems instead of spinning up new ones.