I have tried reading this twice but can't find how are they leveraging the internal spot markets? Are they re-selling unused capacity to a third party?
> I have tried reading this twice but can't find how are they leveraging the internal spot markets? Are they re-selling unused capacity to a third party?
The one universal truth in corporate America is that if your boss’s boss spent a million dollars on somehing, they’re going to make you ‘use’ it if it’s the last thing they do.
Because they worry it will be if they fail. See also multi year contracts for software that’s terrible.
Netflix needs n instances to operate at peak capacity. The article is about them utilizing those pre-paid, necessary compute resources when the actual streaming they do requires a small fraction of n. I'm not sure how that has any correlation to multi year software contracts.
It's not real clear in the article at all, but it appears that they are simply re-allocating unused computing power (on non-video encoding reserved instances) to video encoding tasks on demand. I think that the author is just calling it an "internal spot market" for sake of reusing terminology familiar with their audience (or to try to attract clicks).
Or, y'know, they implemented an internal spot market, where encoding jobs can say "I'm willing to pay $X to encode right now". Other jobs bid $Y, and the scheduler can prioritize appropriately.
If true, that would be pretty interesting - priority is a common feature to implement on top of a naive scheduler, but 'nice' values are harder to connect to dollar amounts.
Unfortunately, the article doesn't go into any detail if that is, indeed, what they implemented.
If you'd like to learn about an open-source scheduler that does mimic some of the properties of an internal spot market, check out Cook: https://github.com/twosigma/Cook
It's unlikely that Netflix implemented anything this complex for video encoding, since Cook is only valuable when you're scheduling jobs from parties with conflicting incentives, rather than a singular purpose like "encode all Netflix video".
Looking closer at the mission to "encode all Netflix video", a less idealized version involves people at Netflix and outside it, who have signed contracts with deadlines and monetary incentives and everyone wants to go first, but how many are willing to put their money where their mouth is, so to speak.
It's expensive to operate thousands of instances in AWS, so I presume there's also a relentless push internally to drive down costs.
Thanks for the link to Cook, that's fascinating. The fact that it plugs into Mesos seems quite useful!
Holy Shit, two sigma, the hedge fund, developed that.(Cook)
>Cook is a powerful batch scheduler, specifically designed to provide a great user experience when there are more jobs to run than your cluster has capacity for.
I designed the audio transcoding infrastructure and scaling plan for an audio-heavy startup a number of years ago, but unfortunately the mobile app was delayed and they ran out of money before public launch.
Using other servers as lower priority job servers is a great way to increase utilization during off-peak times and save on server costs. You have to think about what happens if either a batch job or the server's main task runs away with all the system resources, so it can get tricky. A spot market might actually be an interesting solution to some of those problems, though for most of us, just using disk quotas and nice levels will be enough.
> Video encoding is 70% of Netflix’s computing needs, running on 300,000 CPUs in over 1000 different autoscaling groups.
...
> But if you look, encoding has a bursty workload pattern. That means the farm would be underutilized a lot of the time.
> ...they built their own internal spot market to process the chunked encoding jobs. The Engineering Tools team built an API exposing real time unused reservations at the minute level.
They're running multiple encoding tasks at once (an understatement) and when one of these is using less CPU, they shift another encoding task to that same machine.
Agreed the article is unclear. My understanding is they’re using unused capacity on servers that are already performing other tasks. So they’re not really creating a spot market for unutilized servers, so much as for unutilized capacity on the severs they’re already using.
For example if one of their servers is using 50% CPU, they can fill the remaining 50% with video encoding tasks.
But obviously I don’t know any better than you. That’s just my takeaway from the vaguely worded article.
The article says they have a lot of reserved instances which at some point is not fully used.
So if you have 50 cpu hours reserved capacity per hour and during parts of the day you're using only 10h, they can allocate the remaining 40h which they reserved and paid upfront for encoding.
They probably rent servers / instances for a fixed price per year and use those to do video conversion when they're not busy, instead of renting additional or on-demand instances when there's encoding to be done. So just reusing existing systems instead of spinning up new ones.