This feels crazy to me. They'll have to encode all their content into 15-20 different formats for various devices/bitrates, but surely they cache the output for each of those so it'll only have to be done once for each show/movie. Compare that to serving content, which happens thousands or millions of times for each stream. This isn't Youtube, there's not much of a "long tail" of content that's only watched a few times but has to be encoded anyway.
High resolution video is larger than you might expect.
Though, I have no idea what network connections they would use. I mean even in black and white this would saturate a 100Gbps link.
I'm sure it's totally impossible in theory and in practice.
Seems reasonable to me.
Cloud facilities like Google and Twitter have been co-locating workloads using containers for a while now (using Borg or Mesos) but they're still below 40%.
As someone who spends most of their day writing C# web apps for clients, something like a job at Netflix seems like an impossibility at this point.
Their infrastructural/cloud ops teams are broader in the proper nouns they wrangle, but that's a different kettle of fish and is totally learnable outside of Netflix: understand AWS (when to build versus buy), understand how systems talk to one another (networking, HTTP for online queries versus queues for offline jobs), and how systems scale and repair after failure (horizontal versus vertical scaling, quorums for decisionmaking, when to use CP versus AP datastores, that kind of thing). And Netflix open-sources enough of their internal tooling that, once you have a baseline in the infrastructural field, you can probably do a good job of figuring out what the missing bits are.
I'm sure, having not worked there but worked on fairly large systems, that Netflix has its own wacky problems, but they're the things that baseline understanding of this stuff enables you to learn. It's all just digital plumbing.
The TV UI team looks for seasoned software engineers who build UIs and we come from a variety of backgrounds: web, desktop, mobile, embedded devices.
That's actually pretty interesting, and loads of respect to them for it.
This reduces complexity and overhead significantly. With a cloud model, each level of abstraction requires new methods for networking, orchestration, configuration, execution, debugging, etc. On a single baremetal machine, all of the resources are essentially the same things running on one layer. But you can still get fancy with cgroups, namespaces and quotas if you need enforced isolation or resource limits.
Plus the costs of maintaining that hardware both from an opex and capex point of view.
Plus the capex and opex costs of creating and maintaining the high speed network required to ship those files back and forth within the datacenter space that you're renting.
Plus the massive hit in how nimble your team can be. (Want to support a new codec? Either buy and install new hardware, or set the priority on these re-encodes lower and tell your boss to wait six months)
When you have an extremely variable workload such as Netflix et al, even though AWS/GCP/Azure are more expensive on a unit cost, your savings from operational expenditure will more than make up for that difference. Not having to buy and maintain a bunch of "just using baremetal machines" is a massive, massive cost savings.
Lambda / Google Cloud Functions / Azure Functions are a natural extension of this-- your write/test/deploy cycle can be much, much faster and you don't have to even worry about maintaining any infrastructure. The brainpower savings of this more than make up for the extra cost for CPU cycles.
Second, even if Netflix's "workload" is variable, they're still using reserved instances. There's no practical difference from those to baremetal machines.
Finally, I am highly skeptical of anyone telling me that five more layers of abstraction make it easier or quicker to write, test and deploy code. Code is code. Servers are servers. Microservices are microservices. Different people are (or should be) maintaining all of these, and if they do their jobs right it should not incur a performance penalty on anyone else.
I will also state the obvious: if you don't hire the right people to manage the right pieces, you will waste huge amounts of time and money trying to figure out what other people have known for decades. This is universal, and has nothing to do with what technology you use.
You better have a bloody big datacentre.
The beauty of the cloud is that when that happens, they can burst up and use on-demand instances to increase the overall pool. With bare metal you either have to keep a bunch of spares or not be able to handle that situation.
They'll need 4K, 1080p, 720p etc encoded versions ready for the release of the series, but they might have copies of some of the episodes ready for encoding a few months before the release, so can slowly encode them on their spare resources.
I think bare metal is simpler, and that simpler often means better, but there are limits to bare metal. Those limits are where the cloud can come in. You can (and many people do) use both.
They have reserved a lot of instances in EC2.
You can get a steep discount for reserving instances. A reserved instance is typically reserved for a number of years.
When these aren't in use, then they can put them into their video encoding pool and they are essentially free.
I guess they reserve lots of instances to make sure they have the capacity at peak times and to access some serious discounts.
Because they worry it will be if they fail. See also multi year contracts for software that’s terrible.
If true, that would be pretty interesting - priority is a common feature to implement on top of a naive scheduler, but 'nice' values are harder to connect to dollar amounts.
Unfortunately, the article doesn't go into any detail if that is, indeed, what they implemented.
It's unlikely that Netflix implemented anything this complex for video encoding, since Cook is only valuable when you're scheduling jobs from parties with conflicting incentives, rather than a singular purpose like "encode all Netflix video".
It's expensive to operate thousands of instances in AWS, so I presume there's also a relentless push internally to drive down costs.
Thanks for the link to Cook, that's fascinating. The fact that it plugs into Mesos seems quite useful!
>Cook is a powerful batch scheduler, specifically designed to provide a great user experience when there are more jobs to run than your cluster has capacity for.
Using other servers as lower priority job servers is a great way to increase utilization during off-peak times and save on server costs. You have to think about what happens if either a batch job or the server's main task runs away with all the system resources, so it can get tricky. A spot market might actually be an interesting solution to some of those problems, though for most of us, just using disk quotas and nice levels will be enough.
> But if you look, encoding has a bursty workload pattern. That means the farm would be underutilized a lot of the time.
> ...they built their own internal spot market to process the chunked encoding jobs. The Engineering Tools team built an API exposing real time unused reservations at the minute level.
They're running multiple encoding tasks at once (an understatement) and when one of these is using less CPU, they shift another encoding task to that same machine.
For example if one of their servers is using 50% CPU, they can fill the remaining 50% with video encoding tasks.
But obviously I don’t know any better than you. That’s just my takeaway from the vaguely worded article.
So if you have 50 cpu hours reserved capacity per hour and during parts of the day you're using only 10h, they can allocate the remaining 40h which they reserved and paid upfront for encoding.
What is your solution.?
I think we both agree, just that I was more clumsy with my language.
That's a very outdated definition of scheduling in regards to cloud computing. The schedulers of Mesos, Omega, Borg, etc. all do allow for CPU-load based scheduling, and have schedulers for other scenarios.
* Video encoding is 70% of Netflix's computing needs
* Stated cost savings on encoding is 92%
This equals a 64% savings on "Netflix's computing needs", which must be a good chunk of their budget.
says their 3rd party cloud computing costs increased by $23M in 2015 and increased a further $27M in 2016. More importantly, their total technology and development cost increased from $650M in 2015 to $850M in 2016. That includes engineers so it's a bit tricky to figure out what their costs really are, but they didn't go down appreciably in 2016 so this effect had to be in 2017. Looking at their first three 10Q's shows their Technology and Development budget keeps going up. I don't see how a streaming service could save 64% of its total computing needs, and it not show up on any of its SEC filings.
In 2016 they completed a seven year move of all their computing infrastructure to AWS. When they sized the system, they bought reserved instances (probably at a significant discount). From the math in this article, they bought 64% more than they needed if they just discovered they could save 64% of their EC2 budget. That seems like a big math error.
Servers barely even register as far as costs go, and also, it's a 92% savings over what the cost would have been without the system, not an absolute 92%. The farm keeps growing all the time as movies become higher resolution and more encodings are supported.
Let's use your numbers. Say that two years ago computing costs were $50M, and encoding was $46M of that. Now say that their costs are currently $100M, but the encoding workload grew 6X. Under the old system, that would have cost $276M, but under the new system it is on $22M. That would be a 92% savings, and would totally be in line given that in the last few years they have drastically increased their machine learning output, which would have overtaken encoding work.
If the EC2 hosts have already been reserved then Netflix "cloud costs" would be relatively stable. What you might see is a reduced rate of increase y/y.
2. Keep in mind Amdhal's Law (or rather a slight variant). The absolute reduction has to be weighted against the percentage of resource usage that has been reduced.
(All numbers are made up) If previously Netflix was paying 3MM for encoding, and now they are paying 0MM; Compared to an annual 30MM on streaming with a ~25% y/y growth, you wouldn't notice the missing 3MM unless it was pointed out.
Why? Don't they write/control their own client apps and video players? Why isn't a single codec with a choice of 3 or 4 different resolutions sufficient?
Even if most things support, say, H.264 out of the box, each video decoding SoC has different quirks which means either A) the same encoding looks different on different hardware or B) the encoding fails to decode on different hardware, so even the same codec can have a bunch of parameters which can/need to be adjusted for different devices.
One smart TV might have a Broadcom SoC with a GPU which supports h.264 decoding like the Raspberry Pis do, an Android tablet might have a Mali GPU which can decode VP9 and HEVC, some weird older set top box which has netflix built in might only support MPEG...
Scaling a video on the client looks like crap and is slow. So you have to deal with all the fiddly screen resolutions, of which the distinct count used to be truly awful but now is much better. However you have to add tablets in which cancels out some of the gains in sanity.
On the bandwidth thing, I wouldn’t put it past Netflix to reencode popular movies with a more time intensive compression settings to achieve better aggregate bandwidth usage but not hold up new releases.
You've no realistic option than to support whatever formats that the hardware has native decoding support, and you'd prefer H265 over 264 (bandwidth costs money) if it's available, and there are many different HDR formats.
All in all, they clearly do push huge amounts of data out of AWS, but that data account for a very small percentage of the total data that gets pushed out to clients via their CDN.
> Open Connect is the name of the global network that is responsible for delivering Netflix TV shows and movies to our members worldwide. This type of network is typically
referred to as a “Content Delivery Network” or “CDN” because its job is to deliver
internetbased content efficiently by bringing the content that people watch close to
where they’re watching it. The Open Connect network shares some characteristics with
other CDNs, but also has some important differences.
Netflix began the Open Connect initiative in 2011, as a response to the everincreasing
scale of Netflix streaming. We started the program for two reasons:
1) As Netflix grew to be a significant portion of overall traffic on consumer Internet
Service Provider (ISP) networks, it became important to be able to work with
those ISPs in a direct and collaborative way.
2) Creating a content delivery solution customized for Netflix allowed us to design a
proactive, directed caching solution that is much more efficient than the standard
demand driven CDN solution, reducing the overall demand on upstream network
capacity by several orders of magnitude.
In context of Netflix, they purchased tons of reserved instances in AWS - these reserved instances will never enter into AWS spot pool, so are dedicated compute power to Netflix.
However, Netflix may not use that reserved instance to its full capacity and so it has downtime that Netflix was paying for.
To make efficient use of the tons of money paid to reserve that AWS instance exclusively for Netflix needs, but seeing that it was only needed to stream to clients say, 30% of the day, they built an internal compute bidding market at Netflix, so that other jobs could say “hey netflix, I need to run this encoding job and I need a bunch of boxes, out of all our reserved instances, gimme some that are not currently utilized for my job”
So they layered their own internal compute request service to maximize the util of their tons of reserved instances, instead of going to additional AWS instances to complete these other jobs...