I get that, in part, this is a way to get people excited to work for Netflix, however, the people working on this are, probably, pretty proud of what they achieved and thus like to share it with us, their colleagues at other places of work and they have the freedom to do so because of our industry's great track record of knowledge sharing.
Is it really a competitive advantage though? I would think that their brand and content library are the main ones. A CDN server doesn't really count IMHO: other streaming services can create their own appliances, though I'm guessing they'll use Linux due to its ubiquity.
Though as a long-time FreeBSD fan and part-time user, I'm happy for the patches. I run a large Isilon OneFS cluster, and so hopefully these changes will help in NFS performance. :)
There is another aspect to this contribution: the FreeBSD kernel code is ever-changing, so unless Netflix wants to carry forward this patch internally into the future, it is prudent to get it committed upstream so it can be cared for by the community at-large.
It totally is. These things get installed into ISP racks. ISPs have limited space and power for these things. One box pushing 200Gbps in the same space is twice as good (although, note they have 4x100Gbps NICs in this generation, there's room for more perf here).
Reducing the number of nodes helps with management of the network as well.
Also a shame that it is TLS'd. The content already has its own DRM/encryption.
E.g. every 64k block contains a SHA256 hash of the next 64k block. When a user seeks, you provide the hash of the first block they'll be receiving over a trusted channel. (It already can exist at start of stream RSA signed or whatever).
But that's probably a pain when you have so many different devices to support.
(It also makes it harder to figure out what content people are watching... Though I think they use variable bitrate encoding which will fall to traffic analysis).
The important feature of the scheme that I mentioned is that there's no per-client encryption needed, except maybe a small operation on each client seek in the stream. In general, you just serve the file, and the file contents are self-authenticating against tampering.
You might use signed HTTP exchanges to push the first hash in the chain on a seek, I guess. Or just run it over TLS. Or just have a 64k block pre-signed at the beginning stream with the hashes of 2k chosen seek points that the client can store.
You can go back and look at Netflix presentations from when they switched to TLS, and they were no longer able to saturate their network cards (and here, they're at 50% on network still)
But (a) what other streaming services even have CDN appliances, and (b) of the ones that do, which are using FreeBSD as a base?
If none of Netflix's competitors will use this patch, then it makes no material difference whether they share it or not.
B) as far as I'm aware, everyone else is running Linux, except maybe one of the CDNs, but I don't know if that CDN is actually running appliances.
However, everyone and their uncle are launching a streaming service in the next 18 months, some of those might get enough traffic to justify appliances, and they can take Netflix's work here to get a competitive box off the bat.
Also, at any point, one of the current big guys can say "hey, netflix gets 200gbps from one box, why do we have 4? Why don't we try FreeBSD on some of these"
CDN appliances are pretty isolated, so it doesn't matter as much if most of their proprietary stack isn't ported. Chances are, they've got someone on staff with FreeBSD experience, and FreeBSD experience ages better, because there's less churn.
I believe technology as a whole is light-years ahead of where it would be if companies treated all of their findings and system designs as closely-guarded secrets. I am relatively early in my career and I get to open-source a project that I have been working on for the first time and it feels great.
They might give the advantages away if they release everything, so that others can simply deploy it or offer it as a hosted solution. But the pure documentation what they did are more like advertisements for interesting engineer work @ Netflix.
"Top U.S. CEOs say companies should put social responsibility above profit"
It's even more amazing to me that this is a successful tactic in light of the fact that there's no shortage of people willing to work for ad-tech firms. If working for a company with good values was such a draw, you'd think it would give them more of a competitive advantage. Maybe the real reason is that Netflix's "good values" are genuine, not just a draw for employees. I'm really not sure.
Worse yet is I’ve met some really cool people in the space but it is still ad tech
Netflix still has to work things out and get the last mile ISPs to play ball ($$$).
But this potentially saves Netflix a whole ton on transit.
It is abit like Facebook Sharing their work on Servers Hardware and Database improvements. Ultimately they dont think any of those technology are their completive advantage , their core business is Social ( and Ad Tracking, but that is another topic ).
Similar to Netflix, and would be the same as Disney. Their core business is not in the delivery, but the content.
Having said that it is still very nice of them to share their work. FreeBSD, Ruby Rails, Chaos Engineering, there are lots of things I love about their Tech Stacks.
edit: it looks like they hit 200GB/s with both Intel and AMD
This is about PCI-E devices being hooked up to a NUMA node to avoid saturating the link between nodes. There's a limited amount of bandwidth, and crossing nodes saturates this and increases latency, both of which will have limiting effects on your total possible throughput.
RSS configuration doesn't need you to set up your hardware in any specific way, with this you need to ensure that the set of disks and nic are hooked up to coupled domains - e.g. if you place the two NICs on the same NUMA node, then no software configuration is going to fix that, and you'd have to go and physically rearrange things to fix it.
You might still use RSS to distributed the workload across multiple cores within that NUMA domain when using this setup.
We talk about this stuff every once in a while, but it doesn't really make sense to do right now. (Disclaimer: I work at Netflix)
At least originally, it was because of the license.
However most systems aren't pushed to extremes, so these corner cases aren't seen very often, regardless of OS.
Thank you for your contributions. Great information.
Neither CPU nor main memory see any of the network packets as long as they stay on the happy path. Only connection setup, DMA orchestration and occasional TLS renegotiation have to be handled.
AMD Ryzen has a built-in crypto "decelerator" — a FreeBSD driver was written for the crypto engine, but it's disabled by default because it made everything slower than AES-NI. (Though I guess it would be funny to use it to mine bitcoin, since it supports SHA256. AMD — Advanced Mining Devices!)
The AMD Zen1 Crypto Co-Processor is indeed slower than AESNI; I think it's mostly used by stuff like SecureBoot, TPM, etc, and also used internally by the CPU to generate RDRAND/RDSEED data. It was probably never intended to be used by OS drivers and certainly not intended to be any kind of accelerator.
The part I know of that is built into the CPU is a DMA engine called I/OAT; it just does DMA and maybe basic checksum and RAID transformations. It is sometimes confused with QAT (I've personally confused the two...):
Now that ktls is upstream, we are looking at using the ccr crypto acceleration. We've already tested them in inline mode. TOE is not an option for us, since we do innovation in the TCP stack.
On Linux there’s a spinlock in do soft_irq that blocks even in non-blocking IO.
Still though I hate seeing needless waste. There’s no reason the active thread needs to block on a soft_irq when there are unused cores to process them.
Its been a few months and I had intended to go back and try using a try_lock(). But I’m not normally a kernel dev.
Have you considered using FreeBSD? ;)
There is a lot of reasons why everything should be behind TLS.
Pretty sure Intel single socket of this generation is totally non-viable for this workload due to lack of PCIe lanes. Maybe viable when Intel gets gen4 PCIe.
We're in total agreement :-). Their dataflow model requires something like 2x that in PCIe bandwidth and 4x in memory in the optimal case, as covered in the slides. 2x200 gbps = 400 gbps, which is a bit more than 345 gbps.
Maybe they could push 345/2 = 172 Gbps out of a single Skylake-X, best case. For some workloads, that might be the right local optima! They must have decided that the marginal cost of a 2P system was worth the extra ~25 Gbps to saturate the 200 Gbps pipe fully.
> they're doing it with two CPUs, two sets of NVMe devices, and two NICs, and a bunch of hacks to make the operating system pretend all this stuff is not in the same box. It seems way more straight-forward to me if they had made it all be actually NOT in the same box!
I've spoken with NFLX engineers in the past and my recollection is that in many installations, NFLX only get to install one box. (Or something like that. Might just be a cost thing.) So they need to make that one box fast.
I guess the other factor is the IP management overhead discussed in the slides. Two boxes necessitates the costly 2nd IP, as far as I know. It's hard to imagine the cost of an IP address dominating the marginal cost of a 2P socket system and 2nd Xeon, but I guess AWS is friggin expensive.
The 4NPS gives the best performance, followed by 2NPS, followed by non-NUMA. This surprised me as well.
But if CPU#1 wants to access a file that is on CPU#2 NVMe nodes, NUMA allows you to share those files across memory (and its a "local" file according to the OS), instead of over NFS or SMB.
And yes, as much as we like to pretend that there's no communication and everything scales horizontally... in practice... people like sharing files between systems. NUMA allows for these files (and other resources: such as PCIe network cards or GPUs) to be shared between systems at the speed of DDR4 memory.
It may be a bit early to have a presentation from the latest Epyc processors. Most of the work was likely done with previous processors, but their slides said their AMD boxes are single socket.
 https://www.youtube.com/watch?v=vcyQBup-Gto (about the 12 minute mark)
Sure, they might run into some bug in -CURRENT, but what's the chance of a bunch of their OCAs all hitting it at the same time?
1) GPL vs BSD licenses. Companies like BSD licenses much more than GPL licenses. GPL adherents can whine all they want, but this is simply true.
2) BSD has a long history of having very good networking stacks--albeit on specific hardware. Linux supported everything initially--including really cheap crap--and consequently its networking stack was a lot more ad hoc. FreeBSD chose specific hardware for stability--but then supported that much more completely.
3) FreeBSD has a long history of being the servers in Internet infrastructure. There are specific architectural choices in the kernel because of this. There is probably still some inertia, too, in that the kind of old guard people who REALLY grok networking are still more comfortable on FreeBSD machines.
Consequently, it is hardly surprising that an advanced networking development would take place on FreeBSD.
Given that there are vendors that use FreeBSD for their appliances, they really don't want to have to send out techs to customers sites to fix things. So when the appliance makers choose hardware, they talk to component vendors about quality.
It's no surprise that you see commits from Intel and Chelsio employees in the FreeBSD logs: companies like Netflix, Isilon, NetApp, and Juniper partner with them to make sure things aren't buggy.
These collaborations lead to point 3.