Hacker News new | past | comments | ask | show | jobs | submit login
Serving Netflix Video at 400Gb/s on FreeBSD [pdf] (freebsd.org)
707 points by drewg123 34 days ago | hide | past | favorite | 293 comments

These are the slides from my EuroBSDCon presentation. AMA

I eventually figured it out. But I would suggest maybe giving a brief 1-slide thingy on "lagg", and link aggregation? (Maybe it was clear from the presentation, but I only see the slides here soooooo...)

I'm not the best at network infrastructure, though I'm more familiar with NUMA stuff. So I was trying to figure out how you only got 1-IP address on each box despite 4-ports across 2-nics.

I assume some Linux / Windows devops people are just not as familiar with FreeBSD tools like that!

EDIT: Now that I think of it: maybe a few slides on how link-aggregation across NICs / NUMA could be elaborated upon further? I'm frankly not sure if my personal understanding is correct. I'm imagining how TCP-connections are fragmented into IP-packets, and how those packets may traverse your network, and how they get to which NUMA node... and it seems really more complex to me than your slides indicate? Maybe this subject will take more than just one slide?

Thanks for the feeback. I was hoping that the other words on the slide (LACP and bonding) would give enough context.

I'm afraid that my presentation didn't really have room to dive much into LACP. I briefly said something like the following when giving that slide:

Basically, the LACP link partner (router) hashes traffic consistently across the multiple links in the LACP bundle, using a hash of its choosing (typically an N-tuple, involving IP address and TCP port). Once it has selected a link for that connection, that connection will always land on that link on ingress (unless the LACP bundle changes in terms of links coming and going). We're free to choose whatever egress NIC we want (it does not need to be the same NIC the connection entered on). The issue is that there is no way for us to tell the router to move the TCP connection from one NIC to another (well, there is in theory, but our routers can't do it)

I hope that helps

> We're free to choose whatever egress NIC we want

Wait, I got lost again... You say you can "output on any egress NIC". So all four egress NICs have access to the TLS encryption keys and are cooperating through the FreeBSD kernel to get this information?

Is there some kind of load-balancing you're doing on the machine? Trying to see which NIC has the least amount of traffic and routing to the least utilized NIC?

LACP (which is a control protocol) is a standard feature supported forever on switches. To put it simply both the server and switch see a single port (fake) that has physical ethernet members and a hashing algorithm puts the traffic on that physical ethernet link based on the selected hash. The underlying sw/hw picks which physical link to put the packet on. The input to the hash in picking the link can be src/dst (for all items listed here) MAC, port, IP addr. LACP handles the negotiation of what that hash is between the ends and also hands a link failure "hey man, something broke, we have one less link now). For any given single flow it will hash to the same link. So a for example in a 4x10G lag (also called a port-channel in networking speak) max bandwidth for a single flow would be 10g the max of a single member. In an ideal world the hashing would be perfectly balanced however it is possible to have a set of flows all hash to the same link. Hope that helps.

That's an excellent overview. I think I got the gist now.

There's all sorts of subtle details though that's probably just "implementation details" of this system. How and where do worker threads spawn up? Clearly sendfile / kTLS have great synergies, etc. etc.

Its a lot of detail, and an impressive result for sure. I probably don't have the time to study this on my own so of course, I've got lots and lots of questions. This discussion has been very helpful.


I think some of the "missing picture" is the interaction of sendfile / kTLS. It makes sense, but just studying these slides have solidified a lot for me as well: https://papers.freebsd.org/2019/EuroBSDCon/shwartsman_gallat...

Adding the NUMA things "on top" of sendfile/kTLS is clearly another issue. The hashing of TCP/Port information into particular links is absolutely important, because of the "Physical location" of the ports matter.

I think I have the gist at this point. But that's a lot of moving parts here. And the whole NUMA-fabric being the bottleneck just ups the complexity of this "simple" TLS stream going on...

EDIT: I guess some other bottleneck exists for Intel/Ampere's chips? There's no NUMA in those. Very curious.


Rereading "Disk centric siloing" slides later on actually answer a lot of my questions. I think my mental model was disk-centric siloing and I just didn't realize it. Those slides work exactly how I think this "should" have worked, but it seems like that strategy was shown to be inferior to the strategy talked about in the bulk of this presentation.

Hmmm, so my last "criticism" to this excellent presentation. Maybe an early slide that lays out the strategies you tried? (Disk Siloing. Network-siloing. Software kTLS, and Hardware kTLS offload)?

Just one slide at the beginning saying "I tried many architectures" would remind the audience that many seemingly good solutions exist. Something I personally forgot in this discussion thread.

The whole paper is discussing bottlenecks, path optimisation between resources, and impacts of those on overall throughout. It's not a simple loadbalancing question being answered

In terms of LACP in general, not for hw TLS. For HW TLS, the keys and crypto state are NIC specific.

If we are using LACP between router and server, it means we create single logical link between them. We can use as many as physical link supported by both server and router. The server and router will treat them as a single link. Thus the ingress and egress of the packet doesn’t really matter.

Whats happening when at the switch(es) the NICs are connected into?

although we’re free to choose egress port in LACP it’s still wise to maintain some kind of client or flow affinity, to avoid inadvertent packet reordering.

LACP/link aggregation are IEEE Ethernet standards concepts supported by nearly every hardware or software network stack. https://en.wikipedia.org/wiki/Link_aggregation

It’s an IEEE standard not a FreeBSD thing.


I build this stuff so it's so cool to read this, I can't really be public about my stuff. Are you using completely custom firmwares on your Mellanoxes? Do you have plans for nvmeOF? I've had so many issues with kernels/firmware scaling this stuff that we've got a team of kernel devs now. Also how stable are these servers? Do you feel like they're teetering at the edge of reliability? I think once we've ripped out all the OEM firmware we'll be in a much better place.

Are you running anything custom in your Mellanoxes? dpdk stuff

No, nothing custom, we don't have access to the source.

We like to run each server as an independent entity, so we don't run NVMEof.

They're pretty much rock solid.

If you'd like to discuss more, ping me via email. Use my last name at gmail . com.

It is hard to decipher your message, but just to clarify, firmware doesn't process packets (dataplane) it only manages and configures hardware (control plane). And no, you definitely won't be in "much better place" by "ripping it off" because modern NICs have very complex firmware with hundreds (if not thousands) man/years spent implementing and optimizing it.

Bud, I work on this stuff, I know all about Cavium and Mellanox firmware and it's issues, specifically with things like the math/integers used to determine packetflow which is used to charge customers having issues on specific versions that that have been internally patched. Do you think I just randomly typed this? It's even worse now that a huge chunk of their firmware teams have been lost in the shuffle of all of the competitors being bought and brought under one umbrella. Do you recall a bug on AWS years ago where they were incorrectly charging customers bandwidth usage? Everyone uses these types of nics, tons of PAAS/SAAS corps had that problem.

What an obnoxious, pedantic and naive response.

"And no, you definitely won't be in "much better place" by "ripping it off" because modern NICs have very complex firmware with hundreds (if not thousands) man/years spent implementing and optimizing it."

What do you think I just explained? I literally write this stuff. Modern nic firmware is still written in C and still has bugs. Do you seriously think drivers/firmware aren't going to have bugs? You just said yourself that they're high complexity. I can't believe I'm even needing to explain this. You've clearly never been a network engineer, do you have any idea how many bugs are in Juniper, Cisco, Palo, etc?

If you don't work on bare metal architecture/distributed systems, move along. This isn't sysadmin talk. Almost nothing used is stock, even k8s gets forked due to bugs. You can't resell PAAS with bugs that conflict with customer billing. NFLX isn't reselling bandwidth so they likely don't encounter these issues, they're using something like Cedexis to force their CDN providers at the edge to compete with one another down to the lowest cost/reliability and they're liable for things like this and can be sued/loss based on their contract. They (CDN customers) are acutely aware of when billing doesn't match up with realized BW usage. They'll drop the traffic to your CDN and split more of it over to a "better" CDN until those issues are mitigated - and while you're not getting that traffic you're not getting that customers exoected monthly payment... because now they're buying less bandwith from you because Cedexis tells them that you're less performant/reliable.

ANY large customer buying bandwidth from a CDN does this, none of this is specific to NFLX whom I know nothing about beyond.. "this is how reselling PAAS works."

I bet the next response is "no way NFLX uses a CDN".. LOL. They all do, my friend, HBO, paramount, disney, etc. They aren't in the biz of edge caching.

You really overreacted here.

How would you characterize the economics of this vs. alternative solutions? Achieving 400Gb/s is certainly a remarkable achievement, but is it the lowest price-per-Gb/s solution compared to alternatives? (Even multiple servers.)

I'm not on the hardware team, so I don't have the cost breakdown. But my understanding is that flash storage is the most expensive line item, and it matters little if its consolidated into one box, or spread over a rack, you still have to pay for it. By serving more from fewer boxes, you can reduce component duplication (cases, mobos, ram, PSUs) and more importantly, power & cooling required.

The real risk is that we introduce a huge blast radius if one of these machines goes down.

I almost fell for the hype of pcie gen4 after reading https://news.ycombinator.com/item?id=25956670, and it is quite interesting that pcie gen3 nvme drives can still do the job here. What would be the worst case disk I/O throughput while serving 400Gb/s?

If you look at just the pci-e lanes and ignore everything else, the NICs are x16 (gen4) and there's two of them. The NVMes are x4 (gen3) and there are 18 of them. Since gen4 is about twice the bandwidth of gen3, it's 32 lanes of gen4 NIC vs about 36 lanes of gen4 equivalent NVMe.

If we're only worried about throughput, and everything works out with queueing, there's no need for gen4 NVMes because the storage has more bandwidth than the network. That doesn't mean gen4 is only hype; if my math is right, you need gen4x16 to have enough bandwidth to run a dual 100G ethernet at line rate, and you could use fewer gen4 storage devices if reducing device count were useful. I think for Netflix, they'd like more storage, so given the storage that fits in their systems, there's no need for gen4 storage; gen4 would probably make sense for their 800Gbps prototype though.

In terms of disk I/O, either in the thread or the slides, drewg123 mentioned only about 10% of requests were served from page cache, leaving 90% served from disk, so that would make worst case look something like 45GB/sec (switching to bytes cause that's how storage throughput is usually measured). From previous discussions and presentations, Netflix doesn't do bulk cache updates during peak times, so they won't have a lot of reads at the same time as a lot of writes.

Thanks for the numbers. Perhaps hype is not the right word. It is just interesting to see that some older hardware can still be used to achieve the state of the art performance, as the bottleneck may lie elsewhere.

It's always balancing bottlenecks. Here, the bottleneck is memory bandwidth, limiting to (more or less) 32 lanes of network; the platform has 128 lanes, so using more lanes than needed at a slower rate works and saves a bit of cost (probably). On their Intel Ice Lake test machine, that only had 64 lanes which is also a bottleneck, so they used Gen4 NVMe to get the needed storage bandwidth into the lanes available.

thank you for this link, superuseful for my next project

Ouch. I guess there is always a trade off to be made…

One of the factors is ISPs that install these boxes have limited space and many companies that want to place hardware there. If Netflix can push more traffic in less space, that makes it a better option for ISPs that want to reduce external traffic and thus more likely to be installed benefiting their users.

If what I found is current, Netflix has 1U and 2U appliances, but FB does a 1U switch + groups of 4x 2U servers, so starting at 9U; but I was looking at a 2016 FB pdf that someone uploaded, they may have changed their deployments since then. 2U vs 9U can make a big difference.

I doubt that the 32-core EPYC they focused on is even the most economical solution in this situation.

If they're really RAM-bandwidth constrained, then the 24-core 74F3 (which still has all 256MBs of L3 cache) or even 8-core 72F3 may be better.

L3 won’t matter.

The more compute clusters the more PCI lanes in EPYC and the SSD lanes go direct per 8cores.

I recall when I was at Intel in the 90s and 3MB of L3 was a big deal.

The 8 core still has 128 lanes and 8 channels of memory, though.

Particularly, kTLS is a solution that fits here, on a single-box, but I wonder how things would look if the high-perf storage boxes sent video unencrypted and there was a second box that dealt only with the TLS. We'd have to know how many streams 400Gb/s represents though, and have a far more detailed picture of Netflix's TLS needs/usage patterns.

A proxy for TLS wouldn't help with this load.

That proxy would still need to do kTLS to reduce the required memory bandwidth to something the system can manage, and then you're at roughly the same place. The storage nodes would likely still have kTLS capable NICs because those are good 2x100G NICs anyway. It would be easier to NUMA align the load though, so there might be some benefit there. With the right control software, the proxy could pick an origin connection that was NUMA aligned with the client connection on the proxy and the storage on the origin. That's almost definitely not worth doubling the node count for though, even if proxy nodes don't need storage so they're probably significantly less expensive than flash storage nodes.

Could Netflix replace TLS with some in-house alternative that would push more processing to the client? Something that pre-encrypts the content on disk before sending, eliminating some of the TLS processing requirements?

If you pre-encrypt contents on disk, it wouldn't be a per-user unique key.

The content is already encrypted anyway for DRM.

I'd assume TLS is used in large part for privacy reasons (so ISPs can't sell info on what shows are popular)

This, can someone add to this it's crucial.

This level of architecture management on big server CPUs is amazing! I occasionally handle problems like this on a small scale, like minimizing wake time and peripheral power management on an 8 bit microcontroller, but there the entire scope is digestible once you get into it, and the kernel is custom-designed for the application.

However, in my case, and I expect in yours, requirements engineering is the place where you can make the greatest improvements. For example, I can save a few cycles and a few microwatts by sequencing my interrupts optimally or moving some of the algorithm to a look-up table, but if I can, say, establish that an LED indicator flash that might need to be 2x as bright but only lasts for a couple milliseconds every second is as visible as a 500ms LED on/off blink cycle, that's a 100x power savings that I can't hope to reach with micro-optimizations.

What are your application-level teams doing to reduce the data requirements? General-purpose NUMA fabrics are needed to move data in arbitrary ways between disc/memory/NICs, but your needs aren't arbitrary - you basically only require a pipeline from disc to memory to the NIC. Do you, for example, keep the first few seconds of all your content cached in memory, because users usually start at the start of a stream rather than a few minutes in? Alternatively, if 1000 people all start the same episode of Stranger Things within the same minute, can you add queues at the external endpoints or time shift them all together so it only requires one disk read for those thousand users?

> Alternatively, if 1000 people all start the same episode of Stranger Things within the same minute

It would be fascinating to hear from Netflix on some serious details of the usage patterns they see and particular optimizations that they do for that, but I doubt there's so much they can do given the size of the streams, the 'randomness' of what people watch and when they watch, and for the fact that the linked slides say the servers have 18x2TB NVME drives per-server and 256GB.

I wouldn't be surprised if the Netflix logo opener exists once on disk instead of being the first N seconds of every file though.

In previous talks Netflix has mentioned that due to serving so many 1000s of people from each box, that they basically do 0 caching in memory, all of the system memory is needed for buffers that are enroute to users, and they purposely avoid keeping any buffer cache beyond what is needed for sendfile()

Hey Drew, thanks for taking the time.

What would you rate the relative complexity of working with the NIC offloading vs the more traditional optimizations in the rest of the deck? Have you compared other NIC vendors before or has Mellanox been the go to that's always done what you've needed?

I wish the video was online... We tried another vendor's NIC (don't want to name and shame). That NIC did kTLS offload before Mellanox. However, they could not retain TLS crypto state in the middle of a record. That meant we had to coerce TCP into trying really hard to send at TLS record boundaries. Doing this caused really poor QoE metrics (increase rebuffers, etc), and we were unable to move forward with them.

Is there anyway we can see a video of the presentation? I'm extremely interested

The videos should appear on the conference's youtube channel in a few weeks: https://www.youtube.com/eurobsdcon

Is there something I as a person can contribute to make the videos available sooner?

I have video editing skills and have also done some past-time subtitling of videos. I have all of the software necessary to perform both of these tasks and would be willing to do so free of charge.

Was FreeBSD your first choice? Or did you try with Linux first? What were the numbers for Linux-based solution, if there was one?

FreeBSD was selected at the outset of the Open Connect CDN (~2012 or so).

We did a bake off a few years ago, and FOR THIS WORKLOAD FreeBSD outperformed Linux. I don't want to get into an OS war, that's not productive.

Its important to consider that we've poured man years into this workload on FreeBSD. Just off the top of my head, we've worked on in house, and/or contributed to or funded, or encouraged vendors to pursue: - async sendfile (so sendfile does not block, and you don't need thread pools or AIO) - RACK and BBR TCP in FreeBSD (for good QoE) - kTLS (so you can keep using sendfile with tls, saves ~60% CPU over reading data into userspace and encrypting there) - Numa - kTLS offload (to save memory bandwidth by moving crypto to the NIC)

Not to mention tons of VM system and scheduler improvements which have been motivated by our workload.

FreeBSD itself has improved tremendously over the last few releases in terms of scalability

> FreeBSD itself has improved tremendously over the last few releases in terms of scalability

True. FreeBSD (or its variants) has always been a better performer than Linux in the server segment. Before Linux became popular (mostly due to better hardware support), xBSD servers were famous for their low maintenance and high uptime (and still are). This archived page of NetCraft statistics ( https://web.archive.org/web/20040615000000*/http://uptime.ne... ) provides an interesting glimpse into internet history of how 10+ years back, the top 50 server with the highest uptimes were often xBSD servers, and how Windows and Linux servers slowly replaced xBSD.

(Here's an old HN discussion about a FreeBSD server that ran for 18 years - https://news.ycombinator.com/item?id=10951220 ).

a lot of this is in Linux now right ? i am asking a personal opinion and not necessarily a "why-dont-u-move-to-linux" question.

Genuinely curious on where u see state of art when it comes to Linux.

Yes. I ran it for a bake off ~2 years ago. At the time, the code in linux was pretty raw, and I had to fix a bug in their SW kTLS that caused data corruption that was visible to clients. So I worry that it was not in frequent use at the time, though it may be now.

My understanding is that they don't do 0-copy inline ktls, but I could be wrong about that.

Thank you for pushing kTLS!

Was licensing also a contributing factor, or was that irrelevant for you?

My understanding is that licensing did factor into the decision. However, I didn't join Netflix until after the decision had been made.

I recall reading that Netflix chose FreeBSD a decade ago due to asynchronous disk IO was (and still is?) broken and/or limited to fixed block offsets. So nginx just works better on FreeBSD versus Linux for serving static files from spinning rust or SSD.

This used to be the case, but with io_uring, Linux has very much non-broken buffered async I/O. (Windows has copied io_uring pretty much verbatim now, but that's a different story.)

Could you expand more on the Windows io_uring bit please?

I have run Debian based Linux my entire life and recently moved circumstantially to Windows. I have no idea how it's kernel model works and I find io_uring exciting.

Wasn't aware of any adoption of io_uring ideas in Windows land, sounds interesting

Windows has had “IO completion ports” since the 1990s which work well and are high performance async for disk/network/other IO operations.

This isn't the same as the old Windows async I/O. ptrwis' links are what I thought of (and it's essentially a 1:1 copy of io_uring, as I understand it).

By how much?

As an infrastructure engineer, these numbers are absolutely mind blowing to me!

Not sure if it’s ok to ask.. how many servers like this one does it take to serve the US clients?

I don't have the answer, and even if I did, I'm not sure I'd be allowed to tell you :)

But note that these are flash servers; they serve the most popular content we have. We also have "storage" servers with huge numbers of spinning drives that serve the longer tail. They are constrained by spinning rust speeds, and can't serve this fast.

I found somewhere that Netflix has ~74 million US/canada subscribers. If we guesstimate half of those might be on at peak time, that's 37 million users. At 400k connections/server that's only 85 servers to serve the connections, so I think the determining factor is the distribution of content people are watching.

What lead you to investigate PCIe relaxed ordering? Can you suggest a book or other resource to learn more about PCIe performance?

To be honest, it was mostly the suggestion from AMD.

At the time, AMD was the only Gen4 PCIe available, and it was hard to determine if the Mellanox NIC or the AMD PCIe root was the limiting factor. When AMD suggested Relaxed Ordering, that brought its importance to mind.

How do you benchmark this ? Do you use real-life traffic, or have a fleet of TLS clients ? If you have a custom testsuite, are the clients homogeneous ? How many machines do you need ? Do the clients use KTLS ?

We test on production traffic. We don't have a testbench.

This is problematic, because sometimes results are not reproducible.

Eg, if I test on the day of a new release of a popular title, we might be serving a lot of it cached from RAM, so that cuts down memory bandwidth requirements and leads to an overly rosy picture of performance. I try to account for this in my testing.

Test in prod FTW!

However, how do you test the saturation point when dealing with production traffic? Won't you have to run your resources underprovisioned in order to achieve saturation? Doesn't that degrade the quality of service?

Or are these special non-ISP Netflix Open Connect instances that are specifically meant to be used for saturation testing, with the rest of the load spilling back to EC2?

We have servers in IX locations where there is a lot of traffic. Its not my area of expertise (being a kernel hacker, not a network architect), but our CDN load does not spill back to EC2.

The biggest impact I have to QoE is when I crash a box, but clients are architected to be resilient against that.

Thanks, it's an interesting set of tradeoffs.

Does FreeBSD `sendfile` avoid a context switch from userspace to kernelspace as well, or is it only zero-copy? I've worked with 100Gbps NICs and had to end up using both a userspace network stack and a userspace storage driver on Linux to avoid the context switch and ensure zero-copy.

Also, have you looked into offloading more of the processing to an FPGA card instead?

There is no context switch, like most system calls, sendfile runs in the thread context of the thread making the syscall.

FreeBSD has "async sendfile", which means that it does not block waiting for the data to be read from disk. Rather, the pages that have been allocated to hold the data are staged in the socket buffer and attached to mbufs marked "not ready". When the data arrives, the disk interrupt thread makes a callback which marks the mbufs "ready", and pokes the TCP stack to tell them they are ready to send.

This avoids the need to have many threads parked, waiting on disk io to complete.

To be clear, was FreeBSD used because of historical reasons or because similar performance can't be/harder to achieve on Linux?

I mean most CDN and FANG run on Linux, I think in that case it's kTLS that makes a big difference the rest not much.

Async sendfile is also an advantage for FreeBSD. It is specific to FreeBSD. It allows an nginx worker to send from a file that's cold on disk without blocking, and without resorting to threadpools using a thread-per-file.

The gist is that the sendfile() call stages the pages waiting to be read in the socket buffer, and marks the mbufs with M_NOTREADY (so they cannot be sent by TCP). When the disk read completes, a sendfile callback happens in the context of the disk ithread. This clears the M_NOTREADY flag and tells TCP they are ready to be sent. See https://www.nginx.com/blog/nginx-and-netflix-contribute-new-...

sendfile() with splice and io_uring is similar? I know that this is very experimental on Linux.

The overall idea is to copy bytes from disk to the socket with almost no allocation and not blocking, this is the idea right?

Maybe. A few things in io-uring are implemented by letting a kernel task/thread block on doing the actual work. Which calls that are seems to change in every version, and might be tricky to find out without reading the kernel code.

I'd imagine scaling, licensing and overhead all had something to do with it, too.

Interesting. Are there any benchmarks that you would recommend to look at, regarding FreeBSD vs Linux networking performance?

For anyone interested, here is benchs from late 2018, comparing Fedora and FreeBSD performance: https://matteocroce.medium.com/linux-and-freebsd-networking-...

Why does the author put so much effort into testing VMs? Bare metal installations aren't even tried, so the article won't represent more typical setup (unless you want to run in cloud, in which case it would make sense to test in cloud).

If given the choice I'd never run anything on bare metal again. Let's say we have some service we want to run on a bare metal server. For not very much more hardware money, amortized, I can set up two or three VMs to do the same server, duplicated. Then if any subset of that metal goes bad, a replica/duplicate is already ready to go. There's no network overhead, etc.

I've been doing this for stuff like SMTPE authority servers and ntpd and things that absolutely cannot go down, for over a decade.

That doesn't really matter for benchmarking purposes. I think the parent comment was emphasizing that syscall benchmarks don't make sense when you're running through a hypervisor, since you're running tacitly different instructions than would be run on a bare-metal or provisioned server.

Because the cards were in PCI passthrough, so the performance was exactly the same of a physical system

Author mentions as much as to say that there's was some indirectness at least in the interrupts.

There are also VirtIO drivers involved, and according to the article, they had effect too.

It’s probably worth noting that there have been huge scalability improvements - including introduction of epochs (+/- RCU) - in FreeBSD over the last few years, for both networking and VFS.

Nginx and OpenSSL are open source. Give it a try and reproduce their results with Linux ;-).

IMO the question was reasonable, whereas the answers like yours have always sounded to me like "fuck you."

It was done a tested more than once. As I recall, it took quite a bit to get Linux to perform to the level that BSD was performing for (a) this use can and (b) the years of investment Netflix had already put into the BSD systems.

So, could Linux be tweaked and made as performant for _this_ use case. I expect so. The question to be answered is _why_.

sendfile + kTLS. I'm unaware of the in-kernel TLS implementation for Linux. Is there any around?

Yes, Linux has kTLS. When I tried to use it, it was horribly broken, so my fear is that its not well used/tested, but it exists. Mellanox, for example, developed their inline hardware kTLS offload on linux.

> When I tried to use it, it was horribly broken, so my fear is that its not well used/tested, but it exists.

Do you have any additional references around this? I'm aware that most rarely used functionality is often broken and therefore usually don't recommend people to use it, but would like to learn about kTLS in particular. I think for Linux OpenSSL 3 now added support for it in userspace. But there's also the kernel components as well as drivers - all of them could have their set of issues.

I recall that simple transmits from offset 0..N in a file worked. But range requests of the form N..N+2MB lead to corrupt data. Its been 2+ years, and I heard it was later fixed in Linux.

I've used sendfile + kTLS on Linux for a similar use case. It worked fine from the start, was broken in two (?) kernel releases for some use cases, and now works fine again from what I can tell. This is software kTLS, though; haven't tried hardware (not the least because it easily saturates 40 Gbit/sec, and I just don't have that level of traffic).

I once recommended a switch and router upgrade to allow for more, new WAPs for an office that was increasingly becoming dependent on laptops and video conferencing. I went with brand new kit, like just released earlier in the year because I'd heard good things about the traffic shaping, etc.

Well, the printers wouldn't pair with the new APs, certain laptops with fruit logos would intermittently drop connection, and so on.

I probably will never use that brand again, even though they escalated and promised patches quickly - within 6 hours they had found the issue and we're working on fixing it, but the damage to my reputation was already done.

Since then I've always demanded to be able to test any new idea/kit/service for at least a week or two just to see if I can break it.

Interesting. I had imagined the range handling is purely handled by the reading side of things, and wouldn't care how the sink is implemented (kTLS, TLS, a pipe, etc). So I assumed the offset should be invisible for kTLS, which just sees a stream of data as usual.

1. Could GPU acceleration help at all?

2. When serving video, do you use floating point operations at all? Could this workload run on a hypothetical CPU with no floating point units?

3. How many of these hardware platforms do you guys own? 10k?100k?

1) No. Well, potentially as a crypto accelerator, but QAT and Chelsio T6 are less power hungry and more available. GPUs are so expensive/unavailable now that leveraging them in creative ways makes less sense than just using a NIC like the CX6-Dx, which has crypto as a low cost feature.

2) These are just static files, all encoding is done before it hits the CDN.

I wonder if one creative (and probably stupid) way to leverage GPUs might be just as an additional RAM buffer to get more RAM bandwidth.

Rather than DMA from storage to system RAM, and then from system RAM to the NIC, you could conceivably DMA to GPU RAM and then to the NIC for a subset of sends. Not all of the sends, cause of PCIe bandwidth limits. OTOH, DDR5 is coming soon and is supposed to bring double the bandwidth and double the fun.

The videos are precomputed. So no GPU required to stream

How exactly 'Constrained to use 1 IP address per host' helps eliminate cross-NUMA transfers?

If we could use more than 1 IP, then we could treat 1 400Gb box as 4 100Gb boxes. That could lead to the "perfect case" every time, since connections would always stay local to the numa node where content is present.

I wonder if you could steer clients away from connections where the NIC and storage nodes are mismatched.

Something like close the connection after N requests/N minutes if the nodes are mismatched, but leave it open indefinitely if they match.

There's of course a lot of ways for that to not be very helpful. You'd still have only a 25% of getting a port number that hashes to the right node the next time (assuming tcp port number is involved at all, if it's just src and dest ips then client connections from the same IP would always hash the same, and that's probably a big portion of your clients), and if establishing connections is expensive enough (or clients aren't good at it) then that's a negative. Also if a stream's files don't tend to stay on the same node, then churning connections to get to the right node doesn't help if the next segment is on a different node. I'm sure there are other scenarios too.

I know some other CDN appliance setups do use multiple IPs, so you probably could get more, but it would add administrative stress.

Is there a reason you can’t?

In a past life we broke LAGs up to use different subnets per port to prevent traffic crossing the NUMA bridge.

I’m sure there are good reasons you didn’t take this approach, be interesting to hear them.

There is no kTLS for IPv6? IPv6 space is abundant and most mobiles is USA/Canada have IPv6. Won't that solve the problem?

IPv4 and IPv6 can both use kTLS. We offer service via V6, but most clients connect via IPv4. It differs by region, and even time of day, but IPv4 is still the vast majority of traffic.

I've had to blackhole Netflix IPv6 ranges on my router because Netflix would identify my IPv6 connection as being a "VPN" even though it's not.

If you could email me your ranges, I can try to look into it internally. Use my last name at gmail.com (or at freebsd.org)

I'm not GP; Hurricane Electric ipv6 always had Netflix think I was on a VPN, but now I have real ipv6 through the same ISP, I just pay more money so Netflix doesn't complain anymore.

Are you referring to HE's tunnel broker service?

If so, then yeah, that's a VPN.

> Mellanox ConnectX-6 Dx - Support for NIC kTLS offload

Wild, didn't know nVidia was side-eyeing such far-apart but still parallel channels for their ?GPUs?.

Was this all achievable using nVidia's APIs out-of-the-box, or did the firmware/driver require some in-house engineering :)

Mellanox was bought by nVidia 2 years ago, so while it's technically accurate to say it's an nVidia card, that elides their history. Mellanox has been selling networking cards to the supercomputing market since 1999. Netflix absolutely had to do some tuning of various counters/queues/other settings in order to optimize for their workload and get the level of performance they're reporting here, but Mellanox sells NICs with firmware/drivers that work out-of-the-box.

The architecture slides don't show any in-memory read caching of data? I guess there is at least some, but would it be at the disk side or the NIC side? I guess sendfile without direct IO would read from a cache.

Caching is left off for simplicity.

We keep track of popular titles, and try to cache them in RAM, using the normal page cache LRU mechanism. Other titles are marked with SF_NOCACHE and are discarded from RAM ASAP.

How much data ends up being served from RAM? I had the impression that it was negligible and that the page cache was mostly used for file metadata and infrequently accessed data.

It depends. Normally about 10-ish percent. I've seen well over that in the past for super popular titles on their release date.

in which node would that page cache be allocated? In the one where the disk is attached, or where the data is used? Or is this more or less undefined or up to the OS?

This is gone over in the talk. We allocate the page locally to where the data is used. The idea is that we'd prefer the NVME drive to eat any latency for the NUMA bus transfer, and not have the CPU (SW TLS) or NIC (inline HW TLS) stall waiting for a transfer.

This may be a naive question, but data is sent at 400Gb/s to the NIC, right? If so, is it fair to assume to assume that data is actually sent/received at a similar rate?

I ask since I was curious why you guys opted not to bypass sendfile(2). I suppose it wouldn't matter in the event that the client is some viewer, as opposed to another internal machine.

We actually try really, really hard not to blast 400Gb/s at a single client. The 400Gb/s is in aggregate.

Our transport team is working on packet pacing, or really packet spreading, so that any bursts we send are small enough to avoid being dropped by the client, or an intermediary (cable modem, router, etc).

Have you done any work to see whether the NIC hardware packet pacing mechanisms could improve QoE by reducing bursts?

Well it's not 1 client. It's thousands of viewers and streams. An individual stream will have whatever the maximum 4k bandwidth for Netflix is.

What architectures are you guys running FreeBSD on?

Would these techniques be applicable on arm64 and/or riscv64?

Slide 5. What is the difference of "mem bw" and "networking units"?

Networking tends to use bits per second instead of bytes per second, so in order to more easily compare the memory bandwidth to the rest of the values used in the presentation, the presenter multiplied the B/s values by 8 to get the corresponding b/s values.

Oh.. Networking unit is using "bit" instead of "byte".

There is no video for this available yet AFAIK, but for those interested there is a 2019 EuroBSDcon presentation online from the same speaker that focuses on the TLS offloading: https://www.youtube.com/watch?v=p9fbofDUUr4

The slides here look great. I'm looking forward to watching the recording.

Drew also gave this talk about NUMA optimization that I found enjoyable (from the same conf): https://www.youtube.com/watch?v=8NSzkYSX5nY

If this is one server, I can't imagine how much bandwidth Netflix pushes out of a single PoP - does anyone have those numbers, or at least estimates?

It may disappoint but the servers are big because it's cheaper than buying racks of servers not because it was the only way to get enough servers into a PoP. Their public IX table should give an idea that most are just a couple https://openconnect.netflix.com/en/peering/ (note they have bulk storage servers too for the less popular content not just 1 big box serving every file).

I've seen their boxes walking through Equinix facilities Dallas in Dallas and Chicago and it is a bit jarring how small some of the largest infrastructure can be.

It's worth noting that they have many smaller boxes at ISP colo locations as well not just the big regional DCs like Equinix.

Seeing Hulu at equinix, as well as PlayStation Network, in large cages, and then our two small cages was rather eye opening. Some people have a lot of money to throw at a problem, others have smart money to throw at smart engineers.

Someone that worked at one of the former sort of organizations once described it to me as having a 'money hammer'. The money hammer is the solution to all problems.

Sometimes, for some organizations, spending 0.1 or 5 or 10 million dollars to solve a problem for now, right now is the smart and prudent choice.

I mean, there's gotta be some question about costs of hardware (plus colocation costs, etc.) versus the costs of engineers to optimise the stack.

I don't doubt there's a point at which it's cheaper to focus on reducing hardware and colocation costs, but for the vast majority engineers are the expensive thing.

This is only due to the fact that there is a good chance an engineer on github is already working on your problems and you just wait to integrate his work.

Companies that think engineers are expensive will continue to buy tons of hardware and scale rather badly. If you are not actively pushing the ceiling you gonna fall out. You should work on problems cause, who knows, it seems like there might be some value.

They colocate these servers inside ISPs typically. Would be interesting to know, for the traffic that does go to a Netflix primary pop - how much the hottest one consumes.

It feels like a luxury that they can fit their entire content library in all encoded formats into a single server (even though it's a massive server).

It’s not the entire content library. It’s a cache of the most watched at that time.

I saw a photo somewhere of a Netflix POP that showed dozens of servers and a huge chassis router with hundreds of 100G links, so that's terabits.

Capacity isn't indicative of bandwidth peak - but I'm sure peaks are well past the terabit measure of each server is being designed for 100-400Gbps.

There were some crazy number like 80% of internet traffic is Netflix. I have no idea of the validity of that, but without seeing actual numbers, that sounds like a lot.

I think its funny that these servers use FreeBSD but if you zoom in on the CPU graphic in the slides it is reading a book called ‘Programming Linux’

Wow, I never zoomed in on that. I've been using that as clip art in my slides for years now. You should work for CSI

this guy enhances

That is funny. The title is pretty damn small. Here's what appears to be the source SVG: https://openclipart.org/download/28107/klaasvangend-processo...

And the text is in a <tspan>, so it could be appropriately re-titled if desired :)

> so it could be appropriately re-titled if desired

...to something like "How to not be too curious for your own good"?

Good catch.

Hilarious to see all the attempts at “but why not Linux”.

I wonder how many will read this and consider trying out FreeBSD. It’s a rather dope OS but I am a bit biased.

For those with know how, where can I get a technical paper which describes this numa architecture. A server with multiple nics with a peferred core association is of interest. Presumably this needs Linux support too.


1. Is there any Likelihood Netflix needs will migrate to ARM in thr next few years? (I see you’re right up at end of the deck, curious if you’re seeing more advancements with ARM than x86 and as such, project ARM to surprise x86 for your needs in the foreseeable future)

2. Can you comment more on thr 800 Gbps reference at end of deck

We're open to migrating to the best technology for the use case. Arm beat Intel to Gen4 PCIe, which is one of the reasons we tried them. Our stack runs just fine on arm64, thanks to the core FreeBSD support for arm64

The 800Gb is just a science experiment.. a dual-socket milan with a relatively equal number of PCIe lanes from each socket. It should have been delivered ages ago, but supply chain shortages impacted the delivery dates of a one-off prototype like this.

Why was it being shipped to VA? I thought Netflix does all the dev in Los Gatos, CA.

We do all of our testing on production traffic, and we have servers in IX'es around the world.

For those wondering about NUMA (Non Uniform Memory Architecture) here is an explanation: https://m.youtube.com/watch?v=KmtzQCSh6xk

I hate and respect you at the same time.

Cool but how much does one of those servers cost... 120k?

Pretty crazy stuff though, would love to see something similar from cloudflare eng. since those workloads are extremely broad vs serving static bits from disk.

I think the described machines were designed for a very specific workloads even in CDNs world. As I guess they serve exclusivelly fragments of on-demand videos, no dynamic content. So high bytes to number of requests ratio, nearly 100% hit ratio, very long cache TTL (probably without revalidation, almost no purge requests), small / easy to interpret requests, very effective HTTP keep alive, relatively small number of new TLS sessions (TLS handshake is not accelerated in hardware), low number of objects due to their size (much fewer stat()/open() calls that would block nginx process). Not talking about other CDN functionality like workers, WAF, page rules, rewrites, custom cache keys, no or very little logging of requests etc. That really simplifies things a lot compared to Cloudflare or Akamai.


For a EPYC 7502P 32-core / 256GB RAM / 2x Mellanox Connect-X 6 dual-nic, I'm seeing $10,000.

Then comes the 18x SSD drives, lol, plus the cards that can actually hold all of that. So the bulk of the price is SSD+associated hardware (HBA??). The CPU/RAM/Interface is actually really cheap, based on just some price-shopping I did.

The WD SN720 is an NVMe SSD so there are no HBAs involved it just plugs into the (many) PCIe lanes the Epyc CPU provides. CDW gives an instant anonymous buyer web price of ~$11,000.00 for all 18.

Surprisingly less it seems? Those NICs are only like a kilobuck each, the drives are like 0.5k, CPU is like 3k-5k. So maybe 15k-20k all in, with the flash comprising about half that?

Seems surprisingly cheap, but I’m not sure if that’s just great cost engineering on Netflix part or a poor prior on my part … I’ll chose to blame Nvidia’s pricing in other domains for biasing me up

That’s in the right ballpark, I believe.

Will the presentation be available online anywhere? This topic is very interesting to me as I maintain a Plex server for my family and have always been curious how Netflix does streaming at scale.

There is a EuroBSDCon youtube channel. The last time I presented, it was a few weeks before the recordings were processed and posted to the channel. I'm not sure how soon they will be available this year.

I have to say that a lot of this may be overkill for plex. Do you know if plex even uses sendfile?

It's 100% overkill, haha! I was just asking because streaming architecture really interests me. I have two nodes (one is TrueNAS with ~50tb of storage), and the other is a compute machine (all SSDs) which consumes the files on the NAS and delivers the content. My biggest bottleneck right now is that my internal network isn't 10Gpbs, so I have to throttle some services so that users don't experience any buffering issues.

Truenas also introduced me to ZFS and I have been amazed by it so far! I haven't dug to deep into FreeBSD yet, but that's next on my list.

One thing to remember with ZFS is that it does not use the page cache; it uses ARC. This means that sendfile is not zero-copy with ZFS, and I think that async sendfile may not work at all. We use ZFS only for boot/root/logs and UFS for content.

Hmm. While this represents a write-once-read-extremely-many (someone here mentioned how there's no FS cache (?) (https://news.ycombinator.com/item?id=28588682)) type of situation, I'm curious if bulk updates have brought any FS-specific Interestingness™ out of the woodwork in UFS?

(I'm also curious where the files are sourced from - EC2 would be horrendously expensiveish, but maybe you send to one CDN and then have it propagate the data further outwards (I'm guessing through a mesh topology).)

Very interesting, I had no idea. Would adding something like an intel optane help performance for large transfers with a lot of I/O? I manage my services with a simple docker-compose file and establish the data link to my NAS via NFS, which I’m sure is adding another layer of complexity (I believe NFSv4 is asynchronous?).

Short answer: no

There's a lot of blogs that go over this in detail for our usecase at home. Your can do it but you will not see much improvements if any at all.

From the tail end of the presentation I'd expect UFS2 isn't a potential bottleneck (I'd naively expect it to be good at huge files, and the kernel to be good at sendfile(2).) Is that your opinion as well, or are there efficiency gains hiding inside the filesystem?

At some point the presentations will come up here https://www.youtube.com/c/EuroBSDcon

How large is your family?

Just the closest /16 and a few friends.

Haha, not large at all! I've just built a couple of servers and their main responsibility is content storage and then content serving (transcoding). My main bottleneck right now is that I don't have 10Gbps switches (both servers have dual 10Gbps intel NICs ready to go). I have to do some throttling so that users don't experience any buffering issues.

Have you tried using a crossover cable and static IPs to link the two servers directly?

FYI if using modern hardware then you might not need a specific cable to try this thanks to Auto MDI-X: https://en.wikipedia.org/wiki/Medium-dependent_interface#Aut...

What’s amazing to me is how much data they’re pumping out — their bandwidth bills must be insane. Or they did some pretty amazing negotiating contracts considering Netflix is like what, $10-15/mo? And there are many who “binge” many shows likely consuming gigabytes per day (in addition to all the other costs, not least of which is actually making the programming)?

Netflix has many peering relationships with isps


The actual bandwidth is significant though, even compared to something like YouTube


there's a graph somewhere showing the bitrate for netflix before and after they paid out extortion fees to comcast.

Fast.com was originally a tool to measure if your line was throttling Netflix traffic.[0]

[0] https://qz.com/688033/netflix-launched-this-handy-speed-test...

Extortion fees? Netflix wanted Comcast to provide them with free bandwidth. While some ISPs may appreciate that arrangement as it offloads the traffic from their transit links, Comcast is under no obligation to do so.

You could argue then that Comcast wasn't upgrading their saturated transit links, which they weren't with Level 3, but to assume that every single ISP should provide free pipes to companies is absurd.

Comcast were purposefully keeping those links under capacity so they could double dip - get money from both their customers and Netflix.

So no business should ever have to pay for bandwidth because customers of said business are already paying? Or should I get free internet because businesses are already paying ISPs for their network access?

That sounds reasonable? If I'm an end-user, what I'm paying my ISP for is to get me data from Netflix (or youtube or wikipedia or ...); that is the whole point of having an ISP. If that means they need to run extra links to Netflix, then tough; that's what I'm paying them for.

You're paying them for access to the ISPs network. For an internet connection, this also comes with access to other networks through your ISPs peering and transit connections.

If you know of any ISP that would give me 100+ Gbps connectivity for my business, please let me know as I'd love to eliminate my bandwidth costs.

> You're paying them for access to the ISPs network.

I thought we moved past that model after AOL

Businesses have to pay for bandwidth to get data to the customer's ISP, but they generally don't pay the customer's ISP for the same bandwidth the customer has already paid for.

Netflix did not have to pay Comcast either. One of their ISPs Level 3 already had peering arrangements with Comcast to deliver the content. Instead of paying their own ISP, Netflix wanted to get free bandwidth from Comcast. There's a difference.

Consider it from a customer's point of view.

Comcast threatened to throttle customers' bandwidth, refusing to deliver the speeds they had promised. The data was available to Comcast, customers had paid for the service of delivering that data, but Comcast wouldn't provide the full service they had sold unless Netflix paid them more.

The deeper issue is that Comcast is lying to their customers, promising them more bandwidth than they are able to deliver, so when Comcast's customers wanted to use all the bandwidth they bought to watch Netflix, Comcast couldn't afford to honor their promises.

But Comcast has a monopoly in the markets they serve, while Netflix exists in a competitive market, so Comcast got away with it.

No ISP can guarantee speeds outside of their network. Once it leaves their network it's considered best effort. Comcast has about 30 million subscribers. If they were to guarantee bandwidth out of their network, and if every subscriber had a laughably slow 10Mbps connection, Comcast would need 300Tbps of connectivity to every single company and network. For this reason, every ISP in the world "throttles" in one way or another.

True but not relevant to this case. Netflix was willing to provide the high speed connectivity; the throttling was just intimidation.

Customer <> ISP <> ISP <> Customer

Why does one end of this example have to pay both ISPs in your view?

They don't, and I never suggested they did. Netflix had their own ISPs. Netflix wanted the Customer's ISP to give them free bandwidth so they could offload the traffic from their other ISPs.

If Netflix pays for the bandwidth, what do Comcast's customers pay for?

Customer service!

"I have people skills; I am good at dealing with people. Can't you understand that? What the hell is wrong with you people?"

I don't normally upvote jokes unless they're funny but also make a legitimate point. This was a good one :-D

The bandwidth paid for by their customers you mean?

Do you work for comcast or some other isp? it seems like you're biased in favor of the pipe companies here mate...

One of the funny truths about the business of networks, is that the one who "controls" the most bandwidth has the #1 negotiating spot at the table.

Take for example, if say Verizon decided to charge more for bandwidth to Netflix... if Netflix said "no" and went with another provider, then Verizon's customer's would suffer from worse access times to Netflix.

Verizon has the advantage in that they have a huge customer base that no one wants to piss off Verizon. So it cuts both ways. Bandwidth becomes not a cost at this scale, but instead a moat.

My hypothesis is that if Netflix/Youtube hadn't stressed out our bandwidth and forced the ISPs of the world upgrade for the last decade, the world wouldn't have been ready for WFH of the covid world.

ISPs would have been more than happy to show the middle finger to the WFH engineers, but not to the binge-watching masses.

> My hypothesis is that if Netflix/Youtube hadn't stressed out our bandwidth and forced the ISPs of the world upgrade for the last decade, the world wouldn't have been ready for WFH of the covid world.

Couldn't agree more.

We see the opposite when it comes to broadband monopolies: "barely good enough" DSL infrastructure, congested HFC, and adversarial relationships w.r.t. subscriber privacy and experience.

When it became worthwhile to invest in not just peering but also last-mile because poor Netflix/YouTube/Disney+/etc performance was a reason for users to churn away, they invested.

This isn't to say that this is all "perfect" for consumers either, but this tension has only been good for consumers vs. what we had in the 90's and early-mid 00's.

The US is still not ready. The vast majority of people have access to only over subscribed coaxial cable internet, with non existent upload allocation.

Anecdotally, I'm always very impressed with the connectivity I see in the US versus my home connection in the UK. I'm fairly certain that the last mile from my suburban home to the cabinet is made of corroded aluminium.

UK is certainly worse. Even in a well populated suburb 10min from Birmingham, I know a family that can only get ADSL or some other service like that at 2Mbps.

I have been spoiled with symmetric 1Gbps up and down fiber for a few years, and using the internet is like turning on the electric or gas or water, you do not ever have to think about it.

I live in central London (zone 2) and the fastest wired broadband service available at my house is slower than 20Mbps. (I'm playing with LTE now, but so far it's been a bit of a mixed bag.)

The binge-watching masses are easy to satisfy. All it takes for the stream to work is average speed of a few megabits per second, but there's so much caching at client end that high latency and few seconds of total blackout every now and then don't really matter.

Pandemics are temporary. Binge watching will never go away.

Not sure if you realize which ISP you picked, but Verizon and Netflix actually had peering disputes in 2014 which gave me quite the headache at my then-current employer.

Of course they are not paying $120/TB like AWS public pricing

I heard they are paying something between $0.20 (eu/us) to $10 (exotic) per TB based on the region of the world where the traffic is coming from

> I heard they are paying something between $0.20 (eu/us) to $10 (exotic) per TB based on the region of the world where the traffic is coming from

They're likely paying even less. $0.20/TB ($0.0002/GB) is aggressive but at their scale, connectivity and per-machine throughput, it's lower still.

A few points to take home:

- They [Netflix, YT, etc] model cost by Mbps - that is, the cost to deliver traffic at a given peak. You have to provision for peaks, or take a reliability/quality of experience hit, and for on-demand video your peaks at usually 2x your average.

- This can effectively be "converted" into a $/TB rate but that's an abstraction, and not a productive way to model. Serving (e.g.) 1000PB (1EB) into a geography at a peak of 3Tbps per day is much cheaper than serving it at a peak of 15Tbps.

- Netflix, more so than most others, benefits from having a "fixed" corpus at any given moment. Their library is huge, but (unlike YouTube) users aren't uploading content, they aren't doing live streaming or sports events, etc - and thus can intelligently place content to reduce the need to cache fill their appliances. Cheaper to cache fill if you can trickle most of it during the troughs as you don't need a big a backbone, peering links, etc. to do so.

- This means that Netflix (rightfully!) puts a lot of effort into per-machine throughput, because they want to get as much user-facing throughput as possible from the given (space, power, cost) of a single box. That density is also attractive to ISPs, as it means that every "1RU of space" they give Netflix has a better ROI in terms of network cost reductions vs. others, esp. when combined with the fact that "Netflix works great" is an attractive selling point for users.

Sorry for the naive question, but to offer those prices, there are two options: A) Amazon is losing money to keep Netflix as their client, B) They are doing profit even with $0.20/TB, which means the $120/TB is, at least, 119.80 profitable. Wow.

Netflix has appliances installed at exchange points that caches most of Netflix. Each appliance peers locally and serves the streams locally.

The inbound data stream to fill the cache of the appliance is rate limited and time limited - see https://openconnect.zendesk.com/hc/en-us/articles/3600356180... The actual inbound data to the appliance will be higher than the fill because not everything is cached.

The outbound stream from the appliance serves consumers. In New Zealand for example, Netflix has 40Gbps of connectivity to a peering exchange in Auckland. https://www.peeringdb.com/ix/97

So although total Netflix bandwidth to consumers is massive, it has little in common with the bandwidth you pay for at Amazon.

Disclaimer: I am not a network engineer.

$0.20/ TB costs is not for their agreements with Amazon, bulk of their video traffic is directly peering with ISPs, AWS largely serves for their APIs, and orchestration infrastructure.

Amazon and most Cloud providers do overcharge for b/w. You can buy a OVH/Hetzner type box with a guaranteed un-metered 1 Gbps public bandwidth for ~ $120/month easily, which if fully utilized is equivalent 325TB / month or $3-4/TB, completely ignoring the 8/16 core bare metal server and attached storage you also get. This is SMB/ self-service prices, you can get much deals with basic negotiating and getting into a contract with a DC.

One thing to remember though not all bandwidth are equal, CSPs like AWS provide a lot of features such as very elastic scale up on-demand, a lot of protection up to L4 and advanced SDN under the hood to make sure your VMs can leverage the b/w, that is computationally expensive and costly.

AWS {in,e}gress pricing strategy is not motivated by their cost of provisioning that service. Cloudflare had a good (if self-motivated) analysis of their cost structure discussed on here a while ago


Video data does not stream over AWS. AWS is used for everything else though.

AWS public pricing is comically high. No one even prices in that manner.

Hence, to a large extent, the devices described in the presentation. They’re (co)located with ISP hardware, so the bulk data can be transferred directly to the user at minimal / zero marginal cost

Enjoying the slides! Is there a small typo on slide 17?

>Strategy: Keep as much of our 200GB/sec of bulk data off the NUMA fabric [as] possible

No, 400Gb/s == 50 GB/s

With software kTLS, the data is moved an and out of memory 4x, so that's 200GB/s of bandwidth needed to serve 50GB/s (400Gb/s) of data.

How do you think this will need to be re-architected for HTTP3? Will you do kernel bypass with a userspace stack?

We've looked at http3 / quic, and have not yet seen any advantages for our workload. Just a large loss of CPU efficiency.

> Run kTLS workers, RACK / BBR TCP pacers with domain affinity

TCP BBR is in the FreeBSD upstream now? Cool.

As of 13.0

Out of curiosity, why TLS encrypt when it's already DRM? To prevent snooping on viewer habits?


Maybe they should add "But I already have DRM!" to the list. DRM solves a complete different problem.

More generically, “But my content is already encrypted!” I’m surprised it isn’t already there.

Dumb question, but what, specifically, does "fabric" mean in this context?

Fabric = data transmission infrastructure between CPUs or cores. This includes the wires which carry the data signals, as well as routing logic.

At 400Gb/s, how long would it take to read through the entire catalogue?

I'd love to know how many users could be served off one of these boxes

You can do reasonable 1080p at 3Mbit with h.264, if these were live streams you could fit around 130k sessions on the machine. But this workload is prerecorded bulk data, I imagine the traffic is considerably spikier than in the live streaming case, perhaps reducing the max sessions (without performance loss) by 2-3x.

Slightly unrelated, but did they ever consider P2P? It scales really well and saves Terabytes of bandwidth, so I wonder what the cons are, except that's it's associated with illegal distribution.

P2P doesn't offer as many experience consistency guarantees and in general adds a lot of complexity (=cost) that doesn't exist in a client server model. Even if you went full bore on P2P you still have to maintain a significant portion of the centralized infrastructure both to seed new content as well as for clients that aren't good P2P candidates (poor connections, limited asymmetric connections, heavily limited data connections). Once you got through all of those technical issues even if you found you could overall save cost by reducing the core infrastructure... it rides on the premise customers are going to be fine with their devices using their upload (both active bandwidth and total data) at the same price as when it was a hosted service.

But yes I imagine licensing is definitely an issue too. Right now only certain shows can be saved offline to a device and only for very restricted periods of time for the same reason. It's also worth noting in many cases Netflix doesn't pay for bandwidth.

Thanks for the detailed answer!

It's already there. The piratebay already serves me movies at 200 MBps with almost zero infrastructure cost. It's probably more a licensing issue like you said.

It's "not already there" as what torrents do is not the same as what Netflix does even though both are related to media delivery. Try seeking on a torrent which only caches the local blocks and getting the same seek delay, try getting instant start and seek when 1 million people are watching a launch, try getting people to pay you the same amount to use their upload, try getting the same battery life on a torrent as a single HTTPS stream using burst intervals. As I said P2P doesn't offer as many experience consistency guarantees and has many other problems. Licensing is just one of many issues.

Of course for free many people are willing to live with starting and seeking being slow or battery life being worse or having to allocate the storage up front or using their upload and so on but again the question is can it fit in the cost offset of some of the centralized infrastructure not what you could do for free. I don't have anything against torrents, quite the opposite I am quite a heavy user of BOTH streaming services and torrenting due to DRM restrictions on quality for some of my devices, but it isn't a single issue problem like you are pressing to make it be.

For some internal distributed action Netflix has made IPFS into something they can use but not for end delivery https://blog.ipfs.io/2020-02-14-improved-bitswap-for-contain...

Residential connections have much lower upload speeds than their download speed. That can impact web browsing speeds because outgoing packets can saturate the bandwidth and delay TCP handshake packets. This is a problem I've been having constantly with my Comcast internet connection. If the net feels slow, I check if something's doing some large upload despite that it's gigabit. I've tried QoS, managed switches etc, none helped the situation. P2P is a no-no from that perspective, in addition to other valid points until same up/down speeds become a norm (miss you Sonic!).

wouldn't p2p mean that the client receiving somehow participates in the sharing too? that wouldn't go well with some internet plan caps

Maybe dumb question, but does p-2-p work on native TV apps, chromecast, etc ? I know it does if you run a client app on Windows or Mac

If Netflix had made P2P the standard ten years ago then TVs and Chromecasts would support P2P today. But they didn't so they don't.

That's not P2P in quite the same sense of the above was talking about. Chromecast is client/server it's just your device can become the server or instruct the Chromecast to connect to a certain server.

Right, but can that realistically be done?

have you considered dpdk?

Associate network connections with NUMA nodes ● Allocate local memory to back media files when they are DMA’ed from disk ● Allocate local memory for TLS crypto destination buffers & do SW crypto locally ● Run kTLS workers, RACK / BBR TCP pacers with domain affinity ● Choose local lagg(4) egress port All of this is upstream!

This sounds like some alien language

Basically, everything is about “pinning” the hardware on a low level.

You pin the network adapters to be handled by a specific set of cores.

You pin the filesystem to handle specific sets of files on specific cores.

You then ensure the router in the rack to distribute the http requests to the network ports exactly in the way that they always arrive on the network adapters that have those files pinned.

It’s not much different from partitioning / sharding and “smart” load balancing a cluster of servers, it’s just on a lower level of abstraction.

here it is without acronyms, though not much clearer:

- Associate network connections with "Non-Uniform Memory Architecture" nodes [(a compute core with faster access to some memory, disks, and network-interface-cards)]

- Allocate local memory to back media files when they are direct-memory-access'ed [(directly copied to memory)] from disk

- Allocate local memory for transport-layer-security cryptography destination buffers & do software cryptography locally

- Run kernel-transport-layer-security workers, "Recent Acknowledgment" / "Bottleneck Bandwidth and Round-Trip-Time" Transmission-Control-Protocol pacers with domain affinity

- Choose local Link-Aggregation egress port

-- All of this is upstream!

Welcome to devops.

This is not devops. This is plain "high performance video delivery".

> plain "high performance video delivery".

Why is that not "devops"? Is it not quite a broad term?

Software development is at the core of devops, not video streaming.

Welcome to system operations (sysop). We've been doing this before devops was even a word.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact