I'm not the best at network infrastructure, though I'm more familiar with NUMA stuff. So I was trying to figure out how you only got 1-IP address on each box despite 4-ports across 2-nics.
I assume some Linux / Windows devops people are just not as familiar with FreeBSD tools like that!
EDIT: Now that I think of it: maybe a few slides on how link-aggregation across NICs / NUMA could be elaborated upon further? I'm frankly not sure if my personal understanding is correct. I'm imagining how TCP-connections are fragmented into IP-packets, and how those packets may traverse your network, and how they get to which NUMA node... and it seems really more complex to me than your slides indicate? Maybe this subject will take more than just one slide?
I'm afraid that my presentation didn't really have room to dive much into LACP. I briefly said something like the following when giving that slide:
Basically, the LACP link partner (router) hashes traffic consistently across the multiple links in the LACP bundle, using a hash of its choosing (typically an N-tuple, involving IP address and TCP port). Once it has selected a link for that connection, that connection will always land on that link on ingress (unless the LACP bundle changes in terms of links coming and going). We're free to choose whatever egress NIC we want (it does not need to be the same NIC the connection entered on). The issue is that there is no way for us to tell the router to move the TCP connection from one NIC to another (well, there is in theory, but our routers can't do it)
I hope that helps
Wait, I got lost again... You say you can "output on any egress NIC". So all four egress NICs have access to the TLS encryption keys and are cooperating through the FreeBSD kernel to get this information?
Is there some kind of load-balancing you're doing on the machine? Trying to see which NIC has the least amount of traffic and routing to the least utilized NIC?
There's all sorts of subtle details though that's probably just "implementation details" of this system. How and where do worker threads spawn up? Clearly sendfile / kTLS have great synergies, etc. etc.
Its a lot of detail, and an impressive result for sure. I probably don't have the time to study this on my own so of course, I've got lots and lots of questions. This discussion has been very helpful.
I think some of the "missing picture" is the interaction of sendfile / kTLS. It makes sense, but just studying these slides have solidified a lot for me as well: https://papers.freebsd.org/2019/EuroBSDCon/shwartsman_gallat...
Adding the NUMA things "on top" of sendfile/kTLS is clearly another issue. The hashing of TCP/Port information into particular links is absolutely important, because of the "Physical location" of the ports matter.
I think I have the gist at this point. But that's a lot of moving parts here. And the whole NUMA-fabric being the bottleneck just ups the complexity of this "simple" TLS stream going on...
EDIT: I guess some other bottleneck exists for Intel/Ampere's chips? There's no NUMA in those. Very curious.
Rereading "Disk centric siloing" slides later on actually answer a lot of my questions. I think my mental model was disk-centric siloing and I just didn't realize it. Those slides work exactly how I think this "should" have worked, but it seems like that strategy was shown to be inferior to the strategy talked about in the bulk of this presentation.
Hmmm, so my last "criticism" to this excellent presentation. Maybe an early slide that lays out the strategies you tried? (Disk Siloing. Network-siloing. Software kTLS, and Hardware kTLS offload)?
Just one slide at the beginning saying "I tried many architectures" would remind the audience that many seemingly good solutions exist. Something I personally forgot in this discussion thread.
Are you running anything custom in your Mellanoxes? dpdk stuff
We like to run each server as an independent entity, so we don't run NVMEof.
They're pretty much rock solid.
If you'd like to discuss more, ping me via email. Use my last name at gmail . com.
What an obnoxious, pedantic and naive response.
"And no, you definitely won't be in "much better place" by "ripping it off" because modern NICs have very complex firmware with hundreds (if not thousands) man/years spent implementing and optimizing it."
What do you think I just explained? I literally write this stuff. Modern nic firmware is still written in C and still has bugs. Do you seriously think drivers/firmware aren't going to have bugs? You just said yourself that they're high complexity. I can't believe I'm even needing to explain this. You've clearly never been a network engineer, do you have any idea how many bugs are in Juniper, Cisco, Palo, etc?
If you don't work on bare metal architecture/distributed systems, move along. This isn't sysadmin talk. Almost nothing used is stock, even k8s gets forked due to bugs. You can't resell PAAS with bugs that conflict with customer billing. NFLX isn't reselling bandwidth so they likely don't encounter these issues, they're using something like Cedexis to force their CDN providers at the edge to compete with one another down to the lowest cost/reliability and they're liable for things like this and can be sued/loss based on their contract. They (CDN customers) are acutely aware of when billing doesn't match up with realized BW usage. They'll drop the traffic to your CDN and split more of it over to a "better" CDN until those issues are mitigated - and while you're not getting that traffic you're not getting that customers exoected monthly payment... because now they're buying less bandwith from you because Cedexis tells them that you're less performant/reliable.
ANY large customer buying bandwidth from a CDN does this, none of this is specific to NFLX whom I know nothing about beyond.. "this is how reselling PAAS works."
I bet the next response is "no way NFLX uses a CDN".. LOL. They all do, my friend, HBO, paramount, disney, etc. They aren't in the biz of edge caching.
The real risk is that we introduce a huge blast radius if one of these machines goes down.
If we're only worried about throughput, and everything works out with queueing, there's no need for gen4 NVMes because the storage has more bandwidth than the network. That doesn't mean gen4 is only hype; if my math is right, you need gen4x16 to have enough bandwidth to run a dual 100G ethernet at line rate, and you could use fewer gen4 storage devices if reducing device count were useful. I think for Netflix, they'd like more storage, so given the storage that fits in their systems, there's no need for gen4 storage; gen4 would probably make sense for their 800Gbps prototype though.
In terms of disk I/O, either in the thread or the slides, drewg123 mentioned only about 10% of requests were served from page cache, leaving 90% served from disk, so that would make worst case look something like 45GB/sec (switching to bytes cause that's how storage throughput is usually measured). From previous discussions and presentations, Netflix doesn't do bulk cache updates during peak times, so they won't have a lot of reads at the same time as a lot of writes.
If what I found is current, Netflix has 1U and 2U appliances, but FB does a 1U switch + groups of 4x 2U servers, so starting at 9U; but I was looking at a 2016 FB pdf that someone uploaded, they may have changed their deployments since then. 2U vs 9U can make a big difference.
If they're really RAM-bandwidth constrained, then the 24-core 74F3 (which still has all 256MBs of L3 cache) or even 8-core 72F3 may be better.
The more compute clusters the more PCI lanes in EPYC and the SSD lanes go direct per 8cores.
That proxy would still need to do kTLS to reduce the required memory bandwidth to something the system can manage, and then you're at roughly the same place. The storage nodes would likely still have kTLS capable NICs because those are good 2x100G NICs anyway. It would be easier to NUMA align the load though, so there might be some benefit there. With the right control software, the proxy could pick an origin connection that was NUMA aligned with the client connection on the proxy and the storage on the origin. That's almost definitely not worth doubling the node count for though, even if proxy nodes don't need storage so they're probably significantly less expensive than flash storage nodes.
I'd assume TLS is used in large part for privacy reasons (so ISPs can't sell info on what shows are popular)
However, in my case, and I expect in yours, requirements engineering is the place where you can make the greatest improvements. For example, I can save a few cycles and a few microwatts by sequencing my interrupts optimally or moving some of the algorithm to a look-up table, but if I can, say, establish that an LED indicator flash that might need to be 2x as bright but only lasts for a couple milliseconds every second is as visible as a 500ms LED on/off blink cycle, that's a 100x power savings that I can't hope to reach with micro-optimizations.
What are your application-level teams doing to reduce the data requirements? General-purpose NUMA fabrics are needed to move data in arbitrary ways between disc/memory/NICs, but your needs aren't arbitrary - you basically only require a pipeline from disc to memory to the NIC. Do you, for example, keep the first few seconds of all your content cached in memory, because users usually start at the start of a stream rather than a few minutes in? Alternatively, if 1000 people all start the same episode of Stranger Things within the same minute, can you add queues at the external endpoints or time shift them all together so it only requires one disk read for those thousand users?
It would be fascinating to hear from Netflix on some serious details of the usage patterns they see and particular optimizations that they do for that, but I doubt there's so much they can do given the size of the streams, the 'randomness' of what people watch and when they watch, and for the fact that the linked slides say the servers have 18x2TB NVME drives per-server and 256GB.
I wouldn't be surprised if the Netflix logo opener exists once on disk instead of being the first N seconds of every file though.
What would you rate the relative complexity of working with the NIC offloading vs the more traditional optimizations in the rest of the deck? Have you compared other NIC vendors before or has Mellanox been the go to that's always done what you've needed?
I have video editing skills and have also done some past-time subtitling of videos. I have all of the software necessary to perform both of these tasks and would be willing to do so free of charge.
We did a bake off a few years ago, and FOR THIS WORKLOAD FreeBSD outperformed Linux. I don't want to get into an OS war, that's not productive.
Not to mention tons of VM system and scheduler improvements which have been motivated by our workload.
FreeBSD itself has improved tremendously over the last few releases in terms of scalability
True. FreeBSD (or its variants) has always been a better performer than Linux in the server segment. Before Linux became popular (mostly due to better hardware support), xBSD servers were famous for their low maintenance and high uptime (and still are). This archived page of NetCraft statistics ( https://web.archive.org/web/20040615000000*/http://uptime.ne... ) provides an interesting glimpse into internet history of how 10+ years back, the top 50 server with the highest uptimes were often xBSD servers, and how Windows and Linux servers slowly replaced xBSD.
(Here's an old HN discussion about a FreeBSD server that ran for 18 years - https://news.ycombinator.com/item?id=10951220 ).
Genuinely curious on where u see state of art when it comes to Linux.
My understanding is that they don't do 0-copy inline ktls, but I could be wrong about that.
I have run Debian based Linux my entire life and recently moved circumstantially to Windows. I have no idea how it's kernel model works and I find io_uring exciting.
Wasn't aware of any adoption of io_uring ideas in Windows land, sounds interesting
Not sure if it’s ok to ask.. how many servers like this one does it take to serve the US clients?
But note that these are flash servers; they serve the most popular content we have. We also have "storage" servers with huge numbers of spinning drives that serve the longer tail. They are constrained by spinning rust speeds, and can't serve this fast.
At the time, AMD was the only Gen4 PCIe available, and it was hard to determine if the Mellanox NIC or the AMD PCIe root was the limiting factor. When AMD suggested Relaxed Ordering, that brought its importance to mind.
This is problematic, because sometimes results are not reproducible.
Eg, if I test on the day of a new release of a popular title, we might be serving a lot of it cached from RAM, so that cuts down memory bandwidth requirements and leads to an overly rosy picture of performance. I try to account for this in my testing.
However, how do you test the saturation point when dealing with production traffic? Won't you have to run your resources underprovisioned in order to achieve saturation? Doesn't that degrade the quality of service?
Or are these special non-ISP Netflix Open Connect instances that are specifically meant to be used for saturation testing, with the rest of the load spilling back to EC2?
The biggest impact I have to QoE is when I crash a box, but clients are architected to be resilient against that.
Also, have you looked into offloading more of the processing to an FPGA card instead?
FreeBSD has "async sendfile", which means that it does not block waiting for the data to be read from disk. Rather, the pages that have been allocated to hold the data are staged in the socket buffer and attached to mbufs marked "not ready". When the data arrives, the disk interrupt thread makes a callback which marks the mbufs "ready", and pokes the TCP stack to tell them they are ready to send.
This avoids the need to have many threads parked, waiting on disk io to complete.
The gist is that the sendfile() call stages the pages waiting to be read in the socket buffer, and marks the mbufs with M_NOTREADY (so they cannot be sent by TCP). When the disk read completes, a sendfile callback happens in the context of the disk ithread. This clears the M_NOTREADY flag and tells TCP they are ready to be sent. See https://www.nginx.com/blog/nginx-and-netflix-contribute-new-...
The overall idea is to copy bytes from disk to the socket with almost no allocation and not blocking, this is the idea right?
I've been doing this for stuff like SMTPE authority servers and ntpd and things that absolutely cannot go down, for over a decade.
There are also VirtIO drivers involved, and according to the article, they had effect too.
So, could Linux be tweaked and made as performant for _this_ use case. I expect so. The question to be answered is _why_.
Do you have any additional references around this? I'm aware that most rarely used functionality is often broken and therefore usually don't recommend people to use it, but would like to learn about kTLS in particular. I think for Linux OpenSSL 3 now added support for it in userspace. But there's also the kernel components as well as drivers - all of them could have their set of issues.
Well, the printers wouldn't pair with the new APs, certain laptops with fruit logos would intermittently drop connection, and so on.
I probably will never use that brand again, even though they escalated and promised patches quickly - within 6 hours they had found the issue and we're working on fixing it, but the damage to my reputation was already done.
Since then I've always demanded to be able to test any new idea/kit/service for at least a week or two just to see if I can break it.
2. When serving video, do you use floating point operations at all? Could this workload run on a hypothetical CPU with no floating point units?
3. How many of these hardware platforms do you guys own? 10k?100k?
2) These are just static files, all encoding is done before it hits the CDN.
Rather than DMA from storage to system RAM, and then from system RAM to the NIC, you could conceivably DMA to GPU RAM and then to the NIC for a subset of sends. Not all of the sends, cause of PCIe bandwidth limits. OTOH, DDR5 is coming soon and is supposed to bring double the bandwidth and double the fun.
Something like close the connection after N requests/N minutes if the nodes are mismatched, but leave it open indefinitely if they match.
There's of course a lot of ways for that to not be very helpful. You'd still have only a 25% of getting a port number that hashes to the right node the next time (assuming tcp port number is involved at all, if it's just src and dest ips then client connections from the same IP would always hash the same, and that's probably a big portion of your clients), and if establishing connections is expensive enough (or clients aren't good at it) then that's a negative. Also if a stream's files don't tend to stay on the same node, then churning connections to get to the right node doesn't help if the next segment is on a different node. I'm sure there are other scenarios too.
I know some other CDN appliance setups do use multiple IPs, so you probably could get more, but it would add administrative stress.
In a past life we broke LAGs up to use different subnets per port to prevent traffic crossing the NUMA bridge.
I’m sure there are good reasons you didn’t take this approach, be interesting to hear them.
If so, then yeah, that's a VPN.
Wild, didn't know nVidia was side-eyeing such far-apart but still parallel channels for their ?GPUs?.
Was this all achievable using nVidia's APIs out-of-the-box, or did the firmware/driver require some in-house engineering :)
We keep track of popular titles, and try to cache them in RAM, using the normal page cache LRU mechanism. Other titles are marked with SF_NOCACHE and are discarded from RAM ASAP.
I ask since I was curious why you guys opted not to bypass sendfile(2). I suppose it wouldn't matter in the event that the client is some viewer, as opposed to another internal machine.
Our transport team is working on packet pacing, or really packet spreading, so that any bursts we send are small enough to avoid being dropped by the client, or an intermediary (cable modem, router, etc).
Would these techniques be applicable on arm64 and/or riscv64?
The slides here look great. I'm looking forward to watching the recording.
I've seen their boxes walking through Equinix facilities Dallas in Dallas and Chicago and it is a bit jarring how small some of the largest infrastructure can be.
It's worth noting that they have many smaller boxes at ISP colo locations as well not just the big regional DCs like Equinix.
Sometimes, for some organizations, spending 0.1 or 5 or 10 million dollars to solve a problem for now, right now is the smart and prudent choice.
I don't doubt there's a point at which it's cheaper to focus on reducing hardware and colocation costs, but for the vast majority engineers are the expensive thing.
Companies that think engineers are expensive will continue to buy tons of hardware and scale rather badly. If you are not actively pushing the ceiling you gonna fall out. You should work on problems cause, who knows, it seems like there might be some value.
It feels like a luxury that they can fit their entire content library in all encoded formats into a single server (even though it's a massive server).
And the text is in a <tspan>, so it could be appropriately re-titled if desired :)
...to something like "How to not be too curious for your own good"?
I wonder how many will read this and consider trying out FreeBSD. It’s a rather dope OS but I am a bit biased.
1. Is there any Likelihood Netflix needs will migrate to ARM in thr next few years? (I see you’re right up at end of the deck, curious if you’re seeing more advancements with ARM than x86 and as such, project ARM to surprise x86 for your needs in the foreseeable future)
2. Can you comment more on thr 800 Gbps reference at end of deck
The 800Gb is just a science experiment.. a dual-socket milan with a relatively equal number of PCIe lanes from each socket. It should have been delivered ages ago, but supply chain shortages impacted the delivery dates of a one-off prototype like this.
Pretty crazy stuff though, would love to see something similar from cloudflare eng. since those workloads are extremely broad vs serving static bits from disk.
For a EPYC 7502P 32-core / 256GB RAM / 2x Mellanox Connect-X 6 dual-nic, I'm seeing $10,000.
Then comes the 18x SSD drives, lol, plus the cards that can actually hold all of that. So the bulk of the price is SSD+associated hardware (HBA??). The CPU/RAM/Interface is actually really cheap, based on just some price-shopping I did.
Seems surprisingly cheap, but I’m not sure if that’s just great cost engineering on Netflix part or a poor prior on my part … I’ll chose to blame Nvidia’s pricing in other domains for biasing me up
I have to say that a lot of this may be overkill for plex. Do you know if plex even uses sendfile?
Truenas also introduced me to ZFS and I have been amazed by it so far! I haven't dug to deep into FreeBSD yet, but that's next on my list.
(I'm also curious where the files are sourced from - EC2 would be horrendously expensiveish, but maybe you send to one CDN and then have it propagate the data further outwards (I'm guessing through a mesh topology).)
There's a lot of blogs that go over this in detail for our usecase at home. Your can do it but you will not see much improvements if any at all.
The actual bandwidth is significant though, even compared to something like YouTube
You could argue then that Comcast wasn't upgrading their saturated transit links, which they weren't with Level 3, but to assume that every single ISP should provide free pipes to companies is absurd.
If you know of any ISP that would give me 100+ Gbps connectivity for my business, please let me know as I'd love to eliminate my bandwidth costs.
I thought we moved past that model after AOL
Comcast threatened to throttle customers' bandwidth, refusing to deliver the speeds they had promised. The data was available to Comcast, customers had paid for the service of delivering that data, but Comcast wouldn't provide the full service they had sold unless Netflix paid them more.
The deeper issue is that Comcast is lying to their customers, promising them more bandwidth than they are able to deliver, so when Comcast's customers wanted to use all the bandwidth they bought to watch Netflix, Comcast couldn't afford to honor their promises.
But Comcast has a monopoly in the markets they serve, while Netflix exists in a competitive market, so Comcast got away with it.
Why does one end of this example have to pay both ISPs in your view?
Take for example, if say Verizon decided to charge more for bandwidth to Netflix... if Netflix said "no" and went with another provider, then Verizon's customer's would suffer from worse access times to Netflix.
Verizon has the advantage in that they have a huge customer base that no one wants to piss off Verizon. So it cuts both ways. Bandwidth becomes not a cost at this scale, but instead a moat.
ISPs would have been more than happy to show the middle finger to the WFH engineers, but not to the binge-watching masses.
Couldn't agree more.
We see the opposite when it comes to broadband monopolies: "barely good enough" DSL infrastructure, congested HFC, and adversarial relationships w.r.t. subscriber privacy and experience.
When it became worthwhile to invest in not just peering but also last-mile because poor Netflix/YouTube/Disney+/etc performance was a reason for users to churn away, they invested.
This isn't to say that this is all "perfect" for consumers either, but this tension has only been good for consumers vs. what we had in the 90's and early-mid 00's.
I have been spoiled with symmetric 1Gbps up and down fiber for a few years, and using the internet is like turning on the electric or gas or water, you do not ever have to think about it.
I heard they are paying something between $0.20 (eu/us) to $10 (exotic) per TB based on the region of the world where the traffic is coming from
They're likely paying even less. $0.20/TB ($0.0002/GB) is aggressive but at their scale, connectivity and per-machine throughput, it's lower still.
A few points to take home:
- They [Netflix, YT, etc] model cost by Mbps - that is, the cost to deliver traffic at a given peak. You have to provision for peaks, or take a reliability/quality of experience hit, and for on-demand video your peaks at usually 2x your average.
- This can effectively be "converted" into a $/TB rate but that's an abstraction, and not a productive way to model. Serving (e.g.) 1000PB (1EB) into a geography at a peak of 3Tbps per day is much cheaper than serving it at a peak of 15Tbps.
- Netflix, more so than most others, benefits from having a "fixed" corpus at any given moment. Their library is huge, but (unlike YouTube) users aren't uploading content, they aren't doing live streaming or sports events, etc - and thus can intelligently place content to reduce the need to cache fill their appliances. Cheaper to cache fill if you can trickle most of it during the troughs as you don't need a big a backbone, peering links, etc. to do so.
- This means that Netflix (rightfully!) puts a lot of effort into per-machine throughput, because they want to get as much user-facing throughput as possible from the given (space, power, cost) of a single box. That density is also attractive to ISPs, as it means that every "1RU of space" they give Netflix has a better ROI in terms of network cost reductions vs. others, esp. when combined with the fact that "Netflix works great" is an attractive selling point for users.
The inbound data stream to fill the cache of the appliance is rate limited and time limited - see https://openconnect.zendesk.com/hc/en-us/articles/3600356180... The actual inbound data to the appliance will be higher than the fill because not everything is cached.
The outbound stream from the appliance serves consumers. In New Zealand for example, Netflix has 40Gbps of connectivity to a peering exchange in Auckland. https://www.peeringdb.com/ix/97
So although total Netflix bandwidth to consumers is massive, it has little in common with the bandwidth you pay for at Amazon.
Disclaimer: I am not a network engineer.
Amazon and most Cloud providers do overcharge for b/w. You can buy a OVH/Hetzner type box with a guaranteed un-metered 1 Gbps public bandwidth for ~ $120/month easily, which if fully utilized is equivalent 325TB / month or $3-4/TB, completely ignoring the 8/16 core bare metal server and attached storage you also get. This is SMB/ self-service prices, you can get much deals with basic negotiating and getting into a contract with a DC.
One thing to remember though not all bandwidth are equal, CSPs like AWS provide a lot of features such as very elastic scale up on-demand, a lot of protection up to L4 and advanced SDN under the hood to make sure your VMs can leverage the b/w, that is computationally expensive and costly.
>Strategy: Keep as much of our 200GB/sec of bulk data off the NUMA fabric [as] possible
With software kTLS, the data is moved an and out of memory 4x, so that's 200GB/s of bandwidth needed to serve 50GB/s (400Gb/s) of data.
TCP BBR is in the FreeBSD upstream now? Cool.
Maybe they should add "But I already have DRM!" to the list. DRM solves a complete different problem.
But yes I imagine licensing is definitely an issue too. Right now only certain shows can be saved offline to a device and only for very restricted periods of time for the same reason. It's also worth noting in many cases Netflix doesn't pay for bandwidth.
It's already there. The piratebay already serves me movies at 200 MBps with almost zero infrastructure cost. It's probably more a licensing issue like you said.
Of course for free many people are willing to live with starting and seeking being slow or battery life being worse or having to allocate the storage up front or using their upload and so on but again the question is can it fit in the cost offset of some of the centralized infrastructure not what you could do for free. I don't have anything against torrents, quite the opposite I am quite a heavy user of BOTH streaming services and torrenting due to DRM restrictions on quality for some of my devices, but it isn't a single issue problem like you are pressing to make it be.
For some internal distributed action Netflix has made IPFS into something they can use but not for end delivery https://blog.ipfs.io/2020-02-14-improved-bitswap-for-contain...
This sounds like some alien language
You pin the network adapters to be handled by a specific set of cores.
You pin the filesystem to handle specific sets of files on specific cores.
You then ensure the router in the rack to distribute the http requests to the network ports exactly in the way that they always arrive on the network adapters that have those files pinned.
It’s not much different from partitioning / sharding and “smart” load balancing a cluster of servers, it’s just on a lower level of abstraction.
- Associate network connections with "Non-Uniform Memory Architecture" nodes [(a compute core with faster access to some memory, disks, and network-interface-cards)]
- Allocate local memory to back media files when they are direct-memory-access'ed [(directly copied to memory)] from disk
- Allocate local memory for transport-layer-security cryptography destination buffers & do software cryptography locally
- Run kernel-transport-layer-security workers, "Recent Acknowledgment" / "Bottleneck Bandwidth and Round-Trip-Time" Transmission-Control-Protocol pacers with domain affinity
- Choose local Link-Aggregation egress port
-- All of this is upstream!
Why is that not "devops"? Is it not quite a broad term?