I posted the Serving Netflix Video Traffic at 800Gb/s and Beyond [1] in 2022. For those who are unaware of the context you may want to read the previous PDF and thread. Now we have some update; quote
> Important Performance Milestones:
2022: First 800Gb/s CDN server 2x AMD 7713, NIC kTLS offload
2023: First 100Gb/s CDN server consuming only 100W of power, Nvidia Bluefield-3, NIC kTLS offload
My immediate question is if the 2x AMD 7713 actually consumes more than 800W of power. i.e More Watts / Gbps. Even if it does, it is based on 7nm Zen 3 and DDR4 came out in 2021. Would a Zen 5 DDR5 outperforms Bluefield in Watts / Gbps?
I haven’t read post yet but I used to work in this space @cable. The nice bit with bluefield3 is you can run nginx on the on-board arm cores which have sufficient ram for live hls usecase. You can use liqid pcie fabric and support 20 bluefield nics off a single cpu 1u box. You essentially turn the 1u box into a mid-tier cache for the nics. Doing this I was able to generate 120 gig off each nic off a 1u hp + the pcie fabric/cards. I worked with liqid and hp lab here in Colorado prototyping it. Edit: I ran the cdn edge directly on nic using yocto linix.
Note that your power consumption is more than just the CPU (combined TDP of 2x225W [0]). You also have to consider SSD (16x20W when fully loaded [1]), NIC (4x24W [2]), and the rest of the actual system itself (e.g. cooling, backplane).
Or, you can skip all the hand calculations and just fiddle with Dell's website to put together an order for a rack while trying to mirror the specs as closely as possible (I only included two NICs, since it complained that the configuration didn't have enough low-profile PCIe slots for four):
Well I am assuming Memory and SSD being the same. The only different should be CPU + NIC since Bluefield itself is the NIC. May be Drewg123 could expand on that ( if he is allowed to )
That is a fair point, as the 2x CPU + 4x NIC are "only" about 550W put together. There's probably more overhead for cooling (as much as 40% of the datacenter's power -- multiplying by 1.5x pushes you just over that 800W number).
That said, being able to do 800G in an 800W footprint doesn't automatically mean that you can drive 100G in a 100W footprint. Not every ISP needs that 800G footprint, so being able to deploy smaller nodes can be an advantage.
Also: I was assuming that 100W was the whole package (which is super impressive if so), since the Netflix serving model should have most of the SSDs in standby most of the time, and so you're allowed to cheat a little bit in terms of actual power draw vs max rating of the system.
You're not wrong, but it's still a nominally usable lower bound for the actual power draw of the chip, and a reasonable proxy for how much heat you need to dissipate via your cooling solution.
Another observation, using the host cpu to manage the nvme storage for vod content is also a bottleneck. You can use the liqid 8x nvme pcie cards and address them directly from the bluefield3 arm processes using nvme-of. You are then just limited by pcie 5 switch capacity among the 10-20 shared nvme/bluefield3 cards.
Sometimes it’s still hard to tackle people psychology who are used to “comfort” of the “sta[b]le” branches.
So at work I came up with the following observation: if you are a consumer and are afraid of unpredictable events at head of master/main branch - if you use the head of master/main from 21 days ago, you have 3 weeks of completely predictable and modifiable future.
Any cherry picks made are made during the build process, so there is no branch divergence - if the fix gets up streamed it is not cherry picked anymore.
Thus, unlike with stable branches, by default it converges back to master.
“But what if the fix is not upstreamed ?” - then it gets to be there, and depending on the nature of the code it bears a bigger or smaller ongoing cost - which reflects well the technical debt as it is.
This has worked pretty well for the past 4-5 years and is now used for quite a few projects.
This is how OS updates have worked at every company I've been at. Either have a handful of devices that get them immediately and you scream test, or you simply just wait 3 weeks, then roll them out. (Minus security CVE's of course)
The one additional bonus of this scheme that popped up somewhat as a side effect is one can deploy the new feature into the “prod” build even before it hits the master: code it up, add the instruction to cherry-pick the change into the branch of a builder, test it and if it works then merge that instruction into the “master” builder.
Then, the feature will remain in prod builds as a custom cherry-pick until it’s merged upstream.
And if it isn’t merged - then one has a recurring reminder of their technical debt that they have incurred by diverging. Once adopted, it became a pretty cool way to adopt “right” long term incentives without door-stopping the short term “urgent” deliverables.
Competition is good (not everyone using Linux I mean), and I've ran FreeBSD on my desktop and server for a few years.
But whenever Netflix's use of FreeBSD comes up I never come away with a concrete understanding for: is the cost/performance Netflix gets with FreeBSD really not doable on Linux?
I'm trying to understand if it's inertia, or if not, why more cloud companies with similar traffic (all the CDN companies for example) aren't also using FreeBSD somewhere in their stack.
If it were just the case that they like FreeBSD that would be fine and I wouldn't ask more. But they mention in the slides FreeBSD is a performance-focused OS which seems to beg the question.
The community is such that if one of either FreeBSD or Linux really outperformed the other in some capacity, they'd rally and get the gap closed pretty quickly. There are benchmarks that flatter both but they tend to converge and be pretty close.
Netflix has a team of, I think it was 10 from the PDF, engineers working on this platform. A substantial number of them are FreeBSD contributors with relationships to it. It's a very special team. That's the difference maker here. If it was a team of former Red Hat guys, I'm sure they'd be on Linux. If it was a team of old Solaris guys, I wouldn't be surprised if they were on one of the Solaris offsprings. Then, Netflix knew that at their scale and to make it work, they had to build this stuff out. That was something they figured out pretty quickly, they found the right team and then built around them. It's far more sophisticated than "we replaced RedHat Enterprise with FreeBSD and reaped immediate performance advantages." At their scale, I don't think there is an off the shelf solution.
I think you're close but missing an element. FreeBSD is a centrally designed and implemented OS from kernel to libc to userland. The entire OS is in the source tree, with updated and correct documentation, with coherent design.
Any systems level team will prefer this form of tightly integrated solution to the systems layer problem if they are responsible for the highly built out specialized distributed application we call netflix. The reasons for design choices going all the way back to freebsd 1 are available on the mailing list in a central place. Everything is there.
Trying to maintain your own linux distro is insanely difficult, infact google still uses a 2.2 kernel with all custom back ported updates of the last 30 years.
The resources to match the relatively small freebsd design and implementation team are minuscule compared to the infinite crawl of linuxes, a team of 10 freebsd sysprogs is basically the same amount of people responsible for designing and building the entire thing.
It comes down to the sources of truth. In the world of fbsd that's the single fbsd repo and the single mailing list. For a linux, that could be thousands of sources of truth across hundreds of platforms.
I understand the point you're trying to make and I agree that FreeBSD tends to have cleaner code and better documentation at different levels but I don't think that it makes it that much more difficult. If you dropped in from a different world and had zero experience then I think you're right and a systems team would almost always pick BSD. Any actual experience pretty quickly swings it the other way though; there are also companies dedicated to helping you fill in those gaps on the Linux side.
I've built a couple embedded projects on Linux, when you're deep on a hard problem, the mailing lists and various people are nice, but the "source of truth" is your logic analyzer and you debug the damn thing. Or your hardware vendor might have some clues as they know more of the bugs in their hardware.
Fair points taken, i was a bit zealous in my use of the word any, the word many is more correctly applicable.
In regard to sources of truth, i mean from the design considerations point of view. For instance, why does the scheduler behave a certain way? We can analyze the logic and determine the truth of the codes behavior but determining the reason for the selection of that design for implementation is far more difficult.
These days yes, off the shelf linux will do just fine at massively scaling an application. When netflix started building blockbuster was still a huge scary thing to be feared and respected. Linux was still a hobby project if you didn't fork out 70% of a commercial unix contract over to rhel.
The team came in with the expectation and understanding they would be making massive changes for their application that may never be merged upstream. The chances of getting a merge in are higher if the upstream is a smaller centralized team. You can also just ask the person who was in charge of the, let's say for example, the init system design and implementation. Or oh, that scheduler! Why does it deprioritize x and y over z, is there more to why this was chosen than what is on the mailing list?
The pros go on and on and the cons are difficult to imagine, unless you make a vacuum and compare 2024 linux to 2004 linux.
> Any systems level team will prefer this form of tightly integrated solution to the systems layer problem if they are responsible for the highly built out specialized distributed application we call netflix
If any team would prefer this, then as gp asked: why is Linux overrepresented at other Non-Netflix content-delivery shops?
Practically this just doesn't matter all that much. You can prefer one approach to the other and that's all fine, but from a "serious engineering" perspective it just doesn't really matter.
> Trying to maintain your own linux distro is insanely difficult, infact google still uses a 2.2 kernel with all custom back ported updates of the last 30 years.
2.2? Colour me skeptical on that.
But it's really not that hard to make a Linux distro. I know, because I did it, and tons of other people have. It's little more than "get kernel + fuck about with Linux booting madness + bunch of userland stuff". The same applies to FreeBSD by the way, and according to the PDF "Netflix has an internal “distro”".
The problems Google has is because they maintain extensive patchsets, not because they built their own distro. They would have the same problems with FreeBSD, or any other complex piece of software.
It's hard to do apples to apples comparison, because you'd need two excellent, committed teams working on this.
I'm a FreeBSD supporter, but I'm sure you could get things to work in Linux too. I haven't seen any posts like 'yeah, we got 800 gbps out of our Linux CDN boxes too', but I don't see a lot of posts about CDN boxes at all.
As someone else downthread wrote, using FreeBSD gives a lot of control, and IMHO provides a much more stable base to work with.
Where I've worked, we didn't follow -CURRENT, and tended to avoid .0 releases, but it was rare to see breakage across upgrades and it was typically easy to track down what changed because there usually wasn't a lot of changes in the suspect system. That's not really been my experience with Linux.
A small community probably helps get their changes upstreamed regularly too.
The truth is that some senior engineer at Netflix chose to use FreeBSD and they stick to that idea since then, FreeBSD is not better it's happen to be the solution they chose.
All the benefits they added to FreeBSD could be added the same way in Linux if it was missing.
YouTube / Google CDN is much bigger than Netflix and runs 100% on Linux, you can make pretty much everything work on modern solution / languages / framework.
> YouTube / Google CDN is much bigger than Netflix
Youtube and Netflix are on par. According to Sandvine, Netflix sneaked past Youtube in volume in 2023[1]. I believe their 2024 report shows them still neck-and-neck.
> you can make pretty much everything work on modern solution
Presenting a false equivalence without evidence is not convincing. "You could write it in SNOBOL and deploy it on TempleOS". Netflix didn't choose something arbitrary or by mistake. They chose one of the world's highest performing and most robustly tested infrastructure operating systems. It's the reason a FreeBSD derivative lies at the core of Juniper routers, Isilon and Netapp storage heads, every Playstation 3/4/5, and got mashed into NeXTSTEP to spawn Darwin and thence macOS, iOS etc.
It continues to surprise me how folks in the tech sector routinely fail to notice how much BSD is deployed in the infrastructure around them.
> All the benefits they added to FreeBSD could be added the same way in Linux
They cannot. Case in point, the bisect analysis described in the presentation above doesn't exist for Linux, where userland distributions develop independently from the kernel. Netflix is touting here the value of FreeBSD's unified release, since the bisect process fundamentally relies on a single dimension of change (please ignore the mathematicians muttering about Schur's transform).
> There's a reason it's the core of Juniper routers, Isilon and Netapp storage heads, every Playstation 3/4/5, and got mashed into NeXTSTEP to spawn macOS.
Licensing?
FreeBSD is a fine OS that surely has some advantages here and there, but I'm also inclined to think that big companies can make stuff work if they want to.
PHP at Meta seems like a pretty good example of this.
The permissive, freewheeling nature of the BSD license is touted by some as an advantage for infrastructure use but in practice, Linux-based devices and services aren't particularly encumbered by the GPL, so to me it's a wash.
Could be lawyers at some companies just didn't want anything to do with the GPL, especially if they're fiddling with the kernel itself. Maybe they're not even correct about the details of it, just fearful. "Lawyers overly cautious" is not an uncommon thing to see.
Having seen the screams from some people when you use GPL legally I can't blame the lawyers. You might be correct but you will still get annoying people yelling you are not. there is also the cost to verifying you are correct (we put the wrong address in for where to send you request for sorce, fortunately a tester caught it before release, but still expensive as the lawyers forced the full recall process meaning we had to send a tech to upgrade the testers instead of saying where to download the fix)
> […] Linux-based devices and services aren't particularly encumbered by the GPL, so to me it's a wash.
Linux uses GPL 2.x. If we are talking about GPL 3.x things may be different:
> One major danger that GPLv3 will block is tivoization. Tivoization means certain “appliances” (which have computers inside) contain GPL-covered software that you can't effectively change, because the appliance shuts down if it detects modified software.
If you want to ship (embedded) GPL 2 Linux-based devices things are fine, but if there's software that is GPL 3 then that can cause issues. Linus on GPLv3:
I think you've hit the nail on the head. I used to work for a very very large company. And they had a very strong preference for BSD licensed software, if we were building something using outside software. A few years ago, Stever Balmer and others spent a lot of time calling spreading FUD by calling Linux and GPL a "cancer". Believe it or not, that stuff had a massive impact. Not on small shops, but large company lawyers.
Over the years, Steve left Microsoft and Microsoft has become a lot more Linux friendly. And the cancer fear has subsided. But it was very very real.
On a side note, if I recall correctly, Steve Jobs wanted to use Linux for MacOS. But he had licensing concerns and called up Linus and asked him to change a few things. Linus gave him the middle finger and that is how we got MacOS with BSD.
Fun fact! NeXT was a commercial att unix fork. The transition from the unix base to the BSD base did infact happen at apple after the acquisition. The value of next was in its application library, which would eventually become the mac foundation libraries like coreaudio and cocoa etc. The earliest releases of Rhapsody are very illuminative about the architecture of XNU/OSX. I don't doubt that linux was considered. There's a specific time when the actual move of rhapsody to a freebsd base occurred and it was at apple sometime in 97 or 98 iirc.
It's a mix of licensing and performance/stability. The licensing ensures that a company can implement their own IP and not have to contribute that back to the community as open source.
The performance/stability of FreeBSD is no joke, that's why, as mentioned previously, storage vendors like Isilon and Netapp choose FreeBSD as the base. They can contribute upstream when needed, but they don't have to provide any source for how their storage appliance software.
Be all that as it may, this is still a “why not Linux” line of thinking, rather than “why FreeBSD”, which is the more interesting question. And it is not a binary choice.
Because FreeBSD is a perfectly fine OS for server type stuff, and maybe it's better for this, that, or the other thing. So someone probably picked it and it has worked ok for them.
Just to be clear, you don't need to convince me - I've been preferring BSDs for my own infrastructure since the 90s. But I'm mindful that the discussion can quickly become polarised and winds up with negative framing or binary assumptions, which isn't helpful for others asking the same question.
> Incorrect. Case in point, the bisect analysis described in the presentation above doesn't exist for Linux, where userland distributions develop independently from the kernel.
You can certainly bisect the Linux kernel changes though. And the bug in question was a kernel bug. For a project like this, IMHO, most of the interesting bugs are going to be kernel bugs.
Probably, if you were doing this on Linux, you'd follow someone's Linux tree, and not run Debian unstable with updates to everything.
You might end up building a lightweight distribution yourself; some simple init, enough bits and bobs for in place debugging, and the application and its monitoring tools.
Anyway, if you did come across a problem where userland and kernel changed and it wasn't clear where the breakage is, the process is simple. Test new kernel with old userland and old kernel with new userland, and see which part broke. Then bisect if you still can't tell by inspection.
> It's the reason a FreeBSD derivative lies at the core of Juniper routers
Juniper is a particularly poor example to start the list with as they have been moving away from FreeBSD for years now with Junos OS Evolved being Linux based and specifically about how being based on FreeBSD was no longer a benefit to them.
A reason. Licensing also plays a role. Some may say the most important one.
>Case in point
Not really. This is an advantage, yes, but was inherent to BSD development style. Not an addition they did. Assume GP refers to other presentations, talking about the optimizations needed to get the performance Netflix needs from FreeBSD.
>doesn't exist for Linux
But can be done by putting kernel and userland in same repo as modules.
Could they have contributed async sendfile(2) to linux as well? Probably. In 2024 these advantages seem to be moot: ebpf, io_uring, more maturity in the linux network stack plus FreeBSD losing more and more vendor support by the day .
Might be a manpower thing. By hiring a bunch of FreeBSD Core devs, Netflix might be able to get a really talented team for cheaper than they might get a less ideologically flavored team. (I say this as I set up i3 on FreeBSD on my Thinkpad X280, I'm a big fan!)
They also get much more control. If they employ most of the core FreeBSD devs they basically have their own OS that they can do what they like with. If they want some change that benefits their use case at the detriment of other people they can pretty much just do it.
The flip side is of course that in this scenario they would have to finance most improvements. In Linux land the cost of development is shared across many, which one might expect to yield a generally better product.
When netflix was founded the only viable commercial linux vendor was rhel and the support contract would have been about the same as just hiring the fbsd core team at salary.
People really do not remember the state of early linux. Raise a hand if you have ever had to compile a kernel module from a flash drive to make a laptops wifi work, then imagine how bad it was 20 years before you tried and possibly failed at getting networking on linux to work, let alone behave in a consistent manner.
The development costs were not yet shared back then, most of the linux users at the vendor support level were still smaller unproven businesses and most importantly if you intend to build something that nobody else has you do not exactly want to be super tied to a large community that can just take your work and directly go clone your product.
Hiring foss devs and putting them under NDA for everything they write that doesn't get upstreamed is an excellent way to get nearly everything upstreamed aswell, and the cost of competitors then porting these merged upstream changes back down into their linux is not nothing, so this gives a competitive moat advantage.
Netflix, the online service, launched in 2007. The previous business, doing DVD mailing, had no such high bandwidth serving requirements. You're exaggerating the timeline quite a lot.
> When netflix was founded the only viable commercial linux vendor was rhel and the support contract would have been about the same as just hiring the fbsd core team at salary.
AFAIK, currently, Netflix only uses FreeBSD for their CDN appliances; their user facing frontend and apis live (or lived) in AWS on Linux. I don't know what they were running on before they moved to the cloud.
I don't think they started doing streaming video until 2007 and they didn't start deploying CDN nodes until 2011 [1]. They started off with commercial CDNs for video distribution. I don't know what the linux vendor marketplace looked like in 2007-2011, but I'm sure it wasn't as niche as in 1997 when Netflix was founded. I think they may have been using Linux for production at the time that they decided to use FreeBSD for CDN appliances.
> Hiring foss devs and putting them under NDA for everything they write that doesn't get upstreamed is an excellent way to get nearly everything upstreamed aswell, and the cost of competitors then porting these merged upstream changes back down into their linux is not nothing, so this gives a competitive moat advantage.
I don't think Netflix is particularly interested in a software moat; or they wouldn't be public about what they do and how, and they wouldn't spend so much time upstreaming their code into FreeBSD. There's an argument to be made that upstreaming reduces their total effort, but that's less true the less often you merge. Apple almost never merges in upstream changes from FreeBSD back into mac os; so not bothering to upstream their changes saves them a lot of collaborative work at the cost of making an every 10 years process a little bit longer.
At WhatsApp, I don't think we ever had more than 10 active patches to FreeBSD, and they were almost all tiny; it wasn't a lot of effort to port those forward when we needed to, and certainly less effort than sending and following up on getting changes upstream. We did get a few things upstreamed though (nothing terribly significant IMHO; I can remember some networking things like fixing a syncookie bug that got accidentally introduced and tweaking the response for icmp needs frag to not respond when the requested mtu was at or bigger than the current value; nothing groundbreaking like async sendfile or kTLS).
The only patch to FreeBSD I maintained was hacking up FreeBSD 2.x to embed Hughes mSQL to back getpwnam etc and radius/tacass. I also needed to exceed the uid limit and came up with hacky uid for the customer and gid as the account letting each customer have the 30k or whatever account ids. I shared the idea with friends at Teleport ISP in Portland who ended up doing something similar with Oracle I think. I think the same sort of thing lead to vservers that best.net used with BSDi and eventually FreeBSD jails was developed and significantly better than my hack job so I migrated the isp over to using jails.
This development, but not all of the other things that go into a kernel and distro. If they pocket the whole core team, then they end up paying for all of the work, not just the few optimizations they really care about.
> In Linux land the cost of development is shared across many, which one might expect to yield a generally better product.
And you have more cooks in the kitchen tweaking things that you also want to work on, so there's a higher chance of simply conflicts, but also folks that want to go in a completely different direction technically/philosophically.
Are there papers out there from other companies that detail what performance levels have been achieved using Linux in a similar device to the Netflix OCA? Maybe they just use two devices that have 90% of the performance?
> Had we moved between -stable branches, bisecting 3+ years of changes could have
taken weeks
Would it really? Going by their number of 4 hours per bisect step, you get 6 bisections per day, which cuts the range to 1/64th of the original. The difference between "three years of changes" and "three weeks of changes" is a factor of 50x. I.e. within one day, they'd already have identified the problematic three week range. After that, the remaining bisection takes exactly as long as this bisection did.
Even if they're limited to doing the bisections just doing the working hours in one timezone for some reason, you'd still get those six bisection steps done in just three days. It still would not add weeks.
Author here:
Note that the 4 hours per bisection is the time to ramp the server up and down and re-image it. It does not count the time to actually do the bisection step. That's because in our case of following head, the merging & conflicts were trivial for each step of the 3 week bisection. Eg, the bisections we're doing are far simpler than the bisections we'd be doing if we didn't track the main branch, and had a far larger patchset that had to be re-merged at each bisection point. I'm cringing just imagining the conflicts.
Does "the server" imply you're only using a single server for this? I would have expected that at Netflix's scale it wouldn't be too difficult to do the whole bisect-and-test on a couple dozen servers in parallel.
You could speculate in both directions, and test three revisions in parallel... However, if there's a lot of manual merging to do, you might not want to do the extra merge that's involved there. You might also get nerd sniped into figuring out if there's a better pattern if you're going to test multiple revisions in parallel.
That's a very good point, but merging in three years of changes would also have pulled in a lot of minor performance changes, and possibly some incompatible changes that would require some code updates. That would slow down each bisection step, and also make it harder to pinpoint the problem.
If you know that some small set of recent changes caused a 7% regression, you can be fairly confident there's a single cause. If 3 years of updates cause a 6% or 8% regression (say), it's not obvious that there's going to be a single cause. Even if you find a commit that looks bad, it might already have been addressed in a later commit.
Edit to clarify: you're technically correct (the best kind of correct!) but I'd still much prefer to merge 3 weeks rather than 3 years, even though their justification isn't quite right.
If a bisection takes a day, it would probably take longer to automate than just find it manually. For performance bugs, you may need to look at non-standard metrics or profiling that would otherwise be a one-off and don’t necessarily make sense to automate.
The full bisection taking just a day doesn't seem compatible with the parameters of the story.
Three weeks of FreeBSD changes seems to be about 500 commits. That's about 9 bisection steps. At two steps / day (the best you can do without automation), that's a full work week. It seems obvious that this is worth automating.
lol, came here to say this, armed with identical log2 53 == 5.7 :-) The replies to your comment are of course, spot on, though. Finding 8% of performance regression in three years of commits could have taken a looooong time.
Why would you alphabetically order initializations?
Every complex system I've ever worked on that had a large number of initializations was sensitive to orders.
Languages with module support like Wirth's Modula-2 ensure that if module A uses B, B's initialization will execute before A's. If there is no circular dependency, that order will never be wrong.
The reverse order could work too, but it's a crapshoot then. Module dependency doesn't logically entail initialization order dependency: A's initializations might not require B's initializations to have completed.
If you're initializing by explicit calls in a language that doesn't automate the dependencies, the baseline safest thing to do is to call things in a fixed order that is recorded somewhere in the code: array of function addresses, or just an init procedure that calls others directly.
If you sort the init calls, it has to be on some property linked to dependency order, otherwise don't do it. Unless you've encoded something related to dependencies into the module names, lexicographic order is not right.
In the firmware application I work on now, all modules have a statically allocated signature word that is initially zero and set to a pattern when the module is initialized. The external API functions all assert that the pattern has the correct value, which is strong evidence that the module had been initialized before use.
One one occasion I debugged a static array overrun which trashed these signatures, causing the affected modules to assert.
Having a consistent ordering avoids differences in results from inconsistent ordering by construction. IIUC, Alpha sort was/is used as a tie breaker after declared dependencies or other ordering information.
In this case, two (or more) modules indicate they can handle the same hardware and didn't have information on priority if both were present. Probably this should be detected / raise a fault, but under the previous regime of alpha sort, it was handled nicely because the preferred drivers happened to sort first.
A topological sort of the dependency graph is just as consistent as any other sort, so long as you have a deterministic tie breaking mechanism for the case of multiple valid toposorts (which can just be another sort based on some unique property).
Besides not being able to do zero copy on ZFS (yet), it probably also have to do with them not using RAID for content, and single drive ZFS doesn't make much sense in that scenario.
Single drive ZFS brings snapshots and COW, as well as bitrot detection, but for Netflix OCAs, snapshots are not used, and it's mostly read-only content, so not much use for COW, and bitrot is less of a problem with media. Yes, you may get a corrupt pixel every now and then (assuming it's not caught by the codec), but the media will be reseeded again every n days, so the problem will solve itself.
I assume they have ample redundancy in the rest of the CDN, so if a drive fails, they can simply redirect traffic to the next nearest mirror, and when a drive is eventually replaced, it will be seeded again by normal content seeding.
Yet does not suggest any sort of impending completion.
We haven't been to the center of the galaxy yet.
We haven't achieved immortality yet.
Both are valid sentences without any fixed timeline, and in the case of the first, a date that is hundreds of thousands of years in the future at soonest.
"yet" just means "up until this time" (in conext, sometimes it means "by the time you're talking about" - e.g. I'm scheduled on-call on the 13th, but i won't be back from my PTO yet).
Netflix has been developing in kernel and userland for this; if ZFS for content was a priority, they could make 0-copy sendfile work. Yes, it's not trivial at all. It will probably happen eventually, by someone who needs 0-copy and ZFS; or by ZFS people who want to support more use cases.
In addition to the lack of zero-copy sendfile from ZFS, we also have the problem the ZFS is also lacking async sendfile support (eg, the ability to continue the sendfile operation from the disk interrupt handler when the blocks arrive from disk).
I agree, but: The best thing about this is that the work is actually done by a small number of people instead of an overengineered system with a custom-solution. The thriller is great, but also the efficiency powering all of Netflix CDNs is refreshing.
I'm guessing only a small fraction of the subscription cost goes to delivering the video. Copyright is a government enforced monopoly on content, and money goes to monopolies. As Thiel said, "Competition is for losers."
Hey @drewg. In hindsight would have looking at the power draw chart combined with the CPU utilization have given you a clue that it was running at a lower clock speed and helped you narrow in on a cause rather than having to go the bisect route? Or a system diff of sysctls?
Do you mean moving away from using Nginx, like Cloudflare moved to a custom replacement? [1]
I don't think that's as needed for Netflix. My understanding is their CDN nodes are using Nginx to serve static files --- their content management system directs off peak transfers, and routes clients to nodes that it knows has the files. They don't run a traditional caching (reverse) proxy, and they most likely don't hit a lot of the things Cloudflare was hitting, because their use case is so different.
(I haven't seen a discussion of how Netflix handles live events, that's clearly a different process than their traditional prerecorded media)
Yes I'm talking about Nginx running on those bsd boxes, they have such a custom design that writing your own static file process would have made sense.
Getting an HTTP server to work with the diversity of HTTP clients in the world that Netflix supports is not going to be fun, and NGINX is right there.
As I understand it, they've made some changes to NGINX, but I don't think they've made a lot, and I don't think anything where the structure of NGINX was not conducive to it or limiting.
I'm not one to shy away from building a custom design, but it's a lot easier when you control the clients, and Netflix has to work with everything from browsers to weird SDKs in TVs.
Netflix OCA performance seems mostly bottlenecked on I/O/memory bandwidth (and cpu overhead for pacing?) and any sensible HTTP server for static files isn't going to use a lot of memory bandwidth processing inbound requests and calling sendfile on static files. So why spend the limited time of a small team to build something that's not going to make a big difference.
Most of us also aren't running Netflix, or anything of the sort. All the big companies that heavily use Linux at scale also employ tons of Linux kernel engineers.
What are they bisecting with? FreeBSD uses CVS and Perforce.
These items in the slide don't add up:
> Things had worked accidentally for years, due to linkerset alphabetical ordering
In other words, the real bug is many years old. Yet, on the last slide:
> Since we found the bug within a week or so of it hitting the tree, the developer responsible was incredibly responsive. All the details were fresh in his memory.
What? Wasn't what hit the tree the ordering change which exposed the bug? It doesn't seem like that being fresh in anyone's mind is of much help.
The unintentional ordering change being fresh Colin's memory was really helpful, as he quickly realized that he'd actually changed the ordering and sent me a patch to fix it. If it had been 3 years in the past, I suspect he would have been less responsive (I know I would have been).
* the new sort handled ties differently. They adjusted that, and they could have stopped there.
* the other was that the correct drivers were sensitive to loading order. They handled this and another driver bug, amdtemp. This was masked by the old sort.
If they'd found this years later, even investigating the first set would be a slower process - it wouldn't be fresh in minds, and it likely would then have other code relying on it, so adjusting it would be trickier.
> Important Performance Milestones:
2022: First 800Gb/s CDN server 2x AMD 7713, NIC kTLS offload
2023: First 100Gb/s CDN server consuming only 100W of power, Nvidia Bluefield-3, NIC kTLS offload
My immediate question is if the 2x AMD 7713 actually consumes more than 800W of power. i.e More Watts / Gbps. Even if it does, it is based on 7nm Zen 3 and DDR4 came out in 2021. Would a Zen 5 DDR5 outperforms Bluefield in Watts / Gbps?
[1] https://news.ycombinator.com/item?id=32519881