Hacker News new | comments | show | ask | jobs | submit login
Serving 100 Gbps from an Open Connect Appliance (medium.com)
146 points by drewg123 109 days ago | hide | past | web | favorite | 75 comments

This is a fantastic read, I highly recommend it if you like kernel performance stuff.

I did much the same sort of work in SunOS ~25 years ago. Very similar, I fixed the file system so we could run at platter speed and then the VM system couldn't keep up. I wrote over a dozen experimental pageout daemons before I came to the conclusion the stock one was as good as anything I could come up with.

I ended up "solving" the problem in the read ahead logic, when memory was starting to get tight I "freed behind" so this file wouldn't swamp memory. It was a crappy answer but it worked well enough. I suspect there is still an LMXXX in ufs_getpage() about it.

One thing that I did, that never shipped, was to keep track of how many clean/dirty pages were associated with a particular vnode. I wrote a "topvn" that worked like top but looked at vnodes instead of processes. Shannon nixxed, actually took it out of the kernel, because I added a vp->basename that was "a" basename of the vnode. He didn't like that hardlinks created confusion so he shit canned the whole thing. If anyone has the SCCS history I'm pretty sure it's in there, thought might be 4.x (SunOS) instead of Solaris. I think the full set of things I added was

    vp->last_fault;    // timestamp of the last time we mapped/faulted/whatever this page
The advantage of that info is you can very very quickly find pages to be released. If a vnode is all dirty pages you skip that one, if it was all clean pages and last_fault is old, dump 'em all. It's way the heck faster than scanning each page.

Thanks. That is a heck of a complement coming from you! I never used bitkeeper, but I've used lmbench since the 90s.

Did you get so far as to implement a pageout daemon that scanned vnodes like that? We send only via sendfile, and we give the kernel hints about what should be freed using the SF_NOCACHE flag. This helps a lot. When we think we're serving something cold based on various weighed popularity rankings, we pass the SF_NOCACHE flag to sendfile(). This causes the page to be released immediately when the last mbuf referencing it is freed.

No, I lost interest when Shannon pulled it out. I'm dancing with a company that might hire me to do some performance work in this area so maybe I'll redo that code. I don't think it is hard and I'd do it in FreeBSD. And bring back topvn.

We've got so much memory these days it might help.

But..... back in the day the top vnode was always swap because everyone used the same swap vnode. Howard Chartok (Mr Swapfs) and I discussed at length an idea of a swap vnode per process group. You want the set of processes that work together to use the same vnode so you have some sort of idea if it has gone idle. Just imagine the stats that the pageout daemon looks being summarized in the vnode, you want atime, mtime, dirty, clean, etc.

I suspect for your workload the swap vnode isn't an issue.

If the pageout daemon is still like it was then it's crazy. 4K pages, 128GB of ram, that's ~33 million pages to scan. If you summarize that you can find stuff to free really fast. And probably drop per file hints in there for the pageout daemon (like you are doing).

Be happy to talk it over over a beer or something. I'm at the usual email addresses.

Thanks. This is the one I was thinking of, unchanged decades later:


200 pages before the pager starts running maybe made sense back then, needs to be rethought now. I'm reading the code to see if anyone other that UFS uses it. Something called UDFS does but that looks like DVDs, not that interesting.

God, i love human ingenuity. You are real engineers. I only have experience doing real high level stuff -- the easy parts, so i didn't follow most of the text. It's impressive that there are people who understand the boxes we all take for granted and can fine tune them.

It's also awesome how it all isn't a huge, inexplicable mess. I cannot make a CRUD php app without dirty hacks, you mess with 40 years of programming effort by thousands of people and it still behaves sanely.

When the AI comes, will it appreciate how hard we tried? I sure hope so.

Sorry for being off topic. IT is great and it's stuff like this that reminds me of it.

If it helps you at all, I just wrote this up for a guy I work with who hasn't done kernel programming. I started learning how to wack on the kernel about 30 years ago but I still remember the horror of being stuck at an adb prompt and having absolutely no clue what to do next :)

I'm reading through the zfs code and I can see why the kernel is intimidating, all this state you have to gather up to make sense of it. One thing that helps is there are patterns. Just like device drivers, file systems all lock mostly the same way, have a certain pattern. You can blindly follow that and get stuff done. Eventually you have to understand what you are doing but you'd be amazed at how far you can go faking it. That's what I did while I was learning and I did tons of useful work sort of "blind". Eventually stuff comes into focus, the architecture comes first, then the arcane details (usually). And even though I was working in the file system code, there was some stuff (the whole hat_ layer) that I never bothered to learn/memorize, it just worked, I wasn't changing it, shrug. I have a pretty good idea what it was doing at the general level but would have to go learn the details if I wanted to change it.

Kernel hacking is fun and apparently isn't that common a skill any more, people like the comfort of userland. I'm no rocket scientist and I got pretty comfortable in SunOS, IRIX, Sys III, Sys V, etc. Unless you are trying to rewrite the whole thing in a clean room, it's really not that hard. It's hard to know all the details about everything but it is rare that you need to (and even more rare to find someone who knows all that stuff).

If this sort of thing seems interesting, you should grab a kernel and figure out how to build and install it, make a new syscall called im_a_stud() that does some random thing, add it, call it. Off you go :)

This is fabulous, you need to post it somewhere more permanent :)

If one guy does what I suggested I'll start a blog. But people mostly just read, they don't do. I'll do if they do, I'd love to be helping people do more, I'm old, it's the kids that need to take on the task. So to be clear, I'll blog if someone adds a syscall and figures out how to call it. FreeBSD, Linux, Solaris, hell, Windows (but I'll need to be educated enough to know that they did it), whatever OS makes you happy.

BTW, I was being sort of bitch on that. If someone wants to add a syscall and figure out how to call it, I'll help. I'll have to look up the details but I've done before, it's not that hard. So hit me up if you want to do it and/or make me blog.

When AI comes it will do things with cellular hardware in a massively distributed way which has almost nothing to do with CPUs and code as you know them.

Provably optimal classical computing substrate is a hardware cellular automaton. We know this, we've known this for 50 years at least, we still don't go there directly.

Cool result! I've done similar things in the past with previous company's units[1] using Mellanox cards. We worked quite a bit with Chelsio as well, and have shown nice results there. I am surprised on the spinning disk front though, as I was showing off about 5.5-6GB/s for previous company's 60 drive bay units 3 years ago[2]. This was a single PCIe gen3 x8 NIC, we were bounded by the IB network (56Gb) performance, and had about 2GB/s extra headroom on the pure spinning disk systems.

[1] https://scalability.org/2016/03/not-even-breaking-a-sweat-10...

[2] https://scalability.org/2014/10/massive-unapologetic-firepow...

Line-rating 100g DC networking is close to plug and play today with Chelsio's zero copy stuff inside a LAN/MAN. Single digit CPU util.

The needs of long haul TCP are a bit harder. Check out the TCP RACK RFC and packet pacing to get an idea of how many timers you have firing off. Also HTTP and TLS vs block protos. Netflix numbers are very impressive with all that in light.

Curious, do you know of anyone using chelsio in production? Their cards look awesome and have tons of features to put them on par with mellanox, but I haven't really heard of anyone using them.

A financial *aaS provider whose infrastructure we built used them.

The cards are very good, I am looking at them for current (platform) projects.

Generally they are one of two choices for very high performance networking on multiple OSes. The other being Mellanox of course.

Mainly used in HFT, but I believe Cloudflare use them.

Author here, willing to answer questions..

I'm curious as to the specific reasoning behind perpetually pushing towards higher per-machine bandwidth. Is it primarily backed by squeezing more usefulness (and therefore decreasing costs elsewhere) out of existing hardware, or is it more of a "Hey let's try and do this" exercise?

Reason I say is that for most smaller operations I'd assume that you'd be more cost effective with dev time to just throw a few more servers at the problem than to go through all the effort to squeeze a few more percentage points of efficiency out of hardware, although I suppose when you're operating at that scale a 1% efficiency gain is enough to justify an entire developer's salary.

Because it is fun! But the real answer is: To improve density. Eg, less networking, less power and less cooling and rack space.

My understanding is that IX interconnects at 40Gb/s cost not much less than interconnects at 100Gb/s. And if we're going to connect a machine at 100Gb/s anyway, it is better to be running it at 100Gb/s than 50 or 60Gb/s. We need fewer network ports.

These OC appliances are deployed in ISP racks, and based on the specs they A) aren't cheap to make and B) are probably 4U based on the support of 44 SATA SSD's. I don't work for Netflix or an ISP, but part of the goal is to probably reduce the total expense (since they just GIVE these to ISP's who have enough Netflix traffic) as well as minimize the footprint. A single 100Gbe connection could easily plug right into an ISP's core router or distribution switch and take up minimal space in a datacenter, but you'd need to eat 3 40Gbe ports to get the same performance and X*2U amount of extra rack space to give a horizontal solution.

When you pass, say, $100k/m, in OpEx be it cloud, colo, whatever it's time to start thinking about this. I'd say the problem is the exact opposite: many top tech businesses have done an abysmal shareholder value thing by ignoring cloud/server sprawl.. tech leadership deficit.

Thanks for contributing to FreeBSD!

Are you considering AMD EPYC processors with their huge number of PCIe lanes?

The PCIe lanes are very exciting Unfortunately it is not that simple, since the EPYC series is highly dependant on NUMA, and FreeBSD is pretty far behind Linux and other OSes in NUMA support.

Fantastic article : couple of questions from someone spending a let of time reading perf-tools/performance counter outputs :

1 -

"From looking at VTune profiling information, we saw that ISA-L was somehow reading both the source and destination buffers, rather than just writing to the destination buffer."

This part intrigued me, i only have use VTUNE superficially, but how did you profile the destination addresses and correlate it to application level buffers?

2 -

How are was it to profile kernel code, my experience is always hit and miss when using linux/perf-tool.

3 -

A more general question, looking at the optimizations needed to saturate 100G NIC, it seems to me that you are guys are fundamentally approaching the scalability limits of the current IO/Networking model of current OS. How much do you think the current network stack can be squeeze (200G, 300G ?) before a complete redesign is necessary ? Is there any OS out there already design specifically for this kind of IO/Network performance (AIX,solaris maybe ?)

Could you discuss how you selected the Mellanox cards? Were other cards such as Solarflare, Broadcom, Chelsio considered as well?

I should also mention that in addition to being the first to market with 100GbE NICS, Mellanox has been just phenomenal to work with. The RSS assisted LRO mentioned in the blog is just one example of the help they've given us. They have also driven the adoption of hardware packet pacing into FreeBSD, as well as NIC queue backpressure, and a few other things we're currently working on. They've been a huge help to FreeBSD networking.

>" They have also driven the adoption of hardware packet pacing into FreeBSD, as well as NIC queue backpressure,"

Could you elaborate these? What is "packet pacing" I am not familiar with this term. Also how does the NIC queue backpressure work? I don't see references to these in the post. Thanks.

Not OP, but I'd wager that one factor is that Mellanox has high-quality, well-maintained (by Mellanox themselves), open-source drivers for their 100 GbE NICs in the FreeBSD kernel.

I'm not sure if the Chelsio cxgbe driver supports 100 GbE yet (I don't think it does) and I'm pretty certain that there aren't yet any drivers for any 100 GbE NICs from any of the others. I don't have a need for anything close to 100 GbE so I haven't really kept up on this so there could be by now but, even if there are, Mellanox was first.

I'll probably get some scowls for this but Mellanox sales and support has rubbed me the wrong way numerous times. I'm trying to mend that at the moment for my company.

Chelsio has a more technically interesting product to me so far, although they weren't as quick to market for 25/100g. Their T6 line does the same line speeds as ConnectX4/5. Chelsio prices are haggle free and puzzling low, which I think vendors like Broadcom, Intel and Mellanox put a ton of wiggle room in their NIC pricing to give your procurement person a nice softball win on how much they "saved" on MSRP. Meh, I'd rather not play games.

Nah, I've heard similar about both Mellanox and Solarflare sales people. Are you running the Chesio cards in production then? Any feedback there? Can you recommend them?

I have many thousand in revenue service, as does Netflix (https://openconnect.netflix.com/en/hardware/). It's the only NIC we buy for 2+ years at $LLNW. T6 requires a little bit of care on Linux, the LTS kernels don't have all the bells and whistles but should pass packets due to common mailbox API. I think kernel 4.13 is fairly complete. It's plug and play on supported FreeBSD versions (10.4, 11.1, -STABLE, HEAD). We run fbsd HEAD for main product line and some Ubuntu 16.04 for other product lines where I packaged up the out of tree driver for kernel 4.4.

They were the first to market with 100GbE NICs by quite a while, and when we were starting, they were the only game in town

Not only first, but they have a dual 100Gbps, which is more than pcie currently supports. They're way ahead of everyone else.


The question that popped to my head was why FreeBSD? I stopped using it for anything back in 2006. Since much of this tuning/debugging/analysis relates to the OS/kernel shouldnt the article start there?

That decision was made long before I joined Netflix. My understanding is that the initial decision was made due to the license. After that, our team attracted a large number of FreeBSD committers, so it just kept its momentum.

I've done kernel work in both OSes, and I'm pretty much agnostic myself. I find the biggest drawback to FreeBSD is hardware support. Netflix has actually driven a fair amount of vendors to provide FreeBSD support due to our using FreeBSD.

I will say that FreeBSD has a few things that Linux doesn't, specifically async sendfile. This ends up being far more efficient than either a thread pool, or aio daemons for handling IO completions. Async sendfile is also one of the things that drove us to implement TLS encryption in the kernel, and is the foundation that pretty much our entire stack rests upon.

> Netflix has actually driven a fair amount of vendors to provide FreeBSD support due to our using FreeBSD.

Yes, and the rest of us FreeBSD users greatly appreciate you for that! Please keep using FreeBSD and pushing these vendors and this household will keep throwing 2 x $10/month at Netflix. :-)

I have heard it described like this. Suppose you want to hire some real kernel talent.

So you go to a Linux conference looking for people. When you get there the conference is crowded with idiots, advertisers, and companies. No way to find the smart people.

Now you got to a FreeBSD conference. It is almost empty but the majority of the people know what they are doing. You only need to be interesting, have a clue, and buy a beer to get all the attention you want. ;-)

Netflix have been heavy users of FreeBSD for a long time.

It also had very good tracing tools before Linux did, so reasoning about performance bottlenecks is much easier.

The networking stack has also handily beat Linux for as long as I can remember.

From an objective standpoint, the Linux stack is actually better in a lot of ways. They are generally much, much, much tighter in terms of cache misses and per-packet costs. They have a lot of features that FreeBSD doesn't, and a lot of innovation happens there (BBR).

And they have a huge number of very smart people working on it. I've had the privilege of working with some of them at Google, and have nothing but respect for them.

So basically from an Kernel / System programmer standpoint, There are no reason to use BSDs today apart from License?

From a business perspective, I think it's easier to pick up *BSD talent. Although the pool is smaller, the pool is more talent dense. If you can afford to not care about the kernel (serverless etc) Linux is fine, if you have to care about the kernel and hardware I'd pick FreeBSD 100% of the time. Also much easier to influence overall project direction and get patches integrated.

>"The networking stack has also handily beat Linux for as long as I can remember."

Can you quantify what "beat" means here? Do you have a citation for this? I am doubting this is true in 2017.

> The networking stack has also handily beat Linux for as long as I can remember.

None of the TOP500 use FreeBSD [1]. I would have thought this would be a primary consideration for a supercomputer. Most of them are using Linux (99.6%!).

1. https://www.top500.org/statistics/details/osfam/4

HPC generally doesn't use IP, so they're using Linux but not the Linux network stack.

Yeah I don't see FreeBSD entering the HPC market ever, except maybe for I/O nodes. Nothing explicitly wrong with that, you have to choose battles.

Probably licensing. You can do whatever you want with BSD-licensed code, not so much with GPL.

For internal purposes like these, they're pretty much indistinguishable. Remember that the GPL only compels you to distribute the source to whoever you gave the binaries, not to everyone else. And distribution inside your own organization doesn't count.

These Netflix boxes get distributed to ISPs, so they may regard this as distribution (legally perhaps unclear).

I'm curious why the LRO improvement gave such a large overall boost, as the workload should be more about sending than receiving. Are the incoming TCP ACKs or TLS control traffic so frequent that you can get 15% more total throughput with this "2 segments in one batch" driver imporvement? Naively you'd think that receiving wouldn't even be 15% of the work.

At 90Gbs, we're sending about 8Mpps (which is mitigated by TSO). However, we're receiving at about 2.2Mpps. Until the RSS based LRO, about 2Mpps was going all the way up the stack into TCP. After the RSS based LRO, that was cut in half.

I know you can't use vpp/dpdk, but would you expect similar performance improvements, or maybe even easier ones if your network stack was in user space? Have you considered any hardware offloads, like TSO?

Edit: just noticed some else asked the dpdk one, so feel free to ignore that.

Hardware supported TSO and LRO are on by default for any worthwhile NIC in FreeBSD

Any thoughts on Chelsio TLS offload?

It sounds very promising. In-line TLS is the holy grail for us, since it would cut our memory bandwidth usage by over 40%, and memory bandwidth is our biggest bottleneck.

However, I'm not very bullish on just plain crypto accelerator mode. We've tried other vendor's accelerators, and while they save some CPU, memory bandwidth is the real issue for us. The data still needs to be encrypted, so it still needs to be read from memory by DMA rather than a CPU read, and written via a DMA write rather than a CPU write.

(BTW: Huge kudos to Netflix for driving vendors to support FreeBSD)

Unfortunately exposing my nativity on both NICs and TLS, I've never understood why NICs couldn't support the crypto in a way that it can just send the packet straight to the wire after encryption. Is there some deep (unfixable) reason in the TLS/TCP protocol or just a lack of foresight in the NIC design?

What makes in-line TLS difficult is that the NIC has to understand the TLS framing and the TCP segmentation for TSO. On top of that, what do you do if you have a re-transmission. The NIC has to re-download the entire 16KB TLS frame to re-send just a 1448b portion of it. So there are some challenges in terms of being able to find the entire TLS record, making sure to reference the entire thing so that it is all still there, etc. The "1 mbuf per TLS record" work I've done will make that a bit simpler, I hope.

>download the entire 16KB TLS frame to re-send just a 1448b portion of it.

The bigger thing, is that TLS is far from ideal for video content, and TCP transmissions in general.

It is much better to incorporate a degree of redundant coding, and tolerance to lost packets than rely on re-transmission.

As you deal with a real-time content, there is a no real need for flow control. You either get packets in time, or your video feed cuts of.

Another moment, for any genuinely live transmission, multicast beats down everything for efficiency, but why it is so hard to run multicast over open Internet is a discussion on its own.

Given that packet loss on the Internet tends to be very low, like 0.1%, retransmission is a lot more efficient than coding. Being able to build up your buffer faster than real time also helps with burst losses.

How is retransmitting 16kb and bigger TLS frames efficient?

I assume everyone is using TCP correctly so that only missing 1500-byte segments are retransmitted.

why NICs couldn't support the crypto in a way that it can just send the packet straight to the wire after encryption.

That's how the Chelsio T6 works. On the transmit path the NIC needs to perform encryption, TLS framing, and TCP/IP/Ethernet segmentation in that order; on receive it needs to perform TCP reordering and such before decrypting. Most NICs just don't want to add that much functionality.

Thanks, that's looks pretty good. I wonder if it will work with TLS 1.3 as well (from my limited understanding it looks like it might).

Here's dreaming of a future where ED25519/Chacha20/Poly1305 is supported in hardware (it's much cheaper for clients that doesn't have AES so I want servers to use it).

I'm starting to test it in co-processor mode. The main limit is.. Intel is dragging their feet on PCIe bandwidth (probably to curtail GPGPU a bit), and you need a lot of BW for co-processor up and down and then back out the pipe. But it's enough for my needs at the moment.


At least for broadwell based Xeons with just 2 memory controllers, our main bottleneck is memory bandwidth. And co-processor cards haven't really helped us much. The same memory accesses need to happen. The difference is that with a co-processor, they're done via DMA rather than via memory reads and writes. Intel has gotten so good with crypto in the recent chips that, at least for GCM, we didn't find much advantage to using add-in accelerators.

Have you considered Skylake? The 6 memory channels should easily put you above 100Gbps now.

Interesting read, thank you!

Have you considered moving into userland? E.g DPDK/netmap/etc + BSD stack extracted?

A team from Cambridge wrote an interesting paper for SIGCOMM 2017 "Disk|Crypt|Net: rethinking the stack for high performance video streaming" The main gain they get out of netmap, etc, is tighter latencies which allows them to make more effective use of DDIO (I/O caching in L3) to free up memory bandwidth. However, their results are on synthetic workloads, and the general feeling is that we would not see nearly as much of a benefit in the "real world".

Interesting. Usually moving networking operations into userland allows your application to "own" the NICs and the stack and reduce lock contention in the packet path down to virtually zero.

My impression was that profiling userland applications is easier too, but I haven't done any serious kernel profiling so I might be wrong.

The hardest part is, of course, ripping the stack out and keeping it up to date with the mainline kernel afterwards if you need TCP.

There is a port of the FreeBSD network stack called libuinet so that's not an unsolved problem, but it would need a bit more care and feeding. There are other issues like.. you need to fetch content off disks, and then you need to move it between address spaces, and then you need buffer types to pass around and then and then.. at some point you are reinventing a lot of wheels for the same goal: saturate some hardware limit like CPU, memory bandwidth, bus bandwidth, drive bandwidth and latency. The FreeBSD kernel works well for all this and is proven.

Userspace networking is a really big win for packet processing which doesn't have any of the above concerns. On FreeBSD with Netmap and VALE you can chain things together in really interesting ways where you can still use kernel networking where advantageous and userland networking where it's advantageous.

> general feeling is that we would not see nearly as much of a benefit in the "real world".

Wouldn't any benefit be interesting? Or is the engineering overhead just too expensive?

I was playing around with KVM and I was getting 54gigs/sec on the buss between VMs on virtio. This was 4 years ago. So 52gigs is very close to my results. I was OC'ed to 4ghz on an asus board but I wasnt going PCI Express, just virtio to the vms.

Thanks for the write up! some interesting stuff

DPDK (userspace networking) and SPDK (userspace storage) seem like a perfect fit for this. Both even support FreeBSD!

memory speed as bottleneck ...

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact