I did much the same sort of work in SunOS ~25 years ago. Very similar, I fixed the file system so we could run at platter speed and then the VM system couldn't keep up. I wrote over a dozen experimental pageout daemons before I came to the conclusion the stock one was as good as anything I could come up with.
I ended up "solving" the problem in the read ahead logic, when memory was starting to get tight I "freed behind" so this file wouldn't swamp memory. It was a crappy answer but it worked well enough. I suspect there is still an LMXXX in ufs_getpage() about it.
One thing that I did, that never shipped, was to keep track of how many clean/dirty pages were associated with a particular vnode. I wrote a "topvn" that worked like top but looked at vnodes instead of processes. Shannon nixxed, actually took it out of the kernel, because I added a vp->basename that was "a" basename
of the vnode. He didn't like that hardlinks created confusion so he shit canned the whole thing. If anyone has the SCCS history I'm pretty sure it's in there, thought might be 4.x (SunOS) instead of Solaris. I think the full set of things I added was
vp->last_fault; // timestamp of the last time we mapped/faulted/whatever this page
Did you get so far as to implement a pageout daemon that scanned vnodes like that? We send only via sendfile, and we give the kernel hints about what should be freed using the SF_NOCACHE flag. This helps a lot. When we think we're serving something cold based on various weighed popularity rankings, we pass the SF_NOCACHE flag to sendfile(). This causes the page to be released immediately when the last mbuf referencing it is freed.
We've got so much memory these days it might help.
But..... back in the day the top vnode was always swap because everyone used the same swap vnode. Howard Chartok (Mr Swapfs) and I discussed at length an idea of a swap vnode per process group. You want the set of processes that work together to use the same vnode so you have some sort of idea if it has gone idle. Just imagine the stats that the pageout daemon looks being summarized in the vnode, you want atime, mtime, dirty, clean, etc.
I suspect for your workload the swap vnode isn't an issue.
If the pageout daemon is still like it was then it's crazy. 4K pages, 128GB of ram, that's ~33 million pages to scan. If you summarize that you can find stuff to free really fast. And probably drop per file hints in there for the pageout daemon (like you are doing).
Be happy to talk it over over a beer or something. I'm at the usual email addresses.
200 pages before the pager starts running maybe made sense back then, needs to be rethought now. I'm reading the code to see if anyone other that UFS uses it. Something called UDFS does but that looks like DVDs, not that interesting.
It's also awesome how it all isn't a huge, inexplicable mess. I cannot make a CRUD php app without dirty hacks, you mess with 40 years of programming effort by thousands of people and it still behaves sanely.
When the AI comes, will it appreciate how hard we tried? I sure hope so.
Sorry for being off topic. IT is great and it's stuff like this that reminds me of it.
I'm reading through the zfs code and I can see why the kernel is intimidating, all this state you have to gather up to make sense of it. One thing that helps is there are patterns. Just like device drivers, file systems all lock mostly the same way, have a certain pattern. You can blindly follow that and get stuff done. Eventually you have to understand what you are doing but you'd be amazed at how far you can go faking it. That's what I did while I was learning and I did tons of useful work sort of "blind". Eventually stuff comes into focus, the architecture comes first, then the arcane details (usually). And even though I was working in the file system code, there was some stuff (the whole hat_ layer) that I never bothered to learn/memorize, it just worked, I wasn't changing it, shrug. I have a pretty good idea what it was doing at the general level but would have to go learn the details if I wanted to change it.
Kernel hacking is fun and apparently isn't that common a skill any more, people like the comfort of userland. I'm no rocket scientist and I got pretty comfortable in SunOS, IRIX, Sys III, Sys V, etc. Unless you are trying to rewrite the whole thing in a clean room, it's really not that hard. It's hard to know all the details about everything but it is rare that you need to (and even more rare to find someone who knows all that stuff).
If this sort of thing seems interesting, you should grab a kernel and figure out how to build and install it, make a new syscall called im_a_stud() that does some random thing, add it, call it. Off you go :)
Provably optimal classical computing substrate is a hardware cellular automaton. We know this, we've known this for 50 years at least, we still don't go there directly.
The needs of long haul TCP are a bit harder. Check out the TCP RACK RFC and packet pacing to get an idea of how many timers you have firing off. Also HTTP and TLS vs block protos. Netflix numbers are very impressive with all that in light.
The cards are very good, I am looking at them for current (platform) projects.
Generally they are one of two choices for very high performance networking on multiple OSes. The other being Mellanox of course.
Reason I say is that for most smaller operations I'd assume that you'd be more cost effective with dev time to just throw a few more servers at the problem than to go through all the effort to squeeze a few more percentage points of efficiency out of hardware, although I suppose when you're operating at that scale a 1% efficiency gain is enough to justify an entire developer's salary.
My understanding is that IX interconnects at 40Gb/s cost not much less than interconnects at 100Gb/s. And if we're going to connect a machine at 100Gb/s anyway, it is better to be running it at 100Gb/s than 50 or 60Gb/s. We need fewer network ports.
Are you considering AMD EPYC processors with their huge number of PCIe lanes?
"From looking at VTune profiling information, we saw that ISA-L was somehow reading both the source and destination buffers, rather than just writing to the destination buffer."
This part intrigued me, i only have use VTUNE superficially, but how did you profile the destination addresses and correlate it to application level buffers?
How are was it to profile kernel code, my experience is always hit and miss when using linux/perf-tool.
A more general question, looking at the optimizations needed to saturate 100G NIC, it seems to me that you are guys are fundamentally approaching the scalability limits of the current IO/Networking model of current OS. How much do you think the current network stack can be squeeze (200G, 300G ?) before a complete redesign is necessary ? Is there any OS out there already design specifically for this kind of IO/Network performance (AIX,solaris maybe ?)
Could you elaborate these? What is "packet pacing" I am not familiar with this term. Also how does the NIC queue backpressure work? I don't see references to these in the post. Thanks.
I'm not sure if the Chelsio cxgbe driver supports 100 GbE yet (I don't think it does) and I'm pretty certain that there aren't yet any drivers for any 100 GbE NICs from any of the others. I don't have a need for anything close to 100 GbE so I haven't really kept up on this so there could be by now but, even if there are, Mellanox was first.
Chelsio has a more technically interesting product to me so far, although they weren't as quick to market for 25/100g. Their T6 line does the same line speeds as ConnectX4/5. Chelsio prices are haggle free and puzzling low, which I think vendors like Broadcom, Intel and Mellanox put a ton of wiggle room in their NIC pricing to give your procurement person a nice softball win on how much they "saved" on MSRP. Meh, I'd rather not play games.
I've done kernel work in both OSes, and I'm pretty much agnostic myself. I find the biggest drawback to FreeBSD is hardware support. Netflix has actually driven a fair amount of vendors to provide FreeBSD support due to our using FreeBSD.
I will say that FreeBSD has a few things that Linux doesn't, specifically async sendfile. This ends up being far more efficient than either a thread pool, or aio daemons for handling IO completions. Async sendfile is also one of the things that drove us to implement TLS encryption in the kernel, and is the foundation that pretty much our entire stack rests upon.
Yes, and the rest of us FreeBSD users greatly appreciate you for that! Please keep using FreeBSD and pushing these vendors and this household will keep throwing 2 x $10/month at Netflix. :-)
So you go to a Linux conference looking for people. When you get there the conference is crowded with idiots, advertisers, and companies. No way to find the smart people.
Now you got to a FreeBSD conference. It is almost empty but the majority of the people know what they are doing. You only need to be interesting, have a clue, and buy a beer to get all the attention you want. ;-)
It also had very good tracing tools before Linux did, so reasoning about performance bottlenecks is much easier.
The networking stack has also handily beat Linux for as long as I can remember.
And they have a huge number of very smart people working on it. I've had the privilege of working with some of them at Google, and have nothing but respect for them.
Can you quantify what "beat" means here? Do you have a citation for this? I am doubting this is true in 2017.
None of the TOP500 use FreeBSD . I would have thought this would be a primary consideration for a supercomputer. Most of them are using Linux (99.6%!).
Edit: just noticed some else asked the dpdk one, so feel free to ignore that.
However, I'm not very bullish on just plain crypto accelerator mode. We've tried other vendor's accelerators, and while they save some CPU, memory bandwidth is the real issue for us. The data still needs to be encrypted, so it still needs to be read from memory by DMA rather than a CPU read, and written via a DMA write rather than a CPU write.
Unfortunately exposing my nativity on both NICs and TLS, I've
never understood why NICs couldn't support the crypto in a
way that it can just send the packet straight to the wire
after encryption. Is there some deep (unfixable) reason in
the TLS/TCP protocol or just a lack of foresight in the NIC
The bigger thing, is that TLS is far from ideal for video content, and TCP transmissions in general.
It is much better to incorporate a degree of redundant coding, and tolerance to lost packets than rely on re-transmission.
As you deal with a real-time content, there is a no real need for flow control. You either get packets in time, or your video feed cuts of.
Another moment, for any genuinely live transmission, multicast beats down everything for efficiency, but why it is so hard to run multicast over open Internet is a discussion on its own.
That's how the Chelsio T6 works. On the transmit path the NIC needs to perform encryption, TLS framing, and TCP/IP/Ethernet segmentation in that order; on receive it needs to perform TCP reordering and such before decrypting. Most NICs just don't want to add that much functionality.
Here's dreaming of a future where ED25519/Chacha20/Poly1305 is supported in hardware (it's much cheaper for clients that doesn't have AES so I want servers to use it).
Have you considered moving into userland? E.g DPDK/netmap/etc + BSD stack extracted?
My impression was that profiling userland applications is easier too, but I haven't done any serious kernel profiling so I might be wrong.
The hardest part is, of course, ripping the stack out and keeping it up to date with the mainline kernel afterwards if you need TCP.
Userspace networking is a really big win for packet processing which doesn't have any of the above concerns. On FreeBSD with Netmap and VALE you can chain things together in really interesting ways where you can still use kernel networking where advantageous and userland networking where it's advantageous.
Wouldn't any benefit be interesting? Or is the engineering overhead just too expensive?