
Serving 100 Gbps from an Open Connect Appliance - drewg123
https://medium.com/netflix-techblog/serving-100-gbps-from-an-open-connect-appliance-cdb51dda3b99
======
luckydude
This is a fantastic read, I highly recommend it if you like kernel performance
stuff.

I did much the same sort of work in SunOS ~25 years ago. Very similar, I fixed
the file system so we could run at platter speed and then the VM system
couldn't keep up. I wrote over a dozen experimental pageout daemons before I
came to the conclusion the stock one was as good as anything I could come up
with.

I ended up "solving" the problem in the read ahead logic, when memory was
starting to get tight I "freed behind" so this file wouldn't swamp memory. It
was a crappy answer but it worked well enough. I suspect there is still an
LMXXX in ufs_getpage() about it.

One thing that I did, that never shipped, was to keep track of how many
clean/dirty pages were associated with a particular vnode. I wrote a "topvn"
that worked like top but looked at vnodes instead of processes. Shannon
nixxed, actually took it out of the kernel, because I added a vp->basename
that was "a" basename of the vnode. He didn't like that hardlinks created
confusion so he shit canned the whole thing. If anyone has the SCCS history
I'm pretty sure it's in there, thought might be 4.x (SunOS) instead of
Solaris. I think the full set of things I added was

    
    
        vp->basename;
        vp->pages_clean;
        vp->pages_dirty;
        vp->last_fault;    // timestamp of the last time we mapped/faulted/whatever this page
    

The advantage of that info is you can very very quickly find pages to be
released. If a vnode is all dirty pages you skip that one, if it was all clean
pages and last_fault is old, dump 'em all. It's way the heck faster than
scanning each page.

~~~
drewg123
Thanks. That is a heck of a complement coming from you! I never used
bitkeeper, but I've used lmbench since the 90s.

Did you get so far as to implement a pageout daemon that scanned vnodes like
that? We send only via sendfile, and we give the kernel hints about what
should be freed using the SF_NOCACHE flag. This helps a _lot_. When we think
we're serving something cold based on various weighed popularity rankings, we
pass the SF_NOCACHE flag to sendfile(). This causes the page to be released
immediately when the last mbuf referencing it is freed.

~~~
luckydude
No, I lost interest when Shannon pulled it out. I'm dancing with a company
that might hire me to do some performance work in this area so maybe I'll redo
that code. I don't think it is hard and I'd do it in FreeBSD. And bring back
topvn.

We've got so much memory these days it might help.

But..... back in the day the top vnode was always swap because everyone used
the same swap vnode. Howard Chartok (Mr Swapfs) and I discussed at length an
idea of a swap vnode per process group. You want the set of processes that
work together to use the same vnode so you have some sort of idea if it has
gone idle. Just imagine the stats that the pageout daemon looks being
summarized in the vnode, you want atime, mtime, dirty, clean, etc.

I suspect for your workload the swap vnode isn't an issue.

If the pageout daemon is still like it was then it's crazy. 4K pages, 128GB of
ram, that's ~33 million pages to scan. If you summarize that you can find
stuff to free really fast. And probably drop per file hints in there for the
pageout daemon (like you are doing).

Be happy to talk it over over a beer or something. I'm at the usual email
addresses.

------
iUsedToCode
God, i love human ingenuity. You are real engineers. I only have experience
doing real high level stuff -- the easy parts, so i didn't follow most of the
text. It's impressive that there are people who understand the boxes we all
take for granted and can fine tune them.

It's also awesome how it all isn't a huge, inexplicable mess. I cannot make a
CRUD php app without dirty hacks, you mess with 40 years of programming effort
by thousands of people and it still behaves sanely.

When the AI comes, will it appreciate how hard we tried? I sure hope so.

Sorry for being off topic. IT is great and it's stuff like this that reminds
me of it.

~~~
luckydude
If it helps you at all, I just wrote this up for a guy I work with who hasn't
done kernel programming. I started learning how to wack on the kernel about 30
years ago but I still remember the horror of being stuck at an adb prompt and
having absolutely no clue what to do next :)

I'm reading through the zfs code and I can see why the kernel is intimidating,
all this state you have to gather up to make sense of it. One thing that helps
is there are patterns. Just like device drivers, file systems all lock mostly
the same way, have a certain pattern. You can blindly follow that and get
stuff done. Eventually you have to understand what you are doing but you'd be
amazed at how far you can go faking it. That's what I did while I was learning
and I did tons of useful work sort of "blind". Eventually stuff comes into
focus, the architecture comes first, then the arcane details (usually). And
even though I was working in the file system code, there was some stuff (the
whole hat_ layer) that I never bothered to learn/memorize, it just worked, I
wasn't changing it, shrug. I have a pretty good idea what it was doing at the
general level but would have to go learn the details if I wanted to change it.

Kernel hacking is fun and apparently isn't that common a skill any more,
people like the comfort of userland. I'm no rocket scientist and I got pretty
comfortable in SunOS, IRIX, Sys III, Sys V, etc. Unless you are trying to
rewrite the whole thing in a clean room, it's really not that hard. It's hard
to know all the details about everything but it is rare that you need to (and
even more rare to find someone who knows all that stuff).

If this sort of thing seems interesting, you should grab a kernel and figure
out how to build and install it, make a new syscall called im_a_stud() that
does some random thing, add it, call it. Off you go :)

~~~
kev009
This is fabulous, you need to post it somewhere more permanent :)

~~~
luckydude
If one guy does what I suggested I'll start a blog. But people mostly just
read, they don't do. I'll do if they do, I'd love to be helping people do
more, I'm old, it's the kids that need to take on the task. So to be clear,
I'll blog if someone adds a syscall and figures out how to call it. FreeBSD,
Linux, Solaris, hell, Windows (but I'll need to be educated enough to know
that they did it), whatever OS makes you happy.

~~~
luckydude
BTW, I was being sort of bitch on that. If someone wants to add a syscall and
figure out how to call it, I'll help. I'll have to look up the details but
I've done before, it's not that hard. So hit me up if you want to do it and/or
make me blog.

------
hpcjoe
Cool result! I've done similar things in the past with previous company's
units[1] using Mellanox cards. We worked quite a bit with Chelsio as well, and
have shown nice results there. I am surprised on the spinning disk front
though, as I was showing off about 5.5-6GB/s for previous company's 60 drive
bay units 3 years ago[2]. This was a single PCIe gen3 x8 NIC, we were bounded
by the IB network (56Gb) performance, and had about 2GB/s extra headroom on
the pure spinning disk systems.

[1] [https://scalability.org/2016/03/not-even-breaking-a-
sweat-10...](https://scalability.org/2016/03/not-even-breaking-a-sweat-10gbs-
write-to-single-node-forte-unit-over-100gb-net-realhyperconverged-hpc-
storage/)

[2] [https://scalability.org/2014/10/massive-unapologetic-
firepow...](https://scalability.org/2014/10/massive-unapologetic-firepower-
part-2-the-dashboard/)

~~~
shaklee3
Curious, do you know of anyone using chelsio in production? Their cards look
awesome and have tons of features to put them on par with mellanox, but I
haven't really heard of anyone using them.

~~~
hpcjoe
A financial *aaS provider whose infrastructure we built used them.

The cards are very good, I am looking at them for current (platform) projects.

Generally they are one of two choices for very high performance networking on
multiple OSes. The other being Mellanox of course.

------
drewg123
Author here, willing to answer questions..

~~~
redm
The question that popped to my head was why FreeBSD? I stopped using it for
anything back in 2006. Since much of this tuning/debugging/analysis relates to
the OS/kernel shouldnt the article start there?

~~~
philjohn
Netflix have been heavy users of FreeBSD for a long time.

It also had very good tracing tools before Linux did, so reasoning about
performance bottlenecks is much easier.

The networking stack has also handily beat Linux for as long as I can
remember.

~~~
jaak
> The networking stack has also handily beat Linux for as long as I can
> remember.

None of the TOP500 use FreeBSD [1]. I would have thought this would be a
primary consideration for a supercomputer. Most of them are using Linux
(99.6%!).

1\.
[https://www.top500.org/statistics/details/osfam/4](https://www.top500.org/statistics/details/osfam/4)

~~~
wmf
HPC generally doesn't use IP, so they're using Linux but not the Linux network
stack.

------
IronWolve
I was playing around with KVM and I was getting 54gigs/sec on the buss between
VMs on virtio. This was 4 years ago. So 52gigs is very close to my results. I
was OC'ed to 4ghz on an asus board but I wasnt going PCI Express, just virtio
to the vms.

------
bhhaskin
Thanks for the write up! some interesting stuff

------
alexforster
DPDK (userspace networking) and SPDK (userspace storage) seem like a perfect
fit for this. Both even support FreeBSD!

------
z3t4
memory speed as bottleneck ...

