
NUMA Siloing in the FreeBSD Network Stack [pdf] - ksec
https://people.freebsd.org/~gallatin/talks/euro2019.pdf
======
mosselman
Isn't it great that we work in an industry where a company like Netflix, which
has increasing competition lately, just shares potential competitive
advantages like this?

I get that, in part, this is a way to get people excited to work for Netflix,
however, the people working on this are, probably, pretty proud of what they
achieved and thus like to share it with us, their colleagues at other places
of work and they have the freedom to do so because of our industry's great
track record of knowledge sharing.

~~~
robbyt
Realistically, Disney, or whomever competition, is not going to have enough
technical expertise to deploy FreeBSD.

~~~
o-__-o
I trained the entire Disney offshore replacement team. They understand FreeBSD
very well.

~~~
m11r
Out of curiosity, does this mean that they’re using (or have used) FreeBSD in
their infrastructure or for some other purpose, or just a high level of Unix
competence in that team? Any interesting details that you’re able to share?

------
drewg123
Author here: The talk will be on Youtube eventually, and a lot of points are
explained in more detail in the actual talk. I was just going to bed in
advance of traveling back to the states tomorrow, but I'll try to answer any
question in the morning.

~~~
floatboth
Why is the hw.model on EPYC redacted, when the chips are listed in the slides?
Secret engineering sample? :)

~~~
cperciva
He said in the talk that yes, the chips used in these measurements were not
publicly released. IIRC he said that they were slightly lower clock speed than
the publicly available chips.

------
bluedino
Nice to see AMD replacing Intel, they've gone with EPYC 7551 & 7502P, from 2x
Intel “Skylake” / “Cascade Lake” Xeon

edit: it looks like they hit 200GB/s with both Intel and AMD

~~~
morning_gelato
It's disappointing to hear that AMD has not provided adequate performance
monitoring tools though. I would have thought by the second generation of Zen
that this would be an area that AMD had given attention to.

~~~
greglindahl
AMD's engineering team is much smaller, and they've always been behind on
performance tools. HPC people have been complaining ever since Opteron. It's
probably not going to change.

------
toast0
I wonder how this would compare with Receive Side Scaling (RSS), which you can
use to pin NIC queues to cpus, and then pin the rest of the handling to the
same cpu, avoiding a significant amount of interprocess communication. NUMA
concerns may be more important than CPU pining within a domain though.

~~~
cthalupa
RSS is more about spreading a workload over multiple cores to not overwhelm
any single core, particularly in high PPS situations.

This is about PCI-E devices being hooked up to a NUMA node to avoid saturating
the link between nodes. There's a limited amount of bandwidth, and crossing
nodes saturates this and increases latency, both of which will have limiting
effects on your total possible throughput.

RSS configuration doesn't need you to set up your hardware in any specific
way, with this you need to ensure that the set of disks and nic are hooked up
to coupled domains - e.g. if you place the two NICs on the same NUMA node,
then no software configuration is going to fix that, and you'd have to go and
physically rearrange things to fix it.

You might still use RSS to distributed the workload across multiple cores
within that NUMA domain when using this setup.

------
amelius
If they are just serving/streaming static data from a single server, wouldn't
an entirely different architecture make more sense? For example, why even use
a CPU? They have the financial resources to build their own hardware, e.g.
encryption ASICs, and they can do interesting things like bundle the pipeline
for multiple viewers watching the same movie at the same timepoint.

~~~
virtuallynathan
Commodity server gear is really cheap, and is more or less "fast enough". The
wins from serving say, >1Tbps from a 1U box with an ASIC or FPGA is negated by
the downsides of having such a large failure domain, and the cost of
development for what would be a pretty low volume part.

We talk about this stuff every once in a while, but it doesn't really make
sense to do right now. (Disclaimer: I work at Netflix)

------
mikece
Is it still accurate that for network bound servers that FreeBSD still
outperforms Linux?

~~~
GhettoMaestro
I think once you get to a certain traffic level you are forced to do kernel-
bypass stuff like DPDK. Regardless of Linux or FreeBSD being the kernel.

~~~
the8472
If it's just serving files you don't necessarily need DPDK/XDP. For server-
grade hardware there now is P2P-DMA and TLS accelerators which can offload
everything to peripherials while still using normal socket APIs. You get NVMe
-(PCIe)-> crypto accelerator -(PCIe)-> ethernet for the bulk of the data.

Neither CPU nor main memory see any of the network packets as long as they
stay on the happy path. Only connection setup, DMA orchestration and
occasional TLS renegotiation have to be handled.

~~~
floatboth
I know Chelsio has crypto directly on the NIC, but are dedicated crypto
accelerator cards a thing and are they ever worth it? Why leave the CPU idle
when the CPU itself is a good crypto accelerator (AES-NI, ARMv8 crypto)?

AMD Ryzen has a built-in crypto "decelerator" — a FreeBSD driver was written
for the crypto engine, but it's disabled by default because it made everything
slower than AES-NI. (Though I guess it would be funny to use it to mine
bitcoin, since it supports SHA256. AMD — Advanced Mining Devices!)

~~~
loeg
Intel has a product line called QAT ("QuickAssisT"?) that does crypto
acceleration, as well as compression. I don't know how performant it is. There
are definitely several older crypto accelerators that were faster than CPUs of
the time; I don't know if any of them (outside of QAT) is still relevant.

The AMD Zen1 Crypto Co-Processor is indeed slower than AESNI; I think it's
mostly used by stuff like SecureBoot, TPM, etc, and also used internally by
the CPU to generate RDRAND/RDSEED data. It was probably never intended to be
used by OS drivers and certainly not intended to be any kind of accelerator.

~~~
shaklee3
Supposedly QAT is built into the chipsets of Skylake and above now. I've never
seen anyone try it, though.

~~~
loeg
By chipset, you mean northbridge? Or the CPU?

The part I know of that is built into the CPU is a DMA engine called I/OAT; it
just does DMA and maybe basic checksum and RAID transformations. It is
sometimes confused with QAT (I've personally confused the two...):

[https://www.intel.com/content/www/us/en/wireless-
network/acc...](https://www.intel.com/content/www/us/en/wireless-
network/accel-technology.html)

~~~
shaklee3
The northbridge. my understanding is that they no longer sell the discrete
cards to perform these tasks, and instead offload it to chips that come on the
boards.

~~~
loeg
That might be true, but you can still buy the cards from 3rd party resellers,
e.g.,
[https://www.newegg.com/p/2AS-006R-00046](https://www.newegg.com/p/2AS-006R-00046)
.

------
jdsully
The Linux network stack has been the bane of my existence trying to squeeze
more performance out of KeyDB. I really hope it gets this kind of love in the
future.

On Linux there’s a spinlock in do soft_irq that blocks even in non-blocking
IO.

~~~
shaklee3
Why not use DPDK? All of these problems go away, and people have reported
hitting 1Tbps on a single node.

~~~
jdsully
We have!

[https://docs.keydb.dev/blog/2019/06/17/blog-
post/](https://docs.keydb.dev/blog/2019/06/17/blog-post/)

Still though I hate seeing needless waste. There’s no reason the active thread
needs to block on a soft_irq when there are unused cores to process them.

~~~
shaklee3
Can you move all softirqs into different cores? That's usually one of the dpdk
tuning steps.

~~~
jdsully
My reading of the source code is there is no way to prevent this. The spinlock
is guarding the check which determines if there is work to do.

Its been a few months and I had intended to go back and try using a
try_lock(). But I’m not normally a kernel dev.

------
thomasjudge
This may be a dumb question, but why would streaming video need to be
encrypted? Is this just part of the "encrypt everything" best practice these
days? Is there metadata accompanying the video data that shouldn't be
unencrypted? Is it just so it's not possible to eavesdrop on the fact that I'm
watching "Marvelous Mrs. Maisel"?

~~~
0xdeadb00f
The main reason is probably DRM.

~~~
floatboth
This is about TLS. DRM would probably be another layer of encryption inside?

~~~
o-__-o
TLS can be used as a DRM, preventing MITM caching attempts. However that also
requires a client that cares

~~~
loeg
That isn't what's going on here.

------
bt848
So why is it important to have a multisocket NUMA machine? Why not just save
yourself a lot of hassle by having one socket? I know that the previous
generation AMD machine had unavoidable NUMA but the new one doesn't.

~~~
loeg
This talk is about Zen+ Epyc, not Zen2 (which is where the non-cache memory
gets uniform). I don't know if they have release quality Epyc 7003 (Zen2)
samples available yet, and if they do, NFLX probably isn't allowed to publish
benchmarks about them. There's almost certainly still some value in their
existing NUMA work even on Zen2, as things like L1/L2/L3 cache have locality
even if memory and PCIe does not.

Pretty sure Intel single socket of this generation is totally non-viable for
this workload due to lack of PCIe lanes. Maybe viable when Intel gets gen4
PCIe.

~~~
bt848
Skylake-X has 44 PCI 3.0 lanes, that's 352GT/s or about 345gbps application
bandwidth. It's certainly more than enough to push 100gbps from disk to net.
These guys are pushing 200gbps, but they're doing it with two CPUs, two sets
of NVMe devices, and two NICs, and a bunch of hacks to make the operating
system pretend all this stuff is not in the same box. It seems way more
straight-forward to me if they had made it all be actually NOT in the same
box!

~~~
loeg
> Skylake-X has 44 PCI 3.0 lanes, that's 352GT/s or about 345gbps application
> bandwidth. It's certainly more than enough to push 100gbps from disk to net.
> These guys are pushing 200gbps

We're in total agreement :-). Their dataflow model requires something like 2x
that in PCIe bandwidth and 4x in memory in the optimal case, as covered in the
slides. 2x200 gbps = 400 gbps, which is a bit more than 345 gbps.

Maybe they could push 345/2 = 172 Gbps out of a single Skylake-X, best case.
For some workloads, that might be the right local optima! They must have
decided that the marginal cost of a 2P system was worth the extra ~25 Gbps to
saturate the 200 Gbps pipe fully.

> they're doing it with two CPUs, two sets of NVMe devices, and two NICs, and
> a bunch of hacks to make the operating system pretend all this stuff is not
> in the same box. It seems way more straight-forward to me if they had made
> it all be actually NOT in the same box!

I've spoken with NFLX engineers in the past and my recollection is that in
many installations, NFLX only get to install one box. (Or something like that.
Might just be a cost thing.) So they need to make that one box fast.

I guess the other factor is the IP management overhead discussed in the
slides. Two boxes necessitates the costly 2nd IP, as far as I know. It's hard
to imagine the cost of an IP address dominating the marginal cost of a 2P
socket system and 2nd Xeon, but I guess AWS is friggin expensive.

------
byefruit
I assumed these were cache servers that went into ISPs, does anyone know what
they mean by "Increases AWS cloud management overhead" on a couple of the
slides?

~~~
tiernano
Guessing that they have managed services in aws for maintaining all the nodes
and where they are. If each box has 2x the ips, they would need more mgmt
resources to keep track of what is on that box, if it's online, stats, etc.

~~~
drewg123
Yes, exactly. When we first started the project, I was told that multiple IPs
were not an option for that reason.

~~~
bluedino
Was that a technical or administrative restriction? I'm interested in how an
organization makes a decision like that, and then what channels are used to
reverse it.

------
alinspired
What a progress! In 2015 they were serving only ~9gb/sec https per server:
[https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf](https://people.freebsd.org/~rrs/asiabsd_2015_tls.pdf)

------
betaby
Are those changes in 12.1? If not, what would be a time frame?

~~~
loeg
Not all of them are landed in CURRENT yet, much less stable/12\. They'll be in
13.0; I can't speak to any future stable 12.2. On the other hand, CURRENT is
pretty solid. A lot of folks, Netflix included, just run CURRENT.

~~~
yjftsjthsd-h
Wait, they run CURRENT _in production?_ Is that... safe?

~~~
morning_gelato
There's at least one video online [1] that talks about Netflix's process for
internal FreeBSD releases. They do 5 weeks of development and then 5 weeks of
testing.

[1] [https://www.youtube.com/watch?v=vcyQBup-
Gto](https://www.youtube.com/watch?v=vcyQBup-Gto) (about the 12 minute mark)

------
tiffanyh
It’s my understanding that many organizations prefer *BSDs over Linux not
because BSD is more performant ... but because you’re way more likely to get
your organizations patch accepted into the upstream than Linux.

~~~
bsder
I doubt that is the case. The FreeBSD guys can be just as hardass about
submissions as the Linux folks. However, there are considerations:

1) GPL vs BSD licenses. Companies like BSD licenses much more than GPL
licenses. GPL adherents can whine all they want, but this is simply true.

2) __BSD has a long history of having very good networking stacks--albeit on
specific hardware. Linux supported _everything_ initially--including really
cheap crap--and consequently its networking stack was a lot more ad hoc.
FreeBSD chose specific hardware for stability--but then supported that much
more completely.

3) FreeBSD has a long history of being the servers in Internet infrastructure.
There are specific architectural choices in the kernel because of this. There
is probably still some inertia, too, in that the kind of old guard people who
_REALLY_ grok networking are still more comfortable on FreeBSD machines.

Consequently, it is hardly surprising that an advanced networking development
would take place on FreeBSD.

~~~
throw0101a
Re: 1 and 2.

Given that there are vendors that use FreeBSD for their appliances, they
really don't want to have to send out techs to customers sites to fix things.
So when the appliance makers choose hardware, they talk to component vendors
about quality.

It's no surprise that you see commits from Intel and Chelsio employees in the
FreeBSD logs: companies like Netflix, Isilon, NetApp, and Juniper partner with
them to make sure things aren't buggy.

These collaborations lead to point 3.

