
Some notes on high speed networking on PCs - fanf2
https://mailman.nanog.org/pipermail/nanog/2018-June/095728.html
======
walterbell
_> the kernel community is working on AF_XDP_

See recent Intel presentation on Time Sensitive Networking,
[https://schd.ws/hosted_files/elciotna18/b6/ELC-2018-USA-
TSNo...](https://schd.ws/hosted_files/elciotna18/b6/ELC-2018-USA-
TSNonLinux.pdf)

~~~
lossolo
TSN is interesting for special use cases like the one they presented in the
presentation. I've impression that those two are solving two different
problems. TSN focus on deterministic, precise timing of packets and low
latency while sacrificing throughput. AF_XDP focuses mainly on throughput.

------
voltagex_
Do you think we'll ever see 10G networking "just work" at home? I mean, I can
plug in a couple of PCs into a relatively cheap switch with relatively cheap
cabling and get somewhere approximating 1 gigabit right now.

~~~
_jal
I know what you mean, but it has gotten a lot cheaper. There are sub-$1K 10G
switches and consumer-ish onboard-10G motherboards.

Probably, something popular needs building that demands 10G-in-the-home to
use, in order to get volume up so pricing can come down a bit. VR, if it
actually does hit this time, is a candidate.

As far as fragility, copper is the way to go at home, unless you're entirely
pet and child-free or have your own cage in the garage.

~~~
benaadams
> Probably, something popular needs building that demands 10G-in-the-home to
> use, in order to get volume up so pricing can come down a bit.

HDMI 2.0 for 4K is 14.4 Gbit/s

HDMI 2.1 for 8K is 42.7 Gbit/s

~~~
exikyut
...how cheap is HDMI capture?

Receiving non-visual data from the GPU has the interesting property that the
shaders and video RAM are generally kind of very close to the HDMI port, so I
could potentially be "sending" the result of running a shader on data in video
RAM.

Something something using VRAM for caching...?

I've been meaning to figure out how to capture HDMI, at line speed, for a
while. My current thinking is to use a PCIe FPGA capable of bus-master DMA, a
horribly-hacked-apart kernel that doesn't touch a few GB of RAM, and a kernel
driver to "chase" after the gigantic ring buffer the FPGA copies into.

Hmm, I probably wouldn't be able to copy at HDMI 2.1 line rate.

Unless there are very very scary FPGA contraptions that use TWO PCI-e slots?
:D:D [EDIT: I now realize this would need two PCI-e root complexes. Woops]

~~~
pjc50
HDMI capture devices are still in the >$100 range, it seems. That usually runs
it straight into an MPEG encoder though.

I believe any FPGA with a decent PCIe implementation should be able to bus-
master, and if you set up the drivers properly you don't have to do anything
with the kernel, it'll just allocate a block of transfer RAM for you.

You can use multiple PCIe _lanes_ easily? If you need more than an x16 slot
I'd be very surprised.

~~~
exikyut
> _HDMI capture devices are still in the >$100 range, it seems. That usually
> runs it straight into an MPEG encoder though._

Right. I wouldn't be surprised if that encoder is a single chip designed to
ingest TMDS so the 10-40Gb/s of raw data only travels a few mm.

And MPEG encoding is the only reasonable way to handle data of this size,
considering that it compresses 10Gbps down to <80Mbps and typically <10Mbps.

> _I believe any FPGA with a decent PCIe implementation should be able to bus-
> master, and if you set up the drivers properly you don 't have to do
> anything with the kernel, it'll just allocate a block of transfer RAM for
> you._

As for the first part, cool.

With the 2nd part, you're seeing my complete lack of hardware knowledge :)

I was envisaging having the host system "ignore" several GB of RAM - have
linux's memory manager simply not touch it - and then turning it into a
gigabyte+-wide circular buffer to copy into.

Then, the processing/do-whatever code is written as a kernel module that
"chases" after wherever the "head" is in the circular buffer (presumably this
would be a pointer written to a fixed memory location). Because the buffer is
gigabytes wide the kernel driver can stall (for whatever reason) for multiples
of seconds and have wild swings in "chase performance" before there are any
real issues.

This design enables an important goal which I wanted to implement: pressing
PrtScr or hitting a hardware button flushes the last few seconds of whatever
was on the screen to disk - ie, you can achieve hardware-level, pixel-perfect,
video capture. Frame glitch? Saved. "LOOK, the GPU displayed a single frame
wrong again, agh, you missed it"? Saved. Weird timing glitch in graphics stack
you can't reproduce? Saved. Weird timing/race-condition-related graphics
issues that only happen when your test code is removed and are too fast to
diagnose? Saved. Suddenly you can just spray debug data into the corner of the
screen and analyze the captures later.

Ideally, some simple compare-while-copying code running in the FPGA could
store simple frame deltas and distil whole-screen updates into a list of
changed rects. _Maybe_ you could even do that on-CPU, as a kernel task.

But here's the thing. In the worst case, with no compression/reduction code
present, you'd be able to store 1 minute of 60fps 1080p in 20.85GB of RAM, and
6 minutes in 125.15GB. 1 minute of 3840x2160 uses 83.42GB, 3 minutes uses
250.28GB. 1 minute of 8K takes 333GB :D

Of course, flushing to disk is a major operation; it would likely require
multiple PCI NVMe devices, in either RAID or driven via a threaded storage
engine, to flush the memory to storage fast enough - in the 8K example, you
need to copy 333GB of RAM to disk in 1 minute, so your solution needs to
receive 5GB (44.4Gb) per second, sustained, no performance spikes/dips. :P

(This is where the gigantic circular buffer comes in to play; the idea is that
the wraparound hits RAM with new data _just_ as the old data got saved.)

What's potentially interesting with this are the situations doing desktop-type
work, small-scale tests, etc, where not all of the screen is going to be
updated, and the reduction system will diff the updates very effectively. In
these situations, if you're careful, you may go from minutes to tens-of-
minutes or even hours of recording time, on a 128GB or 256GB RAM system. And
that's only spending a few hundred $. (For the RAM, that is.)

If you're doing game development - every screen effectively new data - and you
need pixel-perfect HDMI capture, well, 2TB RAM workstations are only low/mid-5
figures now...

> You can use multiple PCIe _lanes_ easily? If you need more than an x16 slot
> I'd be very surprised.

The GP mentioned that HDMI 2.1 is 42.7Gbps. Elsewhere in this thread it's
mentioned that PCI-e's limit is 50Gbps. [EDIT: Just noticed the other comment
about PCI-e speed. This makes things a little easier!]

~~~
pjc50
Have a look at [https://www.kernel.org/doc/Documentation/DMA-API-
HOWTO.txt](https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt) \-
"write into host buffer(s)" is a very common feature, and I don't see why you
can't just define a giant buffer.

------
nickpsecurity
Back when I looked into high-speed networking in small spaces, my idea was
doing it over PCI directly bypassing Ethernet. I really wanted to do something
like SGI's NUMAlink but PCI be cheaper with commodity parts available. You'd
need a switch for it, though. I ended up finding a company selling them.

So, there's always that kind of thing to consider. I wonder if 10-100Gbps have
better price/performance than TCP/IP over PCI by now. I cant even rember the
company or tech's name.

~~~
daveguy
I wanted to do the same thing with PCI!! I thought the bandwidth was large
enough that it should be a great direct interface. Unfortunately synchronizing
between two pci masters was going to be tricky and the high frequency
development required for PCIe requires some pricey diagnostic equipment. I ran
across those off the shelf switches / interfaces, but they were a lot more
expensive that 10gb ethernet at the time and they weren't even available for
high lane count pcie 2 and pcie 3.

I think now that 10gb Ethernet prices are coming down to about $100 per card
and switch port there's no way specialized direct pcie would compete off the
shelf.

Still I think it would be cool to have a direct 32 lane pcie 3 of 4. It could
operate as fast as ram speeds. If you connected two 2P beasts it would be
close to having a 4P.

Of course there would probably be protocol issues and overhead as the no-two-
masters issue would mean you can't use the native protocol directly.

~~~
namibj
Don't worry, the chip you are searching is branded ExpressFabric and sold/made
by Broadcom. It's AFAIK only PCIe4, and you get ones with up to iirc 96 lanes
per chip. They support to be connected into a mesh, even adding shared PCIe
devices like network cards, GPUs, NVMe and SAS/SATA HBAs is possible, though
you might have difficulties with the GPUs in particular. The others, if you
are careful when buying, should do well if they can handle SR-IOV, as (from
their view), the mesh switch they are attached to is the root, and the
computers attached to the mesh take the place a VM would normally occupy.

If you could get the smallest chip they make, it should not be hard to get a
PCB made for it that fans out to e.g. a set of USB-C receptacles. One such
receptacle can handle 2 PCIe3 lanes, and some power/USB-2 on the side. The
main reason being the relatively low price, compared to e.g. mini SAS HD,
which has a similar density and twice the lanes per connector, but also a
slightly higher target impedance at the upper end of the PCIe spec, whereas
USB-C is specifically targeting the perfect PCIe impedance.

~~~
justinclift
A quick search on Digikey shows several entries for ExpressFabric which seem
like what you mean:

[https://www.digikey.com/products/en?keywords=expressfabric](https://www.digikey.com/products/en?keywords=expressfabric)

They all have a part status of "Discontinued at Digi-Key" though. Not seeing
any further info of where they _are_ available either. :/

~~~
namibj
You found the chips I was referring to. If you want them, you should probably
ask Broadcom, as they should either be able to point you to where you can
still buy them, or to a replacement. And in case they don't help you, you can
still ask sales at competitors, as some might have something suitable. This
certainly looks like they'd do a run for you if you want some, albeit with a
more significant MOQ (probably in increments of one wafer, with the best price
to you if you agree to take all good dies of those they make for you).
According to their website [0], they are still active products. It just seems
that no one really wants to buy them, due them being somewhat weird. There is
a reference platform [1], which is a 32-port, as far as I can tell PCIe3 x4 on
QSFP ports, 1U TOR switch. It goes with 2-port cards that go into the servers
and contain re-timers. Broadcom claims one can connect the switch with the
server cards using optics or copper, but the switch is $11k at mouser. The
technology apparently offers virtual Ethernet NICs and 8 QOS classes.

Did you want to buy such technology?

[0]: [https://www.broadcom.com/products/pcie-switches-
bridges/expr...](https://www.broadcom.com/products/pcie-switches-
bridges/expressfabric/#tab-PCIe1) [1]:
[https://www.broadcom.com/applications/datacenter-
networking/...](https://www.broadcom.com/applications/datacenter-
networking/expressfabric) (bottom of the page, grep "ExpressFabric Reference
Platform")

~~~
justinclift
> Did you want to buy such technology?

Nope, just merely curious

For most of the use cases I can think of that these would suit, Infiniband
seems like it would also work and that's fairly widely available already.

That being said, there's probably use cases this would suit better. They're
just not coming to mind easily. :)

~~~
namibj
Cheaper than Infiniband, I assume. Even if you buy the TOR switch, from what I
can tell.

Also, the latency is much smaller than for Infiniband, and it is nice to e.g.
combine blades with NICs or similar configurations, without the NICs being
special multi-host ones like Mellanox offers.

------
Spooky23
10gb on VDI is fun when your storage can absorb line rate spikes.

Things break in new ways and are fun to troubleshoot.

~~~
exikyut
This sounds kind of interesting. What happened?

~~~
Spooky23
The constraints moved!

We had a write intensive workload for a user community that would generate
short (~5s) periods of high traffic (6-9Gb) against a file server — this would
impact everyone because sessions would be interrupted. It was devilishly
difficult to troubleshoot because it was lost in the statistics, aggregated in
30-100s chunks.

Fixing that revealed another constraint where we hit a constraint with the
file server process itself.

It’s one of these things that was interesting because the storage with fancy
SSD can handle the workload no sweat. In the old days, monitoring would catch
the io constraint first.

~~~
exikyut
Hmm. This highlights the importance of statistics AND insane threshold edge
detection. Good one to file away, thanks.

~~~
Spooky23
Yeah, we’re working on a way to grab real-time stats during peak load
conditions to help correct for outliers lost in averages.

Another key thing is to be aware where heavy write load generators are on SMB,
and scope the SMB shares around them to scope failure domains. There are
limits, and the vendors (Microsoft or 3rd party NAS) have a hard time finding
them too.

------
wemdyjreichert
Well, that was... bellicose. No company is your friend, but some are still
better than others. It is true that companies work in their own self-
interests, but those interests may differ. Apple has done quite well without
selling user data, and has cultivate a brand that makes money doing something
else. Facebook has not. The enemy of my enemy is my friend, even if they work
in their own self-interest. It is in the self-interest of apple to keep
customers and they might lose more than they gain selling user info. Again, no
company is perfect. "Applying constant pressure" isn't perfect either, as
governments are at least as bad as any company - their motivation is to stay
in office. Apple is at least a better company then, say, Facebook.

~~~
benaadams
Um, wrong topic?

This one is about "Some notes on high speed networking on PCs" \- not sure how
Apple comes into it?

~~~
saagarjha
Looks like a comment destined for
[https://news.ycombinator.com/item?id=17276184](https://news.ycombinator.com/item?id=17276184)

