
High Speed Networking: Open Sourcing our Kernel Bypass Work - XzetaU8
https://www.bbc.co.uk/rd/blog/2018-04-high-speed-networking-open-source-kernel-bypass
======
john37386
Many people don't understand how or why the linux kernel is the bottlebeck at
speed higher than 10 Gbps.

The problem is that the linux kernel can't process many simultaneous small
connections.

Just to be clear: \- linux can easily transfer at 40 Gbps \- linux chokes at
around 1M packet per second per cpu socket.

Thats right.

So linux can easily transfer at 40 Gbps one or few simultaneous flows

But! It can't transfer several flows at 40 Gbps.

The bottlebeck is the number of packets per seconds it can inspect.

So if you are sending 100 big transfers, you will reach 40 Gbps.

If you send 5M small transfers, linux will die.

This is why netmap is handy.

It offloads the packets from linux directly to the app

Bbc is sending xM small requests per second, hence millions packets per
seconds.

In conclusion, netmap is good if you need a lot of small simultaneous
connections.

I reached 40M connections on chelsio 40 Gbps nic using netmap on FreeBSD 11.

~~~
sengork
Thanks, I've always wondered under which conditions BSD networking performs
better compared to Linux. It's the number of simultaneous streams as opposed
to raw bandwidth for fewer streams.

~~~
john37386
Well BSD kernel will also choke at more or less the same limits as linux
kernel.

The key point here is that a netmap driver bypass the kernel and therefore
open the door to many millions packets per seconds or millions requests per
seconds. Not many Gbps

------
dsr_
Next time management asks why the driver for your product should be open
source, point out that it makes your product much more attractive to people
who will improve and promote your product for free.

That is, sales will improve.

~~~
imtringued
Sales improve because the potential userbase has increased, not because of
free contributions.

Imagine you are google and you have created your own custom build system. You
now want to hire more people that are familiar with it. Unfortunately the
potential applicant pool is exactly 0 because nobody outside of google can
even start to learn how to use the google specific tools because they simply
don't exist outside of google.

~~~
dsr_
The big barriers to adoption of a new product, service, or whatever are:

1\. People knowing about it.

2\. People needing it.

3\. People trusting it.

Like a fire needs heat, oxygen and fuel, new products survive only when they
get all three corners of the triangle. If your product is cheap enough, some
people will try it when they think that they have a need for something like
that; trust comes eventually. Open source builds trust that in the worst case,
people using your product will have a glide path out instead of a sudden brick
wall.

In the instant case, Mellanox gets a highly visible, reputable customer
telling everyone in their industry that Mellanox NICs are high-performance,
are trusted, and can be adapted to their needs. Anyone who reads this article
and thinks about high-performance NICs will have a bit more trust in the
Mellanox brand as a flexible system.

------
kierank
Lots of people missing the point in this discussion, the goal of this is to
send Uncompressed video over closed IP networks in a studio for example. I
talked about some of the problems at Demuxed last year (referencing BBC R&D's
work):
[https://www.youtube.com/watch?v=A4L5xEXXlas](https://www.youtube.com/watch?v=A4L5xEXXlas)

~~~
ckdarby
I don't think many people missing this point. I got the point from the article
itself and even before this when I saw BBC I figured it was going to be
related to video.

------
decasia
One of the sibling articles explains that part of the project involves trying
to cut labor costs on covering massively multi-sited events:

[https://www.bbc.co.uk/rd/projects/nearly-live-
production](https://www.bbc.co.uk/rd/projects/nearly-live-production)

The idea is explicitly that you would run the equivalent of a TV control room
on a web app that knows how to switch between video feeds (potentially from
fixed cameras without human operators) and then can output video to... whoever
is your audience.

The "moving window" of near-live editorial decisions is pretty interesting as
well, and seems geared towards giving non-professional editors a chance to fix
mistakes.

~~~
stephen_g
How does a web app know how to switch between video feeds, or cameras know
where to point?

Maybe we're talking about different kinds of events, but there's actually a
lot of creativity that goes into camera framing, movement, shot selection,
timing of cuts, etc. to end up with something that's actually interesting to
watch...

~~~
birdman3131
The app is not choosing. The operator/director is choosing. The thing is that
you give yourself a buffer before broadcasting.

"The user can pause the action at any point in time and seek back through the
session to fine tune edit decisions using a visual representation of the
programme timeline. On resuming, the play-head seeks forward to the end of the
timeline. This functionality is made possible by ensuring the time shift
window (DVR window) of the camera feed is infinite so that all live footage is
recorded and can be randomly accessed by seeking.

Once the operator has established a sufficient buffer of edit decisions, they
can begin broadcasting them. [...]

A research goal is to investigate how big the window of time should be between
the broadcast play head and edit play head to ensure the operator has enough
time to perform edits without feeling rushed and whilst keeping the programme
as close to a live broadcast as possible. We call this the ‘near-live window’"

~~~
petetttt
More on building similar web apps here [1] and on getting a computer to play
director here [2]

1 - [https://www.bbc.co.uk/rd/blog/2017-07-compositing-mixing-
vid...](https://www.bbc.co.uk/rd/blog/2017-07-compositing-mixing-video-
browser) 2 - [https://www.bbc.co.uk/rd/projects/ai-
production](https://www.bbc.co.uk/rd/projects/ai-production)

------
kalleboo
> unique challenge here involves handling IP packets (around 1500 bytes each)
> at data rates of between 1 and 8 Gigabits per second

This reminded me of something. At this point it seems like Jumbo frames are
never going to be widely adopted, are they? Otherwise this seems like the
perfect application - massive datarates, controlled hardware/software, high-
quality wiring...

~~~
cremp
Working with Mellanox cards for a few weeks; the MTU matters a lot. The
difference between an iperf test measuring 14 Gbits/s on MTU 1500 vs. 40
Gbits/s on MTU 9000.

We got a couple of Connect-X5 cards, which allow switchless connections, akin
to a ring topology. A lot of neat things, at just stupid line speeds, and
latency levels I haven't seen in software, ever.

~~~
kevstev
Interesting. I bought a "real" netgear switch a few years back (geared to the
small business market) and did some tests with jumbo frames enabled, and saw
only a slightly under 10% difference in performance pushing 10GB files from my
laptop to my NAS.

Everything else was consumer level though. I did ensure that everything was
flipped to jumbo frames. I/O shouldn't have been the factor since it was SSD
on both sides.

Mellanox has some interesting stuff, I used to work with it back in my HFT
days.

------
jwbensley
BBC R&D gave a talk on this very project a couple of years ago at a UKNOF
meeting, it was very interesting:

[https://m.youtube.com/watch?v=yLL8wl8YUwA](https://m.youtube.com/watch?v=yLL8wl8YUwA)

------
ausjke
Last kernel hacking I had to do some work with 10Gbps NICs, so 80Gbps is just
so attractive to my eyes.

There are different pass-through/fastpath patch for different chips to avoid
memcpy, or do zero-copys, but they are all kernel patches, kernel-only.

An alternative method will be BPF/XDP and DPDK for which you will need modify
the kernel drivers somehow for good performance. Wondering if Netmap does that
already or is has nothing to do with them.

All of them are addressing Dataplane packet move, hope I can have an
environment to experience these close-to-100bps network in my next projects.

In the meantime, I am wondering, why do you pass 4K uncompressed video using
IP packets...

------
corndoge
Wonder why they didn't use DPDK instead of rolling their own MLX specific
thing.

~~~
xjia
DPDK is a very recent thing...

~~~
lttlrck
Also netmap is _extremely_ easy to get going. It’s a man page and some great
examples and the deva are superbly helpful. I found DPDK daunting in
comparison (it really is a different beast though, a real toolkit).

One nice thing about Netmap is it can fall-back to emulation mode and work
with any NIC. This can be really helpful if you want a single codepath but
don’t always need the performance. DPDK might support this too now, I haven’t
looked at it for quite a while.

~~~
wmf
Netmap is easier, but when _everyone other than you_ is on DPDK that's kind of
a hint that you're painting yourself into a corner.

~~~
kierank
Except for minor services such as Verisign's DNS which uses netmap, sure.

~~~
shaklee3
Source? Couldn't find anything saying that.

------
chatmasta
Can someone ELI5 the major challenges of delivering "broadcast quality" video
over IP compared to cable? It seems crazy that pushing HD content over a
standard aux cable is faster than downloading HD over IP. We have been pushing
HD content to our TV's for 10+ years, but many people still have trouble
streaming an HD movie from Netflix.

Is this simply due to the overhead of IP transport supporting bidirectional
communication? That is, a TV broadcast only needs to support a fixed set of N
unidirectional flows (channels), but IP needs to support a dynamic set of N
bidirectional flows?

~~~
corndoge
As you said, broadcast is usually done with IP multicast, so it's not really
bandwidth intensive. The netflix example is different, there you have a large
group of people who all want different streams. At netflix's scale serving
that many simultaneous connections from any reasonable amount of datacenters
has the capability to saturate DC uplinks at peak hours. Hence why they've
resorted to caching appliances in IXs.

~~~
chatmasta
Actually, I didn't realize that broadcast was done with IP multicast. I was
thinking it used a separate, older technology. So does that mean that
generally speaking, modern ISP's that bundle Internet + Cable push the cable
content over the IP transport using multicast in the layer 2 DOCSIS network?

~~~
discreditable
I used to have a fiber + TV package. It was definitely multicast. I found that
when I swapped their router for my own I had to make sure IGMP was handled in
order for TV and on-demand to work.

~~~
sofaofthedamned
Yes, I did the IGMPProxy dance too!

I was amazed that a bit of software that is basically in all home routers is
so old, unloved and misunderstood.

Multicast is a funny thing. For years people thought it was the future until
they realised 1) it requires the WAN end to support it and 2) people to watch
stuff at the same time.

------
peterwwillis
There's a lot of orgs out there developing broadcast IP tech in house. Would
be interesting to see an industry wide consortium on this

~~~
brensmith
There are some consortiums that are starting to form:
[http://www.videoservicesforum.org/activity_groups/RIST_poste...](http://www.videoservicesforum.org/activity_groups/RIST_poster_for_VidTrans2018Feb25.pdf)

------
adrianmonk
So if this lossless video over ethernet thing ever trickles down to consumers,
what happens?

Can my computer monitor become just another network device plugged into
Ethernet like my printer already is?

In my living room, can I ditch the video routing part of my AV receiver and
just plug everything (streaming devices, video games, cable box) into a very
fast ethernet switch?

~~~
fulafel
Quickest Ethernet is up to 400 gbps now, you can probably get by with a
pedestrian 10gbit switch :)

Hard to see uncompressed consumer video catching on though since compression
works so well.

~~~
e12e
Afaik thunderbolt is packet switched?

------
greyfox
am i reading this right? are they coming up with a way to use IP to broadcast
"television" in the sense that, if this was finished and open source that
"anyone" could feasibly run their own "television channel/station" (edit: over
IP) ?

~~~
cjensen
Not exactly. This is about how to pass video between interoperable machines.

Traditionally within a broadcast house, video+audio was sent between machines
using HD-SDI using BNC coax[1]. For the first generation of HD, this ran at
1.5Gb/s. For 1080p/59.94Hz, a new standard was developed to run at 3Gb/s. For
4K at 59.94Hz, there is a 12Gb/s standard.

HD-SDI routers are extremely expensive. Each input and output from a device
has to be individually cabled to a router and some devices can have many
dozens of I/Os. There has been a push to switch from using ridiculous numbers
of cables to using IP solutions and off-the-shelf IP routers. SMPTE 2110
provides a framework for doing this using RTP [2].

4K needs roughly 12Gb/s, so to carry a single video, you need at least 25Gb/s
networking. If you want to carry many signals, the network bandwidth goes up
fast -- that's why BBC is interested in 100Gb/s links.

Also keep in mind the traffic is not "bursty". These links may be nearly
saturated 24/7\. Most off-the-shelf routers are not designed with enough
buffering to handle fully saturated links on all connectors.

[1]
[https://en.wikipedia.org/wiki/Serial_digital_interface](https://en.wikipedia.org/wiki/Serial_digital_interface)

[2]
[https://en.wikipedia.org/wiki/SMPTE_2022](https://en.wikipedia.org/wiki/SMPTE_2022)
(older IP standard)

------
gnufx
If this is for a more-or-less closed system, I wonder why it uses IP, and not
an RDMA-type network (Infiniband etc.) which already is kernel-bypass and
works routinely at around 100G, 1μs latency. There's nothing special about the
Mellanox drivers amongst the Openfabrics ones in having free drivers in Linux,
but they generally require blobs or separate firmware.

------
apazgo
Wouldn't products like openvpn benefit from using this? Or am I missing
something?

------
jlebrech
All paid by
[https://en.wikipedia.org/wiki/Television_licensing_in_the_Un...](https://en.wikipedia.org/wiki/Television_licensing_in_the_United_Kingdom)

~~~
lainga
I never understood why they stick with this model instead of just levying a
tax on citizens (like my beloved CBC). The BBC has been more than just a TV
station for a long time; and I hear people in the UK are dead tired of having
BBC license "collectors" coming round every 2 months to harass them into
paying fees for the flatscreen TV they have hooked up to an XBox.

~~~
Symbiote
If you tell the BBC licence collectors (phone/letter/in person) that you're
not watching TV, they stop coming.

The license collectors remain a favourite grumble of anti-government or anti-
tax people, most people aren't affected. (Most people buy the license, most
people who don't need to instead tell the BBC roughly every 3 years.)

~~~
jlebrech
after two years then will come back, and if they have reason to believe you
are watching tv they can still come.

people can watch netflix and still be bothered by those people.

------
jacksmith21006
I will be curious to see how well the new Google kernel, Zircon, performs
compared to Linux.

I have my doubts a micro kernel can come close to the Linux kernel efficiency.

Zircon is the new micro kernel that is part of Fuschia.

[https://github.com/fuchsia-mirror/zircon](https://github.com/fuchsia-
mirror/zircon)

~~~
e12e
Isn't this a bit of an odd comment, on a story about kernel bypass for high
performance networking?

~~~
jacksmith21006
Why? Think it is very relevant.

~~~
e12e
You doubt a microkernel can approach Linux kernel efficiency, yet here a
typical kernel task is delegated to user-space for efficiency under Linux ; an
approach that is similar to how networking might be handled by a microkernel?

I agree that it'll be interesting to see how fuchsia turns out.

~~~
jacksmith21006
Yes definitely have my doubts. But hope it works out. Just love Flutter and
hope Zircon works out.

But we have to see it actually happen.

------
Y_Y
Cool.

