
Improving Linux networking performance - sciurus
https://lwn.net/Articles/629155/
======
zelos
I love reading articles like this: _" So, for example, a cache miss on
Jesper's 3GHz processor takes about 32ns to resolve. It thus only takes two
misses to wipe out the entire time budget for processing a packet."_

Then I go back to adding another layer of abstraction to my bloated Java code
and die a little inside.

~~~
gretful
yeah, it's easy to cry when you see a real artist working on something, and
you have to go back to your etch-a-sketch.

~~~
npsimons
Real art may inspire the soul, but road signs keep you from dying.

------
joshbaptiste
Indeed, there has been a large amount of "bypass the kernel" campaigns in the
last couple of months. Robert Graham's 2013 Shmoocon talk has a great
introduction into the whys and hows of this movement.

[https://www.youtube.m/wco?v=D0atch9jdbS6oSI](https://www.youtube.m/wco?v=D0atch9jdbS6oSI)

Facebook had a job posting that showed up on HN for a position to help speed
up Linux's networking stack. While I doubt these improvements will surpass the
kernel bypassing model, I'm glad a developer has decided to tackle this head
on and help overall efficiency.

~~~
mobiplayer
It seems to be pretty uncommon for Linux setups to have, e.g. TOE enabled, but
in my humble opinion it is an easy win on performance on 10G networks.

~~~
wmf
TOE isn't supported under Linux and it was only a win if your TCP stack was
slow (and by "your" I mean Windows). TSO is enabled by default in Linux and is
indeed an easy win.

------
fulafel
Too bad improvements in network technology haven't found their way to consumer
level. It's probably related to the stagnant broadband speeds, last mile
bandwidth improvements slowed to a crawl many years ago and concurrently
device connectivity actually moved to a slower networking tech, wi-fi. Now
people are happy to use wi-fi for home desktop computers since their last mile
net connection is so slow anyway.

It's 10 years since motherboard integrated 1G became commonplace in regular
PCs, same for 10G is nowhere in sight...

~~~
skuhn
10 gig still isn't even commonly available on server motherboards, because of
power / space / cost. There also aren't many copper 10 gig top-of-rack
switches, just the Cisco Nexus 3064T and Arista 7050T come to mind. Juniper
doesn't even have one.

It's easier for a lot of places to use twinax with 10 gig SFP+ switches rather
than going copper 10 gig. That is definitely not going to trickle down to the
consumer level.

It will probably be another 1-2 years before 10 gig is ubiquitous at the
server level, and another 2-3 years after that before it is commonly on
consumer equipment. Or maybe it never will be, and things will go in another
direction.

~~~
donavanm
> There also aren't many copper 10 gig top-of-rack switches, just the Cisco
> Nexus 3064T and Arista 7050T come to mind. Juniper doesn't even have one.

I might be missing your definition of tor. The juniper QFX5100 series has the
48T which does 48x 10GBASE-T plus 6x QSFP. the 5100-96S does 96x SFP+ and 8x
QSFP. There are plenty of other cheap merchant silicon platforms that look
similar. Personlly Im happy with DAC on SFP+ ports.

~~~
skuhn
Oh yeah, I always forget about the QFX series. Seems like that would do the
job.

List prices are utter nonsense for switches, but the QFX does come in above
the other two I mentioned. Perhaps because of its fibre-channely nature that
no one (I hope) cares about.

QFX5100-48T: $24,000 Nexus 3064-T: $13,000 7050T-52: $20,000

Still, any of these would work if you get the right deal. I could see an
advantage for 10g copper if I had mixed racks where not all of my hosts needed
10g on the server side, but that's a big premium to pay over 1g TOR if you
aren't using lots of 10 gig ports.

For me, I just use copper on 1 gig racks and DAC on 10 gig racks. So far, so
good.

------
arca_vorago
Am I the only one that thinks we need to start at re-evaluation of BSD sockets
first? I know Apple tried and gave up, but it just seems like so many of the
building block pieces we use everyday could really use a major polish or some
good competition.

~~~
wmf
This article is mostly about the lower parts of network stack like QoS and
talking to the NIC; the user API is kind of orthogonal but equally important.
There have been several research projects about improved networking APIs; my
favorite is IX which gets line-rate performance while retaining kernel/user
protection. [https://www.usenix.org/conference/osdi14/technical-
sessions/...](https://www.usenix.org/conference/osdi14/technical-
sessions/presentation/belay)

------
jalcazar
This reminds me of MegaPipe. Basically it creates a pipe between kernel and
user space. It uses batching too
[https://www.usenix.org/conference/osdi12/technical-
sessions/...](https://www.usenix.org/conference/osdi12/technical-
sessions/presentation/han)

~~~
trentnelson
It bugs me that the source isn't available for stuff like this. Makes it tough
to objectively evaluate things.

------
alricb
Possible contrast & compare: the presentations on OpenBSD's network stack at
[http://openbsd.com/papers/](http://openbsd.com/papers/)

~~~
scott_karana
The first relevant-seeming presentation is from 2009[1] and doesn't really get
into the pitfalls of low-latency switching/handling like this article does.

I'm definitely interested to see how other operating systems handle this,
though. In particular: Windows (is networking in user-mode?) and Solaris-
likes.

1 [http://quigon.bsws.de/papers/2009/eurobsdcon-
faster_packets/](http://quigon.bsws.de/papers/2009/eurobsdcon-faster_packets/)

------
xenadu02
Why are we still using 1500 byte packets at 100G again? Seems like there won't
be any tricks left to make 1000G work. Does that count as technical debt?

~~~
donavanm
Pretty much everything has supported 9k jumbos for over a decade. The
internets mostly 1.5k MTU, but you normally arent doing multi gigabit streams
over public connectivity. The other argument is TSO. Your kernels probably
writing a 64K "packet" to the NIC driver. When segmentation etc is handled by
the hardware why do you care about the MSS? On the network device side the
SerDes are the issue. And were already running parallel lanes there; 40 is 4x
10 lanes and 100 is 4x 25 lanes. Why not 10x 100 in a couple years?

~~~
nitrogen
_When segmentation etc is handled by the hardware why do you care about the
MSS?_

Because of Ethernet's mandatory minimum inter-packet spacing.

~~~
donavanm
Ok... So looking at IFG as 96 bits or 12 bytes of "overhead" thats 0.8% or
0.13% for 1.5k and 9k frames. Why do I care about 0.67% of throughput? And
pretty much all silicon in over a decade does line rate at 1k anyways. Or if
its latency a hypothetical higher clocked lane would be something like 1ns
instead of ~3ns per frame? Thats the difference between 2 clock cycles and 6
cycles of latency.

So what is your shorter/faster ifg buying in practice.

~~~
nitrogen
I've seen an audio bus that had to shorten the gap to have enough bandwidth,
but that was with 100mbit.

------
arjn
Interesting article. Reminds me of my time in grad school :-)

Looks like jasper's recommended way is to improve or find a way to bypass the
memory managment (slab allocation) subsystem.

There should be a way to tack on a more network optimized memory management
layer or allocator onto the regular one.

Could turn out be a good research project.

------
icantthinkofone
The first they should do is do what Facebook is doing and turn to FreeBSD for
ideas in their attempt to make Linux as good as FreeBSD's:
[http://www.theregister.co.uk/2014/08/07/facebook_wants_linux...](http://www.theregister.co.uk/2014/08/07/facebook_wants_linux_networking_as_good_as_freebsd/)

~~~
riffraff
IIRC netmap was a cool concept born on freebsd but also available for linux.

[http://info.iet.unipi.it/~luigi/netmap/](http://info.iet.unipi.it/~luigi/netmap/)

------
fideloper
So...we're all upvoting this hoping _someone else_ understands networking at
this level, right? And that maybe they'll do something awesome with it.

~~~
wmf
I understood this article and it is relevant to my interests since 25G NICs
are coming this year.

~~~
agrover
source? I thought the next step was 40G?

~~~
wmf
[http://25gethernet.org/](http://25gethernet.org/)

40G has been out for a few years but it's fairly expensive since it uses four
lanes. 25G will be the best option if you need something faster than 10G IMO.

~~~
mcpherrinm
To expand on that: On a switch chip today, like the common Trident 2, you have
10 and 40 gig interfaces. The 40 gig are just four lanes bonded together (10
being one). These 25 gig products runs each lane at 25G instead of 10, so you
get a 25G port in the same density you used to have 10, 50G at double the
density of 40, and 100gbit/s at the old 40 gig density.

I think this is largely being driven by the server folk, who want to connect
at 25G instead of 10.

~~~
justincormack
There was an attempt to do that with 2.5G, but it has been relegated to
backplanes and was never formally standardised.

~~~
donavanm
The atom server boards released last year actually have 4x 2.5g lanes. As far
as ive seen everyone just uses 4x 1g serdes on them instead of the hybrid 10.

