

10Gbit/s full TX wirespeed smallest packet size on a single CPU core - eloycoto
http://netoptimizer.blogspot.com/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html

======
jjoonathan
How typical is this benchmark for a single CPU? If this is (approximately)
cutting-edge on the software side then it's a pretty big deal because the
cutting-edge hardware can handle 100Gbit/s ethernet with 33Gbit/s out of each
transceiver [1] which would mean the CPU is the bottleneck!

[1] [http://www.xilinx.com/products/silicon-
devices/fpga/virtex-u...](http://www.xilinx.com/products/silicon-
devices/fpga/virtex-ultrascale-plus.html)

~~~
mtanski
In test case in the article they are sending the smallest possible ethernet
ethernet frame at 84 bytes. Further, the caveat here is that they can do that
on a single CPU only in the kernel layer. Once you involve userspace at the
same packet size you need 11 CPUs to drive it.

If you're using more realist payload (eg. larger packets) you should be able
to scale it further. eg 40Gbit

Bottleneck in this ongoing story has been CPU... that's because of how the
software was written. So the issues have been package scheduling,
synchronization (locking overhead), high level software (TCP layer, and packet
filtering).

Here are more details about the original problem and the test case:
[https://lwn.net/Articles/629155](https://lwn.net/Articles/629155)

~~~
wyldfire
> they are sending the smallest possible ethernet ethernet frame at 84 bytes

It's impressive, but I can't see its application. Does that mean that it will
scale well to many different types of traffic? Are there extremely low-latency
applications that need these tiny frames and yet want to use 10GbE?

> If you're using more realist payload (eg. larger packets) you should be able
> to scale it further. eg 40Gbit

Yes, I would think so. I was able to saturate four 10GbE links simultaneously
(using 9000 byte frames, among four separate CPU cores spanning two memory
nodes). This was using ~2010 hardware using linux 2.6.27.

~~~
wtallis
Most of the processing overhead is per-packet rather than per-byte so they're
testing the worst-case scenario by minimizing the size of the packets and
frames, thereby maximizing the number of packets and frames. It doesn't have
much direct applicability, but it does mean that the optimizations will be
sufficient to achieve wire speed for _any_ traffic pattern.

------
dchichkov
Impressive. Could be useful out of the box for cheap 10Gbe load-testing.

I'm also curious, how you guys did the test, that you've actually achieved
line speed? In my book, line speed means that you _don 't_ have millisecond-
level gaps between packets from time to time, and it is fairly difficult both
to test and to achieve on a stock kernel.

~~~
signa11
> Could be useful out of the box for cheap 10Gbe load-testing.

honest question: why not use dpdk / netmap for this ?

------
DiabloD3
So, according to this, with 9000 byte MTUs, I can drive 4x40gbit (!= 160,
oversubscribed cards doing 128gbit of PCI-E) using about 2.9% of my CPU on
quad core (assuming same CPU speed, which my CPU is probably about a third
faster than theirs too).

Thats pretty sexy.

