
Capturing Millions of Packets per Second in Linux without Third-Party Libraries - andreygrehov
http://kukuruku.co/hub/nix/capturing-packets-in-linux-at-a-speed-of-millions-of-packets-per-second-without-using-third-party-libraries
======
cm3
In case you're wondering about the different linux kernel bypass mechanisms,
here's the relevant slide from a recent talk:
[https://lh3.googleusercontent.com/TO1UdUicn1wuF4jIAhskikO6ML...](https://lh3.googleusercontent.com/TO1UdUicn1wuF4jIAhskikO6MLaQgUuurORG_l9Zxr_L_XRvwvVhuZmF5vWTGGkauRFVT2daOOM=w1439-h1079-no)

Sorry I haven't found the actual slides yet, that's why it's a photo from
someone who took it while attending the talk.

------
revelation
This must be the 10th blog post to land on HN on the same topic, and they all
walk through the same steps and all use the same hardware (ixgbe), which is by
the way a hard prerequisite to make much of these strategies effective.

In any case, stop reinventing the wheel, just use a library purpose-made:

[http://dpdk.org/](http://dpdk.org/)

~~~
lukego
I am a Snabb hacker and I see things differently. Ethernet I/O is
fundamentally a simple problem, DPDK is taking the industry in the wrong
direction, and application developers should fight back.

Ethernet I/O is simple at heart. You have an array of pointer+length packets
that you want to send, an array of pointer+length buffers where you want to
receive, and some configuration like "hash across these 10 rings" or "pick a
ring based on VLAN-ID." This should not be more work than, say, a JSON parser.
(However, if you aren't vigilant you could easily make it as complex as a C++
parser.)

DPDK has created a direct vector for hardware vendors to ship code into
applications. Hardware vendors have specific interests: they want to
differentiate themselves with complicated features, they want to get their
product out the door quickly even if that means throwing bodies at a
complicated implementation, and they want to optimize for the narrow cases
that will look good on their marketing literature. They are happy for their
complicated proprietary interfaces to propagate throughout the software
ecosystem. They also focus their support on their big customers via account
teams and aren't really bothered about independent developers or people on
non-mainstream platforms.

Case in point: We want to run Snabb on Mellanox NICs. If we adopt the vendor
ecosystem then we are buying into four (!) large software ecosystems: Linux
kernel (mlx5 driver), Mellanox OFED (control plane), DPDK (data plane built on
OFED+kernel), and Mellanox firmware tools (mostly non-open-source, strangely
licensed, distributed as binaries that only work on a few distros). In
practice it will be our problem to make sure these all play nice together and
that will be a challenge e.g. in a container environment where we don't have
control over which kernel is used. We also have to accept the engineering
trade-offs that the vendor engineering team has made which in this case seems
to include special optimizations to game benchmarks [1].

I say forget that for a joke.

Instead we have done a bunch more work up front to first successfully lobby
the vendor to release their driver API [2] and then to write a stand-alone
driver of our own [3] that does not depend on anything else (kernel, ofed,
dpdk, etc). This is around 1 KLOC of Lua code when all is said and done.

I would love to hear from other people who want to join the ranks of self-
sufficient application developers. Honestly our ConnectX driver has been a lot
of work but it should be much easier for the next guy/gal to build on our
experience. If you needed a JSON parser you would not look for a 100 KLOC
implementation full of weird vendor extensions, so why do that for an ethernet
driver?

[1]
[http://dpdk.org/ml/archives/dev/2016-September/046705.html](http://dpdk.org/ml/archives/dev/2016-September/046705.html)
[2] [http://www.mellanox.com/related-
docs/user_manuals/Ethernet_A...](http://www.mellanox.com/related-
docs/user_manuals/Ethernet_Adapters_Programming_Manual.pdf) [3]
[https://github.com/snabbco/snabb/blob/mellanox/src/apps/mell...](https://github.com/snabbco/snabb/blob/mellanox/src/apps/mellanox/connectx4.lua)

~~~
grive
> DPDK is taking the industry in the wrong direction, and application
> developers should fight back.

DPDK is doing the exact same work you did, make hardware vendor release their
driver API and abstract it away so that Application developer can stay
independent from it.

You "successfully lobbyied" for one API to be released. Now do that for any
number of hardware, NICs versions, and in the end you will have to release a
generic API, which is effectively a new DPDK.

Completely independent applications will only go so far. You are left with a
vendor lock-in with a very high upfront cost if you ever need to evolve your
hardware.

~~~
lukego
I understand your perspective. If you are satisfied with using a vendor-
provided software stack to interface with hardware then you are well catered
for by DPDK and do not have to care what is under the hood.

I feel that the hardware-software interface is fundamental and that vendors
should not control the software. I see an analogy to CPUs. I am really happy
that CPU vendors document their instruction sets and support independent
compiler developers. I would be disappointed if they started keeping their
instruction sets confidential, available only under NDA, and told everybody to
just use their LLVM backend without understanding it.

~~~
grive
That is effectively the case. See for example DDIO with Intel which can only
be enabled for specific devices with full cooperation between Intel and this
particular vendor.

You cannot compete with a DDIO-enabled device, which of course all Intel
devices are.

See also the Intel multibuffer crypto library, which was specialized and timed
for Intel CPUs. No one else could write at this level of optimization, because
we do not have the internal design and simulator that Intel work with.

So yeah, you are talking with sophisticated hardware which will have firmware
blobs and undocumented features. If you only rely on general instructions sets
you will only get so far. When we are talking about ns of latency and these
level of bandwidth, they make the difference between several stacks.

The push for smart-NICs will increasingly blur the line between soft and hard
layer. We can either direct our efforts so as to avoid rewriting an
abstraction layer upon it or do so for each vendor-specific API (OFED is but
one example, there will be others).

~~~
lukego
I will respectfully disagree :).

We have taken Intel's reference code ([https://github.com/lukego/intel-
ipsec/blob/master/code/avx2/...](https://github.com/lukego/intel-
ipsec/blob/master/code/avx2/gcm_avx_gen4.asm)) for high-speed AES-GCM
encryption and used DynASM
([https://luajit.org/dynasm.html](https://luajit.org/dynasm.html)) to refactor
it as a much smaller program
([https://github.com/snabbco/snabb/blob/master/src/lib/ipsec/a...](https://github.com/snabbco/snabb/blob/master/src/lib/ipsec/aes_128_gcm_avx.dasl)).
I see this as highly worthwhile: we are working on making the software
ecosystem simpler and tighter just because we are hackers, while Intel are
working primarily on selling CPUs and whatever is best for their bottom line.

I disagree with this characterization of DDIO but I don't think Hacker News
comments is the best venue for such low-level discussions. Hope to chat with
you about it in some more suitable forum some time :) that would be fun.

~~~
JoachimSchipper
FWIW, I would be quite interested in your view on DDIO - anything you can
link?

~~~
lukego
Intel DDIO FAQ:
[http://www.intel.com/content/dam/www/public/us/en/documents/...](http://www.intel.com/content/dam/www/public/us/en/documents/faqs/data-
direct-i-o-faq.pdf)

My understanding is that DDIO is an internal feature of the processor and
works transparently with all PCI devices. Basically Intel extended the
processor "uncore" to serve PCIe DMA requests via the L3 cache rather than
directly to memory.

------
pavel_odintsov
Thanks for resurrecting my article! :) It was originally wrote for one russian
site and I do not have spare time to translate it.

------
Damogran6
There's still not a lot of oomph left over to do anything with the
traffic...or is that not the point of the exercise? You're not going to be
comparing it, or writing it to disk at these levels.

This traffic is a bit above the levels I've dealt with, but I've seen Cloud
Datacenter levels of traffic that, as far as I know, you can't practically
log/monitor/IPS/SIEM...or am I misinformed?

~~~
nmjohn
> you can't practically log/monitor/IPS/SIEM...or am I misinformed?

It depends the hardware you're using (specifically router), but using netflow
/ sflow / ipfix [0] you can get pretty high visibility even for high bandwidth
networks. This only gets you "metadata" and not a full packet capture - but
for monitoring and the like, the metadata can be far more useful.

I'm not entirely sure what level of traffic you're talking about, but I know
it's possible with the right hardware to use netflow with 100GbE links without
having to sample (ie: Recording flows for every packet, not 1 in every n
packets)

[0]: Good sflow vs. netflow beakdown:
[http://networkengineering.stackexchange.com/a/1335](http://networkengineering.stackexchange.com/a/1335)

------
bsder
That's great, but how do you then get that many packets to _disk_ so that you
can do something with them?

Presumably you need flash drives, and probably an append-only filesystem?

~~~
signa11
> but how do you then get that many packets to disk

it _may_ be possible to do disk i/o at that high rate e.g. with pci-e or a
dedicated appliance for dumping the entire stream. but you would running out
of storage pretty fast.

for example, a quick back-of-the-envelope calculation, where you dump packet
stream from 4x10gbps cards with minimal 84b size (on ethernet), show that you
would exhaust the storage in approx. 4.5 minutes :)

~~~
bsder
40 Gigabits per second is roughly 4 Gigabytes per second.

4 Gigabytes per second times 86400 seconds per day is 345,600 Gigabytes per
day.

Roughly: 345 Terabytes per day.

Large, but not stupidly so.

~~~
m-app
40 Gbps would actually be _exactly_ 5 Gigabytes per second (divided by 8).

~~~
bsder
While I don't know the exact overhead of 10GigE, there is likely still some
overhead.

At the lower speeds, things like 8b/10b encoding and Reed-Solomon ECC added
enough overhead that dividing by 10 was more accurate than dividing by 8.

~~~
signa11
> While I don't know the exact overhead of 10GigE, there is likely still some
> overhead.

on 10gige pipes, at max Ethernet mtu (1500) bytes etc, there is approx. 94% of
available bandwidth for user data (accounting for things like inter-frame-gap,
crc checksums etc). with jumbo-frames that number goes to 99%.

~~~
bsder
Okay, so call it 10% overhead (actually 8%) if we're taking a WAG (wild *ss
guess).

That would mean that I would need to divide by about roughly nine (8.8 or so).

Sorry, I can't do divide-by-nine quickly in my head. I can do divide by 10
though, and my error is roughly 10%.

------
hacknat
If anyone is interested I wrote a lock-free, c/c++ free, implementation of an
AF_PACKET socket abstraction in Golang:

[https://github.com/nathanjsweet/zsocket](https://github.com/nathanjsweet/zsocket)

I haven't implemented fan-out at all, but if anybody is interested in adding
it, I'd happily apply they're pull-request.

------
user5994461
What's the point of capturing and storing 40 Gb/s of network traffic?

~~~
feld
analysis, IDS, etc.

The inability to analyze traffic at this rate is a serious problem. How do you
study it to see how protocols can be improved? A lab environment cannot
compare to real world traffic. How do you detect attacks (not DoS!) if it's
hidden in a link operating at this capacity?

~~~
user5994461
Even if you can capture the traffic at wire speed, the CPU doesn't have the
power to analyse the stream. I thought that traffic analysers had to be done
with FPGA/ASIC because of that.

~~~
feld
My manager did his thesis on this. Endace NICs, split traffic up and send to a
cluster of IDS servers. Allows you to actually do line rate analysis. No need
for FPGA/ASIC.

