
Turtles on the Wire: Understanding How the OS Uses the Modern NIC (2016) - nz
http://dtrace.org/blogs/rm/2016/09/15/turtles-on-the-wire-understanding-how-the-os-uses-the-modern-nic/
======
drewg123
The biggest innovation that I've seen in quite some time in terms of making
creative use of hardware offloads is Hans Petter Selasky's RSS assisted LRO in
FreeBSD.

On our workloads (~100K connections, 16 core / 32 HTT FreeBSD based 100GbE CDN
server) LRO was rather ineffective because there were roughly 3K connections /
rx queue. Even with large interrupt coalescing parameters and large ring
sizes, the odds of encountering 2 packets from the same connection within a
few packets of each other, even in a group of 1000 or more, are rather small.

The first idea we had was to use a hash table to aggregate flows. This helped,
but had the draw back of a much higher cache footprint.

Hps had the idea that we could sort packets by RSS hash ID _before_ passing
them to LRO. This would put packets from the same connection adjacent to each
other, thereby allowing the LRO without a hash table to work. Our LRO
aggregation rate went from ~1.1:1 to well over 2:1, and we reduce CPU use by
roughly 10%.

This code is in FreeBSD-current right now (see tcp_lro_queue_mbuf())

~~~
pebblexe
FreeBSD also has netmap
[https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4](https://www.freebsd.org/cgi/man.cgi?query=netmap&sektion=4)

------
policedemil
Great article! A lot of this is way beyond me, but I'm generally interested in
the process of how a NIC filters based on MAC addresses.

I'm in the humanities and certain scholars working with culture and technology
love to make a huge deal about data leakage and how intertwined we all are
precisely because you can put a NIC in promiscuous mode and cap packets that
weren't meant for you. The whole point is that because your NIC is constantly
receiving data meant for others (i.e. because it's filtering the MAC
addresses), something like privacy on networks is always problematic. I've
always found the whole point somewhat overstated.

So, could anyone explain real quick the process of how a NIC decides whether a
packet/frame is actually bound for it or link some good resources? For
example, does the NIC automatically store the frame/packet in a buffer, then
read the header, and then decide to discard? Or can it read the header before
storing the rest of the frame? How much has been read at the point the NIC
decides to drop it or move it up the stack? Reading all of every packet seems
improbable to me because if it were the case, laptop 1 (awake but not
downloading anything) would experience significant battery drain due to
constantly filtering network traffic that was meant for laptop 2. I'm not sure
that really maps to my experience. Also, I assume there are also differences
for LAN vs WiFi?

Any help on the matter would be greatly appreciated! I've tried google diving
on this question many times before and it's really hard to find much on it.

~~~
javajosh
The serial nature of the physical connection cannot be overstated. Bits flow
to your NIC one at a time.

A 1Gb/s NIC is detecting a billion wiggles in voltage per second. Structure is
imposed on these wiggles in stages: first, the A/D conversion happens, making
the voltage wiggles 1's and 0's, then ethernet framing, then IP packet
parsing, then TCP packet/ordering, then the application handles IO (and can,
and often does, define even more structure, such as HTTP). You might look up
the OSI network layering model (or the OSI 7-layer burrito, as I call it).

My understanding is that MAC filtering happens after ethernet framing, and
before putting into the ring via DMA, and packets failing that test do not
generate interrupts. Your NIC hardware is _choosing_ to ignore packets not
addressed to it because, generally, it's pretty useless to listen in on other
people's packets. Especially these days when your most likely to capture HTTPS
encrypted data.

~~~
policedemil
Thank you! This is starting to make a lot more sense! I guess what's still
confusing to me has to do with the seriality (continuity?) of the process vs
the OSI model, which seems like there are more discrete stages.

Does each stage of the model need to complete for the whole packet before
moving on to the next stage? For example, does A/D conversion take place until
all of the packet information is converted, then the whole binary blob is
enframed as a header and a packet... then we filter for MAC address and move
on up the stack in discrete and consecutive stages? Or are the voltages read
off the line and once there is enough information to construct the header,
compare it, then choose to continue reading the rest from the line or stop the
A/D conversion because it is just a waste of energy? The latter makes a lot
more sense to me.

EDIT: words

~~~
mschuster91
> Does each stage of the model need to complete for the whole packet before
> moving on to the next stage?

Usually, not. Massive core switches could not work if they had to wait for
every frame being fully in the buffer before beginning to transmit it to the
correct out port. All a core switch needs to look at is the destination MAC
address.

Simple math explains why: MTU (max packet size) is usually 1500 bytes (due to
most packets originating in Ethernet systems). The dstmac in the Ethernet
frame is bytes 9 through 16, which means it would be an absolute waste of time
to wait with forward transmission until the remaining 1484 bytes are in the
buffer.

Let's calculate this with your ordinary 100 MBit/s home connection (to keep
the numbers in reasonable magnitudes). 100 MBit/s means: 0.00000001 s/bit (or
0.01 us/bit, or 0.08 us/byte). With retransmit start after the first 16 bytes,
the delay introduced by the equipment is 1.28 us (and needs, basically, 9
bytes of buffering capacity from start of packet to end of packet). Waiting
for the full 1500 bytes would introduce 120 us (or 0.12ms) of latency, as well
as require 1500 bytes of buffer during the transmission time.

~~~
policedemil
Excellently put! That's exactly what I was looking for! Thank you!

~~~
mschuster91
If you want to read further, this is a part of network delay
([https://en.wikipedia.org/wiki/Network_delay](https://en.wikipedia.org/wiki/Network_delay)).
A highly interesting field.

By the way, one thing I have forgotten: an instant-forward has much less
latency (obviously), but it cannot retract packets that were corrupted during
receiving at the ingress port - simply because the checksum can only be
calculated when the whole packet is in the buffer.

So basically you choose between safety (corrupted packets do not travel as
long, because they don't even reach the final station) and latency (e.g. 10
hops a 0.12ms = 1.20ms delay on 100MBit/s), and also for the cost in buffer
memory.

------
en4bz
This is why I'm really hoping RDMA [1] will catch on soon. It would be great
if there was a cloud provider that would enable this feature on some of their
offerings. Amazon has done something similar by allowing kernel bypass via
DPDK [2] with their ENA offering but Kernel bypass is inferior to RDMA in so
many ways IMO.

At this point we have 200Gbit/s NICs being provided by Mellanox [3]. CPUs
aren't getting any faster and the scale out approach is extremely difficult to
get right without going across NUMA domains [4]. Based on the progression of
CPUs lately there just isn't going to be enough time to process all these
packets AND have time left over to actually run your application. There's a
lot of work focusing on data locality at the moment but at this point it's
still not fool proof and the work that has been done is woefully under
documented.

As the article mentioned we've already added a bunch of hardware offloads.
RDMA is just a continuation of these offloads but unfortunately it requires
some minor changes on the application side to take advantage of which is why
it's probably been slow to be adopted.

RDMA has so many great applications for data-transfer for backend services.
Whether it's queries between a web-server and a DB, replication/clustering of
DBs, or micro-service fabric with micro seconds latency. Overall there's a lot
of low hanging fruit that could be optimized with RDMA.

[1]
[https://en.wikipedia.org/wiki/Remote_direct_memory_access](https://en.wikipedia.org/wiki/Remote_direct_memory_access)

[2] [http://dpdk.org/doc/nics](http://dpdk.org/doc/nics)

[3]
[http://www.mellanox.com/page/products_dyn?product_family=266...](http://www.mellanox.com/page/products_dyn?product_family=266&mtag=connectx_6_en_card)

[4] [http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-
of-...](http://rhelblog.redhat.com/2015/09/29/pushing-the-limits-of-kernel-
networking/)

~~~
kev009
One other point, these cloud companies like Amazon and Google do a great deal
of self congratulations for some pretty embarrassing architectures.
Necessitating million server datacenters is as WTF to me as being proud of
millions of lines of code when someone else is doing it with magnitudes less.
40G networking has been commercially viable since the top of the decade and
100G (or partial variants) for a couple years now while ENA tops out at 20G.
They are basically asleep at the wheel because their architectures are
predicated on poor assumptions, shirking understanding of computer
architecture for 10s of thousands of SREs and "full stack developers".

Dovetailing a bit, back in the commercial UNIX and mainframe markets it is
pretty common to have 80%-90% percent system utilization. In the Linux world,
it's usually single digits. For some reasons (I guess it's more inviting to
think holistically due to the base OS model), we are getting those 90% figures
in the BSD community. See drewg's comment, WhatsApp, Isilon, LLNW:

I led the creation of an OS team, to look from first principles, where we were
spending CPU/bus/IO bandwidth, and focusing on north-south delivery instead of
horizontal scale. A team of 5, we are able to deliver significant shareholder
value
[http://investors.limelight.com/file/Index?KeyFile=38751761](http://investors.limelight.com/file/Index?KeyFile=38751761).

~~~
en4bz
Yeah I suppose what I really want is physical procurement of hardware with the
ease of cloud.

~~~
kev009
I haven't used them, but it looks compelling:
[https://www.packet.net/](https://www.packet.net/)

------
tedunangst
Buggy firmware with edge cases is putting it mildly. I suppose a checksum of
0000 or ffff is technically an edge case, but not all that uncommon, and a
pretty popular thing to get wrong.

~~~
gonzo
Even the standards got it wrong.
[https://tools.ietf.org/html/rfc1624](https://tools.ietf.org/html/rfc1624)

------
bluetech
Interesting article.

Another related article I found interesting:
[https://www.coverfire.com/articles/queueing-in-the-linux-
net...](https://www.coverfire.com/articles/queueing-in-the-linux-network-
stack/) Discusses some of the queues in the Linux network stack.

------
ams6110
As an aside, is anyone using the Joyent cloud stuff in production? Any good
comparisons to Openstack? Looking for something easier to manage.

------
pthreads
This is a very useful write-up. I thoroughly enjoyed it i.e. found it
informative. Thank you.

