
Packet copies: Expensive or cheap? - luu
https://github.com/SnabbCo/snabbswitch/issues/648
======
abelsson
If you think the cost of the instructions to express copying is the main
problem I'd call your mental model flawed. That is not the thing to worry
about, when moving data the issued instructions matters very little.

No, the thing to worry about is the memory subsystem. How does the data
interact with the caches? What DRAM access patterns do I have? How do I flow
data as efficiently as possible?

~~~
nl
Not if you are writing a software based switch, especially one designed to be
scriptable.

Which the author is.

Context is everything.

~~~
wmf
Can you elaborate? Is a a software based switch exempt from the memory
subsystem?

~~~
pmelendez
I'm not OC but reading the author's project list, it seems to me that he is
working heavily in networking optimization in cases where packages are small
enough that won't be invalidated from cache until they are done.

If that were true, then the memory subsystem has little to play in this.
However, users of this still have to take the memory in consideration when
sending packages, hence the OC's comment about context is everything.

------
Animats
The big question is where the packet came from. If a program just produced it,
it's probably in cache. If it came in from an I/O device, it probably isn't.

This is why message copying isn't a big cost in modern microkernels. Create
message->IPC->copy message->process message is an in-cache operation, if the
scheduler and the message passing system are on speaking terms.

~~~
aristidb
There is a technology that allows incoming I/O data to be put directly into
cache. Pretty useful! :)

I think this is the original introduction:
[http://web.stanford.edu/group/comparch/papers/huggahalli05.p...](http://web.stanford.edu/group/comparch/papers/huggahalli05.pdf)

------
orm
New to this, but basing such calculation on purely on CPU capacity would imply
a bandwidth to memory (or some cache)

64 B / inst * 2 (read and writeback) * 3 inst / cycle * 3 Gcycle/sec > 1000
gigabytes/sec, which is insane considering 50GB/s is the norm to RAM right
now.

*edit: Haswell L1 cache seems to allow a peak bandwidth of 64 byte loads + 32 byte store per cycle, so like other comments point out, this would match the CPU capacity. On the other hand, there is also latency. But it is hard for me to tell how latency factors into this.

Could someone explain how they would decide whether copying packets is cheap
or not?

~~~
jsprogrammer
>Could someone explain how they would decide whether copying packets is cheap
or not?

Run two machines. One runs packet copying software, the other runs non packet
copying software. Both run on the same data set. Record the amount of
electricity used. Multiply the numbers by the marginal electricity cost and
compare. If the number produced by the packet copying machine is not
significantly different from the machine that didn't copy packets, copying
packets is cheap.

------
lukego
Note to Russian HN readers: I'm giving a talk about Snabb Switch in Moscow on
Tuesday afternoon at HighLoad++.

------
jhallenworld
I worked on a router with multicast and we avoided packet copying with an
interesting reference counting scheme:

When a packet arrives, its handle gets 64K references and its 16-bit reference
counter has 0 references.

When we are done with the handle (packet transmitted), the references in the
handle are added to the counter. If the result is 0, the packet can be freed.

When we duplicate a packet, always it's the case that there is a parent handle
which gets split into children. So what do you do? You divide the references
among the children: so now you have two handles with 32K references. First one
is done, counter will have 32K. Second one is done, counter will have 0 and
packet can be freed.

The advantage of this scheme is that there is only exactly one ADD to
reference counter per duplicated packet.

For large packets this scheme will be faster than copying. For small packets,
maybe not. Even the single ADD is probably a miss and involves moving a cache-
line. (anyway, usually you want packet header to be in the handle, so small
packets fit in the handle, large packets use buffers).

------
XMPPwocky
Isn't that instruction timing data assuming a L1 cache hit?

------
orm
Thinking about it again, from my calculation below, if those packets are
coming from RAM then you are bound to, say, 25GB/s incoming, 25GB/s outgoing.

Since Haswell L1 cache seems to allow a peak bandwidth of 64 byte loads + 32
byte store per cycle, so this would match the CPU throughput and then this
suggests, if the lifecycle is "bring from ram" "copy in and out of L1 cache",
"put back in RAM", then those copies may well be free given you are choking on
RAM to begin with.

