
Ffwd: Delegation is much faster than you think - ingve
https://blog.acolyer.org/2017/12/04/ffwd-delegation-is-much-faster-than-you-think/
======
jsnell
This felt like enough of a "devil is in the details" case that I needed to
read the original rather than a summary. Clicked through to the paper, to be
greeted by an ACM paywall. Blech. But wait, what's this... "article made
available by the morning paper", with a tooltip "special arrangement betwen
ACM and blog.acolyer.com". So that's cool, I guess :)

One odd bit about the design seemed to be that the responses were allocated on
a per-socket rather than per-core basis. The paper implies (but doesn't seem
to state outright) that this is done to pack as much data as possible to each
store buffer. Does that sound right?

~~~
naasking
> This felt like enough of a "devil is in the details" case that I needed to
> read the original rather than a summary.

Seriously, I was very surprised that delegation+combining i slower than ffwd's
delegation alone. I definitely want to see those details.

~~~
jeriksson
Combining is a type of delegation, where threads alternate taking on the role
of the "server" in a conventional client-server delegation scheme.

~~~
naasking
Right, but my point was that combining appears to be slower than delegation
alone. Perhaps it's due to cache effects, since new servers are unlikely to
have the right data in cache, where a dedicated thread will almost certainly
have the most commonly accessed addresses in cache.

------
xaedes
ffwd: fast, fly-weight delegation

One should define such unknown acronyms on first use not in the middle of the
article after the acronym is used a dozen times..

------
eptcyka
I could just move a hashmap into it's own thread, but then I have to deal with
the problem of communicating with that thread - this will inevitably involve
some kind of a lock-free queue or plain old array with a lock. Otherwise, what
is the point? Or am I missing something fundamental about this?

~~~
jsnell
There are two points.

One is that having just one thread access the data structure gives optimal
cache locality.

The other is that locking causes full serialization: if 8 threads want to
access a data structure protected by a lock at the same time, you need 8 full
cache coherency round-trips. Locking also causes lots of ping-pong on the
cache line that contains the lock. This scheme does away with the
serialization by allowing batching of operations. If 8 threads need to access
a data structure at the same time, their requests can be handled in a single
round-trip. It does away with the ping-pong by making every single cache-line
have just one writer. (I think it doesn't need locks or atomic ops, just TSO.)

Which isn't to say it's a free lunch. Two probable downsides:

You need to care about whether you have the right mix of threads doing real
work and threads that are just thin shims on some data structure.

I bet that the data structure threads have to be polling for requests in a
busy-loop for this scheme to have good performance. This would be a major
problem for most applications.

~~~
jeriksson
You are right, the "data structure threads", which we call servers, poll on
requests in a busy loop. If your data structures aren't very busy, dedicating
a thread to serving the data structures may not be a good trade-off, in which
case you don't want to use delegation.

Note, however, that a single server can host any number of data structures,
core counts are growing rapidly, and hyperthreading doubles the number of
available hardware threads, making the trade-off easier to swallow.

------
signa11
honestly, this paper seems to err on the side of too much (scary) numa
details, and not enough on the simplicity of the idea itself i.e. 'only 1
thread for the data-structure'. having used it in fast-forwarding dataplane
applications, the only thing that comes to mind is the high amount of
discipline required for these kind of things. and humans seem to be quite bad
at it.

immutable functional programming nerds may now gloat wisely ;)

