
Scaling Synchronization in Multicore Programs - xwvvvvwx
http://queue.acm.org/detail.cfm?id=2991130
======
arielweisberg
Right off the bat it presents three possible queue arrangements none of which
is the one I prefer.

A single MPSC queue per core works very well and this is what VoltDB does. For
most sane task sizes this works fine and it scales because you can only
process so many tasks at a single core so queue contention isn't an issue.

If you want to crank things to 11 you can do an SPSC queue between every core,
and this is what ScyllaDB does and is suggested in the article. This does
relieve CAS failure, but outside of microbenchmarks I am not sure it is a real
world issue.

I had never considered the advantages of atomic instructions that never fail.
In the context of things like Disrupor or Agrona those choices start to make
more sense.

~~~
jbapple
Have you benchmarked MPSC-per-core and SPSC-per-pair-of-cores against any of
the three concurrent queues in the article?

~~~
arielweisberg
I haven't benchmarked them comparatively, but when I benchmarked and profiled
VoltDB there was very little time spent on CAS failures according to flight
recorder. It could be the profiler is a wrong so maybe instrumenting makes
more sense just to fully prove the point.

A single queue is only being asked to handle low hundreds of thousands of
events per second which isn't that much. This occurs because there is
substantial (for some definition of substantial) work associated with each
task. If you are looking up a value in a hash map then yes it matters, but if
you are doing something more substantial it doesn't matter. The research is
definitely interesting and it solves real world problems, but I don't think
you necessarily need to move to groups of input task queues directly.

If you look at the graph it's at five million operations/second and 1 CAS with
two threads. If you aren't going to push more than 1 million ops through a
single queue/core I think you will be fine.

------
RMarcus
It's very interesting to see a lot of the traditional assumptions behind
system design reserach change with hardware. In-memory databases, choosing
O(n) cache-sensitive algorithms over O(log n) random access ones, and single
nodes that are starting to look more like shared-nothing distributed systems
were all major themes at VLDB this year.

