
Comparison of different concurrency models: Actors, CSP, Disruptor and Threads  - r4um
http://java-is-the-new-c.blogspot.com/2014/01/comparision-of-different-concurrency.html
======
noelwelsh
Disagree with the characterization of CSP. CSP as I understand it has channels
as the main building block for interprocess communication. Channels can be
synchronous or asynchronous, and bounded or unbounded. The important point is
that a channel is a value (you can pass it to a method, return it from a
method) and usually has a type.

Actors are like a simplified CSP where each (lightweight) thread has a single
input channel. In the case of Akka this mean you lose type information because
control messages are mingled with data messages and you can't assign any
useful type of them.

Disruptor is mainly a pattern and implementation for high efficiency -- big
queues, minimum number of threads, and some tricks using CAS operations and
the like. I wouldn't call it a model of concurrency -- it's basically a
particular implementation of CSP.

~~~
calibraxis
Yes, authors cited their sources, so we can understand their perspective
better.

~~~
calibraxis
_facepalm_ That should be "authors _should_..." Which is BTW an absurdly
trivial observation to post in retrospect, on top of it.

------
kenrose
Can anyone elaborate on what happens with Akka after 5 cores? In all of the
timings, Akka has an equivalent exponential drop to all of the other systems -
until it hits 5 cores. At that point, it levels off or goes up. Is there
anything inherent in the Akka implementation that would cause this?

------
stcredzero
There should be a rethink of how our multicore computers are architected. Here
is what I have to do in order to write a multicore parallel soft real-time
application.

1) I have to scrutinize the way I use memory, such that no two threads are
going to stomp on the same cache line too often. (False sharing)

2) The above includes the way I use locking primitives and higher level
constructs that are built on them! (Lock convoying)

3) If I am using a sophisticated container like a PersistentMap, which is
supposed to make it easy to think about concurrency, I still have to think
about 1 & 2 above at the level of the container's API/contract, as well as
think about how they might interleave contended data within their
implementations. (Yes, Concurrency is not Parallelism. Here we see why.)

4) Garbage Collection -- Now I have to think about if the GC is in a separate
worker thread and think about how that can result in cache line contention.

5) Even if you do all of the above correctly, the OS can still come along and
do something counterproductive, like trying to schedule all your threads on as
few cores/sockets as possible. (This is even nicknamed "Stupid Scheduling" in
Linux by people who have to contend with it.) This entails yet more work.

6) Profiling all of the above is, as far as I can tell, still a hazy art.
Nothing is a smoking gun for one of the pitfalls of multicore parallelism. One
is only left with educated guesses, which means that you have to increase data
gathering. Is there something like a debugger like QEMU which can simulate
multicore machines and provide statistics on cache misses? Apparently, there
are ways to get this information from hardware as well.

It would seem that Erlang has an advantage with regards to multicore
parallelism, because its model is "distributed by default," so contention is
severely limited, which is great for parallelism. However, coordination is
severely limited as well! (I need to look at Supervisors and see what they can
and cannot do.)

It would also seem like there's room for languages that combine these recently
popular advanced concurrency tools with enough low-level power to navigate the
above pitfalls, combined with a memory management abstraction that increases
productivity without requiring the complications for parallelism entailed by
GC. Rust, C++, and Objective-C are the only languages that somewhat fit this
bill. (If only Rust were not quite so new!) Go, with its emphasis on pass by
value semantics might also work for certain applications, despite its reliance
on GC.

~~~
vertex-four
What do you mean by coordination being limited in Erlang, exactly?

Supervisors are essentially in charge of one thing, and that's process
lifecycle management; starting up processes, restarting them when they die,
and bailing out if something goes wrong.

~~~
stcredzero
_What do you mean by coordination being limited in Erlang, exactly?_

I have an algorithm for an absolute occupancy grid that can handle multicore
parallelism. Basically, the grid is subdivided into smaller parts that just do
their thing independently, but if they detect that a move takes a player out
of their boundaries, they queue it up locally for the global grid. Once all of
the subgrid threads are done, they wait while the global grid does its thing.
(Aggregating all the local subgrid queues, then processing those moves
requiring global coordination.)

I don't see how I can do that in Erlang. The closest thing I'm aware of (and I
know almost nothing about Erlang) is that moves are are done optimistically,
then collisions can be detected after the fact and rolled back. Maybe that's
what I'll need to do: port to Erlang and use rollbacks.

 _Supervisors are essentially in charge of one thing, and that 's process
lifecycle management_

I thought they could do more than just that, maybe.

~~~
vertex-four
That should be reasonably possible, surely? Your global grid is one process,
that sends messages containing the smaller parts to the processing processes,
then waits for them to reply with another message, and then aggregates the
replies. It's all just message buses, you can push whatever stuff you want
around however you want.

Imagine how you'd implement your system if each part of the system had to
communicate with the others over a network socket, and you're most of the way
to implementing it in Erlang.

It's probably not optimal to do things involving lots of calculations in
Erlang, though, as the VM is fairly slow. Akka implements a similar system,
but for the JVM (Scala).

~~~
stcredzero
_waits for them to reply with another message, and then aggregates the
replies_

Basically, you are proposing that the world actor aggregate the entire world's
data? That wouldn't scale. Or, maybe the world actor just becomes responsible
for adjudicating moves between subgrids. Maybe.

~~~
MetaCosm
Isn't the fundamental point of concurrency the players?

Couldn't you have X players moving on a global grid? I have built systems with
literally tens of millions of players (Erlang processes) moving on a grid (and
doing way more, localized threat detection, decision making, moving away or
towards, and coordination with other units in X range).

~~~
stcredzero
Can you give a high level description of how you implement the global grid?
It's really only absolute coordination of a checkerboard-like grid that I'm
wrestling with. I've verified that my algorithm works, and is nicely stable.
I've also scrutinized it with VisualVM, and on Linux and OS X, the threads are
either in wait or running, in the pattern I'd expect. (The subgrid worker-
group threads have to wait for the global grid to do its coordinating.) I'm
also seeing expected use of the thread pools. However, for some reason I can't
seem to scale beyond 250 users.

One complication is that my grid is for a procedurally generated world with
2^120 locations in it. This is why I generate subgrids. A degenerate case is
one subgrid per user. However, these subgrids are organized in load-balanced
groups, each of which has their own thread pool, caches, and locks.

Also, rollbacks are problematic, though the real problems are arguably corner
cases.

Erlang might be a win because each garbage collector only has to deal with its
own local memory.

EDIT: It turns out my algorithm is somewhat similar to Pikko:

[http://www.erlang-
factory.com/upload/presentations/297/Pikko...](http://www.erlang-
factory.com/upload/presentations/297/PikkoServerErlangconference.pdf)

One big difference, is that my algorithm doesn't move or reconfigure masts,
instead it dynamically creates subgrids, which are then grouped into
"workgroups" each of which is supposed to be processed by a different CPU
socket. Instead of there being an API, it's more that the subgrids stop what
they are doing and their information is briefly managed by the global grid
code. (The procedurally generated map is rigged, so that there are many
opportunities for crossing from one subgrid to another.)

~~~
MetaCosm
Going to do my best to put a quick summary (from memory) on something that
took us a long time to get right and was a bit convoluted.

Our units actually would report to intermediaries their maximum interaction
boundries, which would then be passed to processes to create something
somewhat like a mast, someone like a subgrid -- a dynamic interaction zone.
All our units had hard constraints (max speed, etc) and worked in global ticks
that represented real time. Then, we would talk to global to stretch all the
interaction zones to fill empty space and report back boundaries. Then, our
units would work in little worlds until they crossed a threshold, we caused a
rezoning among them and their neighbors. So initially, the everything would
have to be parsed out into interaction zones, but then they could ignore each
other for periods of times until a unit strayed across an edge, and then
rezoning took place.

Not sure how well it would work with amped up movement (making it have to go
all the way up to global more) and not certain how it would work at the scale
of 1 undecillion 329 decillion 227 nonillion 995 octillion 784 septillion 915
sextillion 872 quintillion 903 quadrillion 807 trillion 60 billion 280 million
344 thousand 576 points!

~~~
stcredzero
I woke up with the realization that my problem is probably GC pressure. The
system is currently written in idiomatic Clojure so doing everything generates
garbage. What's more, the garbage is created in one tick, then released in a
subsequent tick, so I'm losing the benefit of generational GC.

I think I can make my subgrids algorithm work, but it will have to be using
_mutable_ data structures.

------
eternalban
Disruptor is an _application_ of the pre-emptive threading model. It certainly
it interesting but to put it in the same pedestal as CSP and Actor model is
wrong. (Also the stability of Disruptor approach in face of competing
applications on the same machine was an issue last I checked.)

------
rdtsc
It is interesting since in Erlang data structures are functional (immutable)
actor mailboxes could as well be implemented to share data instead of copying
it. Large binaries are handled that way. They live in a binary memory area and
are referenced via pointers. The rest of the messages are not. At some point
it was deemed it was better to actually make the copy.

~~~
ANTSANTS
Even if everything is immutable, sharing data between threads adds memory
management overhead that isn't worth it for most objects. For GC, you have to
walk the environment of every thread to free anything in the shared memory. I
dunno how big of a difference that makes for fancy concurrent GCs, but for
regular ones, you'd have to stop every thread during collection. For reference
counting, it's a bit simpler; so long as each thread keeps its own reference
count, you can decouple the reference counting from freeing and have a hybrid
scheme where you "GC" the shared object heap by scanning for objects with
reference counts of 0 in all threads. Still not free, though.

------
vanderZwan
> _For maximum performance one would create one large job for each Core of the
> CPU used._

Dmitry Vyukov has suggested otherwise in a similar scenario using Go:

> _If you split the image into say 8 equal parts, and then one of the
> goroutines /threads/cores accidentally took 2 times more time to complete,
> then whole processing is slowed down 2x. The slowdown can be due to OS
> scheduling, other processes/interrupts, unfortunate NUMA memory layout,
> different amount of processing per part (e.g. ray tracing) and other
> reasons. [...] size of a work item must never be dependent on input data
> size (in an ideal world), it must be dependent on overheads of
> parallelization technology. Currently a reference number is ~100us-1ms per
> work item. So you can split the image into blocks of fixed size (say 64x64)
> and then distribute them among threads/goroutines. This has advantages of
> both locality and good load balancing._

[https://groups.google.com/d/msg/golang-
nuts/CZVymHx3LNM/esYk...](https://groups.google.com/d/msg/golang-
nuts/CZVymHx3LNM/esYkA_YoB-MJ)

Or to put it this way: imagine that there was zero concurrency overhead. Then
splitting out jobs to their minimal size would be the ideal option, as that
would allow for the most smoothed out division of labour where every processor
does work all the time and they are all doing work until the entire task is
completed.

