Actors are like a simplified CSP where each (lightweight) thread has a single input channel. In the case of Akka this mean you lose type information because control messages are mingled with data messages and you can't assign any useful type of them.
Disruptor is mainly a pattern and implementation for high efficiency -- big queues, minimum number of threads, and some tricks using CAS operations and the like. I wouldn't call it a model of concurrency -- it's basically a particular implementation of CSP.
As long as we're in a well-actually: in the Actors model each channel has one receiving thread. If it were that each thread has one input channel, that wouldn't keep different threads from consuming from the same channel, which is different. Also, the original papers talked about an implementation spawning threads to handle each message -- even two threads to a message, one to compute the messages sent in response and one to compute the next state to handle the next message with, etc.
More comparison of the two: http://en.wikipedia.org/wiki/Communicating_sequential_proces...
1) I have to scrutinize the way I use memory, such that no two threads are going to stomp on the same cache line too often. (False sharing)
2) The above includes the way I use locking primitives and higher level constructs that are built on them! (Lock convoying)
3) If I am using a sophisticated container like a PersistentMap, which is supposed to make it easy to think about concurrency, I still have to think about 1 & 2 above at the level of the container's API/contract, as well as think about how they might interleave contended data within their implementations. (Yes, Concurrency is not Parallelism. Here we see why.)
4) Garbage Collection -- Now I have to think about if the GC is in a separate worker thread and think about how that can result in cache line contention.
5) Even if you do all of the above correctly, the OS can still come along and do something counterproductive, like trying to schedule all your threads on as few cores/sockets as possible. (This is even nicknamed "Stupid Scheduling" in Linux by people who have to contend with it.) This entails yet more work.
6) Profiling all of the above is, as far as I can tell, still a hazy art. Nothing is a smoking gun for one of the pitfalls of multicore parallelism. One is only left with educated guesses, which means that you have to increase data gathering. Is there something like a debugger like QEMU which can simulate multicore machines and provide statistics on cache misses? Apparently, there are ways to get this information from hardware as well.
It would seem that Erlang has an advantage with regards to multicore parallelism, because its model is "distributed by default," so contention is severely limited, which is great for parallelism. However, coordination is severely limited as well! (I need to look at Supervisors and see what they can and cannot do.)
It would also seem like there's room for languages that combine these recently popular advanced concurrency tools with enough low-level power to navigate the above pitfalls, combined with a memory management abstraction that increases productivity without requiring the complications for parallelism entailed by GC. Rust, C++, and Objective-C are the only languages that somewhat fit this bill. (If only Rust were not quite so new!) Go, with its emphasis on pass by value semantics might also work for certain applications, despite its reliance on GC.
Supervisors are essentially in charge of one thing, and that's process lifecycle management; starting up processes, restarting them when they die, and bailing out if something goes wrong.
I have an algorithm for an absolute occupancy grid that can handle multicore parallelism. Basically, the grid is subdivided into smaller parts that just do their thing independently, but if they detect that a move takes a player out of their boundaries, they queue it up locally for the global grid. Once all of the subgrid threads are done, they wait while the global grid does its thing. (Aggregating all the local subgrid queues, then processing those moves requiring global coordination.)
I don't see how I can do that in Erlang. The closest thing I'm aware of (and I know almost nothing about Erlang) is that moves are are done optimistically, then collisions can be detected after the fact and rolled back. Maybe that's what I'll need to do: port to Erlang and use rollbacks.
Supervisors are essentially in charge of one thing, and that's process lifecycle management
I thought they could do more than just that, maybe.
Imagine how you'd implement your system if each part of the system had to communicate with the others over a network socket, and you're most of the way to implementing it in Erlang.
It's probably not optimal to do things involving lots of calculations in Erlang, though, as the VM is fairly slow. Akka implements a similar system, but for the JVM (Scala).
Basically, you are proposing that the world actor aggregate the entire world's data? That wouldn't scale. Or, maybe the world actor just becomes responsible for adjudicating moves between subgrids. Maybe.
Couldn't you have X players moving on a global grid? I have built systems with literally tens of millions of players (Erlang processes) moving on a grid (and doing way more, localized threat detection, decision making, moving away or towards, and coordination with other units in X range).
One complication is that my grid is for a procedurally generated world with 2^120 locations in it. This is why I generate subgrids. A degenerate case is one subgrid per user. However, these subgrids are organized in load-balanced groups, each of which has their own thread pool, caches, and locks.
Also, rollbacks are problematic, though the real problems are arguably corner cases.
Erlang might be a win because each garbage collector only has to deal with its own local memory.
EDIT: It turns out my algorithm is somewhat similar to Pikko:
One big difference, is that my algorithm doesn't move or reconfigure masts, instead it dynamically creates subgrids, which are then grouped into "workgroups" each of which is supposed to be processed by a different CPU socket. Instead of there being an API, it's more that the subgrids stop what they are doing and their information is briefly managed by the global grid code. (The procedurally generated map is rigged, so that there are many opportunities for crossing from one subgrid to another.)
Our units actually would report to intermediaries their maximum interaction boundries, which would then be passed to processes to create something somewhat like a mast, someone like a subgrid -- a dynamic interaction zone. All our units had hard constraints (max speed, etc) and worked in global ticks that represented real time. Then, we would talk to global to stretch all the interaction zones to fill empty space and report back boundaries. Then, our units would work in little worlds until they crossed a threshold, we caused a rezoning among them and their neighbors. So initially, the everything would have to be parsed out into interaction zones, but then they could ignore each other for periods of times until a unit strayed across an edge, and then rezoning took place.
Not sure how well it would work with amped up movement (making it have to go all the way up to global more) and not certain how it would work at the scale of 1 undecillion 329 decillion 227 nonillion 995 octillion 784 septillion 915 sextillion 872 quintillion 903 quadrillion 807 trillion 60 billion 280 million 344 thousand 576 points!
I think I can make my subgrids algorithm work, but it will have to be using mutable data structures.
Dmitry Vyukov has suggested otherwise in a similar scenario using Go:
> If you split the image into say 8 equal parts, and then one of the goroutines/threads/cores accidentally took 2 times more time to complete, then whole processing is slowed down 2x. The slowdown can be due to OS scheduling, other processes/interrupts, unfortunate NUMA memory layout, different amount of processing per part (e.g. ray tracing) and other reasons. [...] size of a work item must never be dependent on input data size (in an ideal world), it must be dependent on overheads of parallelization technology. Currently a reference number is ~100us-1ms per work item. So you can split the image into blocks of fixed size (say 64x64) and then distribute them among threads/goroutines. This has advantages of both locality and good load balancing.
Or to put it this way: imagine that there was zero concurrency overhead. Then splitting out jobs to their minimal size would be the ideal option, as that would allow for the most smoothed out division of labour where every processor does work all the time and they are all doing work until the entire task is completed.