It's arguing it's hard to implement the structures and methods used by concurrency APIs, not that it's hard to use them.
The majority of tasks are "embarrassingly parallel" as they say. Which is just to mean, there are a lot of standalone tasks that need to be done. For example, any web server is embarrassingly parallel. You don't need to be clever, you just need to have the processing power available to handle them.
I used GPU.js for some frontend crunching and it turns out even a builtin GPU (integrated graphics) can get a small upgrade out of moving some data into a more parallel format.
>I guess it's doing some sort of transformation
>Presumably there's a not-significant overhead in both starting it up
That has been the goal, we're really careful to measure it with releases.
>and getting data back from the gpu.js function
That is correct as well, it is about as costly as it would be without GPU.js. However you can enable pipeline on kernels, so they can communicate via textures, which is significantly faster, as we don't need to transfer back to the CPU.
>Is there a way to "compile" a gpu.js function to remove the translation step? That could be really useful.
It is built into every kernel. You simply call `kernel.toString(...args)`.
Parallel programming isn't the hardest part (it is hard, but good teachers help with good abstractions over problems), it is the parallel debugging that is - or at least the second pass over the same codebase.
Often being able to write decent parallel code is easier than debugging a system which is behaving inconsistently without any ability to set up the timing bugs the right way around.
The difference between "I fixed a bug" and "I massively lowered the probability of it" is hard to tell.
I spent 7 months chasing an issue that happened once on average of a billion requests (or rather once every 2-3 days across a web farm). The fix was easy, but debugging it drove me insane slowly over a whole summer.
And after that, there's the loss of efficiency from parallelization which needs enough parallel tracks to actually be faster - more threads is often slower at low thread counts.
As an anecdote, there were enough servers at Yahoo which had SMT turned off because the extra locking introduced by 2 threads on a core ate up the extra CPU those threads freed up.
We need better IDE parallel thread debugging tools. I don’t like how VS won’t let you open a separate Autos/Locals window for each thread, or easily step each thread - setting a breakpoint in a function currently executing in multiple threads is a usability nightmare.
- did you have a method/tool to debug (from book, or chatting with others)
- are there still cpu designs dedicated to parallel workloads (I assume the usual desktop intel/amd, no matter how brilliant, may not be without hiccups in high parallelism)
oh one last thing, do you use dedicated compilers for parallelization ? or is it something "mainstream" like openmp
The issue with x86 is simple: the memory model pretty much requires loads of broadcasts on the SMP interconnect. IIRC Power9 is nice due to (1) a memory model that doesn't require this (much) broadcasting on the interconnect, and (2) remote atomics like "fetch current 64bit value pointed to by register1 and atomically add register2 after", which prevent the cacheline from bouncing between cores or even sockets (Power9 scales to about 5~10kW TDP @8~20W TDP/core and 4 threads/core (there are versions that combine core pairs to have 8 threads using the resources of (almost) 2 cores (overhead / diminishing returns))).
I can't really recommend OpenMP. If the architectural restrictions aren't a problem, you might want to check out Intel's TBB, which uses C++ templates instead of compiler pragmas. Unfortunately you'd ideally need multithreading-aware polyhedral optimization, and _that_ mostly stalled AFAIK.
You'd likely write C++ or now maybe Rust with low-level primitives / intrinsics to build libraries. I don't know of any actually good "automatic" frameworks for general shared-memory parallel programming (There's Chapel, but it seems fairly niche. Still going strong though, by the looks of it.). If you can efficiently express the problem with loop-dataflow, you might want to try timely-dataflow (Rust, does thread/process/node parallelism), as it nicely integrates with normal Rust code.
2) I can't find any hit about power9 remote atomics outisde of GPU memory. Any pointers?
The remote atomics are a GPU-independent feature on the SMP interconnect. The GPU thing is mostly icing on the cake, by exposing this feature to NVLink GPUs (iirc. Volta+, maybe Pascal+).
Specifically for this issue, I ended up forking valgrind to add multi-threaded watchpoints in code (this was debugging shared memory refcounts, so I had to extend it to log negative refcounts only).
The origin of those patches are here - https://valgrind.org/downloads/variants.html?rjw
I'm not sure how you can say that.
Programming a fairly hard activity but the general consensus is adding parallelism to any given programming task, you make that task much, much harder. I don't think you find any version of parallel programming that isn't hard.
The situation of parallelism, your commands happening in an unpredictable order, not synchronously, inherently, by itself, per se, makes thinking about a parallel program harder than a non-parallel program.
Edit: Parallel programming is hard enough that the advised method for most such programs is one or another standard patterns - which are infinitely easier than rolling your own plan but still contain serious gotchas when your program complexity increases.
Of course, it depends a lot on what your programming language has to offer. For example,in my experience Ada makes it easier to get parallelism correct than many other languages, primarily because Ada tasks have a precisely defined and specified semantics and because Ada's "Rendesvouz" makes certain kinds of errors impossible. The same is sometimes claimed about Rust.
In more permissive languages like Go, however, almost every parallel program is wrong in one way or another. For example, it is extremely common for parallel Go libraries to stop in a deadlock. The situation is so bad that I have stopped using concurrent Go libraries, except for the well-implemented standard libraries. Even then, many package authors don't tell you what they do internally and their libraries lock up in unexpected places. For example, I've recently been looking for a parallel tree walker library in Go, and every package I tried stopped with an infinite deadlock on my file system, presumably due to the interplay between error handling and spawning goroutines. Ironically, even many goroutine scheduling libraries I've tried have locked themselves up.
As someone who has made that assumption, what's wrong with it?
When you are missing the synchronization chain, the issue you can run into is that the value of other memory locations is undefined. The sequence of accesses to any given location does comprise a total order (due to cache coherence), but it is not generally the case that the ordering of accesses to different locations will be agreed on by different threads.
E.g. why would reading a shared mutable value be wrong, in some specific circumstances? In particular, if the value is a word (and the architecture guarantees atomic writes for word-sized values), and if the reader doesn't care if it reads the previous or the new value (e.g. executes a read of a monotonic counter in a loop, so even if I read the current value this time, I'll read the new value next time or after some number of next times), and if the language is sensible (i.e. Java, not C++... although AFAIK even C++ frowns upon "values out of thin air" and they're trying to modify the standard to formally prohibit them... but in general the issue with C++ is the compiler, not the platform if we assume x86).
Now, it is known that, if your program is correctly synchronized, then the existing hardware models are indistinguishable from sequentially consistent execution. This gives rise to the data-race-free model that defines all modern language-level programming models: data races can only exist in the absence of proper synchronization, so we call that behavior undefined , and our code goes back to sequential consistency.
The primary issue is that there are two entities playing games with your code to support faster execution: both the compiler and the hardware are making assumptions to speed things up. From the perspective of hardware (i.e., why using "volatile" to stop the compiler from playing games isn't good enough), different processors may have the values in their caches. While cache coherency requires that any given point in time, every processor must agree on the value of every memory location, there is great leeway to reorder loads and stores so that they may execute in different orders than the instruction sequence implies. In the extreme, it is possible that the load of *p may happen before the load of p itself (this is the Alpha memory model, and comes from the existence of uncoordinated cache banks).
 Okay, there's a slight untruth here: the C++11 "relaxed atomics" is effectively a data race as could be observed by hardware, but is not undefined in the C++11 sense. This is where it's important to point out that we have been struggling for over a decade to actually come up with a suitably precise definition for relaxed atomics that we're happy with--it is by no means a solved problem.
But it's also the case that caches avoid writing back into main memory as much as possible, since bandwidth to main memory is quite small. However, cache coherency protocols are used to make all the caches agree on the value, so that committing the value to cache is equivalent to committing it to main memory.
Using pure functional programming on an embarassingly parallel workload comes pretty close.
You still need to create all the parallel threads/processes/fibers/whatever, divide the work between them and collect or merge the results.
And if you work in an environment where state is mutable and shared per default (i.e. your typical shared-memory multithreaded Java/C#/C++/Ruby program), then it's still very easy to mess up somewhere.
Just serialize, pass data onto multiple processes, synchronize and wait on all of them to return their data back to main process. This is one kind of parallel programming that isn't hard.
But I cheatead and created expensive copies and processes and let the OS handle all the nightmare.
You're much better off just picking a paradigm and sticking with it. I'd say it doesn't matter which one, but there are a few paradigms that are clear winners in my experience: immutability always wins, and Erlang-style messaging seems to work very easily. If you have the opportunity to determine the architecture of a parallel project, most languages can imitate Erlang-style parallelism with builtin libraries. But what I will say is that which paradigm you pick doesn't matter so much that you should switch paradigms mid-project.
I don't entirely disagree, but most environments (or "paradigms") simply hide all that away from the user. Which is generally a good thing. Chances are you _are_ using a lot of those primitives in any given parallel or concurrent program, but you're only aware of one or two of them. For example, a channel for message passing between threads (or tasks in a cooperative multitasking context) might contain a Semaphore and an atomic reference counted pointer which in turn contains atomic integers. See  for an example.
And while concurrency and parallelism are not the same thing, you often end up having to consider the same synchronization problems in your code whether it's concurrent or parallel. Cooperative multitasking does involve true parallelism when the scheduler uses multiple threads on a processor with multiple cores. Code is not just being executed concurrently there, it is (possibly, not guaranteed to be fair) executed in parallel.
Okay, but I think it's fairly clear from context that "using" in my post meant "directly using" or "using in such a way that you have to understand how they work". There are a bunch of ways Erlang's message queues could work, but I don't need to know which way they work. If they work with locks you could say I'm using locks, but that's clearly not the point of what I'm saying.
> And while concurrency and parallelism are not the same thing, you often end up having to consider the same synchronization problems in your code whether it's concurrent or parallel.
Often, sure. But also, often not, which is why I said "it's quite possible that you've never written a line of parallel code."
There's a comment above about Haskell and how you don't have to really put any thought into making things parallel. That's great for 99% of cases, but other times you need higher performance where it matters how the concurrency is implemented.
At one extreme of the spectrum, you can do everything under a big lock and be effectively single-threaded. At the other extreme, you can use lots of mutexes and fine grained locking and end up with a mess of potential deadlocks and races and "here be dragons" because the original programmer did not anticipate that you sometimes need data not to change for a bit longer than a short and obvious critical section.
What I'm working on at work right now, oh goodness how many races and problems I've found around locking and multi-threading in general. It's a nightmare.
What I'm saying is, don't limit yourself to Mutexes, there's a whole zoo of synchronization primitives and each of them has a specific use case that it's ideal for.
Recently I've dealt with a lot of synchronization in Rust, and I would say it's been as pleasant as it could be, considering the subject matter. Deadlocks are still possible, but at least the compiler stops you from accidentally unlocking a Mutex too early.
Last time I checked there was no "best parallel sorting algorithm". All depend on your architecture (core, memory, network).
bonus point: to an extent sequential imperative is an abstraction layer over the concurrent IC implementing your computer
FWIW, I've used the original post title ("Parallel Programming: December 2019 Update") to focus on the updates (and partially driven by the fear of comments made right after reading just the book title, "Is Parallel Programming Hard, And, If So, What Can You Do About It?", and focusing solely on answering the question while completely ignoring the content... :-]).
Almost all problems and solutions stem from one limitation: only one writer at any given time. Immutable means one writer. Locking means one writer. Splitting a list and processing each half means one writer. The actor model means one writer. All are valid solutions to your problems.
A piece of personal advice: Avoid acquiring more than one lock at the same time if possible. The solutions are usually incredibly messy. The dining philosopher problem is a pretty good example. Every single problem involving multiple "per object" locks requires a hand crafted solution to the problem. Changing the order in which forks are picked up does not generalize well to other problems.
If you have global locks you may get away with always acquiring them in the same order but if you make a single mistake you're opening yourself up to deadlocks.
Of course this only work if all locks you need to take are exposed and not hidden behind some abstraction. Then again, when they are it is hard to make sure you only take one lock at a time anyway...
But more likely a hashmap is too fine a granularity for queues. If a hashmap can be owned by one process, that would perform better.
If you can shard and distribute the data neatly and evenly so that each process handles a completely separate subset, then that is the most reliable long term solution.
With shared memory, it's difficult to control who has access to what parts and when - and that can make it difficult to detect and analyse bottlenecks.
If you know you're going to need 2 locks and you acquire both of them at the same time atomically, deadlock is impossible. The trouble happens when you acquire one lock, wait a bit, then acquire a second one. Maybe that's what you meant and I just misread it.
You can track development here: https://github.com/jasonswearingen/godot-csharp-tech
I'm just starting the implementation, but I've done a similar system before so pretty sure I'm not in over my head :)
I would say that for concurrency, FRP (or various incremental programming technics) is essential. The problem is the possible control flow / traces space is just too large to characterize by hand. Another problem is control flow can be bidirectional. If you join together two streams and wish to be push driven, when you reach the merge you need to be able to pull from the other side. Virtually all of Futures/Tokio, hit JS thing du hour, etc etc fail at this. Even forks are a bit hard to do manual CPS as you only have one continuation. Blocking calls fail because the inflexibility of linear 1 input 1 output, but this is better known.
From a certain POV computers are always doing things in parallel (8-, 16-, 32-, 64-bits at a time, and so on, eh?)
Has anyone mentioned Pi Calculus yet?
Until Elixir and BEAM for me, parallel programming had been difficult because the competing mental models in other languages pretty much make for a textbook fantasy.
I'm sure some folks just hit rage land, "Pardon me, you contemptuously ignorant and vile buffoon, what did you say?" but if you'd be so kind as to let me qualify it...
Parallelism on computers has more or less evolved from things being asynchronous which gave us the effect of parallelism (yay, for concurrency) to real parallelism with the proliferation of multi-core computers. This jump from emulated parallelism to real parallelism hasn't really been too thoughtfully designed into most languages, and for those starting out, depending on your language and runtime/vm, you might be in for quite an odyssey. We live in a world where almost every language you use has NOT been designed with parallelism as a first class problem.
I'm going to try to give an abridged version of my odyssey, Please note I love anything I seem hate on in this tirade.
First was Ruby, pure and simple, it has (system) processes (forking), threads, fibers, and libraries implementing actors and event-loops :naive-heart-eyes: but _none_ of them offer real parallelism inside the runtime itself. When one writes code with one of those mental models it falls apart at the metal (hey, GIL) and eventually you're left scratching your head, disillusioned with the world, and realize "why am I even doing this if my work is CPU bound? It won't matter."
Next up, I knew the JVM had "REAL THREADS!" I needed to get experience with another language and thought "oh hell, yeah, baby, I'm gonna rock the shit outta these 1.. 2... 4 CORES" watch out CPU bound tasks. Then came reality creeping in like Santa Claus in July... "Creating threads is expensive? Oh. No, shit?" and "oh wow, so like, I now have between 1 and N threads, that I must to schedule, and now I lock my system somehow." I was told apply locks until the burning goes away, but it never does..
And in that a drunken haze of threading hell one night I saw what I thought was a beacon of hope, a JVM based Actor pattern named Akka. "I get the actor model, how could this go wrong?" So. Many. Ways. First and foremost is choking this new fangled "thread pool," Akka had given me with my CPU bound tasks. How does one then gloriously obliterate CPU bound tasks? Oh shit... One goes right back to where they started... in a separate thread. Or one can do as I did, and GALAXY BRAIN the solution: create a separate pool for my CPU bound tasks. This also introduces a new gang of friends "back pressure," "priority," and my favorite "contention."
Eventually, I kind of accepted fate, knowing that some mental model even if totally contrived was better than no mental model-- right? Still I believed there had to be something better, and I kept reading about how awesome Erlang was for this sort of thing  at the time and Elixir had just come out.
Peanut butter meet pickles. This shit was good.
Erlang the language, and it's runtime/vm BEAM, were designed to work together to solve a problem space: asynchronous, fault tolerant, and distributed systems. At that point of realization, I blacked out from astonishment that people could write asynchronous code in a language and run-time designed to do so. Turns out its mental model was for me, the end all be all, because they (lang/vm) were, and this my big point, DESIGNED TOGETHER TO SOLVE THE PROBLEM. On BEAM a "process" can be pre-empted (queue choir of angels) and thus _everything_ is non-blocking. The actor pattern was finally able to hit the metal in my head and my beard gained substantial majesty.
I write Elixir nearly all day long right now, and while I love it, I wish there were other options (I should probably check out Pony again) for me to use, but honestly, I don't know of any. I'm beyond thankful for Erlang and BEAM since it really helped out my software development game and understanding how parallelism works.
P.S. Anyone know if is there a language designed for running on GPUs designed to make it easier to facilitate programming even if they're slightly slower to execute, e.g. the Erlang/OTP/Beam of GPUs?
 BEAM is actually not great a CPU bound tasks, but that's a different story.
> P.S. Anyone know if is there a language designed for running on GPUs designed to make it easier to facilitate programming even if they're slightly slower to execute, e.g. the Erlang/OTP/Beam of GPUs?
Try julia. I wish their gpu model was more actor-ish, but it's really easy to do GPU programming in it. You just take your existing code and re-type annotate it as GPU.
If I had a lot of time I would try to make a Julia/elixir bridge since I feel like the two languages will play nice with each other (Julia for computation and elixir for orchestration).
on gpu: https://neanderthal.uncomplicate.org/articles/tutorial_openc...
Deep Learning for Programmers: An Interactive Tutorial with CUDA, OpenCL, DNNL, Java, and Clojure
Numerical Linear Algebra for Programmers: An Interactive Tutorial with GPU, CUDA, OpenCL, MKL, Java, and Clojure
Both books progress really well, and you can subscribe now to read the drafts and get the complete books when ready. There's no middle man and 100% of proceeds goes into funding the development of open source Clojure HPC/GPU/ML/DL libraries.
Do you have any ideas why Akka/JVM is so much worse than Elixir/BEAM? I thought that as long as you stick to Akka (i.e. don't use global mutable state but wrap everything in actors) it should pretty much work the same way... Maybe has something to do with On BEAM a "process" can be pre-empted? Though AFAIK JVM threads can as well...
BEAM can interrupt (preempt) a running "process" (actor) if it is taking too long at any point, and allow another process (actor) to run for a bit before continuing, this mechanism is how it (BEAM) allows for soft real-time guarantees and for "hundreds of thousands of processes" to be "running" on the system.
The JVM (and subsequently Akka) on the other hand will preempt for a higher priority process but will not actually interrupt a running processes for the purpose of guaranteeing all "tasks" or "processes" eventually have a chance to complete. It cooperatively schedules everything, so at some point, a long running task must complete before a new one can run, "ey mate, can I use that core when your done?"
Example 1: At some point you could have a low priority pool that _never_ gets run because something is always coming in and interrupting it that's more important.
Example 2: Having a queue build up of tasks at the same priority that then causes the end of the queue to start timing out faster than the tasks complete. This is made especially horrible when the queue timing out is higher priority than other queues you have.
Again on BEAM you never have to worry about a lot of this because everything is non-blocking by default. Maybe a concrete example is, on BEAM you can write an recursive loop /function and calling it won't lock the system. You could have a 1000 processes running that same loop, and it still wouldn't lock the system, BEAM just hums along all the same.
Hopefully that helps explain it or lines up with your knowledge of how the JVM works? Again, there are dark parts of the JVM I'm certainly ignorant of or perhaps some of the implementation has changed since I grokked at it.
With all that said, Akka is still pretty fucking great, and the JVM is faster than BEAM for a lot of tasks. If I remember right Akka is built using pretty much the JVM equivalent of Grand Central Dispatch (libdispatch), or was years ago when I was really into it.
Tangentially, IMHO, GCD, is the only sane way to do parallelism and async in OOP.
Basically, Akka takes tasks and runs them with a limited number of executors each on its own JVM thread. A task runs until it's done, at which point another task is chosen to be executed. This is akin to GCD queue-based dispatch.
Technically, this has nothing to do with JVM per se - JVM threads are OS threads and thus can be preempted, but by the OS only, so when the thread continues, it's still running the same Akka task! An alternative Akka implementation would run each task in its own (OS/JVM) thread, which would solve the preemption issue, but OS threads are quite expensive (though getting less so recently) so there would be a large(r) overhead. Erlang / BEAM runtime (and Golang, Haskell, and maybe others) solves this by implementing preemptive user-space threads, which bring the best of both worlds (assuming a good implementation) - ability to run millions of tasks (each with tiny overhead) while avoiding starvation (by making tasks preemptible / re-schedule-able).
I guess IMHO, it does have something to do with the JVM because that's how it chose to implement it! :) Not to say it doesn't have its merits.
Finally, yeah, the whole OS schedule pre-empting has me curious these days as I dig more in the "system" itself. I'm sort of wondering as you pointed out what sort of things I might be able to get free free that emulate the Erlang way of doing things if I look deep enough.
Golang in particular dealt with this issue extensively, not just because of starvation, but also because of GC - in order to enable garbage collection of global objects, all threads need to be paused and their local variables & stacks snapshoted. Doing that efficiently is hard. I think they were trying to make threads arbitrarily preemptable with OS signals, but that requires extreme care with implementation (you need to know where local variables are stored at every instruction!) and is not portable (no signals on Windows).
I'm at almost fifty years of programming (APL was an early highlight; punched cards weren't) in many languages. Haskell isn't the be all / end all of languages, but when I want to add ten lines to a program and have it use every core, I don't know a better choice than Haskell. Nothing else comes close.
The choice of the GHC Haskell to choose on-use evaluation as their non-strict evaluation strategy means that it cannot simply evaluate all function arguments in parallel. On-use means that if you do not use it it does not get evaluated, which some code relies on. Plus, as the trope goes, Haskell is implemented in a very imperative way, which can lead to some fun surprises when parallelizing aggressively.
There are like three different flavors of parallel programming in Haskell, differing in how much you want to trade off performance for having to know about compilation, memory management and execution details to make it run efficiently in non-trivial parallelism cases.
The main issue I found was the language and libraries are really hard to pick up as a beginner. A library for ruby or python would start off with a few examples showing how the most common use case works and then full api docs for the library. Haskell libraries usually only provide a function name and type definition and if you are lucky you get a one liner telling you what the function does but not how to use it.
This seems to be not an issue for haskell experts because "PathMultiPiece ps => (ps -> Writer LiteApp ()) -> Writer LiteApp ()" is all the info needed but I found it much harder to use because of that.
I also found that many of the libraries were totally inaccessible unless you had an understanding abstract mathematical constructs. I wanted to do some basic xml parsing in haskell and the prerequisite was reading some huge book on the theory of some math behind how the library works. I then switched to ruby and nokogiri had on the website basically the exact code snipit I needed.
In Matlab it's a breeze. Just put "par" in front of the thing. Done.
If you wanna shell out money to enable that functionality, that is.