Nit, the relevant property you're looking for is associative (you can omit parantheses), not commutative (you can swap left and right); there are some operations that are the latter but not the former, such as the average:
That's a good example. My favorite has always been rock-paper-scissors. If `vs` is a binary operator which gives you the winner according to the game then
rock vs paper = paper vs rock
but
(rock vs paper) vs scissors != rock vs (paper vs scissors)
Much of parallelism to accomplish a single goal is trying to break computations into units that are associative and commutative. Then you can combine the results regardless of order.
Trying to decompose a problem in this way ranges from hard to impossible.
My thoughts exactly. His sequential program example is easily parallelizable. Maybe one of the takeaways should be that we need to work on converting more serial problems to parallel ones
Unfortunately, many interesting problems are known to be impossible to efficiently parallelize. This includes some very broad problems like evaluating a logical expression or simulating a turing machine.
People have been beating on that problem for decades, with modest success.
It's amazing how many different problems people are solving with GPUs.
There are at least four main ways to break apart a problem into parallel work units:
Search-type problems, where you want to do a lot of computations and discard most of them, are easy to parallelize. The extreme case is a crypto key tester. A Bitcoin miner and the WWII "bombe" are examples.
The amount of data that comes out is tiny compared to the work done.
Then there are problems which decompose spatially. Space is what keeps everything from being in the same place. Fluid flow, weather prediction, and hydrodynamics of nuclear explosions are all like that. The space is divided into regions and the regions talk to their neighbors. So most communication is local. Those are classic supercomputer problems. Machine learning, of course. Units are connected to local neighbors. So that, is in a sense, spatial.
Graphics rendering does large numbers of triangles in parallel. The combining operation has to be associative and commutative. Z-buffers make that happen for opaque and transparent surfaces. Translucent is harder, but there are tricks to pull that off without a depth sort. See "order independent transparency".
There are assembly-line approaches, where data progresses forward through multiple stages, with little or no data moving backwards. Audio and video processing work like that. So do some kinds of vision systems. This is the oldest approach - it's how large punched card tabulator shops worked, with trays of punched cards being carried from machine to machine.
Most things you can do on a GPU seem to fit into one of those four patterns. If you can hammer a problem into one of those four patterns, a GPU might help. If not, well...
It should be noted that operations like addition on floating point numbers are not associative (even though addition on real numbers is). Round off error changes depending on the order of operations. The error in divide-and-conquer summation (which you might see in parallelized addition) is worst case much better than sequential addition, although not as good as more elaborate algorithms like Kahan summation.
Generally recurrent sequences cannot be computed in parallel, but the Fibonacci sequence is a special case because it's a linear recurrence. There's a closed-form equation to generate the Nth number.
Linked Lists and other Tree Traversals are sequential.
Raytracing (which is a tree traversal) is a notable exception because there's so many millions of rays that get launched, that it becomes efficient to group them back together and then parallelize the traversals all together.
So if you have "enough" of them going at once, its back to GPU Superiority. But a *SINGLE* linked list or *SINGLE* tree search will always be best done on a CPU.
For that example, couldn't you do exactly that? Multiplication is commutative