> you couldn’t tell your friend, “You calculate steps 1 through 50 while I calcu...

n2d4 · 2025-01-08T06:05:09 1736316309

Nit, the relevant property you're looking for is associative (you can omit parantheses), not commutative (you can swap left and right); there are some operations that are the latter but not the former, such as the average:

    avg(a, b) = avg(b, a)

but

    avg(avg(a, b), c) != avg(a, avg(b, c))

wdutch · 2025-01-08T06:23:57 1736317437

That's a good example. My favorite has always been rock-paper-scissors. If `vs` is a binary operator which gives you the winner according to the game then

    rock vs paper = paper vs rock

but

    (rock vs paper) vs scissors != rock vs (paper vs scissors)

Animats · 2025-01-08T05:55:15 1736315715

Much of parallelism to accomplish a single goal is trying to break computations into units that are associative and commutative. Then you can combine the results regardless of order.

Trying to decompose a problem in this way ranges from hard to impossible.

vlaaad · 2025-01-08T05:41:08 1736314868

My thoughts exactly. His sequential program example is easily parallelizable. Maybe one of the takeaways should be that we need to work on converting more serial problems to parallel ones

Legend2440 · 2025-01-08T05:59:10 1736315950

Unfortunately, many interesting problems are known to be impossible to efficiently parallelize. This includes some very broad problems like evaluating a logical expression or simulating a turing machine.

https://en.wikipedia.org/wiki/P-complete

Animats · 2025-01-08T08:50:30 1736326230

People have been beating on that problem for decades, with modest success. It's amazing how many different problems people are solving with GPUs.

There are at least four main ways to break apart a problem into parallel work units:

Search-type problems, where you want to do a lot of computations and discard most of them, are easy to parallelize. The extreme case is a crypto key tester. A Bitcoin miner and the WWII "bombe" are examples. The amount of data that comes out is tiny compared to the work done.

Then there are problems which decompose spatially. Space is what keeps everything from being in the same place. Fluid flow, weather prediction, and hydrodynamics of nuclear explosions are all like that. The space is divided into regions and the regions talk to their neighbors. So most communication is local. Those are classic supercomputer problems. Machine learning, of course. Units are connected to local neighbors. So that, is in a sense, spatial.

Graphics rendering does large numbers of triangles in parallel. The combining operation has to be associative and commutative. Z-buffers make that happen for opaque and transparent surfaces. Translucent is harder, but there are tricks to pull that off without a depth sort. See "order independent transparency".

There are assembly-line approaches, where data progresses forward through multiple stages, with little or no data moving backwards. Audio and video processing work like that. So do some kinds of vision systems. This is the oldest approach - it's how large punched card tabulator shops worked, with trays of punched cards being carried from machine to machine.

Most things you can do on a GPU seem to fit into one of those four patterns. If you can hammer a problem into one of those four patterns, a GPU might help. If not, well...

Did I miss any core patterns?

bjourne · 2025-01-08T12:09:15 1736338155

Yes, you would parallelize it with prefix scan which has complexity nlog n I believe. However, floating point is neither commutative nor associative.

pfdietz · 2025-01-08T05:40:12 1736314812

You mean associative?

scoopdewoop · 2025-01-08T05:40:59 1736314859

probably, lol

pfdietz · 2025-01-08T13:57:13 1736344633

It should be noted that operations like addition on floating point numbers are not associative (even though addition on real numbers is). Round off error changes depending on the order of operations. The error in divide-and-conquer summation (which you might see in parallelized addition) is worst case much better than sequential addition, although not as good as more elaborate algorithms like Kahan summation.

https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Alte...

philipov · 2025-01-08T05:45:37 1736315137

A better example would be the Fibonacci sequence, or some other inductive process.

Legend2440 · 2025-01-08T06:06:01 1736316361

Generally recurrent sequences cannot be computed in parallel, but the Fibonacci sequence is a special case because it's a linear recurrence. There's a closed-form equation to generate the Nth number.

dragontamer · 2025-01-08T06:01:18 1736316078

Linked Lists and other Tree Traversals are sequential.

Raytracing (which is a tree traversal) is a notable exception because there's so many millions of rays that get launched, that it becomes efficient to group them back together and then parallelize the traversals all together.

So if you have "enough" of them going at once, its back to GPU Superiority. But a *SINGLE* linked list or *SINGLE* tree search will always be best done on a CPU.