Graphs: Strangely similar to CPUs. Depth-first search to save space, breadth-first to generate parallelism. You'll need the parallel SIMD-stack / SIMD-queue with prefix-sums to efficiently push/pull items off the stack/queue, but its not too difficult actually (but non-obvious to the CPU-programmer).
Trees: BVH traversal (aka: Raytracing) is extremely common, showing that high-throughput / parallel tree traversal is possible on GPUs.
--------
The problem is that GPUs are designed for 10,000 "threads" (really, SIMD-lanes). In fact, I can almost certainly say that any program with less than 1000 threads will run faster on a CPU than a GPU. (unless you're a rare, purely memory bandwidth-bound problem, like memcpy or memset, because GPU-RAM is way faster than standard DDR4)
There are all sorts of techniques to discover more "threads" (or parallel streams of computation) through the usage of prefix sum. But ultimately, a "practical" GPU application must perform tens-of-thousands of calculations in parallel to beat a CPU.
Indeed: something like parallel-Bitonic Sort is less efficient technically. However, because it creates so many parallel threads of compute, its a great fit as a simple GPU-sorting algorithm. (GPU Mergepath sort is within log2(processors) amount of total work as the sequential version, and is therefore considered almost as efficient as the original sequential algorithm)
That's the thing: you end up generating slightly more work with any parallel program than the original sequential algorithm in most cases.
Matrix Multiplication is "great" because the parallel version has exactly the same number of additions and multiplications at the original sequential version. So everyone loves it as a simple parallelism example. Its even stronger: the GPU-algorithm not only has the same number of computations, but also the same number of memory reads/writes. Its the most ideal algorithm to run in parallel.
Other cases, such as sorting, graph algorithms or whatever, will end up using slightly more overall compute (so there's a tradeoff), or traversing in a different order (CPUs favor depth first search. GPUs will go far more breadth-first before their tens-of-thousands of threads fill up, so you end up traversing the graph / tree in a different order and getting slightly different results)
As such, other algorithms require the programmer to make a judgement: is it worth gaining log2(processors) amount of total work, in order to use a GPU and perform the calculation in parallel? Almost certainly. But other algorithms are O(processors) or worse, at which point a purely sequential (or a low-thread count: like 32-threads) would be better than 10,000 threads.
GPU lanes are much slower: they access memory at far higher latenciess, and GPUs are in-order processors (lacking branch prediction, out-of-order or superscalar tricks from the CPU world). As such, GPU threads are much much slower individually than CPU threads. You just get a ton of GPU "threads" to make up the difference.
Do you know of any good books or references where I could learn more about these things?
The books that are usually recommended seems to be CUDA-centric and out of date. I'm interested in learning the more general concepts you talk about in your answer, so that I can effectively write e.g. monte carlo simulations on arbitrary GPU hardware. (I don't have an Nvidia GPU!)
> The books that are usually recommended seems to be CUDA-centric and out of date.
The CUDA-ones are usually best, because they're at least with a modern API. The other recommendations I got are the 80s stuff on vector computers and the CM-2.
It turns out that a huge community of high-performance programmers have experimented with all of these concepts in the 1970s, 80s, 90s, and 00s, long before GPUs. All of their concepts still work today on modern GPUs.
Looking up the right keywords: such as "CREW-PRAM algorithms" (Concurrent-Read Exclusive Write, Parallel RAM model) immediately gives you plenty of results for some things. (Ex: I just searched on 'CREW-PRAM DFS' and got: https://core.ac.uk/download/pdf/82490222.pdf).
The key is understanding what the "old word" for GPU was. That's PRAM, the Parallel-RAM model. That's how programmers from the 1970s, 1980s, and 1990s talked about algorithms written for the GPU-style called it back then.
Newer articles / books talk about GPUs directly.
--------------
I'd say the fundamentals are covered in the 1970s, such as "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations" by Kogge, which leads to the development of Prefix sum (as it is called today).
Of course, CUDA code is clearer than a theoretical CREW-PRAM discussion. So perhaps its easier if you just read GPU Compute Gems by NVidia to cover the same material. Still, I find that the 1970s writing is often times better written for a beginner (back then, fewer programmers knew how parallel programming worked, but were clearly more math-heavy than today. So I find that reading old articles like that one helps my personal brain and personal thinking pattern)
---------
Ah right, the recommendation. I'd say "Data Parallel Algorithms" by Hillis / Steele (ACM Communications 1986) is an excellent introduction to general-purpose SIMD compute. It was written for CM-2, an ancient supercomputer that no longer exists, but the PRAM style applies to GPU algorithms today.
Its like 15 pages, but it really opens your eyes to the possibilities. A lot of CUDA stuff is very specific to NVidia GPUs (important specific details: like bank conflicts and shared memory, which you absolutely should learn about... but such details should be studied after you learned the way of parallel-thinking / PRAM model / etc. etc.)
Oh: in case it isn't clear... CUDA is just the most popular current GPGPU programming language right now. There's ipsc, OpenCL, OpenACC, DPC++, SYCL and many others.
They are all closely related to the PRAM-model however. So algorithms study isn't really about learning CUDA-details or whatever, but learning the generic (and cross-language) concepts of parallel programming.
------
So it really doesn't matter if you're reading CUDA, or C-star (1980s code for the old CM-2 supercomputer). They're both PRAM in concept and therefore somewhat compatible in your brain.
It helps to know GPU-quirks for highest levels of optimization (wavefront programming, uniform branches, __shared__ memory), but you can learn these details after learning the generic PRAM stuff known for the past decades.
Hey, dragon, thanks so much for your thoughtful replies. This thread turned into a bonanza of revelations into the future of graphics apis. And I have to concur, cpus are getting so inexpensive and powerful, proprietary renderers (designed to run on farms as well) simply target vector extensions for parallelism.
Regarding learning GPU Architectures and Programming, usually in the Graphics Gems books there is an introductory section devoted to compute. But you are on own regarding streaming, tracing, tuning, multi gpu. All still very much dark arts ;)
The study of GPU algorithms is completely different and independent of regular CPU algorithms.
Sorting: Parallel Bitonic Sort networks. Radix Sort. Mergepath.
Hashing: Parallel Cuckoo Hash.
Graphs: Strangely similar to CPUs. Depth-first search to save space, breadth-first to generate parallelism. You'll need the parallel SIMD-stack / SIMD-queue with prefix-sums to efficiently push/pull items off the stack/queue, but its not too difficult actually (but non-obvious to the CPU-programmer).
Trees: BVH traversal (aka: Raytracing) is extremely common, showing that high-throughput / parallel tree traversal is possible on GPUs.
--------
The problem is that GPUs are designed for 10,000 "threads" (really, SIMD-lanes). In fact, I can almost certainly say that any program with less than 1000 threads will run faster on a CPU than a GPU. (unless you're a rare, purely memory bandwidth-bound problem, like memcpy or memset, because GPU-RAM is way faster than standard DDR4)
There are all sorts of techniques to discover more "threads" (or parallel streams of computation) through the usage of prefix sum. But ultimately, a "practical" GPU application must perform tens-of-thousands of calculations in parallel to beat a CPU.
Indeed: something like parallel-Bitonic Sort is less efficient technically. However, because it creates so many parallel threads of compute, its a great fit as a simple GPU-sorting algorithm. (GPU Mergepath sort is within log2(processors) amount of total work as the sequential version, and is therefore considered almost as efficient as the original sequential algorithm)
That's the thing: you end up generating slightly more work with any parallel program than the original sequential algorithm in most cases.
Matrix Multiplication is "great" because the parallel version has exactly the same number of additions and multiplications at the original sequential version. So everyone loves it as a simple parallelism example. Its even stronger: the GPU-algorithm not only has the same number of computations, but also the same number of memory reads/writes. Its the most ideal algorithm to run in parallel.
Other cases, such as sorting, graph algorithms or whatever, will end up using slightly more overall compute (so there's a tradeoff), or traversing in a different order (CPUs favor depth first search. GPUs will go far more breadth-first before their tens-of-thousands of threads fill up, so you end up traversing the graph / tree in a different order and getting slightly different results)
As such, other algorithms require the programmer to make a judgement: is it worth gaining log2(processors) amount of total work, in order to use a GPU and perform the calculation in parallel? Almost certainly. But other algorithms are O(processors) or worse, at which point a purely sequential (or a low-thread count: like 32-threads) would be better than 10,000 threads.
GPU lanes are much slower: they access memory at far higher latenciess, and GPUs are in-order processors (lacking branch prediction, out-of-order or superscalar tricks from the CPU world). As such, GPU threads are much much slower individually than CPU threads. You just get a ton of GPU "threads" to make up the difference.