
A sketch of string unescaping on GPGPU - panic
https://raphlinus.github.io/personal/2018/04/25/gpu-unescaping.html
======
epberry
It's late and I may not be thinking clearly or be well informed enough but...

I wonder if this type of thing - implementing algorithms on GPUS that have
been the bread and butter of CPUs for generations (parsing regular languages?)
- begets a new style of using computer hardware. The sketch shows several
clever techniques to get a modest performance gain on the GPU, but what
happens when the GPU hardware keeps getting better and the CPU stagnates?

I keep thinking back to that Feynman lecture of computers using the analogy of
an office filing system. There the final step was a filing clerk so dumb he
could only add one to things but he could do it so fast that he was still
better than the filing clerk that could reason out all the operations. Can we
extend that analogy to have 8 filing clerks all adding one to different wrong
indexes and then choosing the right one in the end? (I'll be honest, my
understanding of how GPUs work is rudimentary).

Perhaps the gains we get from doing the wrong things in parallel actually
start to outweigh CPUs doing the right things in sequence (ex branch
mispredictions).

EDIT: My ramblings aside, kudos to the author for a thought provoking piece.

~~~
gmueckl
Some GPU based methods are faster when they rely on raw parallel processing
power, even if it caused a huge amount of work to be discarded. These chips
are well suited for regular, embarassingly parallel, dumb work. This starts to
drop sharply with increasing branching and the amount of unordered memory
accesses, although GPUs get better and better at that. So it can be better
right now to do the dumb thing instead of the smart thing.

I think that sorting on the GPU is a good example: a parallel quicksort is
hard to implement and maps badly to GPU threads. A bitonic sort algorithm is
likely to be much faster, although it is a very dumb algorithm.

~~~
Coding_Cat
In general, all algorithms seem to get pushed more and more towards "optimize
for memory access" than actually optimizing for computations. GPUs are no
exception.

Personally, I wonder more what effect widespread HBM (with speeds
realistically expressed in TB/s for future iterations) will have on the
algorithms we design.

~~~
gmueckl
Well, GPUs expose the memory hierarchy to the programmer and have atrocious
latency when hitting global memory with unordered memory access patterns. Your
only chance to make a GPU outperform a CPU is to design your algorithm
completely around memory usage. The difference between doing it right and
wrong can be several orders of magnitude.

Even CPUs show huge performance gains when all the algorithm's data fits in
the cache. This can easily be a factor 10 or more when you get it right.

