
Sorting with GPUs: A Survey - lainon
https://arxiv.org/abs/1709.02520
======
oelmekki
Cool to see research in that field, I realized this year that for gaming GPU
and VRAM were actually more important than CPU and RAM, it probably tells
something about underusage of GPU in general computing.

There's one big blocker as far as I can tell, though: portability. When it
comes to gaming, neural networks, cryptomining, etc, I always see "only nvidia
cards supported", or "work best on amd". If we were to use GPU in just any
kind of application, we would need to have hardware abstraction library which
support any kind of GPU, including intel chips.

Is such effort already being worked on, at any stage of completion?

~~~
idle_zealot
Something like
[https://en.wikipedia.org/wiki/OpenCL](https://en.wikipedia.org/wiki/OpenCL) ?

~~~
oelmekki
Awesome, thanks. I've seen the name mentioned several times, but I didn't know
what was behind.

It seems (unsurprisingly for an hardware abstraction) that performance is a
problem, but at least it's a problem that is being worked on. Maybe at some
point we will stop to think "we can't implement that, that's way too massive
to perform properly" and start thinking "this is a job for the GPU" :) At the
very least, databases come to mind, with their sorting/filtering of massive
data.

~~~
roel_v
The problem with OpenCL isn't performance per se, but performance portability
(well it's only a problem for those that need such a thing, of course - many
people don't). When you write OpenCL code and you tweak it for one CPU or GPU,
it might run at 1/10th the speed on another. This is of course something you
don't have with an API that works only on GPU's from one vendor, although even
there different generations of hardware might prefer different parameters or
tradeoffs.

Now you can write OpenCL kernels that automatically tweak themselves to run as
fast as possible on different hardware, but that requires significant extra
work over just getting it to work at all.

And finally, CUDA has a bunch of hand-tweaked libraries for doing common
numerical operations (matrix multiply, FFT, ...) that are (partly) written in
'NVIDIA GPU assembly) (ptx), so those operations will be faster on CUDA than
on OpenCL.

CUDA is also (a bit) easier to write/use than OpenCL code and the tooling is
better, so that's another reason people often default to CUDA.

~~~
14113
The LIFT project ([http://www.lift-project.org/](http://www.lift-
project.org/)) is specifically trying to solve the problem of performance
portability. Our approach relies on a high level model of computation (think
or something like a functional, or pattern based programming language) coupled
with a rewrite-based compiler that explores the space of OpenCL programs with
which to implement a computation.

We get really quite good results over a number of benchmarks - check out our
papers!

~~~
geokon
How does it compare to SYCL that someone else mentioned in another comment?

Sounds like it's trying to do a similar thing

------
exDM69
Sorting is a fundamental problem so this is important stuff but I can't right
now come up with a practical problem involving large sorting operations. Can
anyone come up with a practical application for this? I have no doubt they
exist.

I would think the breakeven point for using the GPU (assuming inputs and
results are on the CPU) is several megabytes of millions or elements at least.

Writing this paper must have been a lot of effort. There are something like 50
different methods reviewed here. Good thing that papers like this exist.

~~~
ben-schaaf
Rendering transparent models in real-time (ie. for games) requires a sorting
step when they overlap. I can imagine a scene containing a couple hundred
thousand transparent models that need sorting.

~~~
pjc50
Isn't that usually done with the depth buffer?

~~~
panic
Depth buffers only work well for opaque objects, which cover each other
completely. With partial coverage or blending, multiple objects could end up
contributing to a pixel in an order-dependent way (e.g., if you're viewing an
anti-aliased leaf edge under the surface of water through a window). The depth
buffer, which only stores a single depth, doesn't solve this problem directly
-- you can use it to render each transparent layer one-by-one ("depth
peeling"), but this can be slow.

------
gnarbarian
I'm using GPU.js for a nbody gravity simulation in three.js. So far I'm liking
GPU.js but it has it's limitations. only being able to return a single float
from any function causes a lot of redundancy. For example, in order to get the
3d acceleration vector for one body I have to call the function once for each
dimension, recomputing all of the temporary variables each pass. Then I have
to make another pass for collision detection so it ends up being about 4n^2 vs
just n^2

nonetheless it's still about an order of magnitude faster than the pure CPU
implementation I have.

In the future they plan to add webGL 2.0 and OpenCL support which should
improve the flexibility of the library.

[http://thedagda.co:9000/?stars=true&bodyCount=1000](http://thedagda.co:9000/?stars=true&bodyCount=1000)

you can toggle CPU nbody computation with:

?CPU=true in the url. remove it for GPU computation.

you can toggle the number of planets with the bodyCount variable. default is
1000

if you have a beefy computer try bodyCount=4000

gamepad=true works if you have a xbox controller hooked up.

on gitHub:

[https://github.com/ubernaut/spaceSim](https://github.com/ubernaut/spaceSim)

~~~
Zelizz
Why not calculate the unchanging intermediate results in separate functions so
that you can pass them as arguments to the final function?

~~~
gnarbarian
Easier said than done. I can't get superkernels to work at all. I've tried.

------
leecarraher
Sorting seems like a memory hard problem where the computationally optimal,
merge sort solution, is heavily branched, two things gpus have never been
terribly good at. Furthermore, the synchronized program counter per block of
threads, presents a considerable road block to getting optimal thread
occupancy specifically in the cuda architecture.

------
frozenport
The table at the end would benefit from a column indicating the availability
of the source code.

