
World's fastest radix sort? 1B 32bit keys a second using a stock GTX 480 - junkbit
http://code.google.com/p/back40computing/wiki/RadixSorting
======
jacquesm
GPUs are scary fast if you have the 'right' problem.

~~~
maximilian
What I keep seeing is that the Fermi video game cards (GTX480) is faster than
the scientific computing minded (Tesla C20 __) cards. According to those
graphs, the difference isn't trivial either, which is kinda surprising.

~~~
wmf
For this _integer_ program the difference is not due to accuracy, DP, or
memory models. The GTX480 has 480 "cores" at 1.4 GHz while the C2050 has 448
"cores" at 1.15 GHz. I can't explain why Nvidia's more expensive card has
lower performance other than they can get away with it.

~~~
manvsmachine
In the case of the Fermi-class cards, you're paying for more memory (3 or 6GB
vs 1 or 2), stability, tweaked drivers, and (possibly most importantly) error
correction. It's really the same argument as for workstation graphics cards -
consumer class cards are designed to sacrifice floating-point accuracy for
maximum speed, and in some cases, this is unacceptable.

That said the 20- series is _really_ freaking expensive. When I got my
C1060's, they were ~$1500/per retail; I think the C2070's are nearly twice
that.

~~~
wmf
It seems like Nvidia would have less of an image problem if the Teslas were
the same performance _and_ have more RAM _and_ have ECC. (Also, the value of
ECC is diminished if a Tesla is over 3x the price of a GeForce, because then
you can run three times and vote.)

~~~
jacquesm
> It seems like Nvidia would have less of an image problem if the Teslas were
> the same performance and have more RAM and have ECC.

That might be hard when going for reliability. It's a bit like overclocking
your CPU, you can do it, but it's outside of the warranty. In this case the
overclockers are the third party board manufacturers that are trying to get an
advantage over their competitors that they can paste on the box so that in the
store when comparing two boxes the customer will pick theirs.

> (Also, the value of ECC is diminished if a Tesla is over 3x the price of a
> GeForce, because then you can run three times and vote.)

That's not really true. The majority of the situations where the Tesla boards
are used you'll find multiple boards in one box already. The limiting factor
here is how many boards you can pack in to a system, more = better. So if
you'd use the 'voting' principle you'd be wasting tons of power, rackspace and
host computers in order to offset the price of the ECC, not just the two extra
cards.

~~~
wmf
_In this case the overclockers are the third party board manufacturers that
are trying to get an advantage over their competitors_

I don't think so; most GeForce cards appear to use stock clocks and Nvidia
reference PCBs and coolers.

------
whakojacko
As with lots of GPU-accelerated benchmarks unfortunately, they ignore the time
to transfer the 1B keys from main memory to the GPU and back.

~~~
jacquesm
That's not entirely fair. If you benchmark a sorting algorithm you normally
don't measure the time it takes to bring the records in from the disk either.

Of course there is overhead, that's obvious, there would be overhead in any
co-processor driven situation, but that overhead depends to a large extent on
the host machine and the bus used to connect, so it would be reducing the
value of the benchmark to include those figures in the timing.

Also, there are plenty of applications where the input to the radix sort would
come from other kernels and/or where the output would go to other kernels.

In those cases there is no overhead.

~~~
pbhjpbhj
Can you, jacquesm, or some other knowledgeable sort give an example of a
situation which could call for this level of sorting?

Sorting a billion of anything in 1s seems likely to kick against other
problems.

Just curious (I'm not asking why you'd want to, but you can tell me that too
if you like).

~~~
_delirium
There's a handful of ML algorithms where sorting is a bottleneck, or where the
algorithm could have some complex data-structures ripped out if sorting were
an order of magnitude faster. They're often hard to implement entirely on the
GPU, though, so you'd have to be careful with the transfer to/from the GPU if
you were using it solely for the sorting.

------
patrickgzill
I have often wondered if you couldn't put the needed database indexes and
other associated data onto a GPU , and have the GPU handle the optimization
for the query, run the query, and then just return to the database server
which blocks to go to to get the data; the indexes could be synced to disk of
course but they would be run from the GPU.

~~~
nl
There's a paper by some people who got MySQL to sort using the GPU, and got
great speedup. That was quite old and I can't find it now.

However, a more recent paper is _Accelerating SQL Database Operations on a GPU
with CUDA_ :
[http://www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpg...](http://www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpgpu10.pdf)

The references in that paper are pretty interesting, too.

~~~
lsb
That looks pretty cool, and the best performance seems to come from
aggregating and doing floating point math.

SQLite is designed for embedding into cell phones and web browsers and other
tiny devices, so I'd expect that there's quite a bit of room for optimizing
the floating point math on server-grade hardware.

------
sgt
Radix sort is my favorite sort algorithm. Here, I made a little Radix sort
video with sound: <http://rasterburn.org/~sgt/stuff3/radixsort.avi>

------
profquail
There's an ongoing thread in the CUDA forums about it:
<http://forums.nvidia.com/index.php?showtopic=175238>

