

 Sorting data in parallel CPU vs GPU - AlexeyBrin
http://solarianprogrammer.com/2013/02/04/sorting-data-in-parallel-cpu-gpu/

======
defrost
I do enjoy these 21st Century partial rediscoveries of the good old trusty
Fibonacci Polyphase merge sort. It's like the late 70's / early 80's all over
again, only with commodity rather than custom hardware.

There's some online discussion here:
<http://flylib.com/books/en/3.55.1.115/1/>

The book references are: External Sorting, Section 5.4 in The Art of Computer
Programming, Vol. 3: Sorting and Searching by Donald Knuth, Addison-Wesley,
1973.

and External Sorting, Chapter 13 in Algorithms by Robert Sedgewick, Addison-
Wesley, 1983.

Sorting and merging blocks is the beginning of the journey, learning the
various tricks to doing it optimally is where it gets interesting, although
perhaps no longer relevant; 30 years ago the OPs approach might take nearly a
day with polyphase tweaks reducing the grind down to under half an hour, these
days it's all over in a fraction of a second and more or less bound by I/O
throughput speed - does optimising sort times still matter?

~~~
tolos
> does optimising sort times still matter?

Yes, not all new computers are fast (calculators, cars, µC, etc).

------
jey
Tangential factoid: GNU libstdc++ has a parallel mode that includes parallel
std::sort,
[http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode...](http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html)

------
jbert
Does anyone have an idea what might cause the small bump/spike in performance
is on the CPU-8 graph at approx 1e6 elements? It looks to be worth around
20-25% of the performance.

If it's related to a sizing parameter of the parallelisation, that suggests
that choosing that parameter well could be important.

~~~
supergirl
It looks like a sudden degradation of performance. Might be the point when the
caches become full. That CPU has 6MB L3 + 256KB L2 x 8 which means ~8MB. The
bump is at 1043786 elements which require 8154KB of memory.

------
aseidl
I've rerun the tests on some slightly more recent hardware: a Tesla K20c and
Xeon E5-2630. See the 'new benchmarks' section on:
<https://github.com/aseidl/Sort_data_parallel>

------
xedarius
Interesting, yet I'd have liked a little more of a break down of the results.
For example the GPU will probably need to transfer the sorted data to and from
the card. This will have a dramatic effect on performance.

~~~
moreaccounts
Take a look at CUDA's pinned memory.

~~~
exDM69
Pinned memory doesn't help a whole lot. It's a page-aligned memory buffer for
DMA transfer, it can only avoid doing one memcpy on the CPU. It does not
improve the latency over the PCI-e bus, which is a lot bigger factor than the
memcpy that can be avoided with pinned memory.

~~~
moreaccounts
I've used it to hide the latency of the copy back (GPU -> CPU) inside the
actual algorithm so I didn't need to suffer the large stall (4-5ms). It helped
tremendously because most of the data was fully computed before the algorithm
was actually done running.

------
coolj
Please post the numbers for that box with vanilla (non-parallel) std::sort, on
the whole array, as well! I'm curious!

------
onedognight
It looks like the OP should be using std::inplace_merge instead of
std::merge/std::copy.

------
pavanky
Using 445M to compare against an i7 is ridiculous (Even if the i7 is 3
generations old).

~~~
octopus
You think the average programmer (user) has a Tesla computing device on his
computer ?

~~~
benmanns
You can buy an hour with two Tesla GPUs for $2.10 on EC2.

~~~
octopus
Are you allowed to install software on an On-Demand Cluster GPU Instances or
you must use what is provided by Amazon ?

~~~
pavanky
cuda comes pre installed with gpu instances. but you can install sotware as
well.

