Sorting data in parallel CPU vs GPU

defrost · on Feb 4, 2013

I do enjoy these 21st Century partial rediscoveries of the good old trusty Fibonacci Polyphase merge sort. It's like the late 70's / early 80's all over again, only with commodity rather than custom hardware.

There's some online discussion here: http://flylib.com/books/en/3.55.1.115/1/

The book references are: External Sorting, Section 5.4 in The Art of Computer Programming, Vol. 3: Sorting and Searching by Donald Knuth, Addison-Wesley, 1973.

and External Sorting, Chapter 13 in Algorithms by Robert Sedgewick, Addison-Wesley, 1983.

Sorting and merging blocks is the beginning of the journey, learning the various tricks to doing it optimally is where it gets interesting, although perhaps no longer relevant; 30 years ago the OPs approach might take nearly a day with polyphase tweaks reducing the grind down to under half an hour, these days it's all over in a fraction of a second and more or less bound by I/O throughput speed - does optimising sort times still matter?

tolos · on Feb 4, 2013

> does optimising sort times still matter?

Yes, not all new computers are fast (calculators, cars, µC, etc).

jey · on Feb 4, 2013

Tangential factoid: GNU libstdc++ has a parallel mode that includes parallel std::sort, http://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode...

jbert · on Feb 4, 2013

Does anyone have an idea what might cause the small bump/spike in performance is on the CPU-8 graph at approx 1e6 elements? It looks to be worth around 20-25% of the performance.

If it's related to a sizing parameter of the parallelisation, that suggests that choosing that parameter well could be important.

supergirl · on Feb 4, 2013

It looks like a sudden degradation of performance. Might be the point when the caches become full. That CPU has 6MB L3 + 256KB L2 x 8 which means ~8MB. The bump is at 1043786 elements which require 8154KB of memory.

aseidl · on Feb 4, 2013

I've rerun the tests on some slightly more recent hardware: a Tesla K20c and Xeon E5-2630. See the 'new benchmarks' section on: https://github.com/aseidl/Sort_data_parallel

xedarius · on Feb 4, 2013

Interesting, yet I'd have liked a little more of a break down of the results. For example the GPU will probably need to transfer the sorted data to and from the card. This will have a dramatic effect on performance.

exDM69 · on Feb 4, 2013

> This will have a dramatic effect on performance.

Agreed, but the effect is bigger on latency than throughput. If the problem set is bigger than available GPU memory, the latency is at least partially hidden by streaming as the GPU can be kept busy while DMA is still transferring. The same thing applies if you will be doing more computations on the same data in the GPU so it won't have to be transferred back and forth.

moreaccounts · on Feb 4, 2013

Take a look at CUDA's pinned memory.

exDM69 · on Feb 4, 2013

Pinned memory doesn't help a whole lot. It's a page-aligned memory buffer for DMA transfer, it can only avoid doing one memcpy on the CPU. It does not improve the latency over the PCI-e bus, which is a lot bigger factor than the memcpy that can be avoided with pinned memory.

moreaccounts · on Feb 5, 2013

I've used it to hide the latency of the copy back (GPU -> CPU) inside the actual algorithm so I didn't need to suffer the large stall (4-5ms). It helped tremendously because most of the data was fully computed before the algorithm was actually done running.

coolj · on Feb 4, 2013

Please post the numbers for that box with vanilla (non-parallel) std::sort, on the whole array, as well! I'm curious!

onedognight · on Feb 4, 2013

It looks like the OP should be using std::inplace_merge instead of std::merge/std::copy.

pavanky · on Feb 4, 2013

Using 445M to compare against an i7 is ridiculous (Even if the i7 is 3 generations old).

jholman · on Feb 4, 2013

Yeah. To contextualize that a little...

The i7-740QM was introduced almost 3 years ago, June 2010, at release time costing $378 (if you bought in lots of 1000).

I didn't find historical pricing data for the 445M video card, but here's a relevant data point: I bought my GeForce 470 at almost exactly that time, for $420 retail (so the bulk price would be around $400, I think?). Measured in raw FLOPS, the GF470 is over three times as fast as the 445M (the video card tested).

So dollar for dollar, TFA undervalues the GPU by about a factor of three?

Sources: http://en.wikipedia.org/wiki/Comparison_of_Intel_processors http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29 http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_p...

aseidl · on Feb 4, 2013

I'm working on running the tests on 2x Xeon E5-2630 and a Tesla K20c. Will post the results here in a couple hours.

octopus · on Feb 4, 2013

You think the average programmer (user) has a Tesla computing device on his computer ?

pavanky · on Feb 4, 2013

The 445M is a mobile GPU and quite low end at that. In fact the Teslas are consistently slower than the GTX cards for single precision (until their k20 line that is). So something like a 480 would be ok. 580 would be ideal.

benmanns · on Feb 4, 2013

You can buy an hour with two Tesla GPUs for $2.10 on EC2.

octopus · on Feb 4, 2013

Are you allowed to install software on an On-Demand Cluster GPU Instances or you must use what is provided by Amazon ?

pavanky · on Feb 6, 2013

cuda comes pre installed with gpu instances. but you can install sotware as well.

allanmac · on Feb 4, 2013

A 445M is an sm_21 Fermi device with 3 SMs (144 cores). This is a low-end GPU.