I do enjoy these 21st Century partial rediscoveries of the good old trusty Fibonacci Polyphase merge sort. It's like the late 70's / early 80's all over again, only with commodity rather than custom hardware.
The book references are:
External Sorting, Section 5.4 in The Art of Computer Programming, Vol. 3: Sorting and Searching by Donald Knuth, Addison-Wesley, 1973.
and
External Sorting, Chapter 13 in Algorithms by Robert Sedgewick, Addison-Wesley, 1983.
Sorting and merging blocks is the beginning of the journey, learning the various tricks to doing it optimally is where it gets interesting, although perhaps no longer relevant; 30 years ago the OPs approach might take nearly a day with polyphase tweaks reducing the grind down to under half an hour, these days it's all over in a fraction of a second and more or less bound by I/O throughput speed - does optimising sort times still matter?
Does anyone have an idea what might cause the small bump/spike in performance is on the CPU-8 graph at approx 1e6 elements? It looks to be worth around 20-25% of the performance.
If it's related to a sizing parameter of the parallelisation, that suggests that choosing that parameter well could be important.
It looks like a sudden degradation of performance. Might be the point when the caches become full. That CPU has 6MB L3 + 256KB L2 x 8 which means ~8MB. The bump is at 1043786 elements which require 8154KB of memory.
Interesting, yet I'd have liked a little more of a break down of the results. For example the GPU will probably need to transfer the sorted data to and from the card. This will have a dramatic effect on performance.
> This will have a dramatic effect on performance.
Agreed, but the effect is bigger on latency than throughput. If the problem set is bigger than available GPU memory, the latency is at least partially hidden by streaming as the GPU can be kept busy while DMA is still transferring. The same thing applies if you will be doing more computations on the same data in the GPU so it won't have to be transferred back and forth.
Pinned memory doesn't help a whole lot. It's a page-aligned memory buffer for DMA transfer, it can only avoid doing one memcpy on the CPU. It does not improve the latency over the PCI-e bus, which is a lot bigger factor than the memcpy that can be avoided with pinned memory.
I've used it to hide the latency of the copy back (GPU -> CPU) inside the actual algorithm so I didn't need to suffer the large stall (4-5ms). It helped tremendously because most of the data was fully computed before the algorithm was actually done running.
The i7-740QM was introduced almost 3 years ago, June 2010, at release time costing $378 (if you bought in lots of 1000).
I didn't find historical pricing data for the 445M video card, but here's a relevant data point: I bought my GeForce 470 at almost exactly that time, for $420 retail (so the bulk price would be around $400, I think?). Measured in raw FLOPS, the GF470 is over three times as fast as the 445M (the video card tested).
So dollar for dollar, TFA undervalues the GPU by about a factor of three?
The 445M is a mobile GPU and quite low end at that. In fact the Teslas are consistently slower than the GTX cards for single precision (until their k20 line that is). So something like a 480 would be ok. 580 would be ideal.
There's some online discussion here: http://flylib.com/books/en/3.55.1.115/1/
The book references are: External Sorting, Section 5.4 in The Art of Computer Programming, Vol. 3: Sorting and Searching by Donald Knuth, Addison-Wesley, 1973.
and External Sorting, Chapter 13 in Algorithms by Robert Sedgewick, Addison-Wesley, 1983.
Sorting and merging blocks is the beginning of the journey, learning the various tricks to doing it optimally is where it gets interesting, although perhaps no longer relevant; 30 years ago the OPs approach might take nearly a day with polyphase tweaks reducing the grind down to under half an hour, these days it's all over in a fraction of a second and more or less bound by I/O throughput speed - does optimising sort times still matter?