Singe: Leveraging Warp Specialization for High Performance on GPUs [pdf]

sharpneli · on April 28, 2014

My first reaction: Some Nvidia cards can synchronize between warps? Nice!

I've been living in OpenCL world which is pretty much just everyone except Nvidia (because nvidia intentionally ignores OpenCL and cripples their support for it) so I have unfortunately missed this development.

On the other hand the particular use case in the article was to circumvent some other limitations of the architecture, such as relatively small register file and the shared instruction counter for a block of 32 threads. Clever and interesting nevertheless.

oneofthose · on April 28, 2014

There are rumours that Nvidia will sooner rather than later support OpenCL 1.2 [0] - apparently CUDA 6 contains a stub library that has OpenCL 1.2 symbols and more.

[0] http://www.phoronix.com/scan.php?page=news_item&px=MTY2OTg

mdda · on April 28, 2014

As a guess, this paper is appearing here not only because it's cool, but also the "Functional Programming Principles in Scala" course (by Martin Odersky) has just re-started on Coursera.

He mentioned GPU-related DSLs in his OSCON Java 2011 keynote (see : http://www.youtube.com/watch?v=3jg1AheF4n0 at ~14m20s) which was one of the listed 'Learning Resources'. However, the Stanford group he was involved in was doing 'Liszt' and this is 'Singe' (and his name isn't in the paper) - so I'm wondering if there isn't some kind of internal race going on...

frozenport · on April 28, 2014

It appears that these algorithms aren't a good candidate for a GPU. They require complicated consumer producers, and only run on the GPU in reduced mode which questions the scientific merit. One wonders why the authors couldn't have arbitrarily expanded or reduced their data set or perhaps done their algorithm in two passes.