

Singe: Leveraging Warp Specialization for High Performance on GPUs [pdf] - eslaught
http://theory.stanford.edu/~aiken/publications/papers/ppopp14.pdf

======
sharpneli
My first reaction: Some Nvidia cards can synchronize between warps? Nice!

I've been living in OpenCL world which is pretty much just everyone except
Nvidia (because nvidia intentionally ignores OpenCL and cripples their support
for it) so I have unfortunately missed this development.

On the other hand the particular use case in the article was to circumvent
some other limitations of the architecture, such as relatively small register
file and the shared instruction counter for a block of 32 threads. Clever and
interesting nevertheless.

~~~
oneofthose
There are rumours that Nvidia will sooner rather than later support OpenCL 1.2
[0] - apparently CUDA 6 contains a stub library that has OpenCL 1.2 symbols
and more.

[0]
[http://www.phoronix.com/scan.php?page=news_item&px=MTY2OTg](http://www.phoronix.com/scan.php?page=news_item&px=MTY2OTg)

------
mdda
As a guess, this paper is appearing here not only because it's cool, but also
the "Functional Programming Principles in Scala" course (by Martin Odersky)
has just re-started on Coursera.

He mentioned GPU-related DSLs in his OSCON Java 2011 keynote (see :
[http://www.youtube.com/watch?v=3jg1AheF4n0](http://www.youtube.com/watch?v=3jg1AheF4n0)
at ~14m20s) which was one of the listed 'Learning Resources'. However, the
Stanford group he was involved in was doing 'Liszt' and this is 'Singe' (and
his name isn't in the paper) - so I'm wondering if there isn't some kind of
internal race going on...

------
frozenport
It appears that these algorithms aren't a good candidate for a GPU. They
require complicated consumer producers, and only run on the GPU in reduced
mode which questions the scientific merit. One wonders why the authors
couldn't have arbitrarily expanded or reduced their data set or perhaps done
their algorithm in two passes.

