I've done some work with CUDA, so I'll just tell you ahead of time: it will prob...

sophacles · on July 17, 2009

I thought that shared mem of 16kb was on a per core basis? It's very possible I misunderstood, in which case: crap! Either way the, the algorithm I have in mind looks pretty matrixy, if you want I can keep you updated as I test things out.

profquail · on July 17, 2009

It's 16kb per multiprocessor. I believe there are 16 cores per multiprocessor, and multiple threads are scheduled at one time in 'blocks'.

I actually started a thread in the CUDA general discussion forum, perhaps you could come by there and discuss it:

http://forums.nvidia.com/index.php?showtopic=102228