Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've done some work with CUDA, so I'll just tell you ahead of time: it will probably be decently fast, but not nearly the kind of speeds you'd get from something like matrix multiplication. The problem is going to be that you need to fit the data to hash in "shared memory" (which is sort of like an L2 cache on the GPU). The standard shared memory size in CUDA is 16kb. Also, because of the way that data must be allocated in shared memory, you're not going to be able to get many string permutations into shared memory at once.

However, I think a good solution (if you're going to use CUDA, which I was thinking about doing as well) might be to load one permuation of words from the dictionary into the shared memory, and then have each thread compute one permuation of the various capitalizations for that string. This way you're mostly working out of the "cache" and that's about the best thing you can do on CUDA.



I thought that shared mem of 16kb was on a per core basis? It's very possible I misunderstood, in which case: crap! Either way the, the algorithm I have in mind looks pretty matrixy, if you want I can keep you updated as I test things out.


It's 16kb per multiprocessor. I believe there are 16 cores per multiprocessor, and multiple threads are scheduled at one time in 'blocks'.

I actually started a thread in the CUDA general discussion forum, perhaps you could come by there and discuss it:

http://forums.nvidia.com/index.php?showtopic=102228




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: