
CUDA, Supercomputing for the Masses - prakash
http://www.ddj.com/article/printableArticle.jhtml;jsessionid=GKN50BK2JE4J4QSNDLOSKHSCJUNN2JVN?articleID=207200659&dept_url=/hpc-high-performance-computing/
======
Readmore
"Are you interested in getting orders-of-magnitude performance increases over
standard multi-core processors, while programming with a high-level language
such as C?"

Ha! That's funny :)

~~~
wmf
ATI was promoting assembly language GPU programming for a while, so I guess
CUDA could be considered high-level in comparison. But it's still funny.

~~~
Readmore
Yeah it made me chuckle. This article is really interesting though. I'm really
excited about the mentioned Python article.

------
tricky
is anyone on here hacking on cuda right now? I have an idea i want to test out
but i don't know anyone who has a clue about working in a massively parallel
environment... would love to hear from a hacker who does.

~~~
apathy
_is anyone on here hacking on cuda right now?_

I am not working on cuda (though I intend to now that I see what it can do),
but have experience with LAM, MPI, etc.

 _i don't know anyone who has a clue about working in a massively parallel
environment..._

the basic point of any parallel programming is to decompose a problem so that
it is without crippling dependencies during its execution, and parcel out the
tasks amongst the nodes/cores/pipelines/whatever, having them report their
results as they complete. There is much more to it, in a practical situation,
but a GPU is very nearly an idealized situation and thus this may be all you
need to know.

Matrix computations are a classic example -- for each row i and each column j,
the multiplication of an ixj matrix by another ixj matrix results in (i x j)
parallel tasks, none dependent on the others. lather, rinse, repeat.

Parallel programming is superficially simple. Your goal is to find serializing
bottlenecks and kill them one at a time, from the tightest loop outwards. What
makes it complicated, of course, is the fact that not all programming is a
series of matrix multiplications, nor are all serializing bottlenecks trivial
to decompose.

The reason GPUs are an obvious target for friendly parallelizing toolkits is
that 3D graphics _are_ inherently representable by series of 4x4 matrices
(superficially you'd think 6x6 per point/observer but there's a cute trick you
can use to reduce the dimensionality to 4x4... take a 3D graphics course if
you are interested in this sort of thing)

Whether this should interest you depends intimately upon your problem domain.
If you have a lot of processes talking to each other in sequence, maybe not.
If there are a lot of tasks that don't really depend on each other that are
slowing your program down, maybe so. (Sometimes even MCMC and MLEM simulations
can be artfully decomposed so that they turn from the former into the
latter... that's the fun part)

sorry for the editing laziness, I need to get back to my simulations. hope
this helps a little

