
How GPU came to be used for general computation - urlwolf
http://igoro.com/archive/how-gpu-came-to-be-used-for-general-computation/
======
wazoox
Now I'd like to know more about what general uses there are to these beasts. I
see the point for simulating fluidsn etc., but out of climate researchers and
aerospace engineers, who needs these sort of tools nowadays? sincerely
wondering.

~~~
xenthral
I've used it to make my conway's game of life implementation, and make it run
really fast with a naive brute force approach. <http://vimeo.com/9516535>

So add toy programs to that list :)

~~~
anigbrowl
Very nice! I would love to mess with that if/when you're ready to release
source.

~~~
xenthral
Done, MIT license knock yourself out :) <http://github.com/xenthral/Conway-
OpenCL> You need opencl-enabled drivers, SDL library and OpenGL.

I suppose this is my first open source project. hurray for me (:

------
Aron
This is a good introduction. I'm interested in finding articles that speculate
about just how aligned the design requirements are between graphics, and most
matrix-based scientific and data mining computation.

For instance, Nvidia has introduced double-precision support and L1 cache,
which has marginal value in traditional graphics. This is going to hurt their
profitability on the Fermi chip compared to the simpler ATI alternatives.

I am gonna enjoy watching how all this plays out.

~~~
liuliu
I was puzzled how that can impact data mining or machine learning as a whole,
too. The difference between data mining algorithms and image processing/string
matching algorithms is huge. It requires more data to get some meaningful
intermediate results on one computing kernel. For example, a typical scene in
my research is to compare the performance of many proposed features (tens of
thousands) against large volume of data, and pick the best one. It is an
embarrassing parallelization problem, but the data throughput is huge. On
supercomputers, it is easy since every computer can have a local copy of
either feature set or data set. But for GPGPU, there is no way for each core
to have a local copy of either set. Thus, to compare against each other, GPGPU
must go back and forth to its shared memory, and the limited bandwidth may
harm the performance badly.

Disclaimer: I am not very experienced in GPGPU field, so my worrying may be
proved wrong.

~~~
andrewcooke
it's horribly hard to accurately predict what will work and what not, and they
do have some caching ability (what would cache a texture map when rendering an
image).

but what you are perhaps missing is that it's ok for gpus to read memory, as
long as you have enough threads. they can switch context _very_ quickly, so
one set of threads can request memory (hopefully a contiguous chunk) and then
drop into the background and let another set of threads do some work (on the
same processing unit). this is critical to their efficiency and is very
different to a cpu, which instead relies on cache and "sits doing nothing" if
it needs to read data from "afar" (obviously there are trade-offs - there's
only so much local memory for state, for example).

i worked on a problem that was not as "nice" as you might hope - the memory
access was unpredictable to some degree. but i still got a speed up of "tens"
on a cheap ($200) graphics card, compared to a meaty xeon. it's more robust
than you might expect.

~~~
liuliu
That's true, it is hard to predict. But I really interested to see what is the
most optimistic prediction for my particular problem to work on the GPGPU. In
this case, I don't think LRU cache will do much help since it has a uniform
access pattern (every piece of data has to be examined to every proposed
feature). However, you do remind me that may be load-ahead fashion of caching
strategy will help. And if the needed data is load to cache with some
synchronization method to guarantee all current running kernel will use the
piece of data to its examining feature, the performance gain may achieved.
Actually, I gonna spend this weekend to try out.

~~~
andrewcooke
i don't really get what you're doing, but have you considered making one
dimension of your work vary over feature? if you arrange that correctly then
you only need to scan the memory once (all features read the first byte of
memory; then all features read the next...)

