Ask HN: Best place to learn GPU programing? - hubatrix
======
rdslw
My advice: 1\. buy some cuda GTX from nvidia (second hand ebayish would do)

2\. install pycuda
[https://mathema.tician.de/software/pycuda/](https://mathema.tician.de/software/pycuda/)

3\. dive into excelennt documentation and tutorials:
[https://documen.tician.de/pycuda/](https://documen.tician.de/pycuda/)

Literally in 2h, I had first code on my GTX at standard ubuntu. Being it
python (plus obvious cuda code in C) was much easier to grasp.

~~~
hubatrix
I already have NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) to work
on, I just have no clue on CUDA so thought reading about GPU arcitecture and
learning the basics by a tutuorial is the right way to start !!

------
greydius
Free udacity course: [https://www.udacity.com/course/intro-to-parallel-
programming...](https://www.udacity.com/course/intro-to-parallel-programming--
cs344)

~~~
hubatrix
I looked into it before, but thought its too slow and its more like for people
having no idea about computers before, sorry these are purley my feeling
alone.So asked for more advance tutorial here !?

~~~
yread
You can just skip the first lessons, the chapter "Squaring Numbers Using CUDA
Part 1" has some code. Then the third lesson some more advanced techniques
(actually they are really basic, but you might not have heard about them if
you're not into parallel programming or networking hardware). I recommend it
btw.

~~~
hubatrix
Thanks, sure will give it a try !!

------
vhffm
There's a nice CUDA tutorial on DrDobb's. It may be slightly outdated by now,
but it was a great help to get me started.

[http://www.drdobbs.com/parallel/cuda-supercomputing-for-
the-...](http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-
part/207200659)

~~~
hubatrix
thank you !!

------
cpcat
I'm afraid there isn't a single best place. What is important first is to
learn how GPU architecture is different from traditional before you start
learning Cuda or linear algebra. Take a look at these sources for absolute
beginners.

[https://thebookofshaders.com/](https://thebookofshaders.com/)

[http://duriansoftware.com/joe/An-intro-to-modern-
OpenGL.-Tab...](http://duriansoftware.com/joe/An-intro-to-modern-
OpenGL.-Table-of-Contents.html)

[http://learnopengl.com/](http://learnopengl.com/)

edit: also when you are ready to pick a language and start writing code, i
advise using the universal OpenGL over Cuda.

~~~
valine
Would you recommend OpenGL? I was under the impression that OpenCL was the
equivalent of Cuda and OpenGL was specifically for 2d / 3d rendering.

~~~
saltcured
I imagine it was a typo, but... I personally have found PyOpenCL and the
vispy.gloo OpenGL APIs to be very convenient together (along with Numpy and
Scipy of course).

If it fits your goals, a useful mental model is that you are programming Numpy
ndarray data, and all these tools just fit together to let you manage the data
elements, apply transforms, and view them. You should be thinking about the
math that solves your domain problem, using alternative mathematical
formulations to optimize your algorithms, and only then worrying about using a
tuned parallel implementation of that math.

I prototype new ideas and visualize results and intermediate data very easily,
with easy transitions between using numpy/scipy routines, custom OpenCL
kernels, and OpenGL shaders. I've even found myself using OpenGL to do the
visual equivalent of "printf debugging" to look at intermediate array results
when developing new array operations and wondering what I've done wrong or
misunderstood about the problem. It's very instructive to create little sub-
problems you can run as independent scripts during dev/test/microbenchmarking
cycles. You should iterate on many small experiments, not assume you can
design high-performance solution in one top-down adventure.

For high performance processing, you eventually need to understand
architecture limitations and the impact of different problem decompositions.
Non-trivial, multi-dimensional problems need to be decomposed into smaller
blocks to get better cache locality for the vectorized/parallel code that will
process each block. Otherwise, you won't enjoy the benefits of parallel
hardware as all the compute units are stalled waiting for data fetches. I find
myself doing more and more meta-programming in Python, where I reason about
array shapes and sizes, compute slicing geometries, and then loop to extract
and dispatch chunks of data to OpenCL jobs which in turn use a dense form of
SIMD or SPMD execution on that block. Python is great for this kind of
marshaling work.

Also, PyOpenCL makes it trivial to switch between backend drivers. The Intel
OpenCL driver is quite good on recent laptop and server CPUs. You might be
surprised how much performance you can get out of recent i5/i7 or Xeon E3/E5
CPUs when you use all cores and all SIMD units effectively. Plus, the CPU
backend is able to use all system RAM and has a more uniform cache behavior,
which makes it more forgiving for badly decomposed problems compared to GPU.
This can be a big help for prototyping as well as for limited-run problems
where you don't have time to invest in fully tuning your data structures for
GPU requirements.

~~~
joe_the_user
If you read the op, their reference to OpenGl isn't a typo, they are linking
to books and articles on OpenGl and recommending learning this before Cuda.

------
greggman
what does "GPU programming" mean?

Does it mean you've never done anything GPU related ever and you want to start
at "put a triangle on the screen" with OpenGL or WebGL?

Does it mean some thing like GPGPU, you want to use GPUs to do computation?

Does it mean something like Vulkan/DX12/Metal, you want do low level stuff?

Maybe someone can give answers to all of those questions but they're arguably
different things and all fit under "GPU Programming"

~~~
hubatrix
Hello, sorry for my vague question, I want learn CUDA so I can increase
performance of an existing framework, pure performance and throughput time in
consideration. And no I dont have no prior experience on working with GPU's
before.

~~~
exDM69
> increase performance of an existing framework

Cautionary tale following...

I don't know the background on this but your goal might not be realistic,
unless the rest of the framework architected in a GPU friendly way.

I was once hired for such a project (some scientific applications in
chemistry). They had a clear idea what needed to be improved, their Python
framework called into C code for some non-trivial matrix operations. This was
at the top of the profile so making it faster would have improved the overall
performance.

However, each matrix operation was just tens to hundreds of kilobytes in size.
Doing them with the GPU was 100x faster, but the latency of getting the data
to the GPU and back eliminated any gains.

It's clear that doing these on the GPU would have improved overall performance
if the operations could be batched. But the framework was tens of thousands of
LOC of scientist-written Python code that was mission critical and could not
be rewritten in the time and budget we had.

GPU programming is high throughput at the expense of high latency. It is very
difficult to retrofit into a CPU-only framework or application. You need to
consider memory management and synchronization for the GPU from ground up.

~~~
hubatrix
Thanks for the insight, Well I am working on a research project, to redesign a
CPU architecture framework into a GPU one, and luckly I am working on real
time batch processed data, so as of now have some plans as to what should go
on GPU and I want to try it out myself and see if there are ways to cut of
those bus tranmsion latencies between CPU and GPU. Not sure how exaclty but I
just started kearning CUDA now, one more thing how will I know if I should use
OpenCL or CUDA ?

~~~
exDM69
Latency hiding is the key for high throughput GPU applications. Make sure the
GPU never waits for CPU and vice versa. Keep transfers to/from GPU flowing
constantly and a queue of jobs scheduled. Each job should preferably be
several megabytes in size.

Memory bandwidth is the bottleneck for most practical tasks. Bandwidths you
should know: CPU 50 GB/s, GPU 300 GB/s, PCI-e DMA 16 (v. 3.x) or 32 (v. 4.x)
GB/s.

~~~
hubatrix
Thanks again, when I reach the point of actually implementing it I am sure
I'll be asking you a question or two !! but HN, unfortunately does not have
friend/follow functionalities

------
ksml
I found Wen Mei Hwu's "Heterogeneous Parallel Programming" course on Coursera
to be an excellent class:
[https://www.coursera.org/course/hetero](https://www.coursera.org/course/hetero)

~~~
hubatrix
thank you, does he also teach CUDA in it ?

~~~
khrm
Yes. It's quite CUDA focus.

~~~
hubatrix
the course seems to be not available anymore

------
woodcut
Get a book on linear algebra, a book on C/C++, and download the OpenGL spec
and build a graphics pipeline that can render a textured polygon, then add
lights, shadows, animations... this will take you a month perhaps.

~~~
inDigiNeous
One month, sure :D Realistically, if you have no GPU or OpenGL experience, to
build something like that, lightning, shadows.. And to really understand what
is happening, in one month maybe if already know C++/your language of choice
well, you can grasb the idea of how to draw a triangle with maybe some
lightning.

Shadows, animation, no way in a month. The whole OpenGL stack is pretty
complex.

My suggestion would be the same though, but maybe reserve one year to really
just grind the basics until you have a clear understanding of what is
happening.

On the other hand, if you just want CUDA, there are frameworks for that for
enabling parallel calculations without having to understand graphics
programming.

~~~
hubatrix
Yes I am simply concentrating on parallel calculations and I dont need any
graphics to work upon

------
jonbaer
[https://developer.nvidia.com/educators/existing-
courses](https://developer.nvidia.com/educators/existing-courses)

~~~
hubatrix
Thank you !!

------
fitzwatermellow
If you want to dive in deep right away, an interesting summer project would be
to create hybrid client / server gpu clusters using webgl and the browser!

Install node-opencl and CL.js and check out how Graphistry is using scalable
distributed gpus for visualization of massive data sets ;)

CL.js:

[https://github.com/graphistry/cljs](https://github.com/graphistry/cljs)

Node-OpenCL

[https://github.com/mikeseven/node-opencl](https://github.com/mikeseven/node-
opencl)

Graphistry

[https://www.graphistry.com/](https://www.graphistry.com/)

------
p0nce
My advice: learn OpenCL, but not OpenGL. The API is much simpler, and very
applicable. Maps on CUDA quite well too.

~~~
hubatrix
I am looking into performance optimization in a data processing framework
would COpenGL help ? or should I just learn CUDA from the links others
suggested !?

~~~
exDM69
OpenGL is most likely not what you want.

For compute, your choices are CUDA, OpenCL, or compute shaders in Vulkan or
D3D. CUDA is most likely the best choice if a single vendor API is acceptable.

------
syedrezaali
This app allows you to code GLSL programs to create images, movies, etc, its a
great way to get started:

[http://www.syedrezaali.com/store/fragment-osx-
app](http://www.syedrezaali.com/store/fragment-osx-app)

Here are a bunch of sketches (GPU Programs):

[https://github.com/rezaali/FragmentSketches](https://github.com/rezaali/FragmentSketches)

------
hubatrix
One more thing: how will I know if I should use OpenCL or CUDA ?

~~~
corysama
Are you more interested in open or practical? OpenCL is open. That's important
for a lot of people. But, on the practical side, Nvidia has invested more in
improving CUDA than all actors combined have put into OpenCL. It's reached the
point that people are implementing open-source CUDA compilers to bring CUDA to
non-Nvidia platforms.

------
beagle3
I last did GPU programming around the transition from just-shaders to gpgpu...
And I want to come back. However, I don't want to buy new hardware just yet.
In addition to all the study materials posted, can anyone recommend a good
hosted GPU setup for experiments? e.g. what kind of EC2/GCE instance would be
a cost effective way to [re]learn GPU programming?

~~~
gtani
People commonly spend US$80+/month on training deep learning models on EC2,
when you can now buy a brand new 980Ti for $440 from newegg, and the very
capable 970 at $280. If you don't have a motherboard/PS that supports 2 video
cards at 16 PCIe 3.0 lanes each, it's a worthy investment.

There are wrinkles like if you're doing double precision arithmetic, consumer
cards from nvidia since the Titan blacks are 12:1 fp32:fp64 or higher i.e.
double precision is relatively slow. Radeons/openCL might be better for the
purpose

~~~
beagle3
Thanks. For sure, if I do anything serious I will buy my own hardware. But I'd
like to play a little to get a feel - I'm sure a $20 over one month will pay
back dividends in the understanding of how/what/whether comes next. I'm just
asking for advice about those $20 (or $40 or $80 or however much they end up
being)

~~~
gtani
Well, there's not much price pressure from azure, you could try the others in
nvidia's list:
[https://www.reddit.com/r/MachineLearning/comments/46ict5/bes...](https://www.reddit.com/r/MachineLearning/comments/46ict5/best_cloud_server_hosts_for_machine_learning/)

Also watch spot price instances, go on weekends:
[https://ec2price.com/?product=Linux/UNIX&type=g2.2xlarge&reg...](https://ec2price.com/?product=Linux/UNIX&type=g2.2xlarge&region=us-
east-1&window=10)

------
gravypod
What kind of operation and system is this meant to be used in? Are you doing
some kind of DSP?

If you are, you may want to consider using an FPGA and doing this all "in
hardware" if you need extreme performance.

------
hacker42
This course looks pretty good:
[https://news.ycombinator.com/item?id=11902172](https://news.ycombinator.com/item?id=11902172)

~~~
hubatrix
yes thanks for the link, but there seems to be no video lectures.

------
estefan
Nvidia?

