Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Best place to learn GPU programing?
155 points by hubatrix on June 29, 2016 | hide | past | favorite | 54 comments



My advice: 1. buy some cuda GTX from nvidia (second hand ebayish would do)

2. install pycuda https://mathema.tician.de/software/pycuda/

3. dive into excelennt documentation and tutorials: https://documen.tician.de/pycuda/

Literally in 2h, I had first code on my GTX at standard ubuntu. Being it python (plus obvious cuda code in C) was much easier to grasp.


I already have NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) to work on, I just have no clue on CUDA so thought reading about GPU arcitecture and learning the basics by a tutuorial is the right way to start !!



I looked into it before, but thought its too slow and its more like for people having no idea about computers before, sorry these are purley my feeling alone.So asked for more advance tutorial here !?


You can just skip the first lessons, the chapter "Squaring Numbers Using CUDA Part 1" has some code. Then the third lesson some more advanced techniques (actually they are really basic, but you might not have heard about them if you're not into parallel programming or networking hardware). I recommend it btw.


Thanks, sure will give it a try !!


the beginning is slow but then it picks up pace, I liked it and they present the topic in a way that is easily understandable


There's a nice CUDA tutorial on DrDobb's. It may be slightly outdated by now, but it was a great help to get me started.

http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-...


thank you !!


I'm afraid there isn't a single best place. What is important first is to learn how GPU architecture is different from traditional before you start learning Cuda or linear algebra. Take a look at these sources for absolute beginners.

https://thebookofshaders.com/

http://duriansoftware.com/joe/An-intro-to-modern-OpenGL.-Tab...

http://learnopengl.com/

edit: also when you are ready to pick a language and start writing code, i advise using the universal OpenGL over Cuda.


Would you recommend OpenGL? I was under the impression that OpenCL was the equivalent of Cuda and OpenGL was specifically for 2d / 3d rendering.


I imagine it was a typo, but... I personally have found PyOpenCL and the vispy.gloo OpenGL APIs to be very convenient together (along with Numpy and Scipy of course).

If it fits your goals, a useful mental model is that you are programming Numpy ndarray data, and all these tools just fit together to let you manage the data elements, apply transforms, and view them. You should be thinking about the math that solves your domain problem, using alternative mathematical formulations to optimize your algorithms, and only then worrying about using a tuned parallel implementation of that math.

I prototype new ideas and visualize results and intermediate data very easily, with easy transitions between using numpy/scipy routines, custom OpenCL kernels, and OpenGL shaders. I've even found myself using OpenGL to do the visual equivalent of "printf debugging" to look at intermediate array results when developing new array operations and wondering what I've done wrong or misunderstood about the problem. It's very instructive to create little sub-problems you can run as independent scripts during dev/test/microbenchmarking cycles. You should iterate on many small experiments, not assume you can design high-performance solution in one top-down adventure.

For high performance processing, you eventually need to understand architecture limitations and the impact of different problem decompositions. Non-trivial, multi-dimensional problems need to be decomposed into smaller blocks to get better cache locality for the vectorized/parallel code that will process each block. Otherwise, you won't enjoy the benefits of parallel hardware as all the compute units are stalled waiting for data fetches. I find myself doing more and more meta-programming in Python, where I reason about array shapes and sizes, compute slicing geometries, and then loop to extract and dispatch chunks of data to OpenCL jobs which in turn use a dense form of SIMD or SPMD execution on that block. Python is great for this kind of marshaling work.

Also, PyOpenCL makes it trivial to switch between backend drivers. The Intel OpenCL driver is quite good on recent laptop and server CPUs. You might be surprised how much performance you can get out of recent i5/i7 or Xeon E3/E5 CPUs when you use all cores and all SIMD units effectively. Plus, the CPU backend is able to use all system RAM and has a more uniform cache behavior, which makes it more forgiving for badly decomposed problems compared to GPU. This can be a big help for prototyping as well as for limited-run problems where you don't have time to invest in fully tuning your data structures for GPU requirements.


If you read the op, their reference to OpenGl isn't a typo, they are linking to books and articles on OpenGl and recommending learning this before Cuda.


Yes my thoughts are the same, so I decided to learn CUDA and port to OpenCL if necessary


I'm just starting with learning gpgpu programming myself but I'm replying because your advice seems wildly off-base. Your links are full books on graphics-processing algorithms, not introductions-to-architecture which must come before using Cuda or gpgpu programming effectively.

That you don't seem to know the difference between OpenGL and OpenCL hardly makes your claims more plausible.

Anyway, Cuda is a general purpose framework as is OpenCL. OpenGL is a 3D graphics framework and different from both. Looking at these structure, Cuda is architected to allow straight forward general general purpose parallel computing - one has to know the broad structure of a GPU but one doesn't have to know all the principles of graphics programming. OpenCL is similar but more complex due to its efforts to take multiple processors into account. As far as I can tell, using Cuda is the simplest path to general purpose GPU computing (the op didn't do any favors by not really saying what kind of GPU computing he wanted to learn - he did say Cuda later - but your post seems confused regardless of what thing someone ultimately wants to learn).


what does "GPU programming" mean?

Does it mean you've never done anything GPU related ever and you want to start at "put a triangle on the screen" with OpenGL or WebGL?

Does it mean some thing like GPGPU, you want to use GPUs to do computation?

Does it mean something like Vulkan/DX12/Metal, you want do low level stuff?

Maybe someone can give answers to all of those questions but they're arguably different things and all fit under "GPU Programming"


Hello, sorry for my vague question, I want learn CUDA so I can increase performance of an existing framework, pure performance and throughput time in consideration. And no I dont have no prior experience on working with GPU's before.


> increase performance of an existing framework

Cautionary tale following...

I don't know the background on this but your goal might not be realistic, unless the rest of the framework architected in a GPU friendly way.

I was once hired for such a project (some scientific applications in chemistry). They had a clear idea what needed to be improved, their Python framework called into C code for some non-trivial matrix operations. This was at the top of the profile so making it faster would have improved the overall performance.

However, each matrix operation was just tens to hundreds of kilobytes in size. Doing them with the GPU was 100x faster, but the latency of getting the data to the GPU and back eliminated any gains.

It's clear that doing these on the GPU would have improved overall performance if the operations could be batched. But the framework was tens of thousands of LOC of scientist-written Python code that was mission critical and could not be rewritten in the time and budget we had.

GPU programming is high throughput at the expense of high latency. It is very difficult to retrofit into a CPU-only framework or application. You need to consider memory management and synchronization for the GPU from ground up.


Thanks for the insight, Well I am working on a research project, to redesign a CPU architecture framework into a GPU one, and luckly I am working on real time batch processed data, so as of now have some plans as to what should go on GPU and I want to try it out myself and see if there are ways to cut of those bus tranmsion latencies between CPU and GPU. Not sure how exaclty but I just started kearning CUDA now, one more thing how will I know if I should use OpenCL or CUDA ?


Latency hiding is the key for high throughput GPU applications. Make sure the GPU never waits for CPU and vice versa. Keep transfers to/from GPU flowing constantly and a queue of jobs scheduled. Each job should preferably be several megabytes in size.

Memory bandwidth is the bottleneck for most practical tasks. Bandwidths you should know: CPU 50 GB/s, GPU 300 GB/s, PCI-e DMA 16 (v. 3.x) or 32 (v. 4.x) GB/s.


Thanks again, when I reach the point of actually implementing it I am sure I'll be asking you a question or two !! but HN, unfortunately does not have friend/follow functionalities


Yes, agreed completely. I had a similar experience in a streaming context, and wrote a paper about it: http://www.scott-a-s.com/files/debs2010.pdf


The following blog series may provide additional insight into the process of applying CUDA to existing code:

http://blog.marcgravell.com/2016/05/cudagetting-started-in-n...


thanks for the resource !!


I found Wen Mei Hwu's "Heterogeneous Parallel Programming" course on Coursera to be an excellent class: https://www.coursera.org/course/hetero


thank you, does he also teach CUDA in it ?


Yes. It's quite CUDA focus.


the course seems to be not available anymore


Get a book on linear algebra, a book on C/C++, and download the OpenGL spec and build a graphics pipeline that can render a textured polygon, then add lights, shadows, animations... this will take you a month perhaps.


One month, sure :D Realistically, if you have no GPU or OpenGL experience, to build something like that, lightning, shadows.. And to really understand what is happening, in one month maybe if already know C++/your language of choice well, you can grasb the idea of how to draw a triangle with maybe some lightning.

Shadows, animation, no way in a month. The whole OpenGL stack is pretty complex.

My suggestion would be the same though, but maybe reserve one year to really just grind the basics until you have a clear understanding of what is happening.

On the other hand, if you just want CUDA, there are frameworks for that for enabling parallel calculations without having to understand graphics programming.


Yes I am simply concentrating on parallel calculations and I dont need any graphics to work upon


I am working on a data processing framework project so I dont think I have a months time so as of now I started reading http://analog.nik.uni-obuda.hu/ParhuzamosProgramozasuHardver...


ah ok then you're better off learning OpenCL, if you don't have access to a beefy graphics card just rent one by the hour off of amazon. You can prototype on your laptops CPU then run the real simulation on a remote machine.


thats an issue I already have NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) to work on. but why OpenCL and not CUDA ?


edit: thats not an issue I already have NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) to work on. but why OpenCL and not CUDA ?


OpenCL vs CUDA is pretty boring debate, both run on the same hardware and so have similar performance. Difference is in the tooling and ecosystems, you can run OpenCL on FPGA's for example.


Out of curiosity, has anyone successfully deployed some OpenCL code across very different platforms?

It seems neat that it'll compile and run on a GPU, CPU or FPGA, but it seems like code written for one style of architecture would be appallingly slow on the others.



Thank you !!


If you want to dive in deep right away, an interesting summer project would be to create hybrid client / server gpu clusters using webgl and the browser!

Install node-opencl and CL.js and check out how Graphistry is using scalable distributed gpus for visualization of massive data sets ;)

CL.js:

https://github.com/graphistry/cljs

Node-OpenCL

https://github.com/mikeseven/node-opencl

Graphistry

https://www.graphistry.com/


My advice: learn OpenCL, but not OpenGL. The API is much simpler, and very applicable. Maps on CUDA quite well too.


I am looking into performance optimization in a data processing framework would COpenGL help ? or should I just learn CUDA from the links others suggested !?


OpenGL is most likely not what you want.

For compute, your choices are CUDA, OpenCL, or compute shaders in Vulkan or D3D. CUDA is most likely the best choice if a single vendor API is acceptable.


This app allows you to code GLSL programs to create images, movies, etc, its a great way to get started:

http://www.syedrezaali.com/store/fragment-osx-app

Here are a bunch of sketches (GPU Programs):

https://github.com/rezaali/FragmentSketches


One more thing: how will I know if I should use OpenCL or CUDA ?


Are you more interested in open or practical? OpenCL is open. That's important for a lot of people. But, on the practical side, Nvidia has invested more in improving CUDA than all actors combined have put into OpenCL. It's reached the point that people are implementing open-source CUDA compilers to bring CUDA to non-Nvidia platforms.


I last did GPU programming around the transition from just-shaders to gpgpu... And I want to come back. However, I don't want to buy new hardware just yet. In addition to all the study materials posted, can anyone recommend a good hosted GPU setup for experiments? e.g. what kind of EC2/GCE instance would be a cost effective way to [re]learn GPU programming?


People commonly spend US$80+/month on training deep learning models on EC2, when you can now buy a brand new 980Ti for $440 from newegg, and the very capable 970 at $280. If you don't have a motherboard/PS that supports 2 video cards at 16 PCIe 3.0 lanes each, it's a worthy investment.

There are wrinkles like if you're doing double precision arithmetic, consumer cards from nvidia since the Titan blacks are 12:1 fp32:fp64 or higher i.e. double precision is relatively slow. Radeons/openCL might be better for the purpose


Thanks. For sure, if I do anything serious I will buy my own hardware. But I'd like to play a little to get a feel - I'm sure a $20 over one month will pay back dividends in the understanding of how/what/whether comes next. I'm just asking for advice about those $20 (or $40 or $80 or however much they end up being)


Well, there's not much price pressure from azure, you could try the others in nvidia's list: https://www.reddit.com/r/MachineLearning/comments/46ict5/bes...

Also watch spot price instances, go on weekends: https://ec2price.com/?product=Linux/UNIX&type=g2.2xlarge&reg...


What kind of operation and system is this meant to be used in? Are you doing some kind of DSP?

If you are, you may want to consider using an FPGA and doing this all "in hardware" if you need extreme performance.


This course looks pretty good: https://news.ycombinator.com/item?id=11902172


yes thanks for the link, but there seems to be no video lectures.


Nvidia?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: