Wen-mei Hwu's lectures on "Advanced Algorithmic Techiques for GPUs" (first lecture slides: http://iccs.lbl.gov/assets/docs/2011-01-24/lecture1_computat...) are a gold mine of GPU programming techniques. I believe he has published several books on the topic too, and released a benchmark suite (Parboil http://impact.crhc.illinois.edu/parboil/parboil.aspx) optimized with these techinques.
Plus, the final exam was extremely harsh so I wouldn't have called it a joke.
Many things were missing from that class, including how to improve performance by ensuring that warp receives the optimal data size, for example by using float4.
I could have learned the same stuff by looking at his MOOC - which is what OP got bored of doing.
I am really interested in the class outcome, and would love to hear what the students in class feel about this arrangement ....
I can see the good things about this. It gives the instructor/TA students an opportunity to grow while giving the peer-learning atmosphere to students in class. Plus, the students in class will learn from their peers who has the latest working knowledge of CUDA fresh in their heads, and this arrangement also frees up a faculty (or two) from having to prepare the course so that they can do their faculty/research work (prepping and teaching a class, especially an interesting and engaging one, is a really draining experience on the part of the faculty as well.)
Only downside I can see could be managing the class well enough so that class time is efficiently utilized. But I believe this should be covered by the faculty who is in supervisor position ....
The motivation behind the student taught class is that it allows for more classes to be taught than could happen otherwise.
As a student: Like any other class, the quality greatly depends on the work put in by the instructors. I think a student instructor is more likely to care about the quality of teaching, but also more likely to be overworked and not have enough time to dedicate to the course. I didn't think the course was particularly good when I took it due to lack of time from the instructors, but I'm glad the course was offered and that I took it as it got my feet wet with GPU programming.
After taking the course, I did an internship doing GPU programming. Doing this internship, I learned a ton and had a lot of ideas about how to improve the course. This put the idea of teaching the course in my head.
As an instructor: Myself and one other student designed the curriculum, gave the lectures, made the problem sets, did everything. We had a 3rd student who helped with grading. Teaching the course was hugely valuable to me, and also a ton of work. The course was hugely valuable because I learned a ton about GPU programming by teaching it and answering questions. As part of my motivation for teaching was to make the course more how I thought it should be, I didn't reuse many materials from the year before and spent many hours making lecture slides and problem sets. Towards the end of the course, I fell short on time and the lectures and problem sets weren't as good as they could have been. We made the class have a large final project of the student's choosing, and a few awesome things were made. Overall, I'm glad I taught the class, and I think I mostly accomplished what I wanted with improving the learning outcome for students.
I'm not sure how common such courses are, but where I went they were called "Student Directed Seminars" and allowed for some interesting courses that normally wouldn't get offered.
If you want to really master CUDA, Nvidia GPUs and the various programming model tradeoffs, the best thing is to write a GEMM kernel and a sort kernel from scratch. To take it even further, write two of each: one that optimizes large GEMMs/sorts, and one that optimizes for batches of small GEMMs (or large GEMMs with tiny (<16 or <32) `k` or another dim) / batches of small sorts. Specialization for different problem configurations is often the name of the game.
For GEMM, you can work through the simple GEMM example in the CUDA documentation, then take a look at the Volkov GEMM from 2008, then the MAGMA GEMM, then the Junjie Lai / INRIA GEMM, then eventually the Scott Gray / Nervana SASS implementation, in increasing order of complexity and state-of-the-art-ness.
That said, I thought the practical nature of the class was a refreshing switch from the heavily theoretical foundation of my other CS coursework experiences.
For anybody following along, there's 2 other books, Wrox Professional Cuda programming, and Cuda for Engineers, which would ease entry for those who aren't versed in HPC (PDE solvers, BLAS/LAPACK, Fourier transforms etc). The Storti/Yurtoglu book is the best intro i've seen to the topic, the Wrox book covers a lot of the material in Wilt's Handbook, not as exhaustively, but more up to date (Kepler vs Fermi).
There's other course material online, UIUC, oxford (especially good, IMO)
If I were NVIDIA I'd probably donate scores of servers+GPUs to schools like Caltech in order to inspire curriculum just like this.
I've looked at CUDA and OpenCL for my own development. I'd love to use OpenCL on the principle of openness and avoiding lock-in to NVidia.
But as far as I can tell, OpenCL is nothing like what a single developer would want to use for a large scale gpgpu program - it seems to be a layer that emulates all the varying features of all the manufacturer's chips to avoid advantaging any single manufacturer. Tons of boilerplate and none of the detailed "how to do X with Cuda" examples and manuals that NVidia has produced. It's not just that it's kind of hard and unfeatured, it's that given that it will basically continue to suck for the purposes of most gpgpu development, one can expect it to die or be suddenly replaced by something better. Basically, "open standards" from consortium of manufacturers have only existed to allow other large companies to produce just a few application on top of. I think any developer wants something they can "just program" and Cuda is miles ahead on that and seems likely to stay that way.
Another reason is that I can run that on my AMD GPU :)