I already have NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2) to work on, I just have no clue on CUDA so thought reading about GPU arcitecture and learning the basics by a tutuorial is the right way to start !!
I looked into it before, but thought its too slow and its more like for people having no idea about computers before, sorry these are purley my feeling alone.So asked for more advance tutorial here !?
You can just skip the first lessons, the chapter "Squaring Numbers Using CUDA Part 1" has some code. Then the third lesson some more advanced techniques (actually they are really basic, but you might not have heard about them if you're not into parallel programming or networking hardware). I recommend it btw.
I'm afraid there isn't a single best place. What is important first is to learn how GPU architecture is different from traditional before you start learning Cuda or linear algebra. Take a look at these sources for absolute beginners.
I imagine it was a typo, but... I personally have found PyOpenCL and the vispy.gloo OpenGL APIs to be very convenient together (along with Numpy and Scipy of course).
If it fits your goals, a useful mental model is that you are programming Numpy ndarray data, and all these tools just fit together to let you manage the data elements, apply transforms, and view them. You should be thinking about the math that solves your domain problem, using alternative mathematical formulations to optimize your algorithms, and only then worrying about using a tuned parallel implementation of that math.
I prototype new ideas and visualize results and intermediate data very easily, with easy transitions between using numpy/scipy routines, custom OpenCL kernels, and OpenGL shaders. I've even found myself using OpenGL to do the visual equivalent of "printf debugging" to look at intermediate array results when developing new array operations and wondering what I've done wrong or misunderstood about the problem. It's very instructive to create little sub-problems you can run as independent scripts during dev/test/microbenchmarking cycles. You should iterate on many small experiments, not assume you can design high-performance solution in one top-down adventure.
For high performance processing, you eventually need to understand architecture limitations and the impact of different problem decompositions. Non-trivial, multi-dimensional problems need to be decomposed into smaller blocks to get better cache locality for the vectorized/parallel code that will process each block. Otherwise, you won't enjoy the benefits of parallel hardware as all the compute units are stalled waiting for data fetches. I find myself doing more and more meta-programming in Python, where I reason about array shapes and sizes, compute slicing geometries, and then loop to extract and dispatch chunks of data to OpenCL jobs which in turn use a dense form of SIMD or SPMD execution on that block. Python is great for this kind of marshaling work.
Also, PyOpenCL makes it trivial to switch between backend drivers. The Intel OpenCL driver is quite good on recent laptop and server CPUs. You might be surprised how much performance you can get out of recent i5/i7 or Xeon E3/E5 CPUs when you use all cores and all SIMD units effectively. Plus, the CPU backend is able to use all system RAM and has a more uniform cache behavior, which makes it more forgiving for badly decomposed problems compared to GPU. This can be a big help for prototyping as well as for limited-run problems where you don't have time to invest in fully tuning your data structures for GPU requirements.
If you read the op, their reference to OpenGl isn't a typo, they are linking to books and articles on OpenGl and recommending learning this before Cuda.
I'm just starting with learning gpgpu programming myself but I'm replying because your advice seems wildly off-base. Your links are full books on graphics-processing algorithms, not introductions-to-architecture which must come before using Cuda or gpgpu programming effectively.
That you don't seem to know the difference between OpenGL and OpenCL hardly makes your claims more plausible.
Anyway, Cuda is a general purpose framework as is OpenCL. OpenGL is a 3D graphics framework and different from both. Looking at these structure, Cuda is architected to allow straight forward general general purpose parallel computing - one has to know the broad structure of a GPU but one doesn't have to know all the principles of graphics programming. OpenCL is similar but more complex due to its efforts to take multiple processors into account. As far as I can tell, using Cuda is the simplest path to general purpose GPU computing (the op didn't do any favors by not really saying what kind of GPU computing he wanted to learn - he did say Cuda later - but your post seems confused regardless of what thing someone ultimately wants to learn).
Hello, sorry for my vague question, I want learn CUDA so I can increase performance of an existing framework, pure performance and throughput time in consideration. And no I dont have no prior experience on working with GPU's before.
I don't know the background on this but your goal might not be realistic, unless the rest of the framework architected in a GPU friendly way.
I was once hired for such a project (some scientific applications in chemistry). They had a clear idea what needed to be improved, their Python framework called into C code for some non-trivial matrix operations. This was at the top of the profile so making it faster would have improved the overall performance.
However, each matrix operation was just tens to hundreds of kilobytes in size. Doing them with the GPU was 100x faster, but the latency of getting the data to the GPU and back eliminated any gains.
It's clear that doing these on the GPU would have improved overall performance if the operations could be batched. But the framework was tens of thousands of LOC of scientist-written Python code that was mission critical and could not be rewritten in the time and budget we had.
GPU programming is high throughput at the expense of high latency. It is very difficult to retrofit into a CPU-only framework or application. You need to consider memory management and synchronization for the GPU from ground up.
Thanks for the insight, Well I am working on a research project, to redesign a CPU architecture framework into a GPU one, and luckly I am working on real time batch processed data, so as of now have some plans as to what should go on GPU and I want to try it out myself and see if there are ways to cut of those bus tranmsion latencies between CPU and GPU. Not sure how exaclty but I just started kearning CUDA now, one more thing how will I know if I should use OpenCL or CUDA ?
Latency hiding is the key for high throughput GPU applications. Make sure the GPU never waits for CPU and vice versa. Keep transfers to/from GPU flowing constantly and a queue of jobs scheduled. Each job should preferably be several megabytes in size.
Memory bandwidth is the bottleneck for most practical tasks. Bandwidths you should know: CPU 50 GB/s, GPU 300 GB/s, PCI-e DMA 16 (v. 3.x) or 32 (v. 4.x) GB/s.
Thanks again, when I reach the point of actually implementing it I am sure I'll be asking you a question or two !! but HN, unfortunately does not have friend/follow functionalities
Get a book on linear algebra, a book on C/C++, and download the OpenGL spec and build a graphics pipeline that can render a textured polygon, then add lights, shadows, animations... this will take you a month perhaps.
One month, sure :D Realistically, if you have no GPU or OpenGL experience, to build something like that, lightning, shadows.. And to really understand what is happening, in one month maybe if already know C++/your language of choice well, you can grasb the idea of how to draw a triangle with maybe some lightning.
Shadows, animation, no way in a month. The whole OpenGL stack is pretty complex.
My suggestion would be the same though, but maybe reserve one year to really just grind the basics until you have a clear understanding of what is happening.
On the other hand, if you just want CUDA, there are frameworks for that for enabling parallel calculations without having to understand graphics programming.
ah ok then you're better off learning OpenCL, if you don't have access to a beefy graphics card just rent one by the hour off of amazon. You can prototype on your laptops CPU then run the real simulation on a remote machine.
OpenCL vs CUDA is pretty boring debate, both run on the same hardware and so have similar performance. Difference is in the tooling and ecosystems, you can run OpenCL on FPGA's for example.
Out of curiosity, has anyone successfully deployed some OpenCL code across very different platforms?
It seems neat that it'll compile and run on a GPU, CPU or FPGA, but it seems like code written for one style of architecture would be appallingly slow on the others.
If you want to dive in deep right away, an interesting summer project would be to create hybrid client / server gpu clusters using webgl and the browser!
Install node-opencl and CL.js and check out how Graphistry is using scalable distributed gpus for visualization of massive data sets ;)
I am looking into performance optimization in a data processing framework would COpenGL help ? or should I just learn CUDA from the links others suggested !?
For compute, your choices are CUDA, OpenCL, or compute shaders in Vulkan or D3D. CUDA is most likely the best choice if a single vendor API is acceptable.
Are you more interested in open or practical? OpenCL is open. That's important for a lot of people. But, on the practical side, Nvidia has invested more in improving CUDA than all actors combined have put into OpenCL. It's reached the point that people are implementing open-source CUDA compilers to bring CUDA to non-Nvidia platforms.
I last did GPU programming around the transition from just-shaders to gpgpu... And I want to come back. However, I don't want to buy new hardware just yet. In addition to all the study materials posted, can anyone recommend a good hosted GPU setup for experiments? e.g. what kind of EC2/GCE instance would be a cost effective way to [re]learn GPU programming?
People commonly spend US$80+/month on training deep learning models on EC2, when you can now buy a brand new 980Ti for $440 from newegg, and the very capable 970 at $280. If you don't have a motherboard/PS that supports 2 video cards at 16 PCIe 3.0 lanes each, it's a worthy investment.
There are wrinkles like if you're doing double precision arithmetic, consumer cards from nvidia since the Titan blacks are 12:1 fp32:fp64 or higher i.e. double precision is relatively slow. Radeons/openCL might be better for the purpose
Thanks. For sure, if I do anything serious I will buy my own hardware. But I'd like to play a little to get a feel - I'm sure a $20 over one month will pay back dividends in the understanding of how/what/whether comes next. I'm just asking for advice about those $20 (or $40 or $80 or however much they end up being)
2. install pycuda https://mathema.tician.de/software/pycuda/
3. dive into excelennt documentation and tutorials: https://documen.tician.de/pycuda/
Literally in 2h, I had first code on my GTX at standard ubuntu. Being it python (plus obvious cuda code in C) was much easier to grasp.