2. install pycuda https://mathema.tician.de/software/pycuda/
3. dive into excelennt documentation and tutorials: https://documen.tician.de/pycuda/
Literally in 2h, I had first code on my GTX at standard ubuntu. Being it python (plus obvious cuda code in C) was much easier to grasp.
edit: also when you are ready to pick a language and start writing code, i advise using the universal OpenGL over Cuda.
If it fits your goals, a useful mental model is that you are programming Numpy ndarray data, and all these tools just fit together to let you manage the data elements, apply transforms, and view them. You should be thinking about the math that solves your domain problem, using alternative mathematical formulations to optimize your algorithms, and only then worrying about using a tuned parallel implementation of that math.
I prototype new ideas and visualize results and intermediate data very easily, with easy transitions between using numpy/scipy routines, custom OpenCL kernels, and OpenGL shaders. I've even found myself using OpenGL to do the visual equivalent of "printf debugging" to look at intermediate array results when developing new array operations and wondering what I've done wrong or misunderstood about the problem. It's very instructive to create little sub-problems you can run as independent scripts during dev/test/microbenchmarking cycles. You should iterate on many small experiments, not assume you can design high-performance solution in one top-down adventure.
For high performance processing, you eventually need to understand architecture limitations and the impact of different problem decompositions. Non-trivial, multi-dimensional problems need to be decomposed into smaller blocks to get better cache locality for the vectorized/parallel code that will process each block. Otherwise, you won't enjoy the benefits of parallel hardware as all the compute units are stalled waiting for data fetches. I find myself doing more and more meta-programming in Python, where I reason about array shapes and sizes, compute slicing geometries, and then loop to extract and dispatch chunks of data to OpenCL jobs which in turn use a dense form of SIMD or SPMD execution on that block. Python is great for this kind of marshaling work.
Also, PyOpenCL makes it trivial to switch between backend drivers. The Intel OpenCL driver is quite good on recent laptop and server CPUs. You might be surprised how much performance you can get out of recent i5/i7 or Xeon E3/E5 CPUs when you use all cores and all SIMD units effectively. Plus, the CPU backend is able to use all system RAM and has a more uniform cache behavior, which makes it more forgiving for badly decomposed problems compared to GPU. This can be a big help for prototyping as well as for limited-run problems where you don't have time to invest in fully tuning your data structures for GPU requirements.
That you don't seem to know the difference between OpenGL and OpenCL hardly makes your claims more plausible.
Anyway, Cuda is a general purpose framework as is OpenCL. OpenGL is a 3D graphics framework and different from both. Looking at these structure, Cuda is architected to allow straight forward general general purpose parallel computing - one has to know the broad structure of a GPU but one doesn't have to know all the principles of graphics programming. OpenCL is similar but more complex due to its efforts to take multiple processors into account. As far as I can tell, using Cuda is the simplest path to general purpose GPU computing (the op didn't do any favors by not really saying what kind of GPU computing he wanted to learn - he did say Cuda later - but your post seems confused regardless of what thing someone ultimately wants to learn).
Does it mean you've never done anything GPU related ever and you want to start at "put a triangle on the screen" with OpenGL or WebGL?
Does it mean some thing like GPGPU, you want to use GPUs to do computation?
Does it mean something like Vulkan/DX12/Metal, you want do low level stuff?
Maybe someone can give answers to all of those questions but they're arguably different things and all fit under "GPU Programming"
Cautionary tale following...
I don't know the background on this but your goal might not be realistic, unless the rest of the framework architected in a GPU friendly way.
I was once hired for such a project (some scientific applications in chemistry). They had a clear idea what needed to be improved, their Python framework called into C code for some non-trivial matrix operations. This was at the top of the profile so making it faster would have improved the overall performance.
However, each matrix operation was just tens to hundreds of kilobytes in size. Doing them with the GPU was 100x faster, but the latency of getting the data to the GPU and back eliminated any gains.
It's clear that doing these on the GPU would have improved overall performance if the operations could be batched. But the framework was tens of thousands of LOC of scientist-written Python code that was mission critical and could not be rewritten in the time and budget we had.
GPU programming is high throughput at the expense of high latency. It is very difficult to retrofit into a CPU-only framework or application. You need to consider memory management and synchronization for the GPU from ground up.
Memory bandwidth is the bottleneck for most practical tasks. Bandwidths you should know: CPU 50 GB/s, GPU 300 GB/s, PCI-e DMA 16 (v. 3.x) or 32 (v. 4.x) GB/s.
Shadows, animation, no way in a month. The whole OpenGL stack is pretty complex.
My suggestion would be the same though, but maybe reserve one year to really just grind the basics until you have a clear understanding of what is happening.
On the other hand, if you just want CUDA, there are frameworks for that for enabling parallel calculations without having to understand graphics programming.
It seems neat that it'll compile and run on a GPU, CPU or FPGA, but it seems like code written for one style of architecture would be appallingly slow on the others.
Install node-opencl and CL.js and check out how Graphistry is using scalable distributed gpus for visualization of massive data sets ;)
For compute, your choices are CUDA, OpenCL, or compute shaders in Vulkan or D3D. CUDA is most likely the best choice if a single vendor API is acceptable.
Here are a bunch of sketches (GPU Programs):
There are wrinkles like if you're doing double precision arithmetic, consumer cards from nvidia since the Titan blacks are 12:1 fp32:fp64 or higher i.e. double precision is relatively slow. Radeons/openCL might be better for the purpose
Also watch spot price instances, go on weekends: https://ec2price.com/?product=Linux/UNIX&type=g2.2xlarge®...
If you are, you may want to consider using an FPGA and doing this all "in hardware" if you need extreme performance.