Hacker News new | past | comments | ask | show | jobs | submit login
An Even Easier Introduction to CUDA (nvidia.com)
247 points by jonbaer on Feb 4, 2017 | hide | past | favorite | 59 comments

Does anyone familiar with the state of GPU programming think OpenCL will eventually 'win' over CUDA? Although CUDA has more adoption, I don't like the idea of using it and being locked into a specific vendor. Of course nVidia is only supporting outdated versions of OpenCL for now. Am I a fool for hoping OpenCL eventually becomes the standard?

In 2013 we started GPU programming at the company I work for. We carefully evaluated CUDA and OpenCL and decided to go for OpenCL because it was a standard and we could chose between 2 vendors of GPU. I can tell you that in 2017 we do not regret our choice. It is great to be able to run our code on both AMD and NVidia GPUs, and to offer our customers to choose whichever GPU vendor they prefer.

Many people criticise OpenCL because when you come from C++ it seems a lot of work. It is true that OpenCL has an API influenced by OpenGL and is verbose. However it is not difficult to write a small framework specific to your needs and domain to factorise much of this verbosity.

NVidia does everything it can to hide the fact that their devices support OpenCL. People thinks that only ancients versions of OpenCL run on NVidia devices. That is not true: 1.2 is not ancient is still as of today the main version of OpenCL used. OpenCL 1.2 is fully supported and NVidia quietly say to its large customers who refuse to use CUDA, that they will starting to support soon some OpenCL 2.0 features.

To answer your question, I am not sure either will win, but they will both exist for a long time.

Thank you, this is very helpful information I was hoping to hear.

Honestly for those of us in machine learning, I think something like XLA will likely win over both paradigms. (Disclaimer, I work on XLA.)


XLA much more closely matches what you want for ML than CUDA/opencl. Which isn't a surprise; it was designed specifically for ML.

Kernel launches are expensive, so any fast CUDA system has to let you compose computations into a single kernel (e.g. multiply by 5 and then take tanh). It's possible to do this in CUDA, but it requires heoric C++ template metaprogramming. It's not uncommon to have files that take ten minutes to compile. Whereas in XLA kernel fusion is nbd, because it's a JIT.

Also, because XLA is generating GPU code after it's seen your model, it can specialize computations specifically to your model. In regular TensorFlow (and I presume other ML frameworks, although I'm not at all familiar with them), you have to compile all of your kernels upfront. This means that the framework probably doesn't have the ideal set of kernels for your model, because the framework's set of kernels needs to be generic. For example, the framework probably isn't going to have a "multiply by 5 and then take tanh" kernel -- if you're lucky, it might have a "multiply by X and then take tanh", but notice that this may be slower because X is now not a constant.

In contrast, not only can XLA specialize for your weird X==5 case, but it can also specialize all of the dimensions of your arrays. This is a really big advantage in many cases.

As just one example, it's common for kernels to do something like

  int index = some computation based on threadIdx and blockIdx;
  if (index < array_len) { ... }
But in XLA we know the size of the kernel, so we know the possible values for threadIdx and blockIdx, and we know the exact value of array_len. We can therefore often optimize out the if entirely.

Justin, XLA sounds interesting. Do you assume you always have CUDA sources for ML operations in XLA? I was under the impression that closed-source libraries like cuDNN were used.

Is it possible to accurately evaluate the profitability of fusing two kernels in CUDA (effects of increased register pressure; shared memory)? On the other hand, the generic kernel and its launch parameters were probably hand tuned for performance.

> Do you assume you always have CUDA sources for ML operations in XLA? I was under the impression that closed-source libraries like cuDNN were used.

Yes, XLA calls into cudnn and cublas. It's not a fundamental architectural thing, though; those are just the fastest matmul etc. kernels we currently have access to.

> Is it possible to accurately evaluate the profitability of fusing two kernels in CUDA (effects of increased register pressure; shared memory)?

For a human, yes, sure, just time both options. The system doesn't currently do this in an automated fashion, though. In a fashion similar to a CPU compiler's inliner, it has heuristics and makes its best guess. In general fusion is very profitable.

> On the other hand, the generic kernel and its launch parameters were probably hand tuned for performance.

Yes, and this is one of ways that XLA can lose to (say) vanilla TensorFlow today. But it's just a matter of tuning; the system is very young.

Why no AMD GPU support?

Why no AMD GPU support?

I think it just reflects the team's internal priorities. Patches are welcome; we want people to use this system.

It wouldn't even be tremendously hard. The XLA IR --> LLVM IR backend is relatively simple, and LLVM already has support for compiling to AMD GPUs. You'd have to split out the nvidia-isms in the generated IR. I think the biggest challenge would just be one of software engineering, namely figuring out a way to specialize the GPU backend for each of the two architectures while allowing it to share code in general.

I've been evaluating Cuda and OpenCL while trying to produce some target independent code.

My impression is that while Cuda might not win, OpenCL will almost certainly lose. OpenCL seems to be a monster compromise interface which takes into account all the architectures of the members of a large consortium. It's the sort-of designed-by-committee api that a developer has to fight against to accomplish anything. Naturally its many years behind Cuda in features, etc.

An open-source library with equivalent qualities to Cuda is needed- ie, a library intended to aid developers, allow abstract c++ to be easily become parallel code, provide reasonable tools and documentation etc.

One promising example is amd's Hip

"HIP allows developers to convert CUDA code to portable C++. The same source code can be compiled to run on NVIDIA or AMD GPUs."


Hadn't heard of this before. This is pretty cool. Does this project have official AMD support?

What I remember from this being on hn months ago is that is this an official AMD project aiming to compete with nVidia.

I think that the closed nature of CUDA will be its undoing. I think that a standard, like C++ amp or openMP-4.5 will be the the ultimate winner.

I liked openCL but it seems to be dying.

Apple seems to have abandoned OpenCL in favor of Metal, which speaks to your case of it dying.

I found Metal Compute Shaders to be very nice to work with, though. Was much easier for me to understand than OpenCL.

I also like metal, but it is not yet performant for high performance computing, which is more my wheelhouse. I'm also skeptical it will be popular if it does not get picked up by the GPGPU folks, but time will tell.

Is C++ AMP still going? I learned it about five years ago back when I was involved in GPU programming, haven't heard a thing about it since then.

Looks like it might be dead/dying:


Although perhaps would still be useful in its stale state, not sure.

I'm not familiar with those other standards, thanks for mentioning them. I'll check them out.

After completing the basic tutorials I hit a mental wall when I want to gpu adapt some "real" code. The hard part isn't going from CPU to GPU but making the CPU code branch-free and friendly to a GPU before actually adapting to the GPU. Something that is fairly straightforward in normal CPU code such as a tree traversal becomes a nightmare of sparse execution masks and inefficient lone threads executing.

OK so basic background here: CUDA processing usually looks like some dimensional array of data (1d, 2d, 3d, etc). Then you have a series of "warps" which tesselate their way through your data space processing a chunk of elements at a time. The warps can be organized into larger "blocks" to share data between parts of the warp. Many blocks make up a "grid", which is more or less synonymous with "the processing elements of a kernel". A kernel is a GPU program.

Blocks can't communicate between each other since they may be on different SMX processor engines (SIMD units). Also, kernels can't communicate either according to spec. CUDA doesn't guarantee the order of kernel scheduling - but it is possible via undefined behavior with spinlocks.

Generally speaking - larger problem sizes should be better for you. GPUs suck at small individual tasks, starting and stopping the kernels [from the CPU] is expensive. They are good when they are doing as big a task as possible (asymptotically to a limit). Memory size will limit how big a data set you can work on, which will limit your total speedup. So overall, less memory usage = better speed.

You run lots and lots of threads at any time. GPUs are designed around the idea of massive threading, easily run dozens of threads per actual core. This covers up the massive latency when you need to go off-chip to load from global memory. You might run 10,000 threads in a program, and most of them will be sleeping while waiting for their data to load. When all threads in a warp are in READY state, the warp is scheduled and will execute.

As you note, GPUs don't work well when the threads are doing different stuff. For example, any threads that don't follow an "if" statement will just idle - because all threads in a warp execute in lockstep. They are masked off and their instructions don't affect their registers. If there are N different paths through the code, you will run it N times.

Architecture is critical to understand because this is actually bare-metal programming, like a microcontroller. There are very few niceties here. Memory is not zeroed between runs (actually not even during a soft PC restart). There is no virtual memory segmentation. Illegal accesses may not even throw, or they may trash your OS's viewport, crash the drivers, etc. And if you don't code around the architecture's limitations, your performance will suck balls.


In terms of general advice: a lot of times, scanning your data to pre-process and select "active" areas of the problem is a viable strategy. Streaming data sequentially across a warp is a pretty efficient operation thanks to warp coalescing, you have mega amounts of bandwidth, etc.

Think real heavily about your data layout. Structure of arrays is often really good because it gives you an efficient stride of 1 as much as is possible when reading/writing. That maximizes your efficiency when coalescing warps. If you are having every thread fire off its own request with no coalescing - your IOPS will trash the memory controller's performance.

As an extremely broad stroke, the best general-purpose approach to GPU programming is to convert your task into a sorting or searching task. GPUs are really, really good at sorting, and there's many good algorithms out there, so you don't have to handle the low-level stuff until you get up to a big problem size (i.e. you are maxing out GPU memory). Pay very close attention to the Thrust "histogram.cu" example because it demonstrates these techniques.

So, one good approach is to find your active elements first. You can sort the active elements to the front of the array. Or, you can use something like a prefix scan/sum or a thrust::copy_if to pull out indexes of "active" elements efficiently, and then scatter your operations across the indexes. If your indexes are sequential, then you will get the maximum amount of warp coalescing that is possible. That may not be much if your "active" elements are very sparse and widely distributed, but at least you're trying, and you're ensuring that all your elements are active as much as possible.

Obviously, wherever possible you want to avoid redundant operations, just like on CPUs. Structure your data to avoid redundant sorting, consider whether you want in-place or stable-sorts, etc. But overall sorting is very efficient on GPUs. You avoid thread divergence, you align memory access, etc.

Another approach is "dynamic parallelism". So you scan your data, figure out where "hot spots" are that have a lot of data that needs processing, and you launch more compute resources there (your kernel can launch additional kernel instances where needed). Also, in some situations you may be able to do the above approach of picking out indexes that need processing and doing them all at once - but you do it into registers or shared RAM. That way you are still keeping your cores processing instead of idling, but you avoid the round-trip to global RAM. The downside is you increase pressure on your registers/SRAM, which are very very limited resources.

If a thread can't find an element to process in a particular place - there's actually no problem with having some of your threads continue on to the next area that the warp was going to process. Assuming a random distribution - on average most of your elements will be in approximately the same area, so you still get some coalescing, and there is really no reason to have the rest of the threads halt/diverge and wait for the active elements.

Another cute dynamic parallelism trick - most of the overhead from starting/stopping kernels is the overhead of syncing the device up to the CPU. Put your main loop in a kernel by itself, and have the kernel launch more processing kernels. Overhead gone, now the GPU is running 100% on its own. However - if you really do need to talk to the CPU, then you will have to spinlock and poll, which is undefined behavior. Again, possible but iffy.

I really fucking hate CURAND. It's absolute garbage to use, it eats tons of global memory, it eats tons of SRAM, it is very not good. Instead, I really like Random123. Essentially instead of a "stateful" generator like Mersenne Twister, it's based on encryption algorithms. If you accept the concept that the output of an encryption algorithm is uncorrelated to a changing input, then essentially the encryption key becomes your "seed", and encrypting the value 0 becomes the first output from the RNG, 1 becomes the second, etc.

The advantage of doing this is that you don't waste your precious memory bandwidth and SRAM on CURAND, and instead you get to use CPU cycles. Paradoxically, GPUs have absolutely insane bandwidth, but bandwidth is the second most precious resource. The only thing more important is SRAM, because you get like 100 bytes per core (note: not per thread, per core, for all threads) or something like that, for all your registers, cache, and shared variables CPU cycles are cheaper than dirt. If you can possibly compute something from some data you already have loaded, that will usually be more efficient than loading it from global memory.

Use some property of your data (say, an index, or a uid) as your key value for Random123 and you get essentially infinite RNGs for free. If you need to have different results across different runs (stochastic simulations) then just add the actual seed to the uid-key-value. By storing a single counter (the max counter value any single element has taken) you can maintain the individual states for every single generator in your set. Not only that, but you can seek to arbitrary places in your RNG sequence. Let's say you generate some property of your data randomly. You don't actually need to store that for each element - you can just store the counter value you used to generate that, you have the index of the data element you're working on, just re-generate it in place wherever you need it. It's free money. Wait no, free global memory, which means you can scale your program up, which means it runs faster. So basically free money. Even better, you can force it to be cached in every SRAM bank using the __constant__ keyword.

I have a really idiosyncratic style for CUDA. I typically start with Thrust (basically the C++ STL for CUDA), writing high-level functional operations. Then I figure out where I can squish operations together, which I move into functors (pass them the index of elements they're working on, plus the array head pointers, they do operations on memory). Functors are nice because Thrust will auto-balance the grid for you for good occupancy. You can then start porting stuff into raw __device__ functions, and then finally translate it to a __global__ function that allows warp and grid level collective operations.

Once you've got the high-level stuff done, you need to tune the low-level kernel behavior. As much as possible - avoid global-atomic operations, since they kill your performance (you bypass cache and operate directly on global memory, incurring latency with every call, and CAS updates tend to cause contention/spinning). Pre-process in your shared RAM as much as possible. CUB (Cuda UnBound) provides warp-level and block-level collective operations that are useful - for example, a prefix-sum can give you the output targets for each thread in a warp that has variable amounts of data (0, 1, many) that it needs to output, which replaces a whole bunch of atomic operations. etc.

However, again a caveat: writing these collective operations can often involve "sync points", like thread fences. These warp/block/global sync points are really expensive in terms of processing, since you will have a bunch of cores idling to wait up for the stragglers. In some cases it's again possible to avoid an explicit sync operation by clever exploitation of the CUDA scheduler (as above, with inter-grid communication: it's not really that smart). But this is obviously very much undefined behavior too.

Texture cache can sometimes also be helpful. Basically it lets you align data in multiple dimensions rather than just one - so you can have a 3D kernel reading values, and from the GPU perspective it looks like they're all aligned, even though you're reading chunks that are hugely off in flat memory space. But there's some caveats, IIRC you really need to set it up before you run a kernel (can't do it on the fly), and IIRC it's read-only.

Also, you can cleverly abuse the texture interpolation for free math sometimes. That's typically the best gains you'll get out of texture memory, but it comes at the cost of lots of extra latency.

In newer revisions of CUDA you can transparently page stuff from host memory and it will kinda try to keep the two memory spaces synced up or whatever. This is a really bad idea, you should think real carefully before using that feature (basically never). Your 300 GB/s memory system is suddenly limited to 16 GB/s over PCIe, and memory bandwidth is precious. Explicitly manage your device memory, explicitly say when you want stuff copied and fsync'd, and don't let the autopilot handle it.


As for your specific problem of tree searching: this is really bad for GPUs. As you noticed, naieve tree algorithms are pretty much the worst case, they lead to lots of divergence which GPUs suck at. As much as possible - you want to convert things into internal "while" loops that can keep moving across your dataset if they don't find something in a specific place. Don't recurse, loop. But generally - the structures which work well for CPUs don't necessarily work well for GPUs. Especially if you insist on doing one operation at a time. Searching for one element in a tree sucks. Doing range queries or searching a couple hundred values is going to be a lot better.

I have always been fascinated with the idea of probabilistic data structures and GPUs. Maybe you don't know for sure where an element is stored, but with 2000 cores you can probably find it even if there's a few dozen places it might be. That avoids some of the traditional problems of lock contention/etc on traditional data structures. And when you need to rebalance - GPUs are good at that sort of thing, since it's more or less sorting.

Also, I feel like GPUs could be an interesting model for Erlang. Lots of threads idling with low overhead? That's Erlang. Efficient message passing would be a trick though, and the use-cases would be diametrically opposite. You would have high latency and efficient numerical processing.

I also think I should be able to implement EpiSimdemics with a similar model to this one, but that model isn't open source and Keith Bissett, the guy at Virginia Tech who runs that program, refused to return my calls when I asked for disease model parameters to validate against. Ah, academia.


Ton of words here, and it's been years since I touched any of this stuff (couldn't find a job in my specialty and ended up programming Java - ugh) but you've inspired me to actually finally put the code for my grad thesis on github. It might be a worthwhile example of a real-world problem for you. Be gentle, it's my first time. I haven't touched it in years and there are a few minor things I know I screwed up (noted in the readme.md).

Repo: https://github.com/holvs/PandemicThrust

Thesis: http://scholarworks.wmich.edu/masters_theses/525/

IEEE conference paper (not very good IMO): http://ieeexplore.ieee.org.sci-hub.ac/document/7041000/


Please see also:

Quick-start docs for the Thrust library, the actual easiest easiest introduction to CUDA that you ever will find, literally 10 lines for a hello-world program: https://thrust.github.io/

Thrust example programs (again, see "histogram.cu"): https://github.com/thrust/thrust/tree/master/examples

If I format this up nicely as a blog post: I'd like to draw some spatial diagrams. I'm a compsci programmer, not a math prof.

I need to draw 2D and 3D spaces, like a 3x3x3 cube, or an arbitrary sized space, with selectable highlighting for each unit-cube in the space.

Can someone please help me with an appropriate tool here? I'm sure there's got to be some Python module out there or something. I don't even know what term to look for there.

In my experience, I've found it easy to make figures with Wolfram Mathematica (the python analogue would be matplotlib) and with Tikz (http://www.texample.net/tikz/examples/all/)

If you want to check out what might be possible within the Mathematica system, you could try out https://www.wolframalpha.com/

Further, hand-drawing on a tablet using a stylus is highly underrated.

Have you tried Blender? It's a 3D modeling tool with a python interface. Might work nicely for what you want to do.

it's not as simple as I'd prefer for 2d but that's exactly what I want for 3D. Thank you.

I hope many people will realize what a superb comment this is.

Thanks for that support, I suppose I should just keep trying. Realistically perhaps I should do GPU code for vector problems rather than trying to do it in anger on "hard" problems with tons of branching.

I think part of the problem is also that I don't know C++ (and more or less refuse to learn it, old dogs etc...). Usually I have some higher level code and wish to speed up parts of it.

You should clean up that comment and add some code and make it a blog post about converting a non-trivial algorithm to CUDA. A lot of the tutorials show the tools more than the craft and just do a matrix multiplication or something similar. Your blog post would reach HN front page for sure.

CMU has a few lectures on this open to the public: http://15418.courses.cs.cmu.edu/fall2016/lectures Check out Lecture 7: GPU Architecture and CUDA Programming it starts 16mins in after some review.

Udacity also has a parallel image processing algorithms w/CUDA course though I haven't done it https://www.udacity.com/course/intro-to-parallel-programming...

Well, IMO it's much harder to take legacy code and port it to GPU. I see lots of tasks where people take like one or two parts of the problem, push it to GPU, do a few operations, and pull it back.

Frankly I think that's the wrong approach to begin with - you don't get good speedups that way. Pushing everything to-and-fro across a PCIe bus that is less than half the speed of DDR3 let alone DDR4 is not a recipe for success. Literally the only place where that's even successful is when you can do a sort-and-search or something similar that the GPU is really super good at.

You really need to be doing almost everything in VRAM as much as possible, and really carefully picking what goes across the bus, because that will bottleneck you, no question. And the problem is that a lot of legacy code is not written with any of these ideas in mind. They're not memory efficient.

I originally inherited legacy C-code that was at least third-hand (and the reason the prof wanted help was because nothing worked right), and took about a year of part time work to reverse-engineer it into a new C implementation that was actually workable, then thread it with OpenMP. The GPU conversion was year 2-3 of this project.

I'm certainly not going to say the reference/OpenMP implementation was a masterwork, and I didn't squeeze it for every drop of memory or performance. But I have zero question that the CUDA implementation was much better. From what I remember it consumed at most half the memory if not less, and was easier to scale up with more processor resources. The functional-esque style with structure-of-arrays worked really really well for that and I actually ended up backporting some features like the "sort-and-search" approach that helped speed the OpenMP implementation up somewhat too (wasn't huge but it was some).

Side note, the Thrust library can target OpenMP as a __device__ back end. So if you write using the Functor-style I outlined, you can write Thrust programs and run them on your CPU for debug/etc. That was another reason I went that route that I didn't really get a chance to explore.

Anyway, what I'm saying here is that from what I've seen, the approach of trying to plug GPUs into a key part of a complex legacy app is doomed to fail. You get like 1-3x speedup at most, often a slowdown. This is embedded programming, you need to boil your problem down to the absolute minimum possible problem, squeeze it as small as possible to maximize your VRAM (problem size), keep everything on the GPU and do as much processing as possible, and minimize your transfers over your bottlenecks. When you do your transfers - do them in bulk instead of one at a time.

It's a very different model from "strong" cores like a CPU, and you have to factor in that it's across a pretty slow bus (APUs with cache-coherent busses are a promising model, as is Knight's Landing). Offloading stuff to a co-processor isn't trivial to begin with, let alone when it has a weird programming model like a GPU that's very different from "strong" CPU cores.

Others have said that too, I really will try to clean this up and repost it. It'll probably end up being a series because a full explanation of each of those chunks will be a couple pages.

Thanks for the great comment. You should write all this up somewhere, it sounds like a lot of hard-earned wisdom!

Thanks for the comment, I really should and I will try to do it sometime before it all falls out of my head any further. I miss doing it, I've just been burned out on trying to unsnarl legacy outsourced Java code for the past 2 years.

Like I said, I was actually really jazzed about trying to implement another model in GPU. This model basically consumed zero SRAM, I think I could easily extend it to a fine-grained temporal model like EpiSimdemic, and I had a neat model in mind. I even documented the idea on my IP agreement on my current job, I just got burned out by not being able to get a disease model for validation and having to do actual work. Especially Java.

Also, I just wanted to chime in here with a compliment for past-me. I tried to comment throughout, and I made a big push to document everything before I handed it off. I've spent the past couple hours looking back through that code, and even though I haven't touched a lick of C code in almost 2.5 years and between the README.md and the comments I feel like I am doing pretty good comprehending past-me's code.

Document your fucking code, people. Future-you will thank you. Especially if it's C.

(AFAIK the handoff never actually happened though, my advisor just had a baby, and this is now officially dead code, so if you want to do a thing, by all means go for it!)

If anyone else has questions, by all means chime in on my gigapost, I'll try to answer.

This is excellent. You should put your contact details in your profile.

We do disease (and other) predictive modeling and I'm looking for people interested in the field...

Edit: my contact details are in my profile. My group funds and does engineering for work like https://arxiv.org/abs/1609.08283

It's really a shame that openCL doesn't have the market share that CUDA does (or kudos are awaiting NVidia's marketing and foresight to invest so heavily in the tooling around its hardware...) because the raw compute performance of AMD hardware is superior to that of AMD and often cheaper.

"...superior to that of Nvidia.." /edit

Isn't another big reason because OpenCL is harder to program in?

Harder in what sense? There is nothing (or very little) that makes OpenCL significantly harder by nature!

OpenCL developer tools and libraries are however a disadvantage compared to NVIDIA's CUDA stack. That's partly thanks to AMD's rather poor tools (I still hope that their OSS initiative might change that). Intel's half-assed attitude towards OpenCL support didn't help either. Most importantly, NVIDIA's attitude of intentionally crippling OpenCL on their hardware by providing piss poor dev tools, only v1.2 support, no extensions that would allow making use of their hardware's features etc. has surely contributed to successful​ly holding back the adoption of the OpenCL standard.

I hope the community wakes up sooner rather than later.

Not with dynamic languages for the host code. Check out http://clojurecl.uncomplicate.org. Full speed with much less code.

This is great. But it seems like many almost every great tutorial has a step zero that is left out. In this case, for me at least, what is missing is: What's a good guide to choosing or building a CUDA system? Preferably a Linux non-laptop. Mostly for playing around with something that offers a bit more power than my day to day (very non-CUDA capable) laptop. Anyone have suggestions?

I think there might be an EC2 solution, but I'm more interested in buying or building my own hardware, as crazy as that might be, just to have a relatively fixed cost (other than electricity) and to skip the overhead of any EC2 learning curve there might be.

You have to pay attention to a few components. So

- fullsize case where a big 3-fan, 13" card won't run into the hard drive or DVD drive cables, if you choose to go that route (it seems to be easier to have multiple Founders edition cards than multiple OEM's, cooling-wise).

- for 2 or 3 fan OEM cards, you have to be moving a lot of air thru the case

- X99 /Z170 / Z97 motherboard (X99 has the highest allocation of PCI-e3 lanes to 2 or more GPU cards, and newegg does a good job of standardizing how they report lane allocations)

- and a beefy powersupply, 750+ watts and a bunch of PCI-e 6 or 8 pin connectors, and you shouldn't hit any constraints.

Also i recommend "Cuda for Engineers" by Storti /Yurtoglu as a good first cuda book, and the Wrox Pro C Cuda programming as 2nd book. There's another that just came out, Programming Massively Parallel Processors, Third Edition by Kirk /Hwu, that looks good but i haven't read it

> Preferably a Linux non-laptop.

You've made this really easy on yourself. Cuda really isn't very picky once you get an NVidia GPU in it. And even a generation old mid-range NVidia GPU will give you plenty of compute performance to keep you busy for a long long time. And when it no longer meets your requirements, you'll have a very good idea why, and what your bottlenecks are.

Shop just like you would for any other Linux non-laptop, just with the requirement of an NVidia GPU. My only other recommendation is to steer away from items marketed too strongly at the performance gaming market, as they are sometime clocked beyond their reliability range.

As the other poster said, any NVIDIA GPU would do. You probably want the latest architecture (Pascal), and check what PSU you have and what PCIe power options it has. Depending on that (no PCIe power, single 6 pin, dual 8 pin, ...) you can see how far up the range you can go and still have the card fit.

If you are really interested, and willing to spend the time, you can get utterly fabulous perf/$

You can google the specifics, but you can build a powerful and stable system for about 300-400 dollars (a WHOLE system, including a CUDA compatible GPU, not just the GPU)

ALL of the following parts can be purchased from ebay(The minimums are taken from actual lists I've taken down while writing this post. There might be some errors - you have been warned, so don't blindly hit purchase if you're not sure. So if anyone has the patience please correct me)

{{Stuff}} are alternatives

CPU: Xeon - $12 - $50

Motherboard: $35-$60

RAM (24 gigs): $35-$50

Power supply: $40-$60 (don't skimp on this. Buy namebrand. Trust me on this one.)

Case: $34-$80 (Funny how this might cost more than any of the other parts I've listed until now. Protip - I made builds without a case, so this is optional, if you want to save $50 and buy a little bit better parts.)

GPU: GTX 1050 :$110-$120 (brand new!)

{{GTX 970 : $160-$180}}

HDD 320GB: $20

{{HDD 1TB: $40

SSD 128GB: $41}}

So adding up all the minimum prices minus case and including GPU: It's around $250. I don't think you can buy a good phone for around that price (Nexus 5x goes for around $270)

The above build price minimums are pretty absolute (with links below as proof) but I'd suggest spending around $250 for everything minus the GPU, since some parts might bottleneck performance of the GPU if you're handling lots of data.

A single processor Xeon which doesn't have problems (google processor model number to see if there are any) with Ubuntu server 16.04.1 would be rock solid. Don't ever listen to ANYONE saying install arch linux,centos, etc. The community + commercial recognition of Ubuntu for LTS version is unparalleled. (Redhat/Centos beats Ubuntu in commercial support but regular community support on stackoverflow and debugging using google? Ubuntu's for you)

Once you get comfortable with your device, get comfortable with Ubuntu (Install Xubuntu-desktop if you want to attach a physical keyboard+mouse), then get the GPU when you think you are almost ready to handle coding for CUDA and linux tools.

Links for verification here:







PS: Here's a recent post that's a great read. I suggest checking out the comments too:


Wow great intro, thanks! I've done builds before but it's been a while so this is really helpful.

Wow Thanks so much!

To learn CUDA programming? Just buy any NVIDIA gpu. Period.

I think there's a small error in the first code sample -- where the comment says:

    Run kernel on 1M elements on the GPU
... The call to `add` isn't a call to a function that'd be thrown onto the GPU. The `add` function is very much CPU-only based on the definition in that code sample (at least, not at that point!)

The first sample is meant to use the CPU.

Yes. My point is that the comment treats `add` like it's GPU code.

You may also enjoy my video from last year's cppcon about CUDA, which is in some ways higher level, and in others much lower level: https://www.youtube.com/watch?v=KHa-OSrZPGo

Anyone know why the CUDA toolkit is 1.2GB? It seems extremely large to get started with. In comparison Vulkan which is only 130mb.

Vulkan leverages the compiler in the graphics drivers, like OpenCL. CUDA comes with a separate compiler that plugs into the system C/C++ compiler (this is often a pain). CUDA also ships with dozens of premade libraries for common compute tasks.

Drivers, a lot of libraries, and a bunch of sdk/profiling tools. I mean they bundle a custom eclipse, just to give some context.

Any suggestions on a cheap cloud compute engine to play with cuda that won't cost me a fortune as I learn?

I have macbook pro. Is it better to just buy a nvidia GPU and throw it in?

It's like 60 cents an hour to run it on AWS. Just shut it down when your not using it.

Which era of MacBook Pro? Many of them have Nvidia GPUs in them.

Is there a preprocessor in the chain? Because

    add<<<1, 1>>>(N, x, y);
isn't regular C++. Sorry if I missed something.

CUDA C++ is technically its own language, which is mostly implemented using a preprocessor; nvcc performs some translation and then passes generated C++ to your compiler of choice. The kernel launch syntax, along with a few implicit includes and macros for __device__ and __global__ are (afaik) the only things that really distinguish it from vanilla C++.

The triple angle bracket syntax is used to specify execution details when the device code's sent to the GPU -- the details are outlined under "Picking up the threads" in the OP.

You are also not using regular c++ compiler ;)

I'd love to learn CUDA but even the darn 'Hello World' examples don't compile.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact