Hacker News new | comments | show | ask | jobs | submit login
Compiling Rust for GPUs (theincredibleholk.org)
79 points by AndrewDucker on Dec 19, 2013 | hide | past | web | favorite | 39 comments

This article is about a year old by now. Since then, a significant amount of progress has been made. Full details are available in this research paper published at the HIPS conference this year.


This describes in some detail how they used the PTX backend and linked the host and GPU to perform matrix operations and so forth. Awesome stuff!

This is very cool. I really hope that GPU programming becomes increasingly tightly integrated into more languages over time. If we could reach a point where every major compiler has GPU support built into the language itself, it would be so much easier for a thriving ecosystem of libraries and other supporting software to emerge (not that this doesn't exist already, but right now it's mostly divided into different groups for each GPU vendor, sadly).

Unfortunately the GPU vendor divide still exists (mainly CUDA for NVIDIA, and OpenCL for AMD/Intel) with no immediately obvious solution. For example even the Rust-to-PTX compiler from this article will fail if you try to run it on AMD, or Intel, or anything but NVIDIA because PTX is an NVIDIA-only pseudo-assembly intermediate format. Projects like "gpuocelot" promise to translate PTX to run on other GPUs, but until such translation layers reach maturity (and don't sacrifice any performance in the process of translation), OpenCL will remain the only viable "intermediate" format that is truly GPU-agnostic.

Which is really a shame, because quite frankly OpenCL is horrible to develop for multiple GPUs, because each GPU vendor has their own OpenCL compiler with a slightly different interpretation of the spec (not to mention severe performance differences between OpenCL and CUDA for NVIDIA GPUs), plus their own proprietary extensions. It's like writing C code that has to run on N different, immature, and mostly quirky compilers all with their own different subset of features, where N is completely unknown to you -- in fact N will grow over time after you release your application.

> It's like writing C code that has to run on N different, immature, and mostly quirky compilers all with their own different subset of features, where N is completely unknown to you -- in fact N will grow over time after you release your application.

In other words, it's like writing C code :-) (At least, if you require fancy features such as inline assembly, threads, string literals longer than 509 characters, 64-bit integers, ...)

i don't understand what's horrible about OpenCL for all the vendors? aside from beta platforms, i've had all my codes work across platforms that claim to support the full profile.

also, gpuocelot is a terrible idea. it's a reimplementation of a closed standard, entirely controlled by nvidia. it will always be catching up to CUDA. the thing is, we already have a portable binary solution in OpenCL, SPIR. it's just a dialect of llvm-ir, in fact.

projects like this (that use CUDA/closed standards) are just deepening the divide between the GPU vendors, and moves us away from a single open standard.

CUDA doesn't break any divide. It's not a closed standard at all. It's a platform specific standard. Nvidia has OpenCL support on their cards too, and its within 10% of the performance of CUDA. CUDA is just very optimized for the architecture, is all.

so how does one go about proposing additions or changes to CUDA?

ironically, i've been able to get more performance from OpenCL than CUDA (mostly due in part to OpenCL's dual-source model, it allows me to compile runtime constants into kernels).

CUDA is no more "optimised" than OpenCL. it's the same compiler that generates kernels. enqueueRead/Write calls end up being cudaMemcpy, etc.

You can compile runtime constants into CUDA too. CUDA has both a driver api AND the standard C toolkit.

CUDA is a lot more optimized than OpenCL when you use it right. Many of the library functions perform better than there OpenCL equivalents or take different types. Having used both, I've got to say that OpenCL is craptastic compared to CUDA. One big example is the fact that you can't set a kernel-wide global in OpenCL, you have to pass it to the kernel every time as a buffer. In CUDA you can lookup the global by symbol, and set it, and it remains persistent. That means the CUDA kernel func will have a slightly smaller stack size from not having all these craptastic persistent args passed to the kernel every time. Also, instead of dereferencing a pointer that is passed as a kernel arg, you are directly using the global, again, code optimized. You haven't used CUDA enough probably. I understand how on the surface they appear similar. But wholly, CUDA is better thought out.

it's not the standard programming model in CUDA.

the library functions compile down to the same PTX, it's the exactly the same toolchain wether you use nvcc or clBuildProgram - if you could provide an example indicating otherwise, i'd very much like to see it.

you most certainly can set a kernel-wide global in OpenCL, and it does not need to be a buffer object. and you don't need to re-set it between kernel invocations (this would be true even if it were a buffer object).

OpenCL is certainly verbose, but honestly, given that you may have multiple implementations of it on the same machine, i don't really see how it could be done any simpler (at least from the C API perspective).

i use CUDA and OpenCL every working day. it is literally what i am employed to do.

Standard programming model? Please stop. You have two options with CUDA, no one is tying your hands in a knot, NVidia doesn't have a committee setting standards on which of their 2 flavors of toolkit you need to use. You are so biased it hurts.

Considering you use OpenCL at work every day it would help all the hackers here if you don't spread mistruths. There are no statics in opencl unless they are in constant memory and set in the program source. And we both know constant memory is quite limited and this fact would require you to interpolate your data into a byte array to sprintf it into source.

There are no persistent global in opencl. You must pass all your globals to your kernel. We both know setKernelArg is persistent across kernel calls- but who cares? In CUDA you don't need an argument and you don't need to dereference it either. Way better.


as in, it's not how any tutorial or walk through i've ever seen has tried to teach the user about CUDA, it's not something nvidia talk about primarily in their documentation of the language. if it's so common, why isn't it being used in rust? please, continue to make fallacious statements about me and what i think. i really don't care.

regardless, there are three (not two) programming models in CUDA - high level (CUBLAS/FFT), CUDA runtime, and the driver API.

constant memory is 64k on an M2050, a GPU from 3 years ago. if you want to set more than 16 thousand floating point constants.. well, you probably want to put that in main memory.

i use OpenCL and CUDA, every day. and sorry for misunderstanding what you were saying, but really? you think passing in an argument is such a hassle? yes, setKernelArg is persistent across calls - so i'm confused why you're saying you don't care? it does exactly what you've been complaining it can't do (and, yes, it requires an extra kernel argument, oh my).

you've been constantly bashing OpenCL - nothing more than a document, the contents of which are decided by a committee of people, none of whom are myself - implicitly praising CUDA, and you call me biased? you don't even know what i think about either of them.

Tutorials... That's your frame of reference? How can you expect to learn much when the quality of tutorials is well.. In the gutter.

Not only an extra arg, but dereferencing everywhere the arg is used.

The constant memory is quite easy to exceed, 64k is nothing. Still, you can't modify constant memory from in the kernel so it's really a different beast entirely. CUDA has clean persistence. Really, that's the big bonus for using it. Dynamic parallelism too on certain cards. And the cuFFT and various libs.

Really, I don't know what you expect when you talk shit on CUDA as if it's anti-trust and monopolistic. Recall that it came before OpenCL and much of OpenCL was based on it and the recommendations given by Nvidia to the committee. If you use fermi devices like the Titan and tesla, CUDA is going to give you more power in terms of both code and architecture output of nvcc to tune for performance. CUDA is awesome and so is OpenCL but don't shit on CUDA because open-source has your panties in a knot.

no, my frame of reference is all the talks, workshops, and presentations i've attended, papers i've read, and real code i've seen and worked with.

setting a single scalar argument requires no extra dereferencing in the kernel. if you're setting an array of values, then you'd need to dereference (via array indexing), but you'd need to do that in CUDA too. besides, even if you did have to dereference it, writing it into a private variable would save future dereferences, and even a basic optimising compiler would do this by default.

and you talk about persistent global state as if you'd want that. global variables (in any language) are generally considered bad practise.

OpenCL 2.0 has dynamic parallelism, too.

.. are you being serious? maybe monopoly means something different to me? exactly, how many brands of GPU have official support for CUDA? yes, CUDA came before OpenCL, and nvidia are (still) a part of "the committee" - khronos. but it was apple that created OpenCL, handing it to khronos after they came up with 1.0.

that said, it was neither apple or nvidia, that came up with the idea of compute on the GPU. BrookGPU is the real project that kick started GPGPU.

i hate to nit-pick, but the Titan is kepler, not fermi. architecture output of nvcc? i don't even know what you mean. for the codes in which i've written both OpenCL and CUDA, the performance difference is almost measurement noise.

my complaint of using CUDA in rust is as follows - rust is an open language, being created by an "open source" company. it makes absolutely no sense to me, that a company with free software principles should make their language dependent on a specific vendor. i don't think that's an outlandish complaint. imagine if rust only compiled for intel CPUs. or only worked in windows.

  > it makes absolutely no sense to me, that a company with 
  > free software principles should make their language 
  > dependent on a specific vendor
None of this work has been done by any Mozilla employee. One of the side-effects of being open-source is that anyone is free to take your code and make whatever extensions they want to it.

sorry, you are correct. i've taken much less time in reading the article, than i have in expressing my arguments in these comments -_-


     __kernel myKernel(int* theMem){
      *theMem = 0xDEAD;

    __device theMem;

    __kernel myKernel(void){
      theMem = 0xBEEF;
nvcc has a lot of nice flags to set max register count as well as smi architecture or compute_architecture. It's a little bit more flexible than OpenCL. http://camolab.googlecode.com/svn/trunk/mycode/cuda/nvcc-hel...

Anyhow, Mozilla's decisions are Mozilla's alone, you shouldn't bad angry at anyone _but_ Mozilla for taking the bait.

you are mistaken.

    __kernel myKernel (int val){

    int val = 10;
    clSetKernelArg(kernel, 0, val, sizeof(int));
similarly, for architecture/max reg count,

    clBuildProgram(..., "-cl-nv-arch sm_35 -cl-nv-maxrregcount 20",...);
i'm not angry at anyone. i disagree with mozilla's decision. that is all.

Setting a kernel argument in this manner can only be used for inputs to the kernel. Any output you want to read (either in a subsequent kernel or from the host program) must be written to a buffer or an image. In your case, that means you need to create a single-element buffer and pass the buffer to the kernel.

well, either way, in the snippet of code you just posted, your assignment is still a dereference. CUDAs &theMem is a pointer to somewhere in memory, OpenCL's theMem is a pointer to somewhere in memory. in the back end of CUDA, it's doing the same thing as what you would explicitly do in OpenCL.

we could talk about minor language difference for days. really, there is no major difference between OpenCL and CUDA (and that's one of the main reasons i'll write CUDA code - when i know my code will only ever run on nvidia GPUs).

but when i write OpenCL code, i know it will run on all GPUs, and CPUs, and any accelerator that has an OpenCL stack.

It's actually not the same thing. I've looked at the PTX disassembly and the difference is that the OpenCL code must perform a dereference while the CUDA code can use a fixed address, (relocatable of course.) The address is relocated based on the loaded base address of the kernel module. So there is a huge gain for certain types of code to save an instruction and possible warping, cache miss, register pressure, etc.

Right, OpenCL runs on the Knights Corner/Bridge (PHI - what a piece of crap btw), CPUs, and many other devices, it has its place. And CUDA does too. I write CUDA when I'm on Nvidia devices because of the dereferencing performance as well as the slight speed gain that the double-buffering of workgroups that CUDA supports slightly better than OpenCL.

All in all, they're pretty on par, I'm fond of CUDA a little more due to the clarity of its driver API and non-evented model. It's a little bit easier to work with.

Anyhow, I think we're done here. I got you to admit that CUDA isn't a piece of monopolistic smudgenry. Which is really all I wanted :-)

I wouldn't say that cuda is more "optimized" but last I checked it does have better support for pointers, recursion, spawning new threads from inside of existing threads etc. This allows for more sophisticated data structures and algorithms that are impossible to replicate exactly in openCL. A Barnes-Hut simulation [1] for example is way tougher to build in OpenCL.

[1] http://en.wikipedia.org/wiki/Barnes%E2%80%93Hut_simulation

i'm not sure what you mean about better support for pointers.

dynamic parallelism is in OpenCL now.

I remember looking into this a couple months ago and using pointers was still strongly discouraged due to massive performance impacts.

The great thing about LLVM is its multitude of backends. I bet this Rust-to-PTX compiler would require just a bit of work to run using the R600 LLVM backend that targets AMD GPUs.

You forgot Renderscript for Android. Google doesn't seem very keen in supporting OpenCL.

i'm glad to see that GPUs are becoming more accessible!

given that they use OpenCL to run kernels, i'm really confused to see that they're compiling kernels to PTX, and not just intermediary OpenCL C. that way they could run on any OpenCL supporting device.

in the future, i would hope to at least see SPIR used in place of PTX.

I really feel like game development is the ideal space for Rust to carve out a name for itself, and going after GPU support is a great place to focus.

That's great. This would be an edge over Golang. Go community has been working with similar problem for some time https://groups.google.com/forum/#!topic/golang-nuts/8OJ6etdl...

I don't know how many times I have to say "Rust and Go are not competing and are in totally different spaces" before it sinks in. This only works because Rust is a low-level design that does not rely on garbage collection; Go is a high-level design.

I don't think you read the thread properly. It's mentioned here, https://groups.google.com/d/msg/golang-nuts/8OJ6etdl6WY/0FKM..., that you could access cuda architecture with golang. And as mentioned here http://commandcenter.blogspot.it/2012/06/less-is-exponential..., golang was designed to replace c++. Any modern language which gives flexibility for both high and low level use would live, if you don't think so, go and advertise assembly. If rust gives cuda and http-server, one could easily design a project like this http://www.rescale.com.

Because Go pretends to be in the same space as Rust even though nobody actually uses it that way.

Does the Rust team take into account what AMD and ARM are trying to do with HSA? What about the changes in OpenCL 2.0? I think they should keep them in mind now before their Rust design spec is finalized, and if possible, try to optimize for them, too.

I am wondering how difficult it is to implement HSA when CPU and GPU is connected via PCI bus. It should be easy for SOCs with everything inside the same die, but for combination like Intel CPU + NVIDIA GPU it will be a long shot.

better yet, use SPIR and OpenCL.

HSA is polluting the GPGPU space, along with CUDA. we need less proprietary crap, and more open standards.

What's not open about HSA? If Intel and Nvidia haven't joined it (yet) doesn't mean it's not open. Nvidia treats OpenCL as 2nd class citizen compared to CUDA anyway, and I don't think Intel is that interested in OpenCL anymore, now that they have Phi. So if neither party is too interested in OpenCL, then I'd rather have HSA than the actual proprietary CUDA/Phi solutions.

there are fewer vendors behind it.

nvidia certainly do treat OpenCL as a second class citizen, and it's incredibly frustrating. however, Intel are very interested in OpenCL, it's their preferred programming model for the Phi..

HSA is a standard like many other in the industry.

OpenCL cannot offer what CUDA does in terms of tooling.

there are far fewer vendors behind HSA than SPIR. AMD and ARM seem to be pushing it hard (and argue it compliments rather than competes with SPIR). but it's just muddying the waters.

hmm, you'll need to be specific about what OpenCL lacks, as there are a lot of tools out there (and a lot of them support CUDA as well)

> there are far fewer vendors behind HSA than SPIR. AMD and ARM seem to be pushing it hard (and argue it compliments rather than competes with SPIR). but it's just muddying the waters.

It will also be the official GPGPU for Java.

> hmm, you'll need to be specific about what OpenCL lacks, as there are a lot of tools out there (and a lot of them support CUDA as well)

Ability to write kernels in C++, Fortran or any other language that targets PTX.

Applications are open for YC Summer 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact