
Compiling Rust for GPUs - AndrewDucker
http://blog.theincredibleholk.org/blog/2012/12/05/compiling-rust-for-gpus/
======
antonyme
This article is about a year old by now. Since then, a significant amount of
progress has been made. Full details are available in this research paper
published at the HIPS conference this year.

[http://www.cs.indiana.edu/~eholk/papers/hips2013.pdf](http://www.cs.indiana.edu/~eholk/papers/hips2013.pdf)

This describes in some detail how they used the PTX backend and linked the
host and GPU to perform matrix operations and so forth. Awesome stuff!

------
electrograv
This is very cool. I really hope that GPU programming becomes increasingly
tightly integrated into more languages over time. If we could reach a point
where every major compiler has GPU support built into the language itself, it
would be so much easier for a thriving ecosystem of libraries and other
supporting software to emerge (not that this doesn't exist already, but right
now it's mostly divided into different groups for each GPU vendor, sadly).

Unfortunately the GPU vendor divide still exists (mainly CUDA for NVIDIA, and
OpenCL for AMD/Intel) with no immediately obvious solution. For example even
the Rust-to-PTX compiler from this article will fail if you try to run it on
AMD, or Intel, or anything but NVIDIA because PTX is an NVIDIA-only pseudo-
assembly intermediate format. Projects like "gpuocelot" promise to translate
PTX to run on other GPUs, but until such translation layers reach maturity
(and don't sacrifice any performance in the process of translation), OpenCL
will remain the only viable "intermediate" format that is truly GPU-agnostic.

Which is really a shame, because quite frankly OpenCL is horrible to develop
for multiple GPUs, because each GPU vendor has their own OpenCL compiler with
a slightly different interpretation of the spec (not to mention severe
performance differences between OpenCL and CUDA for NVIDIA GPUs), plus their
own proprietary extensions. It's like writing C code that has to run on N
different, immature, and mostly quirky compilers all with their own different
subset of features, where N is completely unknown to you -- in fact N will
grow over time after you release your application.

~~~
foxhill
i don't understand what's horrible about OpenCL for all the vendors? aside
from beta platforms, i've had all my codes work across platforms that claim to
support the full profile.

also, gpuocelot is a terrible idea. it's a reimplementation of a closed
standard, entirely controlled by nvidia. it will _always_ be catching up to
CUDA. the thing is, we already have a portable binary solution in OpenCL,
SPIR. it's just a dialect of llvm-ir, in fact.

projects like this (that use CUDA/closed standards) are just deepening the
divide between the GPU vendors, and moves us away from a single open standard.

~~~
goldenkey
CUDA doesn't break any divide. It's not a closed standard at all. It's a
platform specific standard. Nvidia has OpenCL support on their cards too, and
its within 10% of the performance of CUDA. CUDA is just very optimized for the
architecture, is all.

~~~
foxhill
so how does one go about proposing additions or changes to CUDA?

ironically, i've been able to get more performance from OpenCL than CUDA
(mostly due in part to OpenCL's dual-source model, it allows me to compile
runtime constants into kernels).

CUDA is no more "optimised" than OpenCL. it's the same compiler that generates
kernels. enqueueRead/Write calls end up being cudaMemcpy, etc.

~~~
goldenkey
You can compile runtime constants into CUDA too. CUDA has both a driver api
AND the standard C toolkit.

CUDA is a lot more optimized than OpenCL when you use it right. Many of the
library functions perform better than there OpenCL equivalents or take
different types. Having used both, I've got to say that OpenCL is craptastic
compared to CUDA. One big example is the fact that you can't set a kernel-wide
global in OpenCL, you have to pass it to the kernel every time as a buffer. In
CUDA you can lookup the global by symbol, and set it, and it remains
persistent. That means the CUDA kernel func will have a slightly smaller stack
size from not having all these craptastic persistent args passed to the kernel
every time. Also, instead of dereferencing a pointer that is passed as a
kernel arg, you are directly using the global, again, code optimized. You
haven't used CUDA enough probably. I understand how on the surface they appear
similar. But wholly, CUDA is better thought out.

~~~
foxhill
it's not the standard programming model in CUDA.

the library functions compile down to the same PTX, it's the exactly the same
toolchain wether you use nvcc or clBuildProgram - if you could provide an
example indicating otherwise, i'd very much like to see it.

you most certainly can set a kernel-wide global in OpenCL, and it does not
need to be a buffer object. and you don't need to re-set it between kernel
invocations (this would be true even if it were a buffer object).

OpenCL is certainly verbose, but honestly, given that you may have multiple
implementations of it on the same machine, i don't really see how it could be
done any simpler (at least from the C API perspective).

i use CUDA and OpenCL every working day. it is literally what i am employed to
do.

~~~
goldenkey
Standard programming model? Please stop. You have two options with CUDA, no
one is tying your hands in a knot, NVidia doesn't have a committee setting
standards on which of their 2 flavors of toolkit you need to use. You are so
biased it hurts.

Considering you use OpenCL at work every day it would help all the hackers
here if you don't spread mistruths. There are no statics in opencl unless they
are in constant memory and set in the program source. And we both know
constant memory is quite limited and this fact would require you to
interpolate your data into a byte array to sprintf it into source.

There are no persistent global in opencl. You must pass all your globals to
your kernel. We both know setKernelArg is persistent across kernel calls- but
who cares? In CUDA you don't need an argument and you don't need to
dereference it either. Way better.

[http://www.khronos.org/message_boards/showthread.php/6437-Gl...](http://www.khronos.org/message_boards/showthread.php/6437-Global-
variables-in-OpenCL)

~~~
foxhill
as in, it's not how any tutorial or walk through i've ever seen has tried to
teach the user about CUDA, it's not something nvidia talk about primarily in
their documentation of the language. if it's so common, why isn't it being
used in rust? please, continue to make fallacious statements about me and what
i think. i really don't care.

regardless, there are three (not two) programming models in CUDA - high level
(CUBLAS/FFT), CUDA runtime, and the driver API.

constant memory is 64k on an M2050, a GPU from 3 years ago. if you want to set
more than 16 thousand floating point constants.. well, you probably want to
put that in main memory.

i use OpenCL _and_ CUDA, every day. and sorry for misunderstanding what you
were saying, but really? you think passing in an argument is such a hassle?
yes, setKernelArg is persistent across calls - so i'm confused why you're
saying you don't care? it does exactly what you've been complaining it can't
do (and, yes, it requires an extra kernel argument, oh my).

you've been constantly bashing OpenCL - nothing more than a document, the
contents of which are decided by a committee of people, none of whom are
myself - implicitly praising CUDA, and you call me biased? you don't even know
what i think about _either_ of them.

~~~
goldenkey
Tutorials... That's your frame of reference? How can you expect to learn much
when the quality of tutorials is well.. In the gutter.

Not only an extra arg, but dereferencing everywhere the arg is used.

The constant memory is quite easy to exceed, 64k is nothing. Still, you can't
modify constant memory from in the kernel so it's really a different beast
entirely. CUDA has clean persistence. Really, that's the big bonus for using
it. Dynamic parallelism too on certain cards. And the cuFFT and various libs.

Really, I don't know what you expect when you talk shit on CUDA as if it's
anti-trust and monopolistic. Recall that it came before OpenCL and much of
OpenCL was based on it and the recommendations given by Nvidia to the
committee. If you use fermi devices like the Titan and tesla, CUDA is going to
give you more power in terms of both code and architecture output of nvcc to
tune for performance. CUDA is awesome and so is OpenCL but don't shit on CUDA
because open-source has your panties in a knot.

~~~
foxhill
no, my frame of reference is all the talks, workshops, and presentations i've
attended, papers i've read, and real code i've seen and worked with.

setting a single scalar argument requires no extra dereferencing in the
kernel. if you're setting an array of values, then you'd need to dereference
(via array indexing), but you'd need to do that in CUDA too. besides, even if
you _did_ have to dereference it, writing it into a private variable would
save future dereferences, and even a basic optimising compiler would do this
by default.

and you talk about persistent global state as if you'd _want_ that. global
variables (in any language) are generally considered bad practise.

OpenCL 2.0 has dynamic parallelism, too.

.. are you being serious? maybe monopoly means something different to me?
exactly, how many brands of GPU have official support for CUDA? yes, CUDA came
before OpenCL, and nvidia are (still) a part of "the committee" \- khronos.
but it was apple that created OpenCL, handing it to khronos after they came up
with 1.0.

that said, it was neither apple or nvidia, that came up with the idea of
compute on the GPU. BrookGPU is the real project that kick started GPGPU.

i hate to nit-pick, but the Titan is kepler, not fermi. architecture output of
nvcc? i don't even know what you mean. for the codes in which i've written
both OpenCL and CUDA, the performance difference is almost measurement noise.

my complaint of using CUDA _in rust_ is as follows - rust is an open language,
being created by an "open source" company. it makes absolutely no sense to me,
that a company with free software principles should make their language
dependent on a specific vendor. i don't think that's an outlandish complaint.
imagine if rust only compiled for intel CPUs. or only worked in windows.

~~~
goldenkey
OpenCL:

    
    
         __kernel myKernel(int* theMem){
          *theMem = 0xDEAD;
        }
    

CUDA:

    
    
        __device theMem;
    
        __kernel myKernel(void){
          theMem = 0xBEEF;
        }
    

nvcc has a lot of nice flags to set max register count as well as smi
architecture or compute_architecture. It's a little bit more flexible than
OpenCL. [http://camolab.googlecode.com/svn/trunk/mycode/cuda/nvcc-
hel...](http://camolab.googlecode.com/svn/trunk/mycode/cuda/nvcc-help.txt)

Anyhow, Mozilla's decisions are Mozilla's alone, you shouldn't bad angry at
anyone _but_ Mozilla for taking the bait.

~~~
foxhill
you are mistaken.

    
    
        //kernel
        __kernel myKernel (int val){
            //...
        }
    
        //host
        int val = 10;
        clSetKernelArg(kernel, 0, val, sizeof(int));
    

similarly, for architecture/max reg count,

    
    
        clBuildProgram(..., "-cl-nv-arch sm_35 -cl-nv-maxrregcount 20",...);
    

i'm not angry at anyone. i disagree with mozilla's decision. that is all.

~~~
goldenkey
Setting a kernel argument in this manner can only be used for inputs to the
kernel. Any output you want to read (either in a subsequent kernel or from the
host program) must be written to a buffer or an image. In your case, that
means you need to create a single-element buffer and pass the buffer to the
kernel.

~~~
foxhill
well, either way, in the snippet of code you just posted, your assignment is
still a dereference. CUDAs &theMem is a pointer to somewhere in memory,
OpenCL's theMem is a pointer to somewhere in memory. in the back end of CUDA,
it's doing the same thing as what you would explicitly do in OpenCL.

we could talk about minor language difference for days. really, there is no
major difference between OpenCL and CUDA (and that's one of the main reasons
i'll write CUDA code - when i know my code will only ever run on nvidia GPUs).

but when i write OpenCL code, i know it will run on all GPUs, and CPUs, and
any accelerator that has an OpenCL stack.

~~~
goldenkey
It's actually not the same thing. I've looked at the PTX disassembly and the
difference is that the OpenCL code must perform a dereference while the CUDA
code can use a fixed address, (relocatable of course.) The address is
relocated based on the loaded base address of the kernel module. So there is a
huge gain for certain types of code to save an instruction and possible
warping, cache miss, register pressure, etc.

Right, OpenCL runs on the Knights Corner/Bridge (PHI - what a piece of crap
btw), CPUs, and many other devices, it has its place. And CUDA does too. I
write CUDA when I'm on Nvidia devices because of the dereferencing performance
as well as the slight speed gain that the double-buffering of workgroups that
CUDA supports slightly better than OpenCL.

All in all, they're pretty on par, I'm fond of CUDA a little more due to the
clarity of its driver API and non-evented model. It's a little bit easier to
work with.

Anyhow, I think we're done here. I got you to admit that CUDA isn't a piece of
monopolistic smudgenry. Which is really all I wanted :-)

------
foxhill
i'm glad to see that GPUs are becoming more accessible!

given that they use OpenCL to run kernels, i'm really confused to see that
they're compiling kernels to PTX, and not just intermediary OpenCL C. that way
they could run on any OpenCL supporting device.

in the future, i would hope to at least see SPIR used in place of PTX.

------
Pxtl
I really feel like game development is the ideal space for Rust to carve out a
name for itself, and going after GPU support is a great place to focus.

------
okpatil
That's great. This would be an edge over Golang. Go community has been working
with similar problem for some time
[https://groups.google.com/forum/#!topic/golang-
nuts/8OJ6etdl...](https://groups.google.com/forum/#!topic/golang-
nuts/8OJ6etdl6WY)

~~~
pcwalton
I don't know how many times I have to say "Rust and Go are not competing and
are in totally different spaces" before it sinks in. This only works because
Rust is a low-level design that does not rely on garbage collection; Go is a
high-level design.

~~~
Pxtl
Because Go _pretends_ to be in the same space as Rust even though nobody
actually uses it that way.

~~~
zellyn
No, it doesn't. [http://commandcenter.blogspot.it/2012/06/less-is-
exponential...](http://commandcenter.blogspot.it/2012/06/less-is-
exponentially-more.html)

------
salient
Does the Rust team take into account what AMD and ARM are trying to do with
HSA? What about the changes in OpenCL 2.0? I think they should keep them in
mind now before their Rust design spec is finalized, and if possible, try to
optimize for them, too.

~~~
foxhill
better yet, use SPIR and OpenCL.

HSA is polluting the GPGPU space, along with CUDA. we need less proprietary
crap, and more open standards.

~~~
salient
What's not open about HSA? If Intel and Nvidia haven't joined it (yet) doesn't
mean it's not open. Nvidia treats OpenCL as 2nd class citizen compared to CUDA
anyway, and I don't think Intel is that interested in OpenCL anymore, now that
they have Phi. So if neither party is too interested in OpenCL, then I'd
rather have HSA than the actual proprietary CUDA/Phi solutions.

~~~
foxhill
there are fewer vendors behind it.

nvidia certainly do treat OpenCL as a second class citizen, and it's
incredibly frustrating. however, Intel are very interested in OpenCL, it's
their preferred programming model for the Phi..

