
CUDA to x86 compiler, project Ocelot - jacquesm
http://code.google.com/p/gpuocelot/
======
newhouseb
For those not familiar with CUDA vocabulary:

PTX = Parallel Thread Execution is a pseudo-assembly language used in nVidia's
CUDA programming environment. The 'nvcc' compiler translates code written in
CUDA, a C-like language, into PTX, and the graphics driver contains a compiler
which translates the PTX into something which can be run on the processing
cores.

(source - wikipedia)

------
andrewcooke
so what happens if you go ptx -> llvm -> opencl -> nvidia gpu? how does the
speed change?

~~~
Raphael_Amiard
this shouldn't happen but it would be an interresting comparison nevertheless

~~~
andrewcooke
it seems possible from the diagram on the linked page. why shouldn't it
happen?

i am curious whether the analysis stages can improve the code.

~~~
sparky
I see that "OpenCL" is listed above the LLVM box on the Ocelot page, but I'm
not sure why; It is known that several OpenCL toolchains (Nvidia, ATI,
RapidMind) make use of LLVM, but it is unclear in what capacity they are used.
For the sequence you described (PTX->OpenCL->GPU), there would have to be an
OpenCL backend for LLVM. As far as I know, no such backend is publically
available, and another compiler would be necessary to take the OpenCL source
code down to PTX (from whence it came) and then the driver would JIT that PTX
for your specific GPU model.

> i am curious whether the analysis stages can improve the code.

Ihe conventional wisdom is that any series of analyses that takes you from
representation X, through one or more other representations, and back to X can
only make things worse, assuming that the JIT from PTX to GPU machine code
isn't horrendous. This is because any analyses, optimizations, and
transformations need to be conservative to maintain correctness, and high-
level semantic information about the parallelism inherent in the application
is usually lost in each translation step. In this particular case it might not
be so bad, as long as the LLVM IR is rich enough to faithfully represent the
Cooperative Thread Array (CTA) semantics in PTX and not flatten them to SPMD
code. My intuition, however, is that it's not; LLVM was designed as a fairly
generic virtual machine that would faithfully represent most CPU-like
execution models, and hardware CTAs (also called 'warps' in Nvidia parliance)
are mostly a GPU-only phenomenon. CPUs have SIMD units (e.g. SSE, MMX,
Altivec, NEON), but the execution model there is fundamentally different than
the GPU.

~~~
jacquesm
Once the GT300 series hits the shelves that problem will be largely mitigated
though, they're supposed to be mostly independent, with a 'variable warp'
size.

Of course that will introduce a new level of complexity to the optimization
problem.

~~~
andrewcooke
i don't understand this either. "wariable warp size" sounds like a small
efficiency fix for when things aren't multiples of 32, or when they exceed
512. a "variable warp size" doesn't alter the fundamental problem SIMD
approach - you've still got a multiprocessor with slave processors that are
doing very similar work.

for me, the big advances in fermi are a unified address space and some kind of
cache for the global memory. neither of those change the paradigm, but they
may make life significantly simpler when programming the thing.

~~~
jacquesm
Multiples of 32 are nice, multiples of _1_ are better :)

(and let's hope it goes that far down), that would make things a lot easier as
well.

Unified address space I assume you mean across multiple GPUs ? Global memory
cache is a double edged sword, that eats in to the transistor budget at a very
rapid pace, effectively you already have a cache, you just have to fill it
yourself.

GPU programming is definitely a step back in the ease with which you can write
programs, but _if_ your problem maps well on to a GPU the speed increases are
simply astounding. What would have taken you a cluster with 100 boxes now sits
under your desk and consumes 250 watt tops. That's really very impressive.

The way intel seems to edge in to gpu territory and nvidia into cpu territory
will make for some interesting stuff happening in the next couple of years.

~~~
cjenkins
I believe the reference to unified address space is in reference to #6 in the
PDF linked below and caches #4. I agree that the unified memory address space
will be wonderful as managing all the various hierarchies by hand is a pain.

[http://www.nvidia.com/content/PDF/fermi_white_papers/D.Patte...](http://www.nvidia.com/content/PDF/fermi_white_papers/D.Patterson_Top10InnovationsInNVIDIAFermi.pdf)

~~~
jacquesm
It's going to be really hard to graft that on there given the fact that a lot
of the computational horsepower is directly related to the bandwidth to the
'local' memory store. That would mean that the local memory store somehow has
to be turned in to a cache that stays coherent across many 100's of processing
units.

I'm not sure that's impossible, it just seems very hard.

If nvidia manages to crack that nut then the only thing you'll still need to
keep in mind is how big your cache footprint is (as on every other cpu with a
cache) in order to maximize throughput.

~~~
andrewcooke
my original comment (about unified address space) was poorly thought out (it's
not clear how much fermi will help, and how much is down to opencl being
"cross-platform"). but the ideas isn't that you no longer need to care about
the memory hierarchy; only that pointers can be expected to work correctly.
currently (particularly in opencl) there are various restrictions on pointers
that make some code more complex than it needs to be. for example, you can
only allocate 1/4 of the memory in a single chunk, and pointers are local to
chunks, so patching together chunks of memory to get one large array is messy.

~~~
jacquesm
Ok, I see what you're getting at now.

That would definitely be a good thing.

I've spent in total about 2 months now (spread out over the last year)
understanding how this whole GPGPU thing fits in with the rest of computing,
it is much like a specialty tool. It is harder to master, more work to get it
right once you have mastered it, subject to change on shorter notice than most
other solutions (because of the close tie to the hardware) but if you need it,
you need it bad and the pay-off is tremendous.

