
GPUCC – An Open-Source GPGPU Compiler - haberman
http://research.google.com/pubs/pub45226.html
======
haberman
I don't know much about this (it's not my area of expertise), but I thought
this G+ post was interesting:
[https://plus.google.com/u/0/+VincentVanhoucke/posts/6RQmgqcm...](https://plus.google.com/u/0/+VincentVanhoucke/posts/6RQmgqcmx2d)

It says that a lot of the reason TensorFlow initially lagged in performance is
because a lot of those performance issues only manifested under NVCC, whereas
they had been using GPUCC internally.

------
namtrac
This is part of llvm trunk (upcoming 3.9 release) now:
[http://llvm.org/docs/CompileCudaWithLLVM.html](http://llvm.org/docs/CompileCudaWithLLVM.html)

~~~
svensken
Thanks for the link! Pretty exciting stuff.

Can anyone comment on the following quote:

 _The list below shows some of the more important optimizations for GPUs... A
few of them have not been upstreamed due to lack of a customizable target-
independent optimization pipeline._

So the LLVM version of gpucc will be incomplete? Will there be a release of
the original stand-alone gpucc?

~~~
wujingyue
Thanks for your interest, and hope you like it!

Yes, it is currently incomplete, but I'd say at least 80% of the optimizations
are upstreamed already. Also, folks in the LLVM community are actively working
on that. For example, Justin Lebar recently pushed
[http://reviews.llvm.org/D18626](http://reviews.llvm.org/D18626) that added
the speculative execution pass to -O3.

Regarding performance, one thing worth noting is that missing one optimization
does not necessarily cause significant slowdown on the benchmarks you care
about. For example, the memory-space alias analysis only noticeably affects
one benchmark in the Rodinia benchmark suite.

Regarding your second question, the short answer is no. The Clang/LLVM version
uses a different architecture (as mentioned in
[http://wujingyue.com/docs/gpucc-talk.pdf](http://wujingyue.com/docs/gpucc-
talk.pdf)) from the internal version. The LLVM version offers better
functionality and compilation time, and is much easier to maintain and improve
in the future. It would cost even more effort to upstream the internal version
than to make all optimizations work with the new architecture.

~~~
jlebar
In fact I think at the moment almost everything, other than the memory-space
alias analysis and a few pass tuning tweaks, is in. I know the former will be
difficult to land, and I suspect the latter may be as well.

I don't have a lot of benchmarks at the moment, so I can't say how important
they are. And it of course depends on what you're doing.

clang/llvm's CUDA implementation shares most of the backend with gpucc, but
it's an entirely new front-end. The front-end works for tensorflow, eigen, and
thrust, but I suspect if you try hard enough you'll be able to find something
nvcc accepts that we can't compile. At the moment we're pretty focused on
making it work well for Tensorflow.

------
EliRivers
I see Eli Bendersky's name on this; his site (
[http://eli.thegreenplace.net/](http://eli.thegreenplace.net/) ) has a number
of interesting C++ articles, some of which I've even carefully printed out and
taped into my notebook of really useful things. If you're a C++ programmer,
there are a lot of useful reads on there.

I don't see anything specifically about this in the archives, but maybe that's
something to look forwards to.

------
wmf
One wonders why they didn't invest that effort in making an awesome OpenCL 2.1
compiler instead.

~~~
joe_the_user
I'm looking at building a GPGPU program.

When I look at CUDA code, it seems to be a big loop targeting the GPU memory
with standard c code, allocating memory with standard functions and specifying
where code lives with simple defines.

When I look at OpenCL, it is... I don't know what it is. I haven't figure it
out after considerable scanning. And that has cemented my decision to avoid it
because I don't have infinite time to scan obscurity.

For example, here is a standard "first OpenCL program" \- ~200 lines of boiler
plate _and_ no simple example of our many cores working together to do
something brutally simple and useful like add two vectors. Just "hello world"
from GPU.

As far as I can tell, as a production of a multitude of vendors all of which
have different stuff, OpenCl is a monstrosity where you have a wide of variety
of functionalities supported but none of those functionalities is guaranteed
to be present - hence 200 lines of boiler plate. Kind of like the umpteen Unix
flavors and such back in the day, "Open standards" that are bridges between
only semi-compatible hardware have generally been doomed abortions discarded
in favor of a single best approach that all vendors are forced to adopt.

So it seems like the best thing is jettisoning the monstrosity and cloning
CUDA for other hardware.

[https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBoo...](https://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/first-
opencl-program/)

~~~
dman
I dont completely understand the inclination of evaluating a technical stack
by the brevity of hello world.

Use the cl C++ wrapper if brevity is important to you and using C++ is a
choice. The hello world example here is noticeably shorter

[http://simpleopencl.blogspot.com/2013/06/tutorial-simple-
sta...](http://simpleopencl.blogspot.com/2013/06/tutorial-simple-start-with-
opencl-and-c.html)

~~~
joe_the_user
That is a much more useful example program, thank you.

The problem is that the "canonical example" pretty much remains what I showed.
And what's bad about that example isn't simple length but the way that the
creation and manipulation of kernels and threads remains entirely opaque (in
contrast to your example I think).

------
wiso
GPUCC An Open-Source GPGPU Compiler A Preview
[http://images.nvidia.com/events/sc15/SC5105-open-source-
cuda...](http://images.nvidia.com/events/sc15/SC5105-open-source-cuda-
compiler.html)

------
yzh
Not a compiler guy but a GPU programmer. This is exciting! Attended one of the
authors' lecture a while ago. Although at this point I assume gpucc would be
super-optimized for deep learning (by which I mean dense matrix
multiplication), this is very good for the community so that people can work
on various versions that either focus on better general performance, or
difference feature sets for specific applications in the future.

------
cjbprime
So, uh, if it's an open-source GPGPU compiler, where's the source code?

~~~
jpgvm
The code will be submitted to Clang.

~~~
cjbprime
Announcing that they'll throw a patchbomb at Clang at some indeterminate point
in the future seems to satisfy neither the "you can get source now" nor the
"this is developed in a participatory way" definitions of Open Source.

~~~
DannyBee
Except, we didn't. Instead, what's happened is a discussion was started on the
clang and llvm mailing lists about the best way to upstream this stuff, and as
those discussions have reached consensus, patches have started flowing.

See, for, example, the streamexecutor thread, etc.

Also, outside of that, they've been upstreaming the non-controversial smaller
stuff that is part of this for many months now.

(Seriously, i think of all the companies you are going to complain about, you
may want to look at google's interactions with clang and llvm, where we are
actually one of the only folks who work completely upstream at all times,
before throwing stones)

~~~
cjbprime
You're right, I'm not familiar with the culture of who best contributes to
clang and llvm.

But I know some things about what words mean, and publishing a paper
describing an open source project in March and not having any code available
for download in April is just kind of _weird_ , no?

It's good to talk about working out the precise mechanics of upstreaming code.
But in an open source project, you'd expect to publish your fork so that other
people can play an informed part in that conversation.

~~~
dgacmu
As a meta-note, your comments are coming across in a very hostile way, in case
you didn't intend them that way.

In an open source project, the best expectation is to play by the rules of the
existing project and try to integrate your changes in the way that works well
with it. There's no One True Open Source way - there are a lot of projects,
each with their own cultures.

What Google's done is take an internally developed thingy and transition it to
LLVM. That's a pretty non-trivial effort for any company. I don't see why
having it be open source has any requirement for a dump of the internal
version. Earlier commenters noted already that a large fraction of the code is
_already_ present in LLVM and has been streaming in for some time now, so --
why the hostility?

Collectively, I don't think "our" (the wider community) goal is necessarily to
have a bunch of junk forks out there that can't be compiled or used. Working,
thoughtfully contributed code is much more likely to be widely used and have a
big impact, and that's a standard we should be happy if companies meet. LLVM
isn't a quick hack project - it's a foundational bit of tech that millions of
people depend on directly or indirectly, and that millions of people benefit
from improvements to.

------
Alphasite_
Just as a point of interest, is there any limitation to supporting CUDA on AMD
hardware (were this to be compiled with the AMDGPU backend)? With the obvious
lack of libraries, etc.

~~~
slizard
AMD's new Boltzmann initiative includes an LLVM-based compiler which has been
posted online. I'm not sure what are the plans around an OpenCL fronted, but
the backend should be there, so I think an OpenCL support in LLVM for AMD GPUs
could be a realistic goal.

[http://gpuopen.com/compute-product/hcc-heterogeneous-
compute...](http://gpuopen.com/compute-product/hcc-heterogeneous-compute-
compiler)
[https://github.com/RadeonOpenCompute/hcc](https://github.com/RadeonOpenCompute/hcc)

------
fooblaster
I suspect that this compiler is generating ptx and not true native binaries
for nvidia's architectures. Nvidia's proprietary compiler stack is still
heavily involved in the conversion of ptx ir to native binaries. Essentially..
this isn't a full open source stack.

~~~
magicalist
> _I suspect that this compiler is generating ptx and not true native binaries
> for nvidia 's architectures_

It would take all of getting to page 2 of the article to confirm this instead
of speculating...

OTOH, there is an intriguing footnote that

> _We are also experimenting compiling [virtual ISA] PTX to [Nvidia 's
> proprietary Shader ASSembler] SASS before program execution and embedding
> the SASS directly into the resultant binary_

but the paper mentions in the conclusion that a SASS spec is not publicly
available. It would be interesting for someone involved to comment more on
that. Experiments on reverse engineering the compiled PTX results?

If implementing a replacement for nvcc gave these gains, I would imagine being
able to control an offline version of the (normally JIT) compilation to SASS
would also yield large benefits. It would likely be incredibly architecture
dependent, but for the big machine learning projects that still might be worth
the expense.

~~~
jrk
In addition to open source drivers, there has been work to reverse engineer
the binary formats and write open source assemblers for recent versions of
SASS (e.g.,
[https://github.com/NervanaSystems/maxas](https://github.com/NervanaSystems/maxas)).

------
rsp1984
What are the target GPUs for this? Will it run only on NVIDIA cards? What
about mobile GPUs?

~~~
maaku
I presume it will run everywhere CUDA is supported. Draw your own conclusions.

------
varelse
Clang crashed upon impact trying to compile some of my CUDA code as in the
very first .cu file. Not a good start IMO.

------
hsivonen
Can this LLVM back end be used with Rust?

~~~
wcrichton
Yes, and it has been. See:
[https://www.cs.indiana.edu/~eholk/papers/hips2013.pdf](https://www.cs.indiana.edu/~eholk/papers/hips2013.pdf)

