
Nvidia Opens CUDA Platform, Releases Compiler Source Code - Garbage
http://developer.nvidia.com/content/cuda-platform-source-release
======
melonakos
IMO, open sourcing their GPU libraries would be a much bigger deal than only
open sourcing the compiler. I would like to see CUBLAS, CUFFT, CUSPARSE,
CURAND, etc all get opened up to the community.

The pain is not in compiling GPU code; rather, the pain is in writing good GPU
code. The major difference between NVIDIA and AMD (and the major edge NVIDIA
has over AMD) is not as much the compiler as it is the libraries.

Of course, I'm biased, because I work at AccelerEyes and we do GPU consulting
with our freely available, but not open source, ArrayFire GPU library, which
has both CUDA and OpenCL versions.

~~~
dxbydt
> the pain is in writing good GPU code

A viable alternative is to not write the GPU code yourself. Write a code
generator in Scala that spits out GPU code in C. For details see Claudio
Rebbi's work, which uses Scala as a higher level code genarator for CUDA to
solve the Dirac-Wilson equation on the lattice(
[http://wwwold.jlab.org/conferences/lattice2008/talks/poster/...](http://wwwold.jlab.org/conferences/lattice2008/talks/poster/claudio_rebbi.pdf)
). In finance, we are actively looking at CUDA for derivative pricing problems
in risk analytics. None of us wants to actually write GPU code in C, and we do
have considerable amount of risk analytics work being done in Scala, so a code
generator might actually be the way to go.

~~~
melonakos
Over the last 5 years, I've seen a ton of hot air blown about wrt to auto-GPU
code generation. The latest hot air is about how magical directives make
everything run fast.

Truth is, compilers and code generators are crappy.

If you really want to get good performance, you either have to write your own
low-level GPU kernels, or use a library of functions that have already been
written at a low-level.

All other hot air, while interesting, has yet to be proven at scale on more
than a few limited use cases.

Another disclaimer: I work on this, <http://accelereyes.com/arrayfire>

~~~
sharpneli
There are 2 parts in writing good GPU code, parallelizing the algorithm and
writing the kernels. Automatization of one part will not save time on other.

Based on practical experience the compilers are pretty good nowadays. The fine
details of the kernel do not matter that much. The performance issues tend to
float around usage of local memory, bank conflicts and how much one kernel
instance does work, which require hand tuning and in these cases the compilers
are underperforming. Thankfully the poor kernels are 'just' constant factor in
the general time complexity of the algorithm.

On higher level the most important thing is to describe the actual algorithm.
If the algorithm is described as serial one there is no automated way (and
most likely will not ever be general way) of parallelizing it, except running
it to check data dependencies after which you already have the result, and the
dependencies can change based on inputs so result of one run cannot be
generalized.

This could probably be proved by similar method as with halting. The program
calls the autoparallelizer and if the parallelizer says there is no data
dependency between 2 parts it will make them dependent, if it says there is it
will make them independent.

Thus let it be clear: There is no way whatsoever to take the hard parts away
(thinking in parallel). Nothing will take bunch of serial code in and spit
parallel programs out.

------
japaget
The title of this post is slightly misleading. The actual article does not
state that Nvidia has released the source code yet, but only that they are
planning to do so in the near future. A signup form is provided so that you
can be sent an e-mail when Nvidia actually does release the source code.

------
srean
There have been few comments about using specialized code generators, for
example Theano[1] written in Python and as mentioned in a comment quda. I do
not have the background to understand them well, but I find them very
interesting.

One question that I have is whether anyone has looked at adapting or using the
IF2 backend of the Sisal programming language [2] for these. I ask because
some of the optimization that Theano does reminds me of things that IF2 is
supposed to be doing too. Sisal was written with the old school vector
machines and supercomputers in mind but has a backend that depends only on the
availability of pthreads. I suspect that it might be possible to add support
for SSE and its ilk.

[1] <http://deeplearning.net/software/theano/>

[2] <http://sourceforge.net/projects/sisal/>

------
varelse
This answers the #1 objection to using CUDA instead of OpenCL: vendor lock.

What it doesn't answer is who's going to write the compilers and if they will
ever happen.

But it does prove NVIDIA is still a player in the many-core game and that
there are still a few more rounds to go before there's a winner.

------
binarycrusader
Key wording to observe here -- they said they'd release the source code, not
that it would be under an open source license.

They're "opening the platform". We'll see what they actually do.

------
danieldk
Unfortunately, it does not say what license will be used, which is probably
relevant if they want to create an ecosystem around the compiler.

~~~
exDM69
I agree that the exact licensing terms are somewhat relevant if you intend to
depend on this software.

However, it's worth noting that the compiler in question is LLVM based. So you
can construct your own compiler frontend that generates LLVM IR code that can
be compiled for CUDA by their backend. It's very likely that there are some
CUDA-specific LLVM intrinsics, so the frontend will not be entirely
independent of CUDA compiler licensing terms but at least now you have a
somewhat open interchange format to use between your frontend and the CUDA
backend.

------
DiabloD3
Until Mesa/Gallium implements a CUDA stack, I see no point in caring what
Nvidia does or doesn't do with their source code.

And, most likely, CUDA will never be done by Mesa/Gallium unless quite a few
people porting legacy CUDA get together and make it happen.

OpenCL is a multi-vendor supported actual standard, even Nvidia is part of the
Khronos OpenCL group, slightly implying that even Nvidia has admitted defeat.

------
justincormack
we just need documentation to understand what the generated code does then, as
AFAIK the output is code for undocumented hardware.

~~~
sparky
There's a good chance the LLVM backend will emit PTX, not machine code. PTX is
well documented [1]. Under such a system, the generated PTX would be JITed at
runtime by the driver.

Note that LLVM already has a (very experimental and not complete) PTX backend
[2].

[1]
[http://developer.download.nvidia.com/compute/cuda/3_0/toolki...](http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/ptx_isa_2.0.pdf)

[2] <http://llvm.org/releases/3.0/docs/ReleaseNotes.html#whatsnew>

~~~
paxswill
I'm pretty sure this is the case by playing with the OpenCL side of CUDA. If
the '--version' flag is passed to the OpenCL compiler (at least the one with
CUDA 3.0), info from an LLVM build from a year ago is dumped. The '-cl-nv-
verbose' flag is also documented to pass '--verbose' to the ptxas assembler.

------
adrianscott
This sounds very exciting! I guess it's not totally related, but I hope VLC
Player will get better Nvidia hardware acceleration soon...!

~~~
ajross
It's pretty much not related at all. VLC is a player UI client, it doesn't
have codecs of its own. You should be wishing for better GPU acceleration in
libavcodec if anything (but even that isn't implemented with CUDA).

~~~
keeperofdakeys
VLC is more then a UI, they have to implement the decoders in libavcodec, and
they do a lot of work to package things underneath. FFmpeg also supports VDPAU
(the nvidia, linux video acceleration api), but it would still be some work
for VLC to implement it.

~~~
mappu
VLC does use VA-API on linux, though. I guess the rationale is that people
with high-end AMD and nVidia GPUs are likely to have plenty of CPU horsepower,
and acceleration is mostly needed for people with those intel IGPs that VA-API
supports.

(EDIT: the real reason VA-API is used over VDPAU or XvBA is probably pragmatic
and related to driver stability)

~~~
keeperofdakeys
After having a look at VA-API vs VDPAU, I must say VDPAU is much nicer. VDPAU
allows you to define times when frames will be shown, so vsync is handled
fully in hardware; more than one transparent sub-picture can also be shown at
one time.

