
High-Performance GPU Computing in the Julia Programming Language - ceyhunkazel
https://devblogs.nvidia.com/parallelforall/gpu-computing-julia-programming-language/
======
jlebar
> This is in part because of the work by Google on the NVPTX LLVM back-end.

I'm one of the maintainers at Google of the LLVM NVPTX backend. Happy to
answer questions about it.

As background, Nvidia's CUDA ("CUDA C++?") compiler, nvcc, uses a fork of LLVM
as its backend. Clang can also compile CUDA code, using regular upstream LLVM
as its backend. The relevant backend in LLVM was originally contributed by
nvidia, but these days the team I'm on at Google is the main contributor.

I don't know much (okay, anything) about Julia except what I read in this blog
post, but the dynamic specialization looks a lot like XLA, a JIT backend for
TensorFlow that I work on. So that's cool; I'm happy to see this work.

 _Full debug information is not supported by the LLVM NVPTX back-end yet, so
cuda-gdb will not work yet._

We'd love help with this. :)

 _Bounds-checked arrays are not supported yet, due to a bug [1] in the NVIDIA
PTX compiler._ [0]

We ran into what appears to be the same issue [2] about a year and a half ago.
nvidia is well aware of the issue, but I don't expect a fix except by
upgrading to Volta hardware.

[0]
[https://julialang.org/blog/2017/03/cudanative](https://julialang.org/blog/2017/03/cudanative)
[1]
[https://github.com/JuliaGPU/CUDAnative.jl/issues/4](https://github.com/JuliaGPU/CUDAnative.jl/issues/4)
[2]
[https://bugs.llvm.org/show_bug.cgi?id=27738](https://bugs.llvm.org/show_bug.cgi?id=27738)

~~~
syllogism
Does this mean we could hook Cython up to NVPTX as the backend?

I've always thought it weird that I'm writing all my code in this language
that compiles to C++, with semantics for any type declaration etc...And then I
write chunks of code in strings, like an animal.

~~~
nicwilson
IDK about Cython, but I remember a blog post using Python's AST reflection to
jit to LLVM ->NVPTX -> PTX. It's relatively simple to do, I've done it for
LDC/D/DCompute[1,2,3]. It's a little tricker if you want to be able to express
shared memory surfaces & textures, but it should still be doable.

[1] [https://github.com/ldc-developers/ldc](https://github.com/ldc-
developers/ldc) [2] dlang.org [3]
[http://github.com/libmir/dcompute](http://github.com/libmir/dcompute)

------
dragontamer
In my experience, CUDA / OpenCL are actually rather easy to use.

The hard part is optimization, because the GPU architecture (SIMD / SIMT) is
so alien compared to normal CPUs.

Here's a step-by-step example of one guy optimizing a Matrix Multiplication
scheme in OpenCL (specifically for NVidia GPUs):
[https://cnugteren.github.io/tutorial/pages/page1.html](https://cnugteren.github.io/tutorial/pages/page1.html)

Just like how high-performance CPU computing requires a deep understanding of
cache and stuff... high-performance GPU computing requires a deep
understanding of the various memory-spaces on the GPU.

\------------

Now granted: deep optimization of routines on CPUs is similarly challenging,
and actually undergoes a very similar process in how to partition your work
problem into L1-sized blocks. But high-performance GPUs not only have to
consider their L1 Cache... but also "Shared" (or OpenCL __local) memory and
"Register" (or OpenCL __private) memory as well. Furthermore, GPUs in my
experience have way less memory than CPUs per thread/shader. IE: Intel "Sandy
Bridge" CPU has 64kb L1 cache per core, which can be used as 2-threads if
hyperthreading is enabled. A "Pascal" GPU has 64kb of "Shared" memory, which
is extremely fast like L1 cache. But this 64kb is shared between 64 FP32
cores!!!.

Furthermore, not all algorithms run faster on GPGPUs either. For example:

[https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf](https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf)

This paper claims that their GPGPU implementation (Xeon Phi) was slower than
the CPU implementation! Apparently, the game of "Hex" is hard to parallelize /
vectorize.

\---------------

Now don't get me wrong, this is all very cool and stuff. Making various
programming tasks easier is always welcome. Just be aware that GPUs are no
silver bullet for performance. It takes a lot of work to get "high-performance
code", regardless of your platform.

And sometimes, CPUs are faster.

~~~
ViralBShah
Absolutely. The goal with Julia is to make it easy to use whatever hardware is
best suited for the problem you are solving. This work, IMO, reduces the
barrier to entry for writing code for GPUs and gives Julia users more options.

------
gravypod
> Julia has recently gained support for syntactic loop fusion, where chained
> vector operations are fused into a single broadcast

Wow. That's very impressive.

I hope one day we get this sort of tooling with AMD GPUs.

~~~
one-more-minute
Ask and ye shall receive:
[https://github.com/JuliaGPU/CLArrays.jl](https://github.com/JuliaGPU/CLArrays.jl)

~~~
gravypod
That's amazing. I'm very excited about the prospect of auto-magically trans-
piling code into GPU code. This sort of tech will make GPUs approachable to
many more scientists and programmers.

------
jernfrost
How does the Julia approach compare to the alternatives in performance and
ease of use? Can e.g. Python or R do this in any way?

~~~
wallnuss
The big difference is that Julia can handle user defined structs and handle
higher-level functions, e.g. you pass a Julia function to you GPU kernel and
that function will get compile for the GPU without you having to declare it
GPU-compatible.

~~~
ChrisRackauckas
The key difference here is that, while Python and R has a lot of their
standard library written in other languages (C), Julia's is mostly written in
Julia. Same with Julia's packages. This means that you can throw a lot of
library functions and they will GPU compile just fine because the whole stack
is Julia all the way down (in many cases. There are of course exceptions).

~~~
kxyvr
I keep hearing this, but each time I look at the links on HN, I see that the
high-performance libraries being cited are those still written in C, C++, or
some other low level language. For example, even in this link, the code is
tying into things like cuBLAS, which is definitely not Julia code. For me,
high performance linear algebra routines are important and I just checked
here:

[https://docs.julialang.org/en/latest/stdlib/linalg/](https://docs.julialang.org/en/latest/stdlib/linalg/)

It looks like Julia uses a combination of LAPACK and SuiteSparse. These are
good choices, but it's not Julia code and these routines are callable from all
sorts of other languages like Python, MATLAB, and Octave. As such, it still
appears as though Julia is operating more like a glue language rather than a
write all of your numerical libraries in Julia language, which is fine, but I
don't feel like that's what it's being sold as.

~~~
ViralBShah
We use BLAS, LAPACK and SuiteSparse - because they are incredibly high quality
libraries. For example, if you translate LAPACK or SuiteSparse into Julia, you
will get the same performance. BLAS is a different story (and while not
impossible to have a Julia one, the effort to build one would be better
deployed elsewhere for now).

The benefit comes from user code, which in many dynamic languages is
interpreted and is much slower than built-in C libraries. For example, look at
the Julia `sum`. It is written in Julia. Or that we are in the process of
replacing openlibm (based on freebsd libm) with a pure julia implementation.
Or any of the fused array kernels (arithmetic, indexing, etc.). Our entire
sparse matrix implementation (except for the solvers) is in pure Julia.

~~~
kxyvr
To be sure, I agree and think it's the right thing to do to hook into external
libraries when they provide the functionality we need. That's just an
extension of the right tool for the right job philosophy.

Alright, so I write numerical codes professionally. Though it's not quite
fair, I tend to bulk things into glue languages and computation languages. In
a glue language, we combine all of our numerical drivers and produce an
application. For example, optimization solvers don't really need to be written
in a low-level language since their parallelism and computation is primarily
governed by the function evaluations, derivatives, and linear system solvers.
As long as these are fast, we can use something like like Python to code it
and it runs about the same speed, and in parallel, as a C or C++ code. On the
other hand, we have the computation languages where we code the low level and
parallel routines like linear algebra solvers. Typically, this is done is
C/C++/Fortran, but I'm curious to see how Rust can fit in with these language.
For me, the primary focus of a computation language is one that it's fast and
two that it's really, really easy to hook into glue languages. Since just
about every language has a c-api, that's our pathway forward.

Alright, so now we have Julia. Is it a glue language? Is it a computation
language? Maybe it's designed to be both. However, at the end of the day, most
of the examples I see of Julia on HN are using Julia as a glue language. To
me, we have lots of glue languages that already hook into whatever other stuff
we care about be it plotting tools or database readers or whatever. If Julia
is designed to be a computation language, great. However, that means we should
be seeing people writing the next generation of things like parallel
factorizations and then hooking them into a more popular glue language like
Python or MATLAB or whatever. Maybe these examples exist and I haven't seen
them. However, until this is more clear, I personally stay away from Julia and
I advise my clients to as well.

And, to be clear, Julia may be wonderfully suited for these things. Mostly, I
wanted to express my frustration of what I see as an ambiguity in the
marketing.

~~~
sixbrx
I think the biggest reason that Julia might not satisfy your definition of
"computation language" is just that Julia has a significant runtime, as a
garbage collected language. So it's not really suited to writing something as
a library and then using it from glue languages as your proposing for
"computation languages", at least currently. I think that would remain true
even if it had the speed and flexibility and developer resources to not need
to call out to native libraries for its own purposes.

Which reminds me a bit of Java, where the speed is either there or getting
there for tight loops, but it just doesn't play well with others at all when
they are wanting to do the driving.

~~~
kxyvr
That's fair. And, certainly, there's nothing wrong with a glue language geared
toward computation. Then, from my perspective, the question for me becomes
whether Julia provides an good resources for the end application. Stuff like
good plotting, reading from databases and diverse file formats, easy to
generate GUIs, etc. Honestly, that's part of why I think Python became popular
in the computation world. Personally, I dislike the language, but I support it
because there's code floating around to do just about anything for the end
application and that's hugely useful.

There's one other domain that, depending, Julia may fit well. At the moment, I
prototype everything in MATLAB/Octave because the debugger drops us into a
REPL where we can perform arbitrary computations on terms easily. Technically,
this is possible in something like Python, but it's moderately hateful
compared to MATLAB/Octave because factorizing, spectral analysis, and plotting
can be done extremely easily in MATLAB/Octave. That said, I tend not to keep
my codes there since MATLAB/Octave are not good, in my opinion, for developing
large, deliverable applications. As such, in my business where I quickly
develop one off prototype codes on a tight deadline, maybe it would be a
reasonable choice.

Though, thinking about it, there may be licensing problems. The value in
MATLAB is that they provide the appropriate commercial license for codes like
FFTW and the good routines out of SuiteSparse rather than the default GPL. I'm
looking now and it's not clear to me Julia provides the same kind of cover.
This complicates the prototyping angle.

