
An Introduction to GPU Programming in Julia - simondanisch
https://nextjournal.com/sdanisch/julia-gpu-programming
======
maxbrunsfeld
> GPUArrays never had to implement automatic differentiation explicitly to
> support the backward pass of the neuronal network efficiently. This is
> because Julia's automatic differentiation libraries work for arbitrary
> functions and emit code that can run efficiently on the GPU. This helps a
> lot to get Flux working on the GPU with minimal developer effort - and makes
> Flux GPU support work efficiently even for user defined functions. That this
> works out of the box without coordination between GPUArrays + Flux is a
> pretty unique property of Julia

Every time I read about Julia, I’m amazed. What a game changing tool.

------
daenz
GPGPU (general purpose gpu) programming is pretty cool. I wrote a utility to
let you do it in javascript, in the browser, awhile back
[https://github.com/amoffat/gpgpu.js](https://github.com/amoffat/gpgpu.js)

The thing to note about GPU programming is that the vast majority of overhead
comes from data transfer. Sometimes, it is net faster to do the computation on
the CPU, if your data set and data results are very large, even if the GPU
performs each calculations faster on average due to parallelism. To
illustrate, look at the benchmarks on gpgpu.js running a simple kernel:

    
    
      CPU: 6851.25ms
      GPU Total: 1449.29ms
      GPU Execution: 30.64ms
      GPU IO: 1418.65ms
      Theoretical Speedup: 223.59x
      Actual Speedup: 4.73x
    

The theoretical speedup excludes data transfer while actual speedup includes
it. The longer you can keep your data set on the GPU to do more calculations
(avoiding back and forth IO), the bigger your net speed gains are.

~~~
pjmlp
WebGL compute shaders are being discussed, it might be a possible improvement
to your library when they finally get adopted.

~~~
johndough
Are there any news on WebGL compute shaders?

I remember OpenCL for Firefox being discussed and discarded in favor of
compute shaders about 7 years ago, and then when WebGL 2.0 was finally
released 5 years later, compute shaders were not part of it.

Additionaly, SIMD.js has been developed and then killed again and support for
SIMD in WebAssembly has been delayed, so I don't believe that we'll be able to
implement efficient numerical methods in the browser any time soon.

~~~
pjmlp
Here is the proposal.

[https://docs.google.com/document/d/1EwhDJO_lBH1mGMMwheQUXGhh...](https://docs.google.com/document/d/1EwhDJO_lBH1mGMMwheQUXGhhFk9yoC98Ant3TPqwmmA/view#)

------
Athas
I'm a bit surprised to see that GPU Mandelbrot is only at best x75 faster than
(sequential?) CPU. Does Julia just generate _really fast_
(multicore/vectorized?) CPU code? Does it also count communication costs?
Fractal computations like that are extremely GPU friendly because they involve
no memory accesses at all, except for writing the final result. I would expect
at least two orders of magnitude improvement over a straightforwardly written
C implementation.

~~~
Athas
I had a suspicion that the problem was that _maxiters_ was set fairly low
(16). For the Mandelbrot sets I'm used to, this would result in relatively
little computation (over a hundred iterations is more common). To investigate
this hunch, I wrote a Futhark program as a proxy for a hand-written GPU
implementation (didn't quite have the patience for that this evening):

    
    
        import "lib/github.com/diku-dk/complex/complex"
    
        module c32 = mk_complex f32
        type c32 = c32.complex
    
        let juliaset (maxiter: i32) (z0: c32) =
          let c = c32.mk (-0.5) 0.75
          let (_, i) = loop (z, i) = (z0, 0) while i < maxiter && c32.mag z <= 4 do
                       (c32.(z * z + c), i+1)
          in i
    
        let main (n: i32) (maxiter: i32) =
          let N = 2**n
          let (w, h) = (N, N)
          let q = tabulate_2d w h (\i j -> let i = 1 - r32 i*(2/r32 w)
                                           let j = -1.5 + r32 j*(3/r32 h)
                                           in c32.mk i j)
          in map (map (juliaset maxiter)) q
    

I don't have the tools or skills to easily generate nice graphs, but I get
about a x271 speedup when this code is compiled to OpenCL and run on a Vega 64
GPU, versus when it is compiled to C. The C code runs in 272ms, which is very
close to the number reported for Julia here[0] (I assume that the page has a
typo and that the N=2²⁴ column actually means N=2¹², because an N=2²⁴ image
would take up dozens of TiB in GPU memory. Also, the benchmark code linked
only goes up to N=2¹².)

Unfortunately, if I change _maxiters_ to 256, the speedup actually _drops_ ,
to x182. So much for that theory.

Edit: also tried on an NVIDIA GTX 780Ti, and the results are the same. My new
theory is that the Julia code also counts the cost of moving the resulting
array back to the host (which can easily be dominant for a kernel that runs
for just a millisecond or two).

[0]:
[https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/res...](https://github.com/JuliaGPU/GPUBenchmarks.jl/blob/master/results/results.md#juliaset-1)

------
currymj
While having a Torch-esque GPU ndarray is great, the ability to easily write
your own kernels without having to compile gnarly C++ code is what sets Julia
apart from competitors IMO. Not sure if there's any other dynamic language
offering anything like this.

~~~
dragandj
ClojureCL and ClojureCUDA have been doing it for 4 years now.

[https://clojurecl.uncomplicate.org](https://clojurecl.uncomplicate.org)

[https://clojurecuda.uncomplicate.org](https://clojurecuda.uncomplicate.org)

~~~
KenoFischer
Am I missing something? I was excited about other languages doing the same as
Julia, but the first example for ClojureCUDA has

    
    
        (def kernel-source
              "extern \"C\"
                 __global__ void increment (int n, float *a) {
                       int i = blockIdx.x * blockDim.x + threadIdx.x;
                   if (i < n) {
                      a[i] = a[i] + 1.0f;
                    }
               };")
    

Is there an example where the kernel is written in Clojure?

~~~
dragandj
You can of course create Clojure/lisp code that would generate kernel code,
but I am strongly against it as the default option, since I find plain C much
better suited to small kernels close to hardware (which are only a small
proportion of the overall code) than either lisp or Julia.

Please note that there is no need for out-of process compilation step. You
compile this by calling a Clojure function, interactively from the REPL.

~~~
KenoFischer
Sure, but that's a completely separate thing from what Julia is doing here.
The whole point of the GPU backend is that you get almost the full flexibility
of the Julia language at your disposal, without having to fall back to C code.
As a result, not only is your code more familiar to Julia programmers, but you
also get to share large amounts of code between CPU/GPU/TPU. If all you're
doing is super small kernels that may not matter much, but GPUs are so
powerful these days that it's perfectly sensible to offload rather large
programs that you may not want to write in C anymore.

~~~
3rdAccount
It's not just easier to read for Julia programmers. Any pythonista can mostly
read Julia without significant issue. They're both high level and
syntactically similar enough. I know Python is a lot more OO, but when you're
just writing a short script it looks a lot like Julia if Julia didn't need you
to specify types.

------
pjmlp
Love it! So much more fun than being stuck with C derived languages for GPGPU
programming.

~~~
tanderson92
You can do GPGPU programming with Fortran, which is definitely not C-derived,
predating C by at least 20 years.

~~~
3rdAccount
I find Fortran pretty pleasant for scientific work (what it was designed for)
and take pleasure in knowing it hasn't changed a whole lot over the years F70,
F90...etc.

------
eigenspace
It seems kinda weird to tout how great it is that we have CuArrays and
CLArrays when CLArrays haven't been updated for 1.0 and only claims
experimental support for 0.6.

Really hoping we see some movement on CLArrays in the near future.

~~~
keldaris
It is clearly mentioned in a separate paragraph which reads:

> For this article I'm going to choose CuArrays, since this article is written
> for Julia 0.7 / 1.0, which still isn't supported by CLArrays.

~~~
eigenspace
My bad, I should really finish reading things before I comment.

------
eghad
If anyone wants to try out a free GPU using Google Colab/Jupyter (K80, you
might run into ram allocation issues if you're not one of the lucky users who
get to use the full amount) here's a quick guide to get a Julia kernel up and
running: [https://discourse.julialang.org/t/julia-on-google-colab-
free...](https://discourse.julialang.org/t/julia-on-google-colab-free-gpu-
accelerated-shareable-notebooks/15319)

~~~
simondanisch
Or just directly edit the article ;) That will also give you a K80 and you can
directly run the code inside the article

------
IshKebab
It doesn't really describe the fundamental difference between a GPU and a
4000-core CPU, which is that the GPU has a _shared program counter_. All the
cores must execute the same instruction at each cycle.

~~~
why_only_15
Are you sure? I'm pretty sure I can branch on core-specific measures like (in
CUDA) TheadIdx and BlockIdx

~~~
0-_-0
Yes but can you do that from Julia without writing CUDA/OpenCL code?

I'm happy that Julia supports GPU programming for simple code, but I don't see
how you can run algorithms with inter-thread communication.

~~~
simondanisch
That's how you do it:

[https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/lin...](https://github.com/JuliaGPU/GPUArrays.jl/blob/master/src/linalg.jl#L29)

Or directly with cuda shfl intrinsics:
[https://github.com/JuliaGPU/CUDAnative.jl/blob/b249dfd145501...](https://github.com/JuliaGPU/CUDAnative.jl/blob/b249dfd145501632695cc6e03cdf98e49f5c8708/examples/reduce/reduce.jl#L22)

~~~
0-_-0
Oh cool! Thanks!

