
Dynamic Automatic Differentiation of GPU Broadcast Kernels [pdf] - g0wda
http://www.mit.edu/~jvielma/publications/Dynamic-Automatic-Differentiation.pdf
======
jrevels
Author here; the arxiv version can be found at
[https://arxiv.org/abs/1810.08297](https://arxiv.org/abs/1810.08297). Not much
different from OP's linked version, but it includes citations to other
interesting Julia AD/TPU-related papers that utilize this technique.

Happy to answer any questions, at least until I turn in for the night :)

~~~
joe_the_user
Interesting. I'm still researching GPU AD stuff and just skimmed your article.

AD is basically a code transformation method.

What's the most notable way the GPU in particular comes into play?

How does caching come into play? What about intrinsic condensing functions?

~~~
jrevels
I'll forward this to some of my GPU-expert-coauthors in the morning to see if
they have a take on your questions. I think there's a few interesting facets
here, though, so here's my take.

> What's the most notable way the GPU in particular comes into play?

Forward-mode and reverse-mode AD have different expressibility constraints on
the kinds of programs that each can efficiently target in the face of
dynamism, and GPUs also have fairly constrained programming models compared to
the CPU. For me, a big part of the paper was the exploration of the
intersection of these two sets of programmability constraints.

Section 2.2.4 and the experimental sections explain some of this in detail,
but I think one of the more surprising results was that the benefits of fusing
dynamic control flow into the broadcasted derivative kernel outweighed
potential detriments e.g. warp divergence. It turns out newer GPU
architectures give you more leeway in that regard than any of us on the team
expected.

> How does caching come into play?

Depends on which kind of "caching" you're referring to.

If you mean tape-level partial derivative caching/memory usage:

Broadcasting a forward-mode derivative operator, as presented in this paper,
can save on memory when it enables better fusion than reverse-mode on
complicated kernels (resulting in fewer temporaries).

However, there is also a question of _when_ this technique should actually be
employed: during the forward pass, or during the reverse pass? If employed in
the forward pass, then the primal and partial derivative calculations can be
fused, reducing compute cost. However, doing so means that the memory required
to store the partial derivatives is held captive until those derivatives can
be backpropagated in the reverse pass. Conversely, employing the technique in
the reverse pass allows you to free the partial derivative storage quickly,
but features some redundant computation. Section 2.2.3 of the paper discusses
this a bit.

If you mean instruction-level caching, i.e. efficient pipelining of memory
into registers:

On the CPU, it's quite easy to thrash cache for high-arity dual number
calculations (i.e. calculations where dual number instances carry around a
long stack-allocated array of partial derivatives). Our experiment in Section
3.4.1 tries to characterize the analogous GPU behavior by measuring how
occupancy scales with target calculation arity.

Also, there was definitely a bit of implementation work to ensure that loads
from our GPU-backed "dual number" arrays coalesced properly, that indexing
calculations were compiled away when possible, etc. The cool part is that the
dual numbers themselves were just the implementation provided by the
ForwardDiff package
([https://github.com/JuliaDiff/ForwardDiff.jl](https://github.com/JuliaDiff/ForwardDiff.jl)),
which contains no GPU-specific specialization, and they're automagically JIT-
compiled for the GPU by CUDAnative
([https://github.com/JuliaGPU/CUDAnative.jl](https://github.com/JuliaGPU/CUDAnative.jl)).

> What about intrinsic condensing functions?

Hmm...I'm not positive I know what "intrinsic condensing functions" are.
Apologies!

