The title seems pretty cool! Could anyone care to tell me if there is a motivati...

bmc7505 · on Oct 26, 2018

AD is important to understand for ML practitioners in the same way as compilers are important to understand for programmers. You can get away without knowing all the details, but it helps to understand where your gradients come from. However this paper is probably not be a good place to start if you're new to AD. If you want a better introduction, here are a few good resources:

Autodidact is a pedagogical implementation of AD: https://github.com/mattjj/autodidact

A nice literature review from JMLR: http://www.jmlr.org/papers/volume18/17-468/17-468.pdf

This paper reinterprets AD through the lens of category theory, an abstraction for modeling a wide class of problems in math and CS. It provides a language to describe these problems in a simple and powerful way, and is the foundation for a lot of work in functional programming (if you're interested in that kind of stuff). There was a thread on HN recently that discusses why category theory is useful: https://news.ycombinator.com/item?id=18267536

"Category Theory for the Working Hacker" by Philip Wadler is a great talk if you're interested in learning more: https://www.youtube.com/watch?v=gui_SE8rJUM

Also recommend checking out Bartosz Milewski's "Category Theory for Programmers": https://github.com/hmemcpy/milewski-ctfp-pdf

317070 · on Oct 26, 2018

You actually want to know the gist of how these autodiff libraries work to know a) which approaches are fast and which approaches lead to giant complex gradient graphs. b) which approaches are stable and which lead to numerically unstable gradients.

You would be surprised how many code is out there (even for influential papers) whose graphs are obviously bad or whose graph could be fixed easily for more stability. Because people don't think about what the gradient looks like, while they should.