
Backpropagation is a leaky abstraction - nafizh
https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b#.lbzzq2acs
======
fizixer
Backpropagation is a leaky abstraction in the sense that every
algorithm/physics-principle/mathematical-theorem is a leaky abstraction. Take
'sort.' If you use a sort API for large N in a performance-critical section of
your code without knowing if the implementation of that sort is an insertion-
sort or a quicksort, "you would be nervous." Hence you are dealing with a
leaky abstraction, per this article.

I would much rather respond to the students who ask about why they need to
know the details, by saying "because you registered for this class."

Essentially the question "if you’re never going to write backward passes once
the class is over, why practice writing them?" is another way of asking "why
reinvent the wheel?". And the best answer is "because learning is all about
reinventing the wheel". The "code reuse as much as possible" mantra applies to
when you're "using" a technique to do something else, not when you're
"learning" the technique itself. They might as well register for Calculus and
ask "why learn integrals and derivatives when mathematica can do them for
you", or take an Aerodynamics class and ask "why learn fluid mechanics and
dynamics, heck newton's laws, when an airplane can run on autopilot." I doubt
"because calculus is a leaky abstraction" or "because fluid dynamics is a
leaky abstraction" is a good answer to that.

~~~
skybrian
I think you're sweeping an important distinction under the rug.

If every major language provides an O(n log n) sort function, is it still a
leaky abstraction? I'd say no. You can use it without worrying much about the
details.

But it sounds like the situation with back-propagation is different, since the
internal details of the algorithm affect whether you get a usable answer at
all.

A borderline case might be something like a SQL database, where in theory,
creating an index shouldn't change query results, but in practice, the
performance can change so much that it effectively does.

Unlike with back-propagation, you can tune database performance without
worrying about queries returning different results (assuming they don't time
out). So it's still a useful guarantee.

~~~
sirclueless
Speaking from experience, I've had to worry about what implementation of sort
is being used in many languages, from Java to C++, even Go and Python.

There are a lot of details to get right: how are elements compared? Is the
sort stable? Is it efficient for small N? Is it efficient for nearly-sorted
arrays? Is it efficient when almost all the elements compare equal? Is it
guaranteed O(n log n) or average? If average, is there an input that reliably
triggers n^2 behavior making it a DDoS vector?

Anything is a leaky abstraction when you care enough.

~~~
im3w1l
Those properties should ideally be part of the documentation so that the
abstraction stops leaking.

~~~
4ad
One could argue that if these details are necessary for correct/performant
operation, then it _is_ a leaky abstraction.

~~~
im3w1l
To me a leaky abstraction is an abstraction that does not expose all the
relevant details. So if those details are written on the spec sheet of the
black box then there is no leak.

If the box is only labelled with O(nlog n) without specifying constants then
there is a leak.

------
tsomctl
> Why do we have to write the backward pass when frameworks in the real world,
> such as TensorFlow, compute them for you automatically?

Why do you have to learn to calculate integrals and derivatives in school, or
how compilers work internally? Same answer. But seriously, the CS231 class is
excellent, and Andrej is an excellent teacher. You can follow along at home
(which is what I am doing.) The syllabus (at
[http://cs231n.stanford.edu/syllabus.html](http://cs231n.stanford.edu/syllabus.html))
has the course notes and the assignments. The assignments are self grading,
you know when you have it programmed correctly. The lectures are here:
[https://www.youtube.com/playlist?list=PLlJy-
eBtNFt6EuMxFYRiN...](https://www.youtube.com/playlist?list=PLlJy-
eBtNFt6EuMxFYRiNRS07MCWN5UIA)

~~~
gime_tree_fiddy
Sort of off topic. I have been planning to do the same course(CS 231n), and
was wondering, is it okay if I do the Andrew Ng Machine Learning (Coursera) in
parallel, or is it more of a prerequisite ?

~~~
tsomctl
I only did the first several videos of Andrew Ng's class, and didn't like it.
The only prerequisites to CS231n are linear algebra, calculus, and Python.

------
bmh100
After reading the article, my summary of the message is not "backpropagation
is a leaky abstraction" but instead "if you don't understand how the
derivatives are being calculated, it will come back to bite you". The author
talks about issues that arise with s-curve activation functions having
saturated outputs and minimized gradients (part of the historical reason for
moving away from these functions), the vanishing/exploding gradient problem in
RNNs caused by repeated multiplication, the issue with clipping gradients as a
solution, and the issue with zero-valued gradients in ReLUs. This would be
equivalent to being a DBA without understanding indexes, or being a web
developer without understanding the DOM. Yes, you can get by for a while,
coasting on your tools. But when the stakes are high and you need to get it
right, that ignorance will hamstring you. You don't want to be put in the
situation of building a NN for someone and having no good idea about why it
isn't working yet.

------
tnecniv
The phrase "raw numpy" strikes me as funny. I would figure that's about as
abstract as you could get while still working with the math (discarding
symbolic engines).

~~~
p1esk
Yes, after implementing a simple neural network in C (with AVX, and pthreads),
"raw numpy" does sound funny!

On the other hand, try implementing a convnet in numpy, especially the
backprop, and you might start feeling some of its "rawness" :-)

~~~
TPCrow
As someone who also hand-coded a neural network implementation, forward and
back prop as well as an RNN in C, yea "raw numpy" is a joke.

Something I've always hated about people that say "why do I have to write a
backprop when TF does it for me?"

Here's why: you go to a company, they want you to incorporate machine learning
into their c++ engine. Have fun using numpy, you said you knew machine
learning right? implement backprop for me, you can do that right?

------
alxmdev
Questions like _" Why do we have to write X, when framework Y does it for
you?"_ are why I dislike the reinventing the wheel analogy, especially when it
finds its way in education. There's no substitute for the deep understanding
you get by solving a complex problem yourself from beginning to end. Students
complaining about implementing a foundational algorithm instead of using a
framework is depressing.

Not to mention that computer science and software engineering are such young
fields that it seems unhealthy to take readily available abstractions as
absolute givens. Everything stands to improve, even products and concepts that
have been around for decades and that everyone uses.

~~~
posterboy
> There's no substitute for the deep understanding you get by solving a
> complex problem yourself from beginning to end

Yes there is, watch someone else do it. In fact there are three ways to learn,
as the saying goes: trial and error, copying, and insight. I'd be hard pressed
to explain the difference of trial and error vs insight, but I wouldn't
confuse them either, because only one of them is painful.

~~~
tnecniv
I disagree. To paraphrase the intro to my Linear Systems book "math is a
contact sport." This applies to most intricate topics. If an expert takes you
through a one hour tour of a subject, you will get the salient points, but
there's a lot of intuition and that gets lost by you not struggling with the
material on your own. That's why we have homework.

------
aibottle
People complaining about having to implement backprop in a ML class? Nice one.
Had to hand in this exact assignment last Friday. A valuable lesson, that's
for sure. Generally I love how a proper lesson on Machine Learning really
contradicts the "universal power weapon" narrative media put on machine
learning in the last few years. Everything is not so magic anymore when it
boils down to proper algorithm use.

~~~
vertex-four
> People complaining about having to implement backprop in a ML class?

Wouldn't that suggest that other classes the students take have not been
academic enough - i.e. that they focus too much on "things you might use day-
to-day" vs "this is how/why things work"?

~~~
tnecniv
This attitude is not uncommon in programming / CS. It's similar to those who
pine about learning a half dozen search algorithms when they rarely need to
implement them in practice.

------
annnnd
For anyone trying to learn backpropagation but having trouble with math, I
can't recommend Matt Mazur's "Step by Step" guide [0] enough. What is great is
that he is using real numbers so that one can check implementation for
correctness.

[0] [https://mattmazur.com/2015/03/17/a-step-by-step-
backpropagat...](https://mattmazur.com/2015/03/17/a-step-by-step-
backpropagation-example/)

~~~
cjf101
I used this recently when I was relearning NN theory (last time I looked at it
before 2015 was the late 90's,) and I agree. It's a thorough walkthrough of
the math that turned the lightbulbs back on enough for me to write a simple
swift MLP NN without consulting other implementations.

------
vazamb
Backprop is such a fundamental part of Neural Networks, I am very surprised
how anyone can complain about having to know how it works. It is true that
once you grocked the principle implementing it for more than 2 layers is
pretty tedious. I would propose, however, that without having done the tedious
work at least once you cannot truly understand it.

------
conjectures
Much discussion of backprop could be avoided by recalling that the bit that
does the work is the chain rule from calculus.

Error terms represent a sum and product of derivatives. The product of a bunch
of terms will tend to get really big or really small.

The rest is detail: are the terms in some interval? Which? how many are we
multiplying? how many are we summing over? do we doctor the sum after we get
it?

------
geuis
Edit: hmm not sure why this comment is getting downvoted.

Backprop isn't just a leaky abstraction. There's not really any evidence yet
that biological neural networks use anything like backprop. So its important
that students be taught the low-level aspects of the current state of the art
so that better architectures can be invented in the future.

(Note: at the very end of this comment, I am leaving a link to one hypothesis
about how something like backprop may happen at the dendritic level.)

I've been doing a deep dive on ML and bio neurons in the last week or so to
try and get up to speed. The feedforward networks with backprop that are doing
such awesome work right now share almost nothing in common with actual
neurons.

Just some differences: (neuron == bio, node == artificial)

Terms: Neuron resting potential is about -70mV. Action potential (threshold)
is about -55mV.

1) Neurons are spikey and don't have weights. A neuron sums all of its inputs
to reach a threshold. When that threshold is reached, an electric current
(action potential) is spread equally to all synaptic terminals. A node has a
weight value for each input node that it multiplies by the value received from
the input node. All of these values are then summed and passed to the
connected nodes in the next layer.

2) There are multiple types of neurons, but for simplification there are 2
types that called "excitatory" and "inhibatory". A neuron can _mostly_ only
emit "positive" or "negative" signals via the synapse. There are different
neurotransmitters that either add or subtract to the voltage potential for
each dendritic input. The total sums work together to cross the threshold
voltage. I say mostly because there is some evidence for a minority of neurons
to be able to release neurotransmitters from both groups.

3) Artificial networks are generally summation based with no inclusion of
time. Bio networks use summation and summation over time together to determine
if they should fire or not. If I recall correctly, a neuron can repeatedly
fire about 300 times per second. A dendritic input can last for up to 1.5
milliseconds. So if a neuron gets enough positive inputs at the same time, or
collects enough over time, it will fire.

I haven't found any hypotheses or experiments that try to explain how
reinforcement learning takes place in neurons yet in the absence of back
propagation. Pretty sure that information is out there but I haven't run
across it.

Overall I think that there will be multiple engineering approaches to AI, just
like there are to construction and flight. We understand how birds and bees
fly, but we don't build planes the same way.

Its important to remember than cognition is based in physics just like any
other physical system. When the principles are well understood, there are
multiple avenues to using them.

Here is a short collection of links that I've been finding helpful.

[https://en.wikipedia.org/wiki/Neurotransmission](https://en.wikipedia.org/wiki/Neurotransmission)

[http://neuroscience.uth.tmc.edu/s1/introduction.html](http://neuroscience.uth.tmc.edu/s1/introduction.html)

[https://www.quora.com/Can-a-synapse-change-from-being-
inhibi...](https://www.quora.com/Can-a-synapse-change-from-being-inhibitory-
to-excitatory-What-is-the-mechanism-behind-it)

[https://en.wikipedia.org/wiki/Dale's_principle](https://en.wikipedia.org/wiki/Dale's_principle)

[https://en.wikipedia.org/wiki/Neural_backpropagation](https://en.wikipedia.org/wiki/Neural_backpropagation)

~~~
mattkrause
In terms of big picture stuff, you're absolutely right that many DNNs are more
"inspired by" the brain and less a faithful model of it. However, a lot of the
things mentioned in your post are either overstated or outright wrong. For
example:

1\. Neurons, or more specifically, connections between neurons ("synapses"),
_absolutely_ do have weights, and the "strength" of synapses can be adjusted
by a variety of properties that act on a scale of seconds to hours or days. At
the "semi-permanent" end of the spectrum, the location of a synapse matters a
lot: input arriving far from the cell body has much less influence the cell's
spiking. The number (and location?) of receptors on the cell surface can also
affect the relative impact of a given input. Receptors can be trafficked
to/from the membrane (a fairly slow process), or switched on and off more
rapidly by intracellular processes. You may want to read up on long-term
potentiation/depression (LTP/LTD), which are activity-dependent changes in
synaptic strength. There are a whole host of these processes, and even some
(limited) evidence that the electric fields generated by nearby neurons can
"ephaptically" affect nearby neurons, even without making direct contact,
which would allow for millisecond-scale changes.

2\. While you can _start_ by dividing neurons in excitatory and inhibitory
populations, there's a lot more going on. On the glutamate (excitatory) side,
AMPA receptors let glutamate rapidly excite a cell and make it more likely to
fire. However, it also controls NMDA channels that, under certain
circumstances, allow calcium into a cell. These calcium ions are involved in
all sorts of signaling cascades (and are involved--we think--in tuning
synaptic weights). GABA typically hyperpolarizes cells (i.e., makes them less
likely to fire) and is secreted by cells called interneurons . However,
there's a huge diversity of interneurons. Some seem to "subtract" from
excitatory activity, others can affect it more strongly in a divisive sort of
way or even cancel it completely. Furthermore, there's a whole host of other
neurotransmitters. Dopamine, which is heavily involved in reward, can have
excitatory or inhibitory effects, depending on whether it activates D1 or D2
receptors

3\. While the textbook feed-forward neural networks certainly have "instant"
signal propagation, there are lots of other computational models that do
include time. Time-delay neural networks are essentially convnets extended
over time instead of space. Reservoir computing methods like liquid state
machines also handle time, but in a much more complicated way.

4\. I chuckled at the idea of finding a biological correlate analog for
reinforcement learning, since reinforcement learning was initially inspired by
the idea of reinforcement in psychology/animal behavior. People have shown
that brain areas--and individual neurons within them--encode action values,
state estimates, and other building blocks of reinforcement learning. Clearly,
we have a lot to discover still, but the general idea isn't at all
implausible.

Finally, some people are fairly skeptical that the fields have much to learn
from each other; Jürgen Schmidhuber said this a lot at NIPS last year.
However, other, equally-smart people (e.g., Geoff Hinton) seem to think that
there may be a common mechanism, or at least a useful source of inspiration
there. But, if you want to work on something like this (and it is awesomely
interesting), it really helps to have a solid grounding in both.

~~~
geuis
This is _exactly_ the kind of response I was hoping for. Thanks Matt! If it's
not an inconvenience, could you drop any links to the topics you referenced,
especially the ones that differed from what I've been studying, to
charles@geuis.com? I'm going a bit deeper now and reading some studies from
the early 90's and some that are more recent. It's kind of a crapshoot of what
I can google for, so a guided search would be immensely helpful.

~~~
mattkrause
Hmmm...it's hard to do entire fields justice, but here's an attempt.

There are a couple of standard neurobiology textbooks, like Kandel, Jessel,
and Schwartz's _Principles of Neural Science_ , Purve et al.'s _Neuroscience_
and Squire et al.'s _Fundamental Neuroscience_. These are huge books that
cover a bit of everything, and you should know that they exist, but I wouldn't
necessarily start there.

If you're specifically interested in computation, I would start with David
Marr's _Vision_. It's quite old, but worth reading for the general approach he
takes to problem-solving. He proposes attacking a problem along three lines:
at the computational level ("what operations are performed?"), the algorithmic
level ("how do we do those operations?"), and the implementation level ("how
is the algorithm implemented").

From there, it depends on what you're interested in. At the single-cell level,
Cristof Koch has a book called _The Biophysics of Computation_ that "explains
the repetoire of computational functions available to single neurons, showing
how individual nerve cells can multiply, integrate, and delay synaptic input"
(among other things). Michael London and Michael Häusser have a 2005 _Annual
Reviews in Neuroscience_ article about dendritic computation that hits on some
similar themes (here:
[https://www.researchgate.net/publication/7712549_Dendritic_c...](https://www.researchgate.net/publication/7712549_Dendritic_computation)
), along with this short review
([http://www.nature.com/neuro/journal/v3/n11s/full/nn1100_1171...](http://www.nature.com/neuro/journal/v3/n11s/full/nn1100_1171.html))
by Koch and Sergev, and a 2014 review by Brunel, Hakim, and Richardson
([http://www.sciencedirect.com/science/article/pii/S0959438814...](http://www.sciencedirect.com/science/article/pii/S0959438814000130)).
Larry Abbott has also done interesting work in this space, as have Haim
Sompolinsky and many others. Gordon Shepard and his colleagues maintain NEURON
(a simulation package/platform) and a database of associated models (ModelDB)
here: [https://senselab.med.yale.edu/](https://senselab.med.yale.edu/) if you
want something to download and play with (they also do good original work
themselves!)

Moving up a bit, the keywords for "weight adjustment" are something like
synaptic plasticity, long-term potentiation/depression (LTP/LTD), and perhaps
spike-timing dependent plasticity. The scholarpedia article on spike-timing
dependant plasticity is pretty good
([http://www.scholarpedia.org/article/Spike-
timing_dependent_p...](http://www.scholarpedia.org/article/Spike-
timing_dependent_plasticity\);) Scholarpedia is actually a pretty good
resource for most of these topics. The intro books above will have pretty good
treatments of this, though maybe not explicitly computational ones.

More to come, however I also just found this class from a bunch of heavy-
hitters at NYU:
[http://www.cns.nyu.edu/~rinzel/CMNSF07/](http://www.cns.nyu.edu/~rinzel/CMNSF07/)
Those papers are a good place to start!

~~~
geuis
Great info. Definitely have plenty to read over the next couple weeks now.

------
cr0sh
I do think the complaint on having to write the backward pass seems especially
shallow; finding out they were working with numpy makes it even more so (since
numpy takes the pain out of the matrix operations). IIRC, when I took the ML
Class in 2011, we used Octave, but Ng had us first write stuff "the hard way"
\- so we'd understand what was going on later when we used Octave's methods.

Something about this article as a whole, though, does raise a question that
I've been wondering about, and I want to present it here for a bit of
discussion (maybe it needs its own thread?):

Does anyone else here think that the current approach to neural networks has
some fundamental flaws?

Now - I'm not an expert; call me an interested student right now, with only
the barest of experience (beyond the ML Class, I also took Udacity's CS373
course, and I am also currently enrolled in their Self-Driving Car Engineer
nanodegree program).

I understand that what we currently have and know does work. What I mean by
that is the basic idea of an artificial neural network using forward and back-
prop, multiple layers, etc (and all the derivatives - RNN, CNN, deep learning,
etc). I understand the need and reasoning behind using activation functions
based around calculus and derivatives and the chain-rule, etc (though I admit
I need further education in these items).

But something nags at me.

All of this, despite the fact that it works and works well (provided all your
tuning and such is right, etc), just seems like it is over-complicated. Real
neurons don't use calculus and activation functions, nor back-propagation, etc
in order to learn. All of those things in an ANN are just abstractions and
models around what occurs in nature.

Maybe (probably?) I am wrong - but it seems like what nature does is simpler.
Much less power is used, for instance, and the package is much more compact. I
just have this feeling that in some manner we may have gone down a path that
while it has produced a working representation, that representation is overly
complex, and had we took another approach (whatever that might be?), our ANNs
would look and work much differently - perhaps even more efficiently.

About the only alternatives I have heard about otherwise have been things like
spike-train neural networks, and some of the other "closer to nature"
simulation (of ion pumps and real synapses, etc). Still, even those, while
seemingly closer, also have what appears to be too much complexity.

I'm probably just talking out of my nether regions as a general n00b to the
field. I do wonder, though, if there might be another solution, seemingly out
in "left-field" that might push things forward, if someone was willing to look
and experiment. It is something I plan to look into myself, as I find time and
such between lessons and other work for my current learning experience.

~~~
Retra
>Real neurons don't use calculus and activation functions, nor back-
propagation, etc in order to learn.

This sounds like a (common) failure to understand how abstractions work.
Bridges don't do calculus, but the bridge builder uses calculus to understand
what bridges _do_ use (the laws of nature), and thus the calculus abstraction
is used to encode the behavior of bridges. Thus you can _model_ bridges using
calculus.

Similarly, neurons are modeled by calculus. Abstractions are abstract
precisely because they are _not_ the concrete thing they model: they are
necessarily approximations. They give us the power to simplify at the cost of
gaining the capacity to be wrong.

The point being this: you can literally use any abstraction you desire to
model anything you like. Some will work better than others, and the better
they work, the more closely the structure of your abstraction matches the
structure of the concretion being modeled.

~~~
pekk
If you fit some data with a very flexible function approximator, that does not
imply any kind of isomorphism between the function approximator itself and the
process generating the data.

Some people cannot understand this, and believe that if you can closely fit
the output of a process with a neural network that it implies the process
itself is in some way related to neural networks.

------
thecity2
Backprop is just automatic differentiation. The end.

~~~
ww520
That's a succinct way of putting it.

------
thatsadude
Isn't backprop a clever application of chain rule in multivariate calculus?

------
godmodus
[https://m.youtube.com/watch?v=1SrHzSGn-I8](https://m.youtube.com/watch?v=1SrHzSGn-I8)

Feynmann pretty much settles it.

------
rahrahrah
> “Why do we have to write the backward pass when frameworks in the real
> world, such as TensorFlow, compute them for you automatically?”

How many more times do you need to see the same phenomenon under different
guises before you stop asking stupid questions? "Hey teach, why do I need to
learn how to multiply if I can just use a calculator?"

~~~
posterboy
Isn't it somewhat reasonable question, given the relatively recent advent of
TensorFlow, compared to ML curricula? The stress is on, why don't we learn
TensorFlow / Caffe / etc.

~~~
tnecniv
Because frameworks come and go. The important thing are the abstract concepts

At that level, they assume you are pretty smart and capable of figuring out
something like an API on your own time as needed. They'd rather you know what
all these funky things in these APIs are doing at a core level so that you can
employ them in an effective manner.

~~~
HelloNurse
Letting the computer do the menial work of constructing formulas and
generating code from user specifications is a fairly important framework
feature and "abstract concept".

