Hacker News new | past | comments | ask | show | jobs | submit login
Backpropagation is a leaky abstraction (medium.com/karpathy)
365 points by nafizh on Dec 19, 2016 | hide | past | favorite | 101 comments

Backpropagation is a leaky abstraction in the sense that every algorithm/physics-principle/mathematical-theorem is a leaky abstraction. Take 'sort.' If you use a sort API for large N in a performance-critical section of your code without knowing if the implementation of that sort is an insertion-sort or a quicksort, "you would be nervous." Hence you are dealing with a leaky abstraction, per this article.

I would much rather respond to the students who ask about why they need to know the details, by saying "because you registered for this class."

Essentially the question "if you’re never going to write backward passes once the class is over, why practice writing them?" is another way of asking "why reinvent the wheel?". And the best answer is "because learning is all about reinventing the wheel". The "code reuse as much as possible" mantra applies to when you're "using" a technique to do something else, not when you're "learning" the technique itself. They might as well register for Calculus and ask "why learn integrals and derivatives when mathematica can do them for you", or take an Aerodynamics class and ask "why learn fluid mechanics and dynamics, heck newton's laws, when an airplane can run on autopilot." I doubt "because calculus is a leaky abstraction" or "because fluid dynamics is a leaky abstraction" is a good answer to that.

I think you're sweeping an important distinction under the rug.

If every major language provides an O(n log n) sort function, is it still a leaky abstraction? I'd say no. You can use it without worrying much about the details.

But it sounds like the situation with back-propagation is different, since the internal details of the algorithm affect whether you get a usable answer at all.

A borderline case might be something like a SQL database, where in theory, creating an index shouldn't change query results, but in practice, the performance can change so much that it effectively does.

Unlike with back-propagation, you can tune database performance without worrying about queries returning different results (assuming they don't time out). So it's still a useful guarantee.

Speaking from experience, I've had to worry about what implementation of sort is being used in many languages, from Java to C++, even Go and Python.

There are a lot of details to get right: how are elements compared? Is the sort stable? Is it efficient for small N? Is it efficient for nearly-sorted arrays? Is it efficient when almost all the elements compare equal? Is it guaranteed O(n log n) or average? If average, is there an input that reliably triggers n^2 behavior making it a DDoS vector?

Anything is a leaky abstraction when you care enough.

Those properties should ideally be part of the documentation so that the abstraction stops leaking.

One could argue that if these details are necessary for correct/performant operation, then it is a leaky abstraction.

To me a leaky abstraction is an abstraction that does not expose all the relevant details. So if those details are written on the spec sheet of the black box then there is no leak.

If the box is only labelled with O(nlog n) without specifying constants then there is a leak.

I feel that if you care that much about the internals of the abstraction then you don't necessarily want an abstraction anymore, but rather the concrete thing that is being abstracted over. I think abstraction is an admirable quality that has numerous advantages, but if you need guarantees on the properties of what's under the abstraction then an abstraction is probably not what's needed at that point.

Except that for nontrivial use cases, the many different nlogn sort algorithms have trade offs which matter. A friend just wrote his own sort implementation optimized to maximize cache hits, because of certain quirks of his data that he knows ahead of time.

In much the same way, much of the interesting work with machine learning requires deep specialist knowledge in which selecting the best approach requires digging beneath the abstraction. Many specific applications have their own quirks, which is where significant gains are often made. (To be honest, I think the last year was more exciting for the advances in applied ML than in theoretical work.)

A co-worker wrote a library for distributed training of DNNs for a specialized use case. Because of certain quirks of our use case (think sparsity and consistent patterns in the data), there were certain optimizations he could make to the training process that gave nice linear scaling in the number of training nodes.

What is O(n log n)? Time complexity? What about memory requirement? Last month I decreased memory requirement from 16GB to 100MB in one implementation of a "stable&fast" algorithm. Really, using a function just because it's O(n log n) without understanding its characteristic is like a blind surgeon randomly amputating limbs because maybe it will help. But that's fine for me, I have more work then, fixing software bugs made by incompetent* programmers is quite nice.

*incompetent - those who think they know something and are not willing to learn anything new

> What is O(n log n)? Time complexity?

And then worst case? Best case? Average case? Which applies to random input? Nearly-sorted input? What's the overhead for a "short sort", can/should I use this to sort small sequences in a tight-ish loop or is it only for large sequences?

And then worst case? Best case? Average case?

I was under the illusion that Big-O was always asymptotic complexity (ie worst case) and that other notations (little-o, big-omega, big-thetha etc) were used for best/average/etc case. Perhaps I'm wrong, however.

Which applies to random input? Nearly-sorted input? What's the overhead for a "short sort", can/should I use this to sort small sequences in a tight-ish loop or is it only for large sequences?

Indeed. The details matter.

My favourite example is how a naive linear search can perform better than a non-linear search with a better (in O() terms) algorithm if, for example, the linear search can do most of its work in cache (eg small input sizes or otherwise regular access patterns (predictable for prefetch)).

> I was under the illusion that Big-O was always asymptotic complexity (ie worst case) and that other notations (little-o, big-omega, big-thetha etc) were used for best/average/etc case. Perhaps I'm wrong, however.

No, asymptotic notation only refers to the behaviour of the function (or algorithm) as you take the limit N->inf, e.g. changes to the size of the input. But the nature of the input often changes the behavior as well.

Thanks for the correction!

You can have Big-O of worst case and Big-O of average case. "Asymptotic complexity" itself does not imply worst case; the assumptions you make to construct the function itself determine worst case or average case.

Why do you assume the commenter doesn't understand the time and memory characteristics of quicksort?

Nobody mentioned quicksort. You're assuming quicksort, which is what the issue is. Many languages don't actually use quicksort, fyi. The practical performance considerations of just quicksort itself also change depending on how you've implemented the small details - are you moving memory around, or just changing pointers? Huge difference in the real world, no difference in the algorithm.

I oversimplified my response and it lost the point I was making. You are certainly correct that many languages don't use quicksort. My bad.

I was responding to a comment that I felt was needlessly dismissive. The default sort algorithm of any major language's standard library is usually not worth trying to beat, and calling someone "incompetent" for not reinventing the wheel each time they encounter data that needs to be sorted is absurd. Note: I said "usually", not "never". A competent programmer doesn't fix problems that don't exist, and you would absolutely have to profile code before convincing me that the choice of sorting algorithm was the source of a performance problem.

As far as the details go... In only a few languages (e.g. C) might you really even consider moving memory around rather than pointers. The semantics of standard library sort in Java, C#, Ruby, Python, Javascript... is sorting an array of primitives by value or objects by reference.

There is no quicksort mentioned.

> If every major language provides an O(n log n) sort function, is it still a leaky abstraction? I'd say no.

Great example :) You get different results if you use merge sort or quicksort.

The former is stable; the latter is not.

Okay...so you can be more specific with your API and specify whether stability is guaranteed. Eventually you can be specific enough that you essentially characterize the implementation.

> If every major language provides an O(n log n) sort function, is it still a leaky abstraction?

Yes, if the performance is still not acceptable and you look into the problem and discover that your scenario could benefit from radix sort, or one of partial sorting, or some kind of intermittent sorting, all of which would require investigating the specific case at a "white box" level, ignoring the existence of a black-box O(n-log-n) sort API call.

The implementation is a black box but the performance characteristics could very well be part of the API contract. When you require and it's feasible to use radix sort you probably know not to use the standard sort function.

Yes, but reasoning about the API contract is exactly what being taught sorting algorithms helps you navigate, just like learning the guts of backprop helps the ML practitioner understand why things can go wrong.

> You can use it without worrying much about the details.

I invite you to start sorting data which is often already sorted (or often all identical) and tell me you didn't need to worry about the details.

Perhaps the distinction would be between things that are "fundamentally a heuristic with some math behind it" versus things are "mostly theoretically settled with some corner cases you have to worry about"

And neural networks using backpropagation to minimize an error function are definitely the former while the "go to" techniques of mathematics are generally the latter.

I think this is generally acknowledged, in fact.

Sooo it seems logical that students of these modern machine learning techniques should be impelled to "get their hands dirty" with the numericals processes involved since this will add to the data they use to drive the intuitions which allow them to avoid the multiple pitfalls of neural networks.

I think Andrej is arguing over and above the reasons you cite. Not only should you learn backprop because of the same reason you learn to do 2+2, but you should learn backprop ALSO because it's a leaky abstraction.

This is a non-trivial statement, because there are other things which are not leaky. For example, he's not arguing that deep learning practitioners should also learn assembly programming or go into how CUBLAS implements matrix multiplication. Although these things are nice to learn, you probably won't need them 99.9% of the times. Backprop knowledge, however, is much more crucial to design novel deep learning systems.

> Backprop knowledge, however, is much more crucial to design novel deep learning systems.

I would argue that it's not just for that. You need to understand what is happening inside DNN if you want to construct it properly.

> "because you registered for this class."

The Authoritative Argument is not very convincing when your good authority over the course material being challenged. Maybe there is some irony or deeper truth, lost on me. Are you perhaps a leaky abstraction?


> The "code reuse as much as possible" mantra applies to when ...

... you are able to reasonably compromise, got it.

> The Authoritative Argument is not very convincing

Neither is a straw man argument.

He wasn't making an argument from authority. He was essentially saying "because you are learning, not doing".


No, implicitly and very subtly. The essence of the argument was the student in a voluntary class. The implications are manifold, but not stating the implications has a meaning on the meta level, that the students understanding is not expected, but the students expectation of an authoritative answer has to be satisfied anyway by stating the obvious to establish or retain authority.

If someone wants to know a little calculus for a small area of engineering, do they really need to learn various manual techniques for calculating integrals? Why can't they use Mathematica? (What if they're just taking calculus as a requirement of CS or premed?) They're not planning on developing new integrations techniques. Does a pilot need to learn fluid dynamics like an airplane engineer?

Maybe educational students should only require what's actually practical. That can include things that can be solved by computers if they teach ideas that have practical value.

> If someone wants to know a little calculus for a small area of engineering, do they really need to learn various manual techniques for calculating integrals?

Yes, but the logic goes the other way: if they are unable to use the technique manually, then one may conclude they do not understand the idea well enough to use it, whether or not Mathematica is available. The two, manual application and understanding/intuition, go hand in hand and can't be separated. My understanding is that this has even been studied more formally, there is a paper I remember reading by Kahneman and somebody else [1] about the development of intuition and the tension between heuristics-and-biases and naturalistic decision making. In short, don't trust people so much, if they say they understand something but are unable to actually do it, be a bit more skeptical.

[1] Found it: https://www.ncbi.nlm.nih.gov/pubmed/19739881 On second thought, it's probably not as relevant as I remembered it.

Performance is the same thing as correctness only in hard real time.

A better analogy might be stable sort versus unstable. If you need a stable sort, and unstable one is wrong.

(Not sure if that counts as an abstraction leak, though! It's just the semantics. Unstable sort doesn't leak that it is unstable; that's its behavior. If a sort is just documented as putting items in order, an actual implementation does leak information about whether or not it is stable.)

"because fluid dynamics is a leaky abstraction"

No pun intended!

> Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?

Why do you have to learn to calculate integrals and derivatives in school, or how compilers work internally? Same answer. But seriously, the CS231 class is excellent, and Andrej is an excellent teacher. You can follow along at home (which is what I am doing.) The syllabus (at http://cs231n.stanford.edu/syllabus.html) has the course notes and the assignments. The assignments are self grading, you know when you have it programmed correctly. The lectures are here: https://www.youtube.com/playlist?list=PLlJy-eBtNFt6EuMxFYRiN...

Sort of off topic. I have been planning to do the same course(CS 231n), and was wondering, is it okay if I do the Andrew Ng Machine Learning (Coursera) in parallel, or is it more of a prerequisite ?

I only did the first several videos of Andrew Ng's class, and didn't like it. The only prerequisites to CS231n are linear algebra, calculus, and Python.

As long as you know a lil bit of math and coding, you're good to go. If you're watching the cs231n videos and don't understand something, you can always check out their amazing class notes.

The cs231n course covers a lot of material, including some (or most) of the material from the Andrew Ng's Coursera course. You can totally do them in parallel though.

After reading the article, my summary of the message is not "backpropagation is a leaky abstraction" but instead "if you don't understand how the derivatives are being calculated, it will come back to bite you". The author talks about issues that arise with s-curve activation functions having saturated outputs and minimized gradients (part of the historical reason for moving away from these functions), the vanishing/exploding gradient problem in RNNs caused by repeated multiplication, the issue with clipping gradients as a solution, and the issue with zero-valued gradients in ReLUs. This would be equivalent to being a DBA without understanding indexes, or being a web developer without understanding the DOM. Yes, you can get by for a while, coasting on your tools. But when the stakes are high and you need to get it right, that ignorance will hamstring you. You don't want to be put in the situation of building a NN for someone and having no good idea about why it isn't working yet.

The phrase "raw numpy" strikes me as funny. I would figure that's about as abstract as you could get while still working with the math (discarding symbolic engines).

Yes, after implementing a simple neural network in C (with AVX, and pthreads), "raw numpy" does sound funny!

On the other hand, try implementing a convnet in numpy, especially the backprop, and you might start feeling some of its "rawness" :-)

As someone who also hand-coded a neural network implementation, forward and back prop as well as an RNN in C, yea "raw numpy" is a joke.

Something I've always hated about people that say "why do I have to write a backprop when TF does it for me?"

Here's why: you go to a company, they want you to incorporate machine learning into their c++ engine. Have fun using numpy, you said you knew machine learning right? implement backprop for me, you can do that right?

I know what you mean. To take partial derivatives with respect to the filter parameters from a correlation (or convolution), it's simpler to go down to the component level. However, it's hard to get back up to the matrix/vector level after doing so (to write the operations in NumPy).

I'm developing a model (not exactly a convnet) that uses a correlation step. Because of the above problem and its resulting pure-python loops, I may have to cythonize or use the NumPy C API for the gradient evaluation. Do you know of any examples I could check out that implement partial derivatives w.r.t. a correlation (or convolution) in "raw numpy"?

The phrase seems appropriate to me. The students are still working with matrices and linear algebra, focusing on the major algorithms. Going into the element-by-element linear algebra algorithms, for example in matrix multiplication, would be more appropriate for a high-performance, numerical computing class. The right level of detail is given when he talks about considering the behavior of individual gradient elements.

Questions like "Why do we have to write X, when framework Y does it for you?" are why I dislike the reinventing the wheel analogy, especially when it finds its way in education. There's no substitute for the deep understanding you get by solving a complex problem yourself from beginning to end. Students complaining about implementing a foundational algorithm instead of using a framework is depressing.

Not to mention that computer science and software engineering are such young fields that it seems unhealthy to take readily available abstractions as absolute givens. Everything stands to improve, even products and concepts that have been around for decades and that everyone uses.

> There's no substitute for the deep understanding you get by solving a complex problem yourself from beginning to end

Yes there is, watch someone else do it. In fact there are three ways to learn, as the saying goes: trial and error, copying, and insight. I'd be hard pressed to explain the difference of trial and error vs insight, but I wouldn't confuse them either, because only one of them is painful.

I disagree. To paraphrase the intro to my Linear Systems book "math is a contact sport." This applies to most intricate topics. If an expert takes you through a one hour tour of a subject, you will get the salient points, but there's a lot of intuition and that gets lost by you not struggling with the material on your own. That's why we have homework.

> Yes there is, watch someone else do it.

This is not as good. There's a lot of evidence from neuroscience that this leads to an illusion of competence. I.e. it's much easier to follow along through a sample solution than to craft a solution yourself. Until you craft the solution yourself, the knowledge won't actually be chunked as firmly in your brain.

I've seen this time and again, as a teacher, and it was mentioned explicitly in Coursera's Learning How To Learn course: https://www.coursera.org/learn/learning-how-to-learn

NOTE: watching someone is a good way to get started, but until you do it on your own, you haven't learned it deeply, and it may be difficult to recall in a real situation.

I think there are intangible things that get lost when you watch someone else do something as opposed to doing it yourself. The exercise is about going through the mental motions of understanding the problem, designing a solution, and iterating on it until it's correct. The last part is all about learning from your own mistakes, seeing what specific things trip you up, so you know to improve on them. That's not something you can get by watching others.

I agree that there isn't enough time in a life to learn everything you'd want to first hand or from a low level of abstraction, but school should be a place to do as much of it as possible. Just my 2c.

Watching someone else do it is still less effective than bashing your head against the problem yourself before watching them do it.

Whether or not you succeed in solving it on your own, it will emotionally invest you in the problem and its solution while showing you what didn't work and having a better handle on the shape of the problem. This lets you get more out of seeing someone else work out the solution.

People complaining about having to implement backprop in a ML class? Nice one. Had to hand in this exact assignment last Friday. A valuable lesson, that's for sure. Generally I love how a proper lesson on Machine Learning really contradicts the "universal power weapon" narrative media put on machine learning in the last few years. Everything is not so magic anymore when it boils down to proper algorithm use.

> People complaining about having to implement backprop in a ML class?

Wouldn't that suggest that other classes the students take have not been academic enough - i.e. that they focus too much on "things you might use day-to-day" vs "this is how/why things work"?

This attitude is not uncommon in programming / CS. It's similar to those who pine about learning a half dozen search algorithms when they rarely need to implement them in practice.

Hear, hear. Indeed, "welcome" to everybody who takes the time to understand what (s)he is talking about.

Agreed. The most important lesson I took away from my ML class is that I have to learn more of optimization theory.

For anyone trying to learn backpropagation but having trouble with math, I can't recommend Matt Mazur's "Step by Step" guide [0] enough. What is great is that he is using real numbers so that one can check implementation for correctness.

[0] https://mattmazur.com/2015/03/17/a-step-by-step-backpropagat...

I used this recently when I was relearning NN theory (last time I looked at it before 2015 was the late 90's,) and I agree. It's a thorough walkthrough of the math that turned the lightbulbs back on enough for me to write a simple swift MLP NN without consulting other implementations.

Backprop is such a fundamental part of Neural Networks, I am very surprised how anyone can complain about having to know how it works. It is true that once you grocked the principle implementing it for more than 2 layers is pretty tedious. I would propose, however, that without having done the tedious work at least once you cannot truly understand it.

Much discussion of backprop could be avoided by recalling that the bit that does the work is the chain rule from calculus.

Error terms represent a sum and product of derivatives. The product of a bunch of terms will tend to get really big or really small.

The rest is detail: are the terms in some interval? Which? how many are we multiplying? how many are we summing over? do we doctor the sum after we get it?

Edit: hmm not sure why this comment is getting downvoted.

Backprop isn't just a leaky abstraction. There's not really any evidence yet that biological neural networks use anything like backprop. So its important that students be taught the low-level aspects of the current state of the art so that better architectures can be invented in the future.

(Note: at the very end of this comment, I am leaving a link to one hypothesis about how something like backprop may happen at the dendritic level.)

I've been doing a deep dive on ML and bio neurons in the last week or so to try and get up to speed. The feedforward networks with backprop that are doing such awesome work right now share almost nothing in common with actual neurons.

Just some differences: (neuron == bio, node == artificial)

Terms: Neuron resting potential is about -70mV. Action potential (threshold) is about -55mV.

1) Neurons are spikey and don't have weights. A neuron sums all of its inputs to reach a threshold. When that threshold is reached, an electric current (action potential) is spread equally to all synaptic terminals. A node has a weight value for each input node that it multiplies by the value received from the input node. All of these values are then summed and passed to the connected nodes in the next layer.

2) There are multiple types of neurons, but for simplification there are 2 types that called "excitatory" and "inhibatory". A neuron can mostly only emit "positive" or "negative" signals via the synapse. There are different neurotransmitters that either add or subtract to the voltage potential for each dendritic input. The total sums work together to cross the threshold voltage. I say mostly because there is some evidence for a minority of neurons to be able to release neurotransmitters from both groups.

3) Artificial networks are generally summation based with no inclusion of time. Bio networks use summation and summation over time together to determine if they should fire or not. If I recall correctly, a neuron can repeatedly fire about 300 times per second. A dendritic input can last for up to 1.5 milliseconds. So if a neuron gets enough positive inputs at the same time, or collects enough over time, it will fire.

I haven't found any hypotheses or experiments that try to explain how reinforcement learning takes place in neurons yet in the absence of back propagation. Pretty sure that information is out there but I haven't run across it.

Overall I think that there will be multiple engineering approaches to AI, just like there are to construction and flight. We understand how birds and bees fly, but we don't build planes the same way.

Its important to remember than cognition is based in physics just like any other physical system. When the principles are well understood, there are multiple avenues to using them.

Here is a short collection of links that I've been finding helpful.






In terms of big picture stuff, you're absolutely right that many DNNs are more "inspired by" the brain and less a faithful model of it. However, a lot of the things mentioned in your post are either overstated or outright wrong. For example:

1. Neurons, or more specifically, connections between neurons ("synapses"), absolutely do have weights, and the "strength" of synapses can be adjusted by a variety of properties that act on a scale of seconds to hours or days. At the "semi-permanent" end of the spectrum, the location of a synapse matters a lot: input arriving far from the cell body has much less influence the cell's spiking. The number (and location?) of receptors on the cell surface can also affect the relative impact of a given input. Receptors can be trafficked to/from the membrane (a fairly slow process), or switched on and off more rapidly by intracellular processes. You may want to read up on long-term potentiation/depression (LTP/LTD), which are activity-dependent changes in synaptic strength. There are a whole host of these processes, and even some (limited) evidence that the electric fields generated by nearby neurons can "ephaptically" affect nearby neurons, even without making direct contact, which would allow for millisecond-scale changes.

2. While you can start by dividing neurons in excitatory and inhibitory populations, there's a lot more going on. On the glutamate (excitatory) side, AMPA receptors let glutamate rapidly excite a cell and make it more likely to fire. However, it also controls NMDA channels that, under certain circumstances, allow calcium into a cell. These calcium ions are involved in all sorts of signaling cascades (and are involved--we think--in tuning synaptic weights). GABA typically hyperpolarizes cells (i.e., makes them less likely to fire) and is secreted by cells called interneurons . However, there's a huge diversity of interneurons. Some seem to "subtract" from excitatory activity, others can affect it more strongly in a divisive sort of way or even cancel it completely. Furthermore, there's a whole host of other neurotransmitters. Dopamine, which is heavily involved in reward, can have excitatory or inhibitory effects, depending on whether it activates D1 or D2 receptors

3. While the textbook feed-forward neural networks certainly have "instant" signal propagation, there are lots of other computational models that do include time. Time-delay neural networks are essentially convnets extended over time instead of space. Reservoir computing methods like liquid state machines also handle time, but in a much more complicated way.

4. I chuckled at the idea of finding a biological correlate analog for reinforcement learning, since reinforcement learning was initially inspired by the idea of reinforcement in psychology/animal behavior. People have shown that brain areas--and individual neurons within them--encode action values, state estimates, and other building blocks of reinforcement learning. Clearly, we have a lot to discover still, but the general idea isn't at all implausible.

Finally, some people are fairly skeptical that the fields have much to learn from each other; Jürgen Schmidhuber said this a lot at NIPS last year. However, other, equally-smart people (e.g., Geoff Hinton) seem to think that there may be a common mechanism, or at least a useful source of inspiration there. But, if you want to work on something like this (and it is awesomely interesting), it really helps to have a solid grounding in both.

This is exactly the kind of response I was hoping for. Thanks Matt! If it's not an inconvenience, could you drop any links to the topics you referenced, especially the ones that differed from what I've been studying, to charles@geuis.com? I'm going a bit deeper now and reading some studies from the early 90's and some that are more recent. It's kind of a crapshoot of what I can google for, so a guided search would be immensely helpful.

Hmmm...it's hard to do entire fields justice, but here's an attempt.

There are a couple of standard neurobiology textbooks, like Kandel, Jessel, and Schwartz's Principles of Neural Science, Purve et al.'s Neuroscience and Squire et al.'s Fundamental Neuroscience. These are huge books that cover a bit of everything, and you should know that they exist, but I wouldn't necessarily start there.

If you're specifically interested in computation, I would start with David Marr's Vision. It's quite old, but worth reading for the general approach he takes to problem-solving. He proposes attacking a problem along three lines: at the computational level ("what operations are performed?"), the algorithmic level ("how do we do those operations?"), and the implementation level ("how is the algorithm implemented").

From there, it depends on what you're interested in. At the single-cell level, Cristof Koch has a book called The Biophysics of Computation that "explains the repetoire of computational functions available to single neurons, showing how individual nerve cells can multiply, integrate, and delay synaptic input" (among other things). Michael London and Michael Häusser have a 2005 Annual Reviews in Neuroscience article about dendritic computation that hits on some similar themes (here: https://www.researchgate.net/publication/7712549_Dendritic_c... ), along with this short review (http://www.nature.com/neuro/journal/v3/n11s/full/nn1100_1171...) by Koch and Sergev, and a 2014 review by Brunel, Hakim, and Richardson (http://www.sciencedirect.com/science/article/pii/S0959438814...). Larry Abbott has also done interesting work in this space, as have Haim Sompolinsky and many others. Gordon Shepard and his colleagues maintain NEURON (a simulation package/platform) and a database of associated models (ModelDB) here: https://senselab.med.yale.edu/ if you want something to download and play with (they also do good original work themselves!)

Moving up a bit, the keywords for "weight adjustment" are something like synaptic plasticity, long-term potentiation/depression (LTP/LTD), and perhaps spike-timing dependent plasticity. The scholarpedia article on spike-timing dependant plasticity is pretty good (http://www.scholarpedia.org/article/Spike-timing_dependent_p... Scholarpedia is actually a pretty good resource for most of these topics. The intro books above will have pretty good treatments of this, though maybe not explicitly computational ones.

More to come, however I also just found this class from a bunch of heavy-hitters at NYU: http://www.cns.nyu.edu/~rinzel/CMNSF07/ Those papers are a good place to start!

Great info. Definitely have plenty to read over the next couple weeks now.


I probably should have lead with this, but there's been a lot of interest in backprop-like algorithms in the brain

* Geoff Hinton has a talk (and slide deck) about how back-propagation might be implemented in the brain. (Slide deck: https://www.cs.toronto.edu/~hinton/backpropincortex2014.pdf Video: http://sms.cam.ac.uk/media/2017973?format=mpeg4&quality=720p )

* As always, the French part of Canada has its own, slightly different version of things, care of Yoshua Bengio (slide deck from NIPS 2015: https://www.iro.umontreal.ca/~bengioy/talks/NIPS2015_NeuralS... preprint: https://arxiv.org/abs/1502.04156 )

* Here is another late 2015 take on back-prop in the brain by Whittington and Bogacz (http://biorxiv.org/content/early/2015/12/28/035451) This one is interesting because they view the brain as a predictive coding device which is continuously estimating the future state of the world and then updating its predictions. (I think the general predictive coding idea is cool and probably under-explored).

* There's a much older paper by Pietro Mazzioni, Richard Andersen, and Michael I. Jordon attempting to derive a more biologically plausible learning rule (here: http://www.pnas.org/content/88/10/4433.full.pdf) This work is particularly neat because it builds on earlier work by Zipser and Andersen (https://www.vis.caltech.edu/documents/54-v331_88.pdf), who trained a three-layer network (via back-prop) to transform data from gaze-centered ('retinotopic') to head-centered coordinates, and noticed that the hidden units performed transforms that look a lot like the work done by individual neurons in Area 7A. The Mazzioni paper then replaces the backprop with a learning procedure that is more biologically plausible.

For backprop, you need some sort of error signal. Wolfram Schultz's group has done a lot of work demonstrating that dopamine neurons encode something like "reward prediction error." (e.g., this: http://jn.physiology.org/content/80/1/1, but they have lots of similar papers: http://www.neuroscience.cam.ac.uk/directory/profile.php?Schu...). For reinforcement learning, you might also want to maintain some sort of value estimate. There are tons of studies looking at value representation in orbitofrontal cortex (OFC), using mostly humans and monkeys, but occasionally rats. Here's a review from Daeyeol Lee and his postdoc Hyojung Seo describing neural mechanisms for reinforcement learning (http://onlinelibrary.wiley.com/doi/10.1196/annals.1390.007/f... ) The Lee lab has done a lot of interesting value-related things too.

Switching gears slightly, there is also considerable interest around unsupervised learning and related methods for finding "good" representations for things. This is potentially interesting because it would allow for improvements within individuals and even across individuals (e.g., by evolution).

Olshausen and Fields kicked this off by demonstrating that maximizing the sparseness of a linear code for natural images produces a filter bank that resembles the way neurons in primary visual cortex process images. (http://courses.cs.washington.edu/courses/cse528/11sp/Olshaus...)

Michael Lewicki has done similar things in a variety of sensory modalities. Here's a recent paper from him looking at coding in the retina (http://journals.plos.org/ploscompbiol/article?id=10.1371/jou...) but he has similar work in the auditory system and building on the Olshausen and Fields paper linked above to explain complex cells (and more!) in visual cortex. Bill Geisler has also done a lot of work looking at the statistics of natural scenes and how the brain (and behavior) appears to be adapted for them.

@geouis asked you to email him some reading material on this in a sibling comment. if you do send them something, could you post it here too?

One hypothesis is that the neural units in a CNN model hundreds or thousands of individual neurons, including some neurons used to produce/transmit error signals.

This seems reasonable, but I doubt there is any single simple and general abstraction that can describe the learning algorithms used by physical neurons. It seems more likely that the brain uses many highly specialized algorithms for different regions of the brain, each shaped by a ~billion years of evolution.

If you're interested in Reinforcement Learning and Spiking Neural Network, you should look into Izhikevich work on dopamine-modulated STDP:


I do think the complaint on having to write the backward pass seems especially shallow; finding out they were working with numpy makes it even more so (since numpy takes the pain out of the matrix operations). IIRC, when I took the ML Class in 2011, we used Octave, but Ng had us first write stuff "the hard way" - so we'd understand what was going on later when we used Octave's methods.

Something about this article as a whole, though, does raise a question that I've been wondering about, and I want to present it here for a bit of discussion (maybe it needs its own thread?):

Does anyone else here think that the current approach to neural networks has some fundamental flaws?

Now - I'm not an expert; call me an interested student right now, with only the barest of experience (beyond the ML Class, I also took Udacity's CS373 course, and I am also currently enrolled in their Self-Driving Car Engineer nanodegree program).

I understand that what we currently have and know does work. What I mean by that is the basic idea of an artificial neural network using forward and back-prop, multiple layers, etc (and all the derivatives - RNN, CNN, deep learning, etc). I understand the need and reasoning behind using activation functions based around calculus and derivatives and the chain-rule, etc (though I admit I need further education in these items).

But something nags at me.

All of this, despite the fact that it works and works well (provided all your tuning and such is right, etc), just seems like it is over-complicated. Real neurons don't use calculus and activation functions, nor back-propagation, etc in order to learn. All of those things in an ANN are just abstractions and models around what occurs in nature.

Maybe (probably?) I am wrong - but it seems like what nature does is simpler. Much less power is used, for instance, and the package is much more compact. I just have this feeling that in some manner we may have gone down a path that while it has produced a working representation, that representation is overly complex, and had we took another approach (whatever that might be?), our ANNs would look and work much differently - perhaps even more efficiently.

About the only alternatives I have heard about otherwise have been things like spike-train neural networks, and some of the other "closer to nature" simulation (of ion pumps and real synapses, etc). Still, even those, while seemingly closer, also have what appears to be too much complexity.

I'm probably just talking out of my nether regions as a general n00b to the field. I do wonder, though, if there might be another solution, seemingly out in "left-field" that might push things forward, if someone was willing to look and experiment. It is something I plan to look into myself, as I find time and such between lessons and other work for my current learning experience.

>Real neurons don't use calculus and activation functions, nor back-propagation, etc in order to learn.

This sounds like a (common) failure to understand how abstractions work. Bridges don't do calculus, but the bridge builder uses calculus to understand what bridges do use (the laws of nature), and thus the calculus abstraction is used to encode the behavior of bridges. Thus you can model bridges using calculus.

Similarly, neurons are modeled by calculus. Abstractions are abstract precisely because they are not the concrete thing they model: they are necessarily approximations. They give us the power to simplify at the cost of gaining the capacity to be wrong.

The point being this: you can literally use any abstraction you desire to model anything you like. Some will work better than others, and the better they work, the more closely the structure of your abstraction matches the structure of the concretion being modeled.

If you fit some data with a very flexible function approximator, that does not imply any kind of isomorphism between the function approximator itself and the process generating the data.

Some people cannot understand this, and believe that if you can closely fit the output of a process with a neural network that it implies the process itself is in some way related to neural networks.

Don't get me wrong - I may not fully comprehend the math behind what is going on (as I noted, I have little understanding of calculus, and it is one of my failings that I am working to improve on) - but I do understand the need for the abstraction; I understand that it allows us to model the activation function and workings of a neuron, to a certain level of accuracy. I understand that it may -not- be the same way that a real neuron does it, but that it is close enough and works well enough to be useful.

At the same time, I wonder if there isn't a simpler way of doing all of this - if there isn't a simpler model for the abstraction of a neuron that doesn't require back-propagation or calculus? In other words, have we become so use to the current concepts and models of ANNs that we have become hesitant or resistant to imagining other possibilities?

Yes - what we have works, and it works well. Honestly, from what I know, and what I have learned (while I may not fully grasp the mathematical and calculus underpinnings of a neural network, I do have a good feel for how both forward and back-prop is implemented), our current general knowledge on ANNs (ie "how to implement a simple neural network") isn't super-complex. I only question whether it could be made simpler (and I'm not talking about a network of perceptrons or RELUs), or if, because of the early work (Pitts and McCulloch mainly), we are going down a less-than-optimum path (to use an ML analogy, we are stuck in a local minima) - one that fortunately works, but maybe there's something better out there?

Again - I recognize as a non-expert in this field (I only consider myself a student and hobbyist so far) - I am likely very well off in la-la land. Even so, we know that quite often in engineering, the optimum solution tends to be the simplest solution; sometimes, that solution is simpler than nature (an airplane vs a bird, for instance). I have a feeling that may be the case with ANNs as well.

I am not wedded to this idea, though - I just want to put it out there for consideration and maybe discussion. As I noted, what we currently have works well, and isn't a complexity nightmare, and is understandable to anyone willing to understand it. It very well could be it -is- the simplest explanation.

Very well said. It took me a long time to "get it" but once I crossed that threshold I started viewing different mathematical techniques as hammers, some more suited to modeling certain phenomena than others.

I think you are assuming the human brain is significantly simpler and better than it is. To start with: the number of neurons in the brain is ~100 billion, which is roughly 10,000 times more than in a large ANN, we train the human brain for years before it does anything remotely intelligent (as opposed the the expectation that neural nets will get somewhere useful in hours or days) and most importantly: our brains probably have strong priors due to millions of years of evolution encoded in our DNA. On top of this, not all brains are as good as each other, especially when you think about animals who do not have a real system of speech.

I don't know a huge amount about neurology (or neural nets), but...

Adult humans seem to learn faster when there is some combination of theory, examples, and experience (aka feedback). I'm a scientist and I have very little interest in neural nets for science because it contorts away the kind of equation-based systematics that we rely on to understand our world. The theory component is missing.

I'm more interested in expert system-type AI, where the learning may help us infer more about the world in the systematic way that scientists try to grapple with it--logical rules and frameworks, physics equations, statistics, etc. But these seem to be far less effective or efficient than ANNs, at least as currently implemented.

That being said, I did see a great talk last week on using ANNs to accelerate viscoelasticity calculations--those sorts of things are definitely valuable for science by reducing the time required for simulations by orders of magnitude. And some of the discussion after the talk had to do with how a graph of biases and weights is, philosophically and practically, a fundamentally different way of representing the problem and its solution than a set of differential equations. There is no doubt that it's a useful way of solving the calculations, but it's unclear how we can build upon those weights and biases in a theoretically-meaningful way. How do we learn from machine learning?

There are quite a few neurons in an adult human, and they aren't just in some kind of undifferentiated randomly connected neuron soup. So when we talk about things that adult humans do in terms of neurons, it is a bit like talking about Linux in terms of transistors. It is not that transistors are irrelevant to Linux, but...

Now replace the neurons with some tangentially related abstraction that is massively different from real neurons, well...

> I did see a great talk last week on using ANNs to accelerate viscoelasticity calculations

I'd love a reference for this. Link, please?

I think everyone agrees that ANN are a useful abstraction of what happens in real neural networks. A "closer" abstraction in some sense, as you mention, is what is being called Spiking Neural Networks, so you may want to read up on that if you're interested. I don't think they are strictly more powerful in any way, for what it's worth, as they more or less just trade continuous-domain discrete-time computations for digital-domain continuous-time computations; although there's some interesting hardware work being done to implement them.

That said, don't underestimate the complexity of what is happening in real neurons. It's true that an individual neuron is fairly simple (more or less! some may perform very complex encodings of sensory data!), but the complexity of the "computations" they perform goes up very quickly, and you have a _lot_ of them that are highly connected, so it may be that ANN are a good abstraction in the sense that they reduce the complexity of spike trains to a more abstract but more powerful representation.

It's true though that they probably don't perform back-propagation, at least in the mathematical sense; I don't know the literature but I think an inhibitory mechanism is probably a better biological model for how "error detectors" can be used to suppress useless or noisy output.

ANNs would be useful or not completely apart from the relationship to anything "neural." I often wish that the association had never been drawn, it just doesn't seem very useful and it is always misleading so many people.

There is some work in this direction. See Bengio's Biologically Plausible Deep Learning (https://arxiv.org/pdf/1502.04156.pdf) and its references for a starting point. So far, they don't achieve SOTA on anything.

Nature uses much less power in a much more compact package because it took the path of inventing molecular nanotechnology. I agree, we should do that too! But in the meantime, the neural networks we build are pretty good for what they do and the requirements they meet.

Coming from a background in neuroscience, IMO, neurons are extremely complicated. There's tons of molecular machinery beyond their inputs and outputs (which in and of themselves are complicated). They are complicated enough that neuroscience has been a field for sometime now, and we still only know very little about the human brain. That's even if we are just talking about a single neuron, ignoring the enormous numbers of neurons and glial cells we have, all interconnected.

> finding out they were working with numpy makes it even more so (since numpy takes the pain out of the matrix operations)

hmm, I did the cs231n homework and had the opposite experience. It was really easy to complete it by ignoring numpy's provided matrix methods (just write a bunch of for loops in python) but that solution was really slow. If you could use numpy's matrix methods, however, the code executed a lot faster (the loops and control flow now run in C instead of python, so to speak). The hard part of the assignments was expressing my python code, with all its ad-hoc mixture of nested loops and conditionals, in numpy.

I found that in the ML Class, we could complete many of the assignments by doing the calcs via "for-loops" in Octave - and as you noted, it was really slow.

But I think Ng wanted us to understand what was going on under the hood in Octave when you used its in-built vector primitives, and how to think about the problems in such a way to understand how to "vectorise" them so that the solutions would be amenable to using those primitives.

There was a time where I wasn't "getting it" - and proceeded with my own implementation. In time though, it "clicked" for me, and I could put away those routines and use the vector math more fully (and quickly). That said, I wouldn't have wanted it to be left as a "black box" - it was nice to have the understanding of the operations it was doing behind the curtain.

Which is also why I was disappointed that the math wasn't described more in either of those courses; that part was left as "black boxes" and only hinted at a bit (ie "here's the derivative of the function - but you don't have to worry about it, but if you know about the math, you might find it interesting").

In this latest course I am taking, though, they are diving right into the math - and I have found that they assume you already know what a derivative is and how it is formed from the initial function. Unfortunately, I don't have the education needed, so I am fumbling through it (and doing what I can to read up on the relevant topics - I even bought a book on teaching yourself calculus which was rec'd for me by a helpful individual).

Yes, my point is that even though numpy provides a "black box" set of functions, using them is so unnatural (at least for me) that I was forced to completely understand what the numpy functions were doing internally to have any hope of using them.

In fact, now that I think about it, the first assignment asked us to write a 2-loops-in-python version of some function (batch linear classifier, I think), then a 1-loop version, then a 0-loop version, and I often repeated this procedure on subsequent harder questions.

That was an useful thing I wasn't expecting to learn from the course - how to vectorize code for numpy (including using strange features like broadcast and reshape)

If I can offer some advice as a former Calc I student, since you are focused on ML, ignore integrals and the fundamental theorem of calculus. Instead, understand the connection between derivatives and antiderivatives and become practiced with the rules of derivation: product rule, quotient rule, trigonomety functions (tanh is part of hyperbolic trigonometry), exponentiation, logarithms, and the chain rule. Using derivatives to find local minima and maxima will also be useful. You should be able to look at the graph of a function and quickly visualize the graph of its derivative.

That's why I enjoyed the "hackers guide to neural networks" so much - it builds everything from the ground up

> Real neurons don't use calculus and activation functions, nor back-propagation, etc in order to learn. All of those things in an ANN are just abstractions and models around what occurs in nature.

But do you think that real neurons are less complicated than artificial neurons? Look at the molecular structure of a single ion channel. It's crazy complicated!

How do we know that neurons don't do calculus? Chemical gradients, summation, etc.

Backprop is just automatic differentiation. The end.

That's a succinct way of putting it.

Isn't backprop a clever application of chain rule in multivariate calculus?


Feynmann pretty much settles it.

> “Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”

How many more times do you need to see the same phenomenon under different guises before you stop asking stupid questions? "Hey teach, why do I need to learn how to multiply if I can just use a calculator?"

Isn't it somewhat reasonable question, given the relatively recent advent of TensorFlow, compared to ML curricula? The stress is on, why don't we learn TensorFlow / Caffe / etc.

Because frameworks come and go. The important thing are the abstract concepts

At that level, they assume you are pretty smart and capable of figuring out something like an API on your own time as needed. They'd rather you know what all these funky things in these APIs are doing at a core level so that you can employ them in an effective manner.

Letting the computer do the menial work of constructing formulas and generating code from user specifications is a fairly important framework feature and "abstract concept".

I could use that same abstract argument, just that with a framework you wanna see how far you can get, not how low. From the schools perspective it's all the same, just some test on your mental powers.

With a framework it's hard to think outside the frame. With a low level core it's hard to do anything really. It's a matter of compromise. Noone starts writing asm to begin with, although it is interesting, e.g. nand tetris being a famous example.

There's at least two ways to go about learning ML, and I think ideally one should do both. One is to use a high level tool like TensorFlow or Azure ML Studio to experiment with what the tools can do with data. This can get you sufficient competency to use these algorithms in a practical way.

The other is to learn the foundations of those algorithms so you can best understand how to tune, apply and extend them. This is the path to mastery.

Are you seriously asking that? Have you seen the example of learning multiplication vs using a calculator? What would you say in that case?

It's pretty much the same as looking at a multiplication table on a request basis.

Ironically, mere exposure to data is enouhg to learn from in ML, so why not here. Although, I'm not sure about the pedagogic aspect. I'd assume calculation by heart would be learned along the way, despite sending the initial message, it wasn't needed. Maybe starting slow is important, because it's that fundamental. But in hindsight, I really was good after half the elementary training.

I have a similar anecdote: I wasn't good at handwriting and always claimed I wouldn't need to. Now I don't need to, indeed, except for exams. But I actually have a hard time with caligraphy and that's a shame.

> I have a similar anecdote: I wasn't good at handwriting and always claimed I wouldn't need to. Now I don't need to, indeed, except for exams. But I actually have a hard time with caligraphy and that's a shame.

Your anecdote is irrelevant because there is no _understanding_ to be gained by handwriting as opposed to typing.

I really can't be bothered with this conversation, sorry.

Writing is geometric and calculation is pretty much mechanic.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact