Hacker News new | past | comments | ask | show | jobs | submit login
Theories of Error Back-Propagation in the Brain (cell.com)
160 points by nopinsight 73 days ago | hide | past | web | favorite | 33 comments



> "The relationship of spike-time-dependent plasticity to other models requires further clarifying work"

- I got something for that, in fact I think I discovered that backpropagation engenders STDP:

https://github.com/guillaume-chevalier/Spiking-Neural-Networ...


It was interesting to read how you did a spiking eural network in PyTorch, but it seems your neurons' states are coupled continuously in time, whereas in the brain, it would be the opposite, ie. the spike timing carries information and not the state values.

> backprop engenders STDP

This is backwards I think, but definitely an interesting association to make


There's a cool package called Nengo from Chris Eliasmith's group at UWaterloo which hasn't received as much attention as I think it should: https://www.nengo.ai/nengo-dl/examples/tensorflow-models.htm...



It’s interesting that most AI models focus on learning, but in biology the heavy lifting was all done in the developmental evolution and neurogenesis. You’re not going to teach your dog relativity with back propagation—-yet we treat BP like it was going to solve all the problems in AI. Until we start focusing more on neural architecture I have no worries about AI taking over the world.


> You’re not going to teach your dog relativity with back propagation

What do you mean? RNN's can be taught to do math using BP[1]. Heck, you can do a simple version yourself in less than an hour[2]. They can't do tensor calculus yet (AFAIK) but there doesn't seem be any reason why BP in particular would be the stumbling block; the difficulty is finding the right representation.

> yet we treat BP like it was going to solve all the problems in AI

Except for reinforcement learning which does not use BP to solve the Bellma equations[3], AlphaZero which uses minimax[4], AutoML based on non-gradient methods such as bayesian optimization[5], etc. Research into CNNs and RNNs for image, text, and voice processing does tend to focus on BP, but not inappropriately so considering that that approach continues to make rapid progress in those domains.

> Until we start focusing more on neural architecture I have no worries about AI taking over the world.

I can't tell if you're arguing for or against. Are you suggesting that we should focus more on "neural architecture" so that AI can take over the world? Or are you approving of the current narrow focus for reasons of safety?

[1] https://arxiv.org/pdf/1809.08590.pdf

[2] https://machinelearningmastery.com/learn-add-numbers-seq2seq...

[3] https://towardsdatascience.com/introduction-to-various-reinf...

[4] https://www.depthfirstlearning.com/2018/AlphaGoZero

[5] https://en.wikipedia.org/wiki/Bayesian_optimization


>> RNN's can be taught to do math using BP

The paper you cite and every single paper on artificial neural nets learning to "do maths" or "do arithmetic" etc that has every been published. are only showing neural nets learning [1] the results of specific operations between numbers up to a certain value. There is no work that shows neural nets learning to do arbitrary arithmetic operations on arbitrary numbers.

To put it very clearly: neural nets can't learn to "do maths" in any general sense.

Neural nets are infamously incapable of generalising beyond their training dataset and the OP is very reasonably skeptical of comparisons between artificial and natural neural networks, given that only the latter are known to be able to generalise from few examples seen very few times.

______________

[1] As in overfitting to.


>Neural nets are infamously incapable of generalising beyond their training dataset

In what sense? DNNs can certainly generalize to examples outside the training set. I agree that you get no guarantees on performance for samples drawn from a distribution that is different than the training/test set. Not having guaranteed expected performance isn't the same as "incapable" of generalizing to a new distribution.


Sorry for the late reply.

To clarify, when I say "training dataset" I mean the entire dataset used for training. Not the training partition in a cross-validation training/test split.

Neural nets can interpolate between the data points in their training dataset, but cannot extrapolate to cover data points outside this dataset. This is not a matter of architecture. DNNs are no exception. Neural nets are just very bad at generalising.

This lack of generalisation ability is why neural nets need to be trained with huge amounts of data, the more the better. Because they can't generalise, the only way to get them to recognise more instances of a class is to show them more examples of that class.

Here's a longer discussion of this:

https://blog.keras.io/the-limitations-of-deep-learning.html


You’re really proving my point here.


This is a valuable observation. I think the question, though, comes down to which of these two things should be mastered first or if they should be done in parallel. In general, the explosion of successes in DL in the last few years has been driven by AD and backprop rising to the occasion to make novel architectures explorable for the first time.


Well, at the moment we are not teaching dogs. We are trying to create dogs.


That’s not what I see.


I have been on similar kicks for drawing similarities between the two, but it never pans out to anything useful. I.e how does this move the science forward?

I can play devil's advocate (aka classic hacker news' commenter) and say what about inhibiting neurons? Tonic neurons? What about the ability for the brain to do recall after hearing a name once? Procedural memory? Does the relationship you point out help you understand any of these things?


As a neurobiologist, this interests me. I'm leaving this comment because I don't have time to read the article rn, but what to come back and see what HN thinks about it. Briefly, and naive to the content of the article, here is my concern. Consider this network architecture...

You have 3000 input neurons that each makes ~10 synaptic connections with many downstream neurons, but let's focus on just one of those downstream neurons. This downstream neuron thus has 30k synaptic connections along its dendrite. At some time t its cell body (specifically the axon hillock area) receives enough graded electrical potential from some number of those synapses to fire an 'action potential' (a single electrical impulse). Some of those inputs contributed bursts of weak inputs, some may have contributed several strong inputs, some of the inputs were from synapses on very distal regions of the dendrite (therefore the graded signal was significantly weakened by the time it reached the cell body) some were from synapses very near the cell body. But they all integrate at the cell body, at which point, it doesn't matter where they came from - they've arrived is all that matters now - and together their strength is enough to evoke an action potential.

Let's say this action potential impulse resulted in an error of some sort, somewhere downstream in the network. If you're playing by the rules of supervised learning with backprop, the synapses that evoked the signal producing the error-impulse should be made weaker. How? In biological NNs this is the impossible question. Signals that could target individual synapses do not propagate up axons through cell bodies up dendrites and back to synapses (you do have slow signaling that is for homeostatic scaling, but this is thought to simply scale network input up or down across all synapses).

This means you will need another group of neurons to send error signals back. Maybe these recurrent projections exist, and say they did, these recurrent projections would need to (1) form a 3rd party connection at every synapse and (2) know which synapses were to blame for the error. Those don't seem like trivial phenomena from a neurophysiology perspective. I've read theories on local molecular tags or peptide synthesis, but they all seemed very much hand-wavy. Anything that has gained traction on those hypotheses hasn't stood up to scrutiny.

The simplest theory to me, is that cortical networks on the size of human brains don't explicitly 'forget'. That is, they only learn. When you call someone Bob, and they correct you saying their name is Mike, your neural nets don't erase Bob and replace it with Mike. They remember the old name, the new name, the embarrassing situation, all of it. Biological neural nets don't need to explicitly forget because forgetting is (unfortunately?) a byproduct of being a biological entity. Also, neurons have finite resources (they are essentially zero-sum), so you simply will not see infinite run-up of synaptic strengths (which you might in some artificial NN without setting certain hyperparameters). Together, this suggests that recency is a fairly dominant factor when it comes to associative learning (and what gets forgotten) in biological NN.


Why do people love to do this... at least read the article before


Yeah, my bad. That said, I've read the article now. Indeed, the issues I've mentioned above are, more or less, echoed in the article.

Me: "these recurrent projections would need to (1) form a 3rd party connection at every synapse and (2) know which synapses were to blame for the error. "

Article: "Although the dendritic error network makes significant steps to increase the biological realism of predictive coding models, it also introduces extra one-to-one connections (dotted arrow in Box 4) that enforce the interneurons to take on similar values to the neurons in next layer and thus help them to predict the feedback from the next level."

The takeaway for me was that this article wasn't attempting to explain how biological NN actually function, but instead took the position of "for biological NN to implement a backprop-type algorithm, this is what it would entail..." and then went on to detail several models that demonstrate just how complex the bio NN architecture would have to be considering only just a few major constraints (there are still many constraining factors of ancillary importance that would have to be explained, should any of these model make it through the initial gauntlet).


How is your concern with the lack of symmetric backwards connections related to the last paragraph about the brain not forgetting? The backward pass is used to both strengthen and weaken weights so in bp the forgetting and learning happens at the same time.


Hebbian plasticity in biological NN assumes strengthening happens during the forward pass (the only pass). It is a local phenomena at individual synapses that detect a coincidence (2 inputs occur simultaneously or nearly) and is mediated by calcium influx. NMDA receptors will pass calcium but only if two things happen: (1) they are currently bound to glutamate neurotransmitter (input from their own upstream axon), and (2) magnesium is not blocking their calcium channel (Mg will be ejected briefly from the Ca channel if the neuron is depolarized - meaning currently receiving input from elsewhere). If you are receiving sensory input from both my voice and my face, some set of neurons are detecting this coincidence and strengthening that connection so the next time you hear my voice, it becomes easier to picture my face from all the different faces you have seen.

Here is a good article explaining not just in theory how LTP (long term potentiation) works in neurons but why a particular protocol always works irl (I can attest to the validity of these statements based on first hand experience):

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3843869/

To summarize, the most reliable way to induce LTP is to "take control of the postsynaptic membrane with intracellular Cs+ to block K+ channels, which allows the experimenter to hold the cell at a constant membrane potential and induce a minimal ‘pairing’ protocol to induce LTP: depolarizing the cell to 0 mV while stimulating synapses."

Holding the cell at 0 mV ensures Mg is always ejected, so any upstream stimulation will always be seen as a coincidence.


Yes, but also in hebbian learning you must have some weakening of weights, otherwise the weights would just grow indefinitely? One example I guess is Oja's rule. The difference to bp is just how to select what weights to strengthen and what to weaken based on what information. Forgetting and learning must always happen one way or the other. Or am I not getting your point?


I agree. Above I mention that neurons have finite resources so synaptic strength is essentially zero-sum. When a new set of synapses becomes strengthened, it implies that all the other synapses must be weakened by some amount.

Here is a nice animation of signal propagation in biological neural nets:

https://youtu.be/WCqNn9PEELw

To simulate the dynamics of synaptic strength I created 3D mesh of a dendrite segment with several synapses...

https://youtu.be/tDKUU0SqbSA

Then I simulate the diffusion of AMPA receptors on the surface (the number of AMPAR in a synapse is proportional to its strength)...

https://youtu.be/6ZNnBGgea0Y

I don't have animations of this process but you can imagine what happens when one synapse holds onto receptors longer than the others (has a reduced particle diffusion rate), when there are a finite number of receptors



> In cell culture studies, they added neurons to astrocytes that overexpressed ephrin-B1 and were able to see synapse removal, with the astrocytes "eating up" the synapses... "We think that astrocytes expressing too much of ephrin-B1 can attack neurons and remove synapses"

Yeah that would be a problem. Maybe Glial cells play a role, maybe they don't, opinions vary. https://streamable.com/1vi7r


“You have 3000 input neurons that each makes ~10 synaptic connections with many downstream neurons,” Am I interpreting this correctly: can one neuron make 10 connections to the dendrite of a downstream neuron? Can you share a reference article for this?


> recurrent projections would need to (1) form a 3rd party connection at every synapse and (2) know which synapses were to blame for the error

The most striking result in this regard is that one can get backprop with random backprojections. It works regardless because, on average, the error vector will be less than 90° from the true error vector (which is good enough for hillclimbing) and the dynamics play out in a way that the learned weights adjust to the random projections:

https://www.nature.com/articles/ncomms13276

That being said, it does seem to be the case that the brain simply memorizes an awful lot which must work by a different mechanism besides backprop because backprop cannot do one-shot learning. I think one-shot learning is how the brain gets past large discontinuities in the model fitness landscape: It can learn linguistic, logical rules, fragments of general computations and facts that are discovered by exploration (which includes learning about whether it was Bob or Mike) and passed on culturally. The brain basically outsources the problem of tunneling through large discontinuities, to cultural/individual trial-and-error and episodic memory. The greatest consequence is that these bodies of knowledge can concern the improvement of the organization of knowledge itself, resulting in a positive feedback loop in model fitness (especially science/Bayesian updating). Though such bodies of knowledge evolve respecting a learnability-by-hillclimbing soft constraint, which implies they often form a neat latent space where similar codes are organized to belong to similar meanings/representations, as this can easily be learned by stochastic hillclimbing (repetition) because each time the brain processes related information, it is nudged towards the latent space that is meant to be learned. Many parts of the world happen to be learnable in this way because everything is kinda smooth and continuous. Small causes tend to have small effects as everything consists of a myriad of small particles that affect each other in smooth ways if you squint at them. Though obviously not everything can be learned this way (implying large discontinuities) which is where a brute memorization based on reward and punishment comes in handy.


> brain simply memorizes an awful lot which must work by a different mechanism besides backprop because backprop cannot do one-shot learning

If you look through a neuroscience textbook section on memory systems, it's commonly suggested that the hippocampus does the one shot learning and transfers that over time to the cortex. This is backed up by clinical case studies.

> The brain basically outsources the problem of tunneling through large discontinuities, to cultural/individual trial-and-error and episodic memory

That seems like a good strategy. It also reminds me of AlphaGo's Monte Carlo search + neural network training setup. Since the search is non differential, you do lots of simulations and apply a differentiable DL model to the results to approximate a possibly discontinuous landscape


> If you look through a neuroscience textbook section on memory systems, it's commonly suggested that the hippocampus does the one shot learning and transfers that over time to the cortex. This is backed up by clinical case studies.

HC's role in episodic memory and consolidation via dreams seems kinda plausible, though I would not put much weight on it. I think dreams are a way of training a GAN-like discrimination between reality and imagination:

http://gershmanlab.webfactional.com/pubs/GenerativeAdversari...

Repetition of any kind likely does improve the model, even if it's merely simulation/dreaming.

> AlphaGo's Monte Carlo search + neural network

I think, in effect, MCTS amounts to something like bagging/boosting/mixture of experts, as it computes a weighted average of the predictions when exploring different branches. But sure, the search mechanism implements a function which a recurrent neural network could probably not discover as it hides behind substantial discontinuities in fitness landscape (it's not a structure which you can uncover step by step, but you immediately need tree structure, a search recursion etc.). The RNN would likely need to conceptualize the search process (subvocally but) linguistically like humans do, which requires structure for the sequential composition of stable prototypes (symbols) which likely requires a one-shot sequential memory. I think even the human mind does not literally do MCTS (would require an overhead that the brain is just not capable of), but some heuristic approximation thereof. The brain can simulate MCTS by linguistic means, though, even if it's just words of wisdom like "take counsel with your pillow", which literally means explore the hypothesis space some more and let the temporal differences backup better value estimates.


Very interesting also is that you can directly send the error to intermediate layers through sparse random projections without the need for any layerwise backpropagation. This relaxation of the structure of the backward pass makes bp even more plausible from a biological perspective.

https://arxiv.org/abs/1609.01596 https://arxiv.org/abs/1903.02083


"Were guessing plausible ways the brain might work in ways similar to the machine learning world, but have no data to back it up"


Backpropagation feels like a pretty gross hack to me anyway. It is a powerful component that still needs a large amount of additional hand-engineering to do more than recognize shapes it has seen before. And it's quite mathematical, which seems kind of difficult to implement in a brain.

Edit: The article and code try to show how backpropagation might work given neuron behavior. I'll leave discussing the quality of the model to biologists.


> And it's quite mathematical, which seems kind of difficult to implement in a brain.

Another commenter pointed out that you don't need perfect backprop, some random backprijections are sufficient.


> "The relationship of spike-time-dependent plasticity to other models requires further clarifying work"

- I got something for that, in fact I think I discovered that backpropagation engenders STDP:

https://github.com/guillaume-chevalier/Spiking-Neural-Networ...


The problem with the brain, is that you find what you go looking for.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: