If the results hold, they seem significant enough to me that I'd go as far as saying the authors of the paper would end up getting an important award at some point, not just for unifying the fields of biological and artificial intelligence, but also for making it trivial to train models in a fully distributed manner, with all learning done locally -- if the results hold.
Here's the paper: "Predictive Coding Approximates Backprop along Arbitrary Computation Graphs"
I'm making my way through it right now.
"Relaxing the Constraints on Predictive Coding Models" (https://arxiv.org/abs/2010.01047), from the same authors. Looks at ways to remove neurological implausibility from PCM and achieve comparable results. Sadly they only do MNIST in this one, and are not as ambitious in testing on multiple architectures and problems/datasets, but the results are still very interesting and it covers some of the important theoretical and biological concerns.
"Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks" (https://arxiv.org/abs/2103.03725), from different authors. Uses an alternative formulation that means it always converges to the backprop result within a fixed number of iterations, rather than approximately converges "in practice" within 100-200 iterations. Not only is this a stronger guarantee, it means they achieve inference speeds within spitting distance of backprop, levelling the playing field. (Edit: also noted by eutropia)
It'd be interesting to see what a combination of these two could do, and at this point I feel like a logical next step would be to provide some setting in popular ML libraries such that backprop can be switched for PCM. Being able to verify this research just be adding a single extra line for the PCM version, and perhaps replicating state-of-the-art architectures, would be quite valuable.
Unfortunately most of these papers are heavy on theory but light on empirical evidence. If we follow the path of natural sciences, theory has to agree with evidence. Otherwise it's just another theory unconstrained by reality, or worse, pseudo-science.
cs702's original comment above is excessively hyperbolic: the compositional structure of Bayesian inversion is well known and is known to coincide structurally with the backward/forward structure of automatic differentiation. And there have been many papers before this one showing how predictive coding approximates backprop in other cases, so it is no surprise that it can do so on graphs, too. I agree with the ICLR reviewers that this paper is borderline and not in itself a major contribution. But that does not mean that this whole endeavour, of trying to find explicit mathematical connections between biological and artificial learning, is ill motivated.
/u/tsmithe's results on that are well known, now? I can scarcely find anyone to collaborate with who understands them!
The breakthrough seems really limited to showing it holds for graphs. We already knew this was practically true though anyway.
But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules, without having to wait for gradients to be backpropagated to all layers before the entire model can move on to the next sample.
As I understand it, this work has paved a path for training very large networks in massively parallel, fully distributed hardware -- in the not too distant future.
The basic version of this was shown in , as mentioned by the ICLR review:
"Specifically, the original paper by Whittington & Bogacz (2017) demonstrated that for MLPs, predictive coding converges to backpropagation using local learning rules."
That Whittington & Bogacz didn't extend to complex ANN architectures, but it would have been very surprising if what they showed didn't extend to other ANNs.
OTOH, while local-only updates are great it doesn't help much if the overall algorithm needs vastly more iterations. Again, from the ICLR review: "The increase in computational cost (of 100x) is mentioned quite late and seems to be glossed over a bit."
This work opens the door for using new kinds of massively parallel "neuromorphic" hardware to implement orders of magnitude more layers and units, without requiring greater communications bandwidth between layers, because the model no longer needs to wait until gradients have back-propagated from the last to the first layer before moving on to the next sample.
Scaling backpropagation to GPT-3 levels and beyond (think trillions of dense connections) is very hard -- it requires a lot of complicated plumbing and bookkeeping.
Wouldn't you want to be able to throw 100x, 1000x, or even 1Mx more fully distributed computing power at problems? This work has paved a path pointing in that direction :-)
> also for making it trivial to train models in a fully distributed manner, with all learning done locally
seems like a really huge development.
At the same time I remain pretty skeptical of claims of unifying the fields of biological and artificial intelligence. I think the recent tremendous successes in AI & ML lead to an unjustified over confidence that we are close to understanding the way biological systems must work.
Saying "we have no idea" is just being lazy.
For the latest and greatest see
Once you start pulling that thread you'd be surprised how much we do know.
There is a ton of work on this, both theory and empirical evidence, here are just a few:
"Navigating cognition: Spatial codes for human thinking" https://science.sciencemag.org/content/362/6415/eaat6766.abs...
"Organizing conceptual knowledge in humans with a gridlike code"
"The Hippocampus Encodes Distances in Multidimensional Feature Space" https://www.sciencedirect.com/science/article/pii/S096098221...
"A non-spatial account of place and grid cells based on clustering models of concept learning"
"A learned map for places and concepts in the human MTL" https://www.biorxiv.org/content/10.1101/2020.06.15.152504v1....
"What Is a Cognitive Map? Organizing Knowledge for Flexible Behavior" https://www.sciencedirect.com/science/article/pii/S089662731...
"A map of abstract relational knowledge in the human hippocampal–entorhinal cortex" https://elifesciences.org/articles/17086
"Map-Like Representations of an Abstract Conceptual Space in the Human Brain"
"Knowledge Across Reference Frames: Cognitive Maps and Image Spaces" https://www.sciencedirect.com/science/article/pii/S136466132...
"Concept formation as a computational cognitive process" https://www.sciencedirect.com/science/article/pii/S235215462...
"Efficient and flexible representation of higher-dimensional cognitive variables with grid cells" https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...
"The cognitive map in humans: spatial navigation and beyond" https://www.nature.com/articles/nn.4656
"A general model of hippocampal and dorsal striatal learning and decision making" https://www.pnas.org/content/117/49/31427.short
"On the Integration of Space, Time, and Memory" https://www.sciencedirect.com/science/article/pii/S089662731...
With backprop, you can sort of assume that given enough scale your algo will identify these important features. With local learning, wouldn't you get a tendency to identify the easily identifiable features many times? Is there a need for a sort of middleman like a one arm bandit kind of thing that makes a decision to spawn and despawn child nodes to explore the space more?
(meaning it might also actually cause one or more constellations to perform worse than if it wasn't contributing, and realistically, you'll never know)
It's why they're so problematic: you can determine the propagation functions of individual nodes perfectly, and that knowledge tells you exactly nothing about all of the many things it's values contribute to. There is no "concrete thing" at the node level: a single node fundamentally can't see a wing, or a color, or anything else, that's only emergent behaviour of node constellations, and one node can contribute to many constellations simultaneously.
Heck, there often isn't even a "concrete thing" at the many of the constellation levels, concrete things don't start to emerge until you're looking at the full state of all end nodes.
In any case, speaking of them as representing singular features for simplification is appropriate. Maybe it's not one node that codes for legs, but two nodes that codes for legs like this and legs like that, but that's not relevant to the point.
What's the one-armed bandit? (Besides a slot machine.)
My knowledge of this field is rusty, but I actually wrote my MSc thesis on novel ways to get Genetic Algorithms to more efficiently explore the space without getting stuck, so it sounds up my alley.
Although I guess a single arm bandit would be something akin the secretary problem.
The type of research in  (exhaustive analytic study on various parameters on RL training), is clearly beyond typical academia environment, probably also beyond normal industry labs. Note the paper was from Google Brain.
The study consumes a lot of people's time, and computing time. It's no doubt very useful and valuable. But I dont think they should be judged by the same group of reviewers with the other work from normal universities.
This paper extends recent work (Whittington & Bogacz, 2017, Neural computation, 29(5), 1229-1262) by showing that predictive coding (Rao & Ballard, 1999, Nature neuroscience 2(1), 79-87) as an implementation of backpropagation can be extended to arbitrary network structures. Specifically, the original paper by Whittington & Bogacz (2017) demonstrated that for MLPs, predictive coding converges to backpropagation using local learning rules. These results were important/interesting as predictive coding has been shown to match a number of experimental results in neuroscience and locality is an important feature of biologically plausible learning algorithms.
The reviews were mixed. Three out of four reviews were above threshold for acceptance, but two of those were just above. Meanwhile, the fourth review gave a score of clear reject. There was general agreement that the paper was interesting and technically valid. But, the central criticisms of the paper were:
Lack of biological plausibility The reviewers pointed to a few biologically implausible components to this work. For example, the algorithm uses local learning rules in the same sense that backpropagation does, i.e., if we assume that there exist feedback pathways with symmetric weights to feedforward pathways then the algorithm is local. Similarly, it is assumed that there paired error neurons, which is biologically questionable.
Speed of convergence The reviewers noted that this model requires many more iterations to converge on the correct errors, and questioned the utility of a model that involves this much additional computational overhead.
The authors included some new text regarding biological plausibility and speed of convergence. They also included some new results to address some of the other concerns. However, there is still a core concern about the importance of this work relative to the original Whittington & Bogacz (2017) paper. It is nice to see those original results extended to arbitrary graphs, but is that enough of a major contribution for acceptance at ICLR? Given that there are still major issues related to (1) in the model, it is not clear that this extension to arbitrary graphs is a major contribution for neuroscience. And, given the issues related to (2) above, it is not clear that this contribution is important for ML. Altogether, given these considerations, and the high bar for acceptance at ICLR, a "reject" decision was recommended. However, the AC notes that this was a borderline case.
The core reason is that the proposed model lacks biological plausibility. Or, if ignoring this weakness, the model is then computationally more intensive.
I HAVE NOT read the paper, but the review seems mostly based "feeling"; i.e., the reviewers feel that this work is not above the bar. Note that I am not criticizing the reviewers here, in my past review career of maybe in the range of 100+ papers, which I did until 6 years ago, most of them are junks. For the ones that are truly good work, which checks all the boxes: new result, hard problem, solid validation, it was easy to accept.
For yet a few other papers, which all seem to fall into the feeling category, everything looks right, but it was always on a borderline. And the review results can vary substantially based on the reviewers' own backgrounds.
See Professor Edmund T. Rolls books on biologically plausible neural networks:
"Brain Computations: What and How" (2020) https://www.amazon.com/gp/product/0198871104
"Cerebral Cortex: Principles of Operation" (2018) https://www.oxcns.org/b12text.html
"Neural Networks and Brain Function" (1997) https://www.oxcns.org/b3_text.html
From the linked article.
It's probably far too late to change the name for computational neural nets - but I agree. Something like a "differentiable learning graph" or something would be better.
If anyone is interested in the reader's digest version of the original paper check out https://www.youtube.com/watch?v=LB4B5FYvtdI
This is an excellent, concise explanation. It sounds intuitive as something that could work. Would love to try and dabble with this. Any resources?
"Backprop" == "Feedback" of a non-linear dynamical system. Feedback is mathematical description of the behavior of systems, not a literal one.
I don't know of BNNs are incapable of backprop anymore than an RLC filter is incapable of "feedback" when analyzing the ODE of the latter tells you that there's a feedback path (which is what, physically? The return path for charge?)
So what makes BNN incapable of feedback? Are they mechanically and electrically insulated from eachother? How do they share information, and what is the return path?
Other than that I wish more unification was done on ML algorithms and dynamical systems, just in general. There's too much crossover to ignore.
Check out this work, "Deep relaxation: partial differential equations for optimizing deep neural networks" by Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto & Guillaume Carlier.
There is simply no evidence for this global feedback loop, or global error correction, or delta rule training in neurophysiological data collected in the last 80 years of intensive research. 
As for "why", biological learning it is primarily shaped by evolution driven by energy expenditures constraints and survival of the most efficient adaptation engines. One can speculate that iterative optimization akin to the one run by GPUs in ANNs is way too energy inefficient to be sustainable in a living organism.
Good discussion on biological constraints of learning (from CompSci perspective) can be found in Leslie Valiant book .
Prof. Valiant is the author of PAC  one of the few theoretically sound models of modern ML, so he's worth listening to.
It's well known in dynamics that feed-forward networks are no longer feed-forward when outputs are coupled to inputs, an example of which would be a hypothetically feed-forward network of neurons in an animal and environmental conditioning teaching it the consequences of actions.
I'm very curious on the biological constraints, but I'd reiterate my point above that feedback is a mathematical or logical abstraction for analyzing the behavior of the things we call networks - which are also abstractions. There's a distinction between the physical behavior of the things we see and the mathematical models we construct to describe them, like electromechanical systems where physically no such coupling from output-to-input appears to exist, yet its existence is crucially important analytically.
> The backpropagation algorithm requires information to flow forward and backward along the network. But biological neurons are one-directional. An action potential goes from the cell body down the axon to the axon terminals to another cell's dendrites. An axon potential never travels backward from a cell's terminals to its body.
The point of the research here is that backpropagation turns out not to be necessary to fit a neural network, and that it can be approximated with predictive coding, which does not require end-to-end backwards information flow.
>An action potential goes from the cell body down the axon to the axon terminals to another cell's dendrites.
How do you figure that doesn't allow backprop?
A neuronal bit is a loop of neurons. Information absolutely can back- propagate. If it couldn't, how does anyone think it'd be at all possible to learn how to get better at anything?
Neuron fires dendrite to axon, secondary neuron fires dendrite to Axon, Axon branches back to previous neuron's dendrites, rinse, repeat, or add more intervening neurons... Trying to disinclude backprop based on the morphology of a single neuron is... Kinda missing the point.
It's all about the level of connection between neurons and how long or whether a signal returns unmodified to the progenitor that effects the stability of the encoded information or behavior. At least to the best I've been able to plausibly model it. Haven't exactly figured out how to shove a bunch of measuring sticks in there to confirm or deny, but I just can't how a uniderectional action potential forwarding element implies lack of backprop in a graph of connections fully capable of developing cycles.
When you backprop through a linear layer (a matrix W), you need to multiply with W.transpose which is impossible if connections are not bidirectional.
> Information absolutely can back- propagate. If it couldn't, how does anyone think it'd be at all possible to learn how to get better at anything?
Local error aggregation can have a similar effect with backprop but you can run each layer in parallel, you don't need to wait for the signal to reach the loss function and then gradients to come all the way back.
An interesting read: Decoupled Neural Interfaces using Synthetic Gradients (DeepMind, 2017) http://proceedings.mlr.press/v70/jaderberg17a/jaderberg17a.p...
The map is not the territory.
Are we talking about the brain or neural nets biological plausible neural nets?
Personally, when I've come to the point where I'm thinking to myself "that must be it, what else can it be?", I am at the point where I need to do more work to answer the latter part of the question.
Presumably part of the feedback loop (at least for things like motor skills and rote memorisation) is external to the brain. Our brain causes us to act, which alters our perceptions, which causes the brain to adjust.
This means brains must have a bloody good update rule. You just can't update a neural network in 1 billion operations per second, or 4e17 operations until you're 12, about 2 million training steps per neuron, or about half that assuming you sleep. You cannot get to the level of a 12 year old in 4e17 operations, because GPT-3 does more and while it's impressive, it doesn't have anything on a 12 year old.
In keeping with the No-Free-Lunch theorem, it's also highly desirable in general to have a variety of approaches at hand for solving certain predictive coding problems. Yes, this makes ML (as a field) cumbersome, but it also prevents us from painting ourselves into a corner.
I wrote something about this here https://github.com/adamnemecek/adjoint
I'm working on it; I'll send you an e-mail. Things quickly turned out to be more general than I realized last year.
ANNs have deviated widely from their biological inspiration, most notably in the way that information flows, since backpropagation requires two way flow and biological axons are one-directional.
If predictive coding and backpropagation are shown to have similar power, then there's a rough idea that the way that ANNs work isn't too far from how brains work (with lots and lots of caveats).
So many caveats that I don't even really think that is a true statement.
I thought I saw a Matlab explanation of that 99 paper but have not found it again.
Biological neurons don't just emit constant 0...1 float values, they communicate using time sensitive bursts of voltage known as "spike trains". Spiking Neural Networks (SNN) are a closer aproximation of natural networks than typical ML ANNs.  gives a quick overview.
Spike-Timing-Dependant-Plasticity is a local learning rule experimentally observed in biological neurons. It's a form of Hebbian learning, aka "Neurons that fire together wire together."
Summary from . The top graph gives a clear picture of how the rule works.
> With STDP, repeated presynaptic spike arrival a few milliseconds before postsynaptic action potentials leads in many synapse types to Long-Term Potentiation (LTP) of the synapses, whereas repeated spike arrival after postsynaptic spikes leads to Long-Term Depression (LTD) of the same synapse.
This reminds me of a Slate Star Codex article on Friston.
2) Has a CNN version of this been implemented in PyTorch?
 https://arxiv.org/pdf/1512.03385.pdf (Figure 2)