Hacker News new | past | comments | ask | show | jobs | submit login
Predictive coding has been unified with backpropagation (lesswrong.com)
311 points by cabalamat on April 5, 2021 | hide | past | favorite | 85 comments

EDIT: Before you read my comment below, please see https://news.ycombinator.com/item?id=26702815 and https://openreview.net/forum?id=PdauS7wZBfC for a different view.


If the results hold, they seem significant enough to me that I'd go as far as saying the authors of the paper would end up getting an important award at some point, not just for unifying the fields of biological and artificial intelligence, but also for making it trivial to train models in a fully distributed manner, with all learning done locally -- if the results hold.

Here's the paper: "Predictive Coding Approximates Backprop along Arbitrary Computation Graphs"


I'm making my way through it right now.

Note that the paper was rejected for publication in ICLR 2021:


That is an awesome site, thanks for posting I had no idea there was a place with that much transparent review (shows how much I've been publishing).

Yes, I linked to that same page at the top of my comment :-)

Oops, sorry- I noticed the link but I thought it was a HN url, like the one before... and I wondered why it was greyed-out (visited). But still I didn't check it out. My very bad.

No worries :-)

Interesting follow up reading:

"Relaxing the Constraints on Predictive Coding Models" (https://arxiv.org/abs/2010.01047), from the same authors. Looks at ways to remove neurological implausibility from PCM and achieve comparable results. Sadly they only do MNIST in this one, and are not as ambitious in testing on multiple architectures and problems/datasets, but the results are still very interesting and it covers some of the important theoretical and biological concerns.

"Predictive Coding Can Do Exact Backpropagation on Convolutional and Recurrent Neural Networks" (https://arxiv.org/abs/2103.03725), from different authors. Uses an alternative formulation that means it always converges to the backprop result within a fixed number of iterations, rather than approximately converges "in practice" within 100-200 iterations. Not only is this a stronger guarantee, it means they achieve inference speeds within spitting distance of backprop, levelling the playing field. (Edit: also noted by eutropia)

It'd be interesting to see what a combination of these two could do, and at this point I feel like a logical next step would be to provide some setting in popular ML libraries such that backprop can be switched for PCM. Being able to verify this research just be adding a single extra line for the PCM version, and perhaps replicating state-of-the-art architectures, would be quite valuable.

Here's a more recent paper (March, 2021) which cites the above paper: https://arxiv.org/abs/2103.04689 "Predictive Coding Can Do Exact Backpropagation on Any Neural Network"

Yup. I'd expect to see many more citations going forward. In particular, I'd be excited to see how this ends up getting used in practice, e.g., training and running very large models running on distributed, masively parallel "neuromorphic" hardware.

I’m going to personally flog any researcher who titles their next paper “Predictive Coding Is All You Need”. You’ve been warned.

There are already 60+ of those, and counting, all but one of them since Vaswani et al's transformer paper:


the thing is about every week there is a paper published with groundbreaking claims, with this question in particular being very popular, trying to unify neuroscience and deep learning in some way, in search for computational foundations of AI. Mostly this is driven by success of DL in certain industrial applications.

Unfortunately most of these papers are heavy on theory but light on empirical evidence. If we follow the path of natural sciences, theory has to agree with evidence. Otherwise it's just another theory unconstrained by reality, or worse, pseudo-science.

The paper (arxiv:2103.04689) linked by eutropia above has some empirical evidence on the ML side, showing that performance of predictive coding is not so far off backprop. And there is no shortage of suggestions for how neural circuits might work around the strict requirements of backprop-like algorithms.

cs702's original comment above is excessively hyperbolic: the compositional structure of Bayesian inversion is well known and is known to coincide structurally with the backward/forward structure of automatic differentiation. And there have been many papers before this one showing how predictive coding approximates backprop in other cases, so it is no surprise that it can do so on graphs, too. I agree with the ICLR reviewers that this paper is borderline and not in itself a major contribution. But that does not mean that this whole endeavour, of trying to find explicit mathematical connections between biological and artificial learning, is ill motivated.

>the compositional structure of Bayesian inversion is well known

/u/tsmithe's results on that are well known, now? I can scarcely find anyone to collaborate with who understands them!

Not only light on evidence, but essential practicality-free. There's no "there" there. Literally nothing useful will come from this.

I don’t think anyone familiar with the field is in anyway surprised by this results.

The breakthrough seems really limited to showing it holds for graphs. We already knew this was practically true though anyway.

Agree, no one is surprised.

But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules, without having to wait for gradients to be backpropagated to all layers before the entire model can move on to the next sample.

As I understand it, this work has paved a path for training very large networks in massively parallel, fully distributed hardware -- in the not too distant future.

> But the authors successfully show how to train CNNs, RNNs, and LSTM RNNs without backpropagation, i.e., every layer learning only via local rules

The basic version of this was shown in [1], as mentioned by the ICLR review:

"Specifically, the original paper by Whittington & Bogacz (2017) demonstrated that for MLPs, predictive coding converges to backpropagation using local learning rules."

That Whittington & Bogacz didn't extend to complex ANN architectures, but it would have been very surprising if what they showed didn't extend to other ANNs.

OTOH, while local-only updates are great it doesn't help much if the overall algorithm needs vastly more iterations. Again, from the ICLR review: "The increase in computational cost (of 100x) is mentioned quite late and seems to be glossed over a bit."

[1] https://pubmed.ncbi.nlm.nih.gov/28333583/

In my view, there's a big difference between successfully training, say, LSTM RNNs, versus successfully training "vanilla" MLPs.

This work opens the door for using new kinds of massively parallel "neuromorphic" hardware to implement orders of magnitude more layers and units, without requiring greater communications bandwidth between layers, because the model no longer needs to wait until gradients have back-propagated from the last to the first layer before moving on to the next sample.

Scaling backpropagation to GPT-3 levels and beyond (think trillions of dense connections) is very hard -- it requires a lot of complicated plumbing and bookkeeping.

Wouldn't you want to be able to throw 100x, 1000x, or even 1Mx more fully distributed computing power at problems? This work has paved a path pointing in that direction :-)

My background is as an interested amateur, but

> also for making it trivial to train models in a fully distributed manner, with all learning done locally

seems like a really huge development.

At the same time I remain pretty skeptical of claims of unifying the fields of biological and artificial intelligence. I think the recent tremendous successes in AI & ML lead to an unjustified over confidence that we are close to understanding the way biological systems must work.

Indeed, it's worth mentioning we still have absolutely no idea how memory works.

we know a lot about memory, but most AI researchers are simply ignorant in neuroscience or cognitive psychology and stick with their comfort zone.

Saying "we have no idea" is just being lazy.

No. We really have no idea what is going on. We only know some basic psychology about it (holding 7 things in short term, etc.) If we knew something about implementation, we could implement human-like memory.

I suggest starting with the works by Howard Eichenbaum on memory and Edvard & May-Britt Moser (and John O'Keefe and Lynn Nadel) on place & grid cells.

For the latest and greatest see








Once you start pulling that thread you'd be surprised how much we do know.

Literally nothing you posted surprises me in the least, and literally none of this work shows that we know anything at all about how memory is implemented. Perhaps read some of the many takedowns of so called "grid cells" which show that it is completely unsurprising and not at all interesting or noteworthy that activity in some parts of the brain correlates with location information. The important questions always remain unanswered.

We know a fair bit about how cognitive maps work in 2D and 3D Euclidean environments. We know damn little about how nontrivial manifold structure can be learned, particularly in spaces with more than three dimensions.

spatial cognitive maps used for navigation are extendable to arbitrarily high dimensional spaces for abstract concept representation, using pretty much the same machinery.

There is a ton of work on this, both theory and empirical evidence, here are just a few:

"Navigating cognition: Spatial codes for human thinking" https://science.sciencemag.org/content/362/6415/eaat6766.abs...

"Organizing conceptual knowledge in humans with a gridlike code" https://science.sciencemag.org/content/352/6292/1464

"The Hippocampus Encodes Distances in Multidimensional Feature Space" https://www.sciencedirect.com/science/article/pii/S096098221...

"A non-spatial account of place and grid cells based on clustering models of concept learning" https://www.nature.com/articles/s41467-019-13760-8

"A learned map for places and concepts in the human MTL" https://www.biorxiv.org/content/10.1101/2020.06.15.152504v1....

"What Is a Cognitive Map? Organizing Knowledge for Flexible Behavior" https://www.sciencedirect.com/science/article/pii/S089662731...

"A map of abstract relational knowledge in the human hippocampal–entorhinal cortex" https://elifesciences.org/articles/17086

"Map-Like Representations of an Abstract Conceptual Space in the Human Brain" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7884611/

"Knowledge Across Reference Frames: Cognitive Maps and Image Spaces" https://www.sciencedirect.com/science/article/pii/S136466132...

"Concept formation as a computational cognitive process" https://www.sciencedirect.com/science/article/pii/S235215462...

"Efficient and flexible representation of higher-dimensional cognitive variables with grid cells" https://journals.plos.org/ploscompbiol/article?id=10.1371/jo...

"The cognitive map in humans: spatial navigation and beyond" https://www.nature.com/articles/nn.4656

"A general model of hippocampal and dorsal striatal learning and decision making" https://www.pnas.org/content/117/49/31427.short

"On the Integration of Space, Time, and Memory" https://www.sciencedirect.com/science/article/pii/S089662731...

Unless you know of working implementations of memory algorithms I tend to agree that we have no clue how memory works.

that's called being lazy

I'm trying to imagine how that works. Imagine you've got a nueral net. One node identifies the number of feet. One node identifies that number of wings. One node identifies color. This feeds into a layer that tries to predict what animal it is.

With backprop, you can sort of assume that given enough scale your algo will identify these important features. With local learning, wouldn't you get a tendency to identify the easily identifiable features many times? Is there a need for a sort of middleman like a one arm bandit kind of thing that makes a decision to spawn and despawn child nodes to explore the space more?

The fallacy there is the idea that "one node" does anything useful, rather than optimizing itself in a way that you have _no idea_ what it actually codes for, but at the emergent level, you see it contribute to coding for wing detection, or color detection, or more likely actually seventeen different things that are supposedly unrelated, it just happens to be generating values that somehow contribute to a result for the features the various constellations detect.

(meaning it might also actually cause one or more constellations to perform worse than if it wasn't contributing, and realistically, you'll never know)

That's, at best, pedantically true. You can determine the function of individual components of a network, and they will correspond to concrete thing. It's just that the utility of doing this is low in the scheme of things and the function of individual components is going to be fuzzier than nice constructs that humans like to think in. If you wanted to take painstaking steps to align the functionality of nodes to identify specific features per the example, you could do this and the network would work just fine. It's an appropriate way to simplify an explanation of how the model works.

That's literally what we cannot do in a neural net.

It's why they're so problematic: you can determine the propagation functions of individual nodes perfectly, and that knowledge tells you exactly nothing about all of the many things it's values contribute to. There is no "concrete thing" at the node level: a single node fundamentally can't see a wing, or a color, or anything else, that's only emergent behaviour of node constellations, and one node can contribute to many constellations simultaneously.

Heck, there often isn't even a "concrete thing" at the many of the constellation levels, concrete things don't start to emerge until you're looking at the full state of all end nodes.

That's the machine learning 101 worldview. Initial level nodes are likely just minor inscrutable transformations. Later layers will be coding for features that, at least to some degree if it's a tractable problem, humans can understand and agree with as useful features. They'll be fuzzy and not as clearly defined as a human would frame the problem, but their purpose can be explored and generally identified.

In any case, speaking of them as representing singular features for simplification is appropriate. Maybe it's not one node that codes for legs, but two nodes that codes for legs like this and legs like that, but that's not relevant to the point.

> Is there a need for a sort of middleman like a one arm bandit kind of thing that makes a decision to spawn and despawn child nodes to explore the space more?

What's the one-armed bandit? (Besides a slot machine.)

My knowledge of this field is rusty, but I actually wrote my MSc thesis on novel ways to get Genetic Algorithms to more efficiently explore the space without getting stuck, so it sounds up my alley.

I wonder if you thought of it as a type of optimal stopping problem locally on each node and explore-exploit (multi-armed bandit) globally? For example, if each node knows when to halt when it hits a [probably local] minima, the results can be shared at that point and the best-performing models can be cross-pollinated or whatever the mechanism is at that point. Since both copying the models and continuing without gaining ground are both wastes of time, you want to dial in that local halting point precisely. An overseeing scheduler would record epoch-level results and make the decisions, of course.

Haha sorry I meant multi arm bandit, which I'd presume you're familiar with.

Although I guess a single arm bandit would be something akin the secretary problem.

Interesting discussion on the ICLR openreview, resulting in a reject:


The review is great, it contains all the interesting points and counterpoints, in a much more succinct format than the article itself.

Another well received paper [1], but I want to point out that ICLR should really have an industry track.

The type of research in [1] (exhaustive analytic study on various parameters on RL training), is clearly beyond typical academia environment, probably also beyond normal industry labs. Note the paper was from Google Brain.

The study consumes a lot of people's time, and computing time. It's no doubt very useful and valuable. But I dont think they should be judged by the same group of reviewers with the other work from normal universities.

[1] https://openreview.net/forum?id=nIAxjsniDzg

While it wouldn't hurt, I don't think it is necessary. As with any large ML conference, many reviewers are in industry. I don't know the mix of industry to academic reviewers, but would not be surprised if it was biased towards industry supported research.

Copied from this URL, the final review comments that 1) summarized the other reviews, 2) describes the rational for rejection:

``` This paper extends recent work (Whittington & Bogacz, 2017, Neural computation, 29(5), 1229-1262) by showing that predictive coding (Rao & Ballard, 1999, Nature neuroscience 2(1), 79-87) as an implementation of backpropagation can be extended to arbitrary network structures. Specifically, the original paper by Whittington & Bogacz (2017) demonstrated that for MLPs, predictive coding converges to backpropagation using local learning rules. These results were important/interesting as predictive coding has been shown to match a number of experimental results in neuroscience and locality is an important feature of biologically plausible learning algorithms.

The reviews were mixed. Three out of four reviews were above threshold for acceptance, but two of those were just above. Meanwhile, the fourth review gave a score of clear reject. There was general agreement that the paper was interesting and technically valid. But, the central criticisms of the paper were:

Lack of biological plausibility The reviewers pointed to a few biologically implausible components to this work. For example, the algorithm uses local learning rules in the same sense that backpropagation does, i.e., if we assume that there exist feedback pathways with symmetric weights to feedforward pathways then the algorithm is local. Similarly, it is assumed that there paired error neurons, which is biologically questionable.

Speed of convergence The reviewers noted that this model requires many more iterations to converge on the correct errors, and questioned the utility of a model that involves this much additional computational overhead.

The authors included some new text regarding biological plausibility and speed of convergence. They also included some new results to address some of the other concerns. However, there is still a core concern about the importance of this work relative to the original Whittington & Bogacz (2017) paper. It is nice to see those original results extended to arbitrary graphs, but is that enough of a major contribution for acceptance at ICLR? Given that there are still major issues related to (1) in the model, it is not clear that this extension to arbitrary graphs is a major contribution for neuroscience. And, given the issues related to (2) above, it is not clear that this contribution is important for ML. Altogether, given these considerations, and the high bar for acceptance at ICLR, a "reject" decision was recommended. However, the AC notes that this was a borderline case. ```

The core reason is that the proposed model lacks biological plausibility. Or, if ignoring this weakness, the model is then computationally more intensive.

I HAVE NOT read the paper, but the review seems mostly based "feeling"; i.e., the reviewers feel that this work is not above the bar. Note that I am not criticizing the reviewers here, in my past review career of maybe in the range of 100+ papers, which I did until 6 years ago, most of them are junks. For the ones that are truly good work, which checks all the boxes: new result, hard problem, solid validation, it was easy to accept.

For yet a few other papers, which all seem to fall into the feeling category, everything looks right, but it was always on a borderline. And the review results can vary substantially based on the reviewers' own backgrounds.

I'm glad people are talking about this, and the similarity between predictive coding and the action of biological neurons is interesting. But we shouldn't fetishize predictive coding. There's a wider discussion going on, and several theories as to how back propagation might work in the brain.



there is no evidence of back-propagation in the brain.

See Professor Edmund T. Rolls books on biologically plausible neural networks:

"Brain Computations: What and How" (2020) https://www.amazon.com/gp/product/0198871104

"Cerebral Cortex: Principles of Operation" (2018) https://www.oxcns.org/b12text.html

"Neural Networks and Brain Function" (1997) https://www.oxcns.org/b3_text.html

"There is just one problem: [biological neural networks] are physically incapable of running the backpropagation algorithm."

From the linked article.

I read that sentence. The article is not the only source of truth in brain function, and its author may be too certain about the brain. In any case, there will always be dissimilarities between biological neurons and computations on silicon, which probably shouldn't be called neurons, in order to avoid confusion.

I agree. Really don't appreciate the level at which researchers are willing to make these comparisons right now. They're moving fast and publishing things.

It's probably far too late to change the name for computational neural nets - but I agree. Something like a "differentiable learning graph" or something would be better.

This is an overly strong claim for the paper (which is good!) backing it.

If anyone is interested in the reader's digest version of the original paper check out https://www.youtube.com/watch?v=LB4B5FYvtdI

nice video

> Predictive coding is the idea that BNNs generate a mental model of their environment and then transmit only the information that deviates from this model. Predictive coding considers error and surprise to be the same thing. Hebbian theory is specific mathematical formulation of predictive coding.

This is an excellent, concise explanation. It sounds intuitive as something that could work. Would love to try and dabble with this. Any resources?

I don't know enough about biology or ML to know if what I'm posting below is totally wrong, but here goes.

"Backprop" == "Feedback" of a non-linear dynamical system. Feedback is mathematical description of the behavior of systems, not a literal one.

I don't know of BNNs are incapable of backprop anymore than an RLC filter is incapable of "feedback" when analyzing the ODE of the latter tells you that there's a feedback path (which is what, physically? The return path for charge?)

So what makes BNN incapable of feedback? Are they mechanically and electrically insulated from eachother? How do they share information, and what is the return path?

Other than that I wish more unification was done on ML algorithms and dynamical systems, just in general. There's too much crossover to ignore.

> Other than that I wish more unification was done on ML algorithms and dynamical systems, just in general. There's too much crossover to ignore.

Check out this work, "Deep relaxation: partial differential equations for optimizing deep neural networks" by Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto & Guillaume Carlier.


The back-prop learning algorithm requires information non-local to the synapse to be propagated from output of the network backwards to affect neurons deep in the network.

There is simply no evidence for this global feedback loop, or global error correction, or delta rule training in neurophysiological data collected in the last 80 years of intensive research. [1]

As for "why", biological learning it is primarily shaped by evolution driven by energy expenditures constraints and survival of the most efficient adaptation engines. One can speculate that iterative optimization akin to the one run by GPUs in ANNs is way too energy inefficient to be sustainable in a living organism.

Good discussion on biological constraints of learning (from CompSci perspective) can be found in Leslie Valiant book [2]. Prof. Valiant is the author of PAC [3] one of the few theoretically sound models of modern ML, so he's worth listening to.

[1] https://news.ycombinator.com/item?id=26700536

[2] https://www.amazon.com/Circuits-Mind-Leslie-G-Valiant/dp/019...

[3] https://en.wikipedia.org/wiki/Probably_approximately_correct...

I think there's a significant difference worth illustrating that "there is no feedback path in the brain" is not at all equivalent to "learning by feedback is not possible in the brain."

It's well known in dynamics that feed-forward networks are no longer feed-forward when outputs are coupled to inputs, an example of which would be a hypothetically feed-forward network of neurons in an animal and environmental conditioning teaching it the consequences of actions.

I'm very curious on the biological constraints, but I'd reiterate my point above that feedback is a mathematical or logical abstraction for analyzing the behavior of the things we call networks - which are also abstractions. There's a distinction between the physical behavior of the things we see and the mathematical models we construct to describe them, like electromechanical systems where physically no such coupling from output-to-input appears to exist, yet its existence is crucially important analytically.

The article says this:

> The backpropagation algorithm requires information to flow forward and backward along the network. But biological neurons are one-directional. An action potential goes from the cell body down the axon to the axon terminals to another cell's dendrites. An axon potential never travels backward from a cell's terminals to its body.

The point of the research here is that backpropagation turns out not to be necessary to fit a neural network, and that it can be approximated with predictive coding, which does not require end-to-end backwards information flow.

So... I don't understand.

>An action potential goes from the cell body down the axon to the axon terminals to another cell's dendrites.

How do you figure that doesn't allow backprop?

A neuronal bit is a loop of neurons. Information absolutely can back- propagate. If it couldn't, how does anyone think it'd be at all possible to learn how to get better at anything?

Neuron fires dendrite to axon, secondary neuron fires dendrite to Axon, Axon branches back to previous neuron's dendrites, rinse, repeat, or add more intervening neurons... Trying to disinclude backprop based on the morphology of a single neuron is... Kinda missing the point.

It's all about the level of connection between neurons and how long or whether a signal returns unmodified to the progenitor that effects the stability of the encoded information or behavior. At least to the best I've been able to plausibly model it. Haven't exactly figured out how to shove a bunch of measuring sticks in there to confirm or deny, but I just can't how a uniderectional action potential forwarding element implies lack of backprop in a graph of connections fully capable of developing cycles.

> How do you figure that doesn't allow backprop?

When you backprop through a linear layer (a matrix W), you need to multiply with W.transpose which is impossible if connections are not bidirectional.

> Information absolutely can back- propagate. If it couldn't, how does anyone think it'd be at all possible to learn how to get better at anything?

Local error aggregation can have a similar effect with backprop but you can run each layer in parallel, you don't need to wait for the signal to reach the loss function and then gradients to come all the way back.

An interesting read: Decoupled Neural Interfaces using Synthetic Gradients (DeepMind, 2017) http://proceedings.mlr.press/v70/jaderberg17a/jaderberg17a.p...

You're conflating the implementation with the principle. There is no matrix math with neurons; quite the opposite, we posit the existence of the matrix math from the behavior we've observed with neural systems governed by a sigmoid function. The equations we've derived are secondary to the initial implementation. Just as you tweak the error factor in backprop, so too do weights between intersecting neuron networks adjust until thought and intention falls into line with eventual perception and execution.

The map is not the territory.

> There is no matrix math with neurons

Are we talking about the brain or neural nets biological plausible neural nets?

>> If it couldn't, how does anyone think it'd be at all possible to learn how to get better at anything?

Personally, when I've come to the point where I'm thinking to myself "that must be it, what else can it be?", I am at the point where I need to do more work to answer the latter part of the question.

> If it couldn't, how does anyone think it'd be at all possible to learn how to get better at anything?

Presumably part of the feedback loop (at least for things like motor skills and rote memorisation) is external to the brain. Our brain causes us to act, which alters our perceptions, which causes the brain to adjust.

Annnd... What exactly encodes that perception thing? The brain! There's nothing magicabout sense data except that it's how we're used to tweaking things. It is still, in the end, just more neurons telling other neurons what's off. Sure, some are hooked up to photoreceptors. I could swap those out with about anything else and still get learning. Unfortunately, the practical benefit of mastery of stuff you imagine tends to be less useful than it feels like it should be IRL.

Yeah, but then you run into the problem of computation speed. Any given neuron in the middle of your brain does 1 computation per second absolute maximum, and 1 per 10 seconds is more realistic. More to the outside (the vast majority of your brain) 1 per 100 seconds is a lot. And it slows down when you age.

This means brains must have a bloody good update rule. You just can't update a neural network in 1 billion operations per second, or 4e17 operations until you're 12, about 2 million training steps per neuron, or about half that assuming you sleep. You cannot get to the level of a 12 year old in 4e17 operations, because GPT-3 does more and while it's impressive, it doesn't have anything on a 12 year old.

Yeah, I don't like this title. Coding for backprop is worth getting excited about, but please don't assume it supersedes all forms of "predictive coding". Plenty of predictive learning techniques do just fine without it, including our own brains.

In keeping with the No-Free-Lunch theorem, it's also highly desirable in general to have a variety of approaches at hand for solving certain predictive coding problems. Yes, this makes ML (as a field) cumbersome, but it also prevents us from painting ourselves into a corner.

Is this "coding for backprop", or "coding for the same results as backprop"?

I think that this sort of forward backward thing is a very general idea. There’s a one to many relationship called the adjoint, and a many to one relationship called the norm.

I wrote something about this here https://github.com/adamnemecek/adjoint

In fact, the compositional structure underlying that of predictive coding [0,1] is abstractly the same as that underlying backprop [2]. (Disclaimer: [0,1] are my own papers; I'm working on a more precise and extensive version of [1] right now!)

[0] https://arxiv.org/abs/2006.01631 [1] https://arxiv.org/abs/2101.10483 [2] https://arxiv.org/abs/1711.10455

Hurry and publish before I have manuscripts ready applying these results.

Hey, Eli :-)

I'm working on it; I'll send you an e-mail. Things quickly turned out to be more general than I realized last year.

That would make sense. The whole ACT/categorical cybernetics community has been working out how massively general optics are :-).

What were you going to say about Young tableaux?

Dynamic programming and reinforcement learning are just diagonalizations of the Young tableau. This is related to the spectral theorem.

Interesting. Is there some concrete realization you had in mind?

At scale, Evolutionary Strategies (ES) are a very good approximation of the gradient as well. Don’t recommend to jump just yet to conclusions and unifications.

The author's point is that predictive coding is a plausible mechanism by which biological neurons work. ES are not.

ANNs have deviated widely from their biological inspiration, most notably in the way that information flows, since backpropagation requires two way flow and biological axons are one-directional.

If predictive coding and backpropagation are shown to have similar power, then there's a rough idea that the way that ANNs work isn't too far from how brains work (with lots and lots of caveats).

> If predictive coding and backpropagation are shown to have similar power, then there's a rough idea that the way that ANNs work isn't too far from how brains work (with lots and lots of caveats).

So many caveats that I don't even really think that is a true statement.

Is this approach "more local", in the sense that you could build hardware where local units got work done with less communication? That would have potential. It's feasible to build ICs with a few million simple compute units if they don't have to talk to each other or to memory much. GPUs are a few hundred or a few thousand parallel units that talk to memory a lot.

Does anyone know of a simple code example that demonstrates the original predictive coding concept from 1999? Ideally applied to some type of simple image/video problem.

I thought I saw a Matlab explanation of that 99 paper but have not found it again.

The paper was posted on HN a few months ago: https://news.ycombinator.com/item?id=24693609

But does predictive coding perceived as a valid theory for cortical neurons functioning? There was a paper from 2017 drawing similar conclusions about backprop approximation with Spike-Timing-Dependent Plasticity: https://arxiv.org/abs/1711.04214 Looks more grounded to current models of neuronal functioning. Nevertheless, it changed nothing in the field of deep learning since then.

Some general background on STDP for the thread:

Biological neurons don't just emit constant 0...1 float values, they communicate using time sensitive bursts of voltage known as "spike trains". Spiking Neural Networks (SNN) are a closer aproximation of natural networks than typical ML ANNs. [0] gives a quick overview.

Spike-Timing-Dependant-Plasticity is a local learning rule experimentally observed in biological neurons. It's a form of Hebbian learning, aka "Neurons that fire together wire together."

Summary from [1]. The top graph gives a clear picture of how the rule works.

> With STDP, repeated presynaptic spike arrival a few milliseconds before postsynaptic action potentials leads in many synapse types to Long-Term Potentiation (LTP) of the synapses, whereas repeated spike arrival after postsynaptic spikes leads to Long-Term Depression (LTD) of the same synapse.


[0]: https://towardsdatascience.com/deep-learning-versus-biologic...

[1]: http://www.scholarpedia.org/article/Spike-timing_dependent_p...

as long as the model requires delta rule, or 'teacher signal' based error correction it is not biologically plausible.

This was already shown for MLPs some years ago, and it is not really that surprising that it applies to many other architectures. Note that while learning can take place locally, it does still require an upward and downward stream of information flow, which is not supported by the neuroanatomy in all cases. So while it is an interesting avenue of research, I don't think it's anywhere near as revolutionary as this blog post makes it out to be.

> Predictive coding is the idea that BNNs generate a mental model of their environment and then transmit only the information that deviates from this model. Predictive coding considers error and surprise to be the same thing.

This reminds me of a Slate Star Codex article on Friston[1].

[1] https://slatestarcodex.com/2018/03/04/god-help-us-lets-try-t...

1) How similar is this to creating a resnet? [1]. What are some key differences and similarities?

2) Has a CNN version of this been implemented in PyTorch?

[1] https://arxiv.org/pdf/1512.03385.pdf (Figure 2)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact