- I got something for that, in fact I think I discovered that backpropagation engenders STDP:
> backprop engenders STDP
This is backwards I think, but definitely an interesting association to make
What do you mean? RNN's can be taught to do math using BP. Heck, you can do a simple version yourself in less than an hour. They can't do tensor calculus yet (AFAIK) but there doesn't seem be any reason why BP in particular would be the stumbling block; the difficulty is finding the right representation.
> yet we treat BP like it was going to solve all the problems in AI
Except for reinforcement learning which does not use BP to solve the Bellma equations, AlphaZero which uses minimax, AutoML based on non-gradient methods such as bayesian optimization, etc. Research into CNNs and RNNs for image, text, and voice processing does tend to focus on BP, but not inappropriately so considering that that approach continues to make rapid progress in those domains.
> Until we start focusing more on neural architecture I have no worries about AI taking over the world.
I can't tell if you're arguing for or against. Are you suggesting that we should focus more on "neural architecture" so that AI can take over the world? Or are you approving of the current narrow focus for reasons of safety?
The paper you cite and every single paper on artificial neural nets learning to "do maths" or "do arithmetic" etc that has every been published. are only showing neural nets learning  the results of specific operations between numbers up to a certain value. There is no work that shows neural nets learning to do arbitrary arithmetic operations on arbitrary numbers.
To put it very clearly: neural nets can't learn to "do maths" in any general sense.
Neural nets are infamously incapable of generalising beyond their training dataset and the OP is very reasonably skeptical of comparisons between artificial and natural neural networks, given that only the latter are known to be able to generalise from few examples seen very few times.
 As in overfitting to.
In what sense? DNNs can certainly generalize to examples outside the training set. I agree that you get no guarantees on performance for samples drawn from a distribution that is different than the training/test set. Not having guaranteed expected performance isn't the same as "incapable" of generalizing to a new distribution.
To clarify, when I say "training dataset" I mean the entire dataset used for training. Not the training partition in a cross-validation training/test split.
Neural nets can interpolate between the data points in their training dataset, but cannot extrapolate to cover data points outside this dataset. This is not a matter of architecture. DNNs are no exception. Neural nets are just very bad at generalising.
This lack of generalisation ability is why neural nets need to be trained with huge amounts of data, the more the better. Because they can't generalise, the only way to get them to recognise more instances of a class is to show them more examples of that class.
Here's a longer discussion of this:
I can play devil's advocate (aka classic hacker news' commenter) and say what about inhibiting neurons? Tonic neurons? What about the ability for the brain to do recall after hearing a name once? Procedural memory? Does the relationship you point out help you understand any of these things?
You have 3000 input neurons that each makes ~10 synaptic connections with many downstream neurons, but let's focus on just one of those downstream neurons. This downstream neuron thus has 30k synaptic connections along its dendrite. At some time t its cell body (specifically the axon hillock area) receives enough graded electrical potential from some number of those synapses to fire an 'action potential' (a single electrical impulse). Some of those inputs contributed bursts of weak inputs, some may have contributed several strong inputs, some of the inputs were from synapses on very distal regions of the dendrite (therefore the graded signal was significantly weakened by the time it reached the cell body) some were from synapses very near the cell body. But they all integrate at the cell body, at which point, it doesn't matter where they came from - they've arrived is all that matters now - and together their strength is enough to evoke an action potential.
Let's say this action potential impulse resulted in an error of some sort, somewhere downstream in the network. If you're playing by the rules of supervised learning with backprop, the synapses that evoked the signal producing the error-impulse should be made weaker. How? In biological NNs this is the impossible question. Signals that could target individual synapses do not propagate up axons through cell bodies up dendrites and back to synapses (you do have slow signaling that is for homeostatic scaling, but this is thought to simply scale network input up or down across all synapses).
This means you will need another group of neurons to send error signals back. Maybe these recurrent projections exist, and say they did, these recurrent projections would need to (1) form a 3rd party connection at every synapse and (2) know which synapses were to blame for the error. Those don't seem like trivial phenomena from a neurophysiology perspective. I've read theories on local molecular tags or peptide synthesis, but they all seemed very much hand-wavy. Anything that has gained traction on those hypotheses hasn't stood up to scrutiny.
The simplest theory to me, is that cortical networks on the size of human brains don't explicitly 'forget'. That is, they only learn. When you call someone Bob, and they correct you saying their name is Mike, your neural nets don't erase Bob and replace it with Mike. They remember the old name, the new name, the embarrassing situation, all of it. Biological neural nets don't need to explicitly forget because forgetting is (unfortunately?) a byproduct of being a biological entity. Also, neurons have finite resources (they are essentially zero-sum), so you simply will not see infinite run-up of synaptic strengths (which you might in some artificial NN without setting certain hyperparameters). Together, this suggests that recency is a fairly dominant factor when it comes to associative learning (and what gets forgotten) in biological NN.
Me: "these recurrent projections would need to (1) form a 3rd party connection at every synapse and (2) know which synapses were to blame for the error. "
Article: "Although the dendritic error network makes significant steps to increase the biological realism of predictive coding models, it also introduces extra one-to-one connections (dotted arrow in Box 4) that enforce the interneurons to take on similar values to the neurons in next layer and thus help them to predict the feedback from the next level."
The takeaway for me was that this article wasn't attempting to explain how biological NN actually function, but instead took the position of "for biological NN to implement a backprop-type algorithm, this is what it would entail..." and then went on to detail several models that demonstrate just how complex the bio NN architecture would have to be considering only just a few major constraints (there are still many constraining factors of ancillary importance that would have to be explained, should any of these model make it through the initial gauntlet).
Here is a good article explaining not just in theory how LTP (long term potentiation) works in neurons but why a particular protocol always works irl (I can attest to the validity of these statements based on first hand experience):
To summarize, the most reliable way to induce LTP is to "take control of the postsynaptic membrane with intracellular Cs+ to block K+ channels, which allows the experimenter to hold the cell at a constant membrane potential and induce a minimal ‘pairing’ protocol to induce LTP: depolarizing the cell to 0 mV while stimulating synapses."
Holding the cell at 0 mV ensures Mg is always ejected, so any upstream stimulation will always be seen as a coincidence.
Here is a nice animation of signal propagation in biological neural nets:
To simulate the dynamics of synaptic strength I created 3D mesh of a dendrite segment with several synapses...
Then I simulate the diffusion of AMPA receptors on the surface (the number of AMPAR in a synapse is proportional to its strength)...
I don't have animations of this process but you can imagine what happens when one synapse holds onto receptors longer than the others (has a reduced particle diffusion rate), when there are a finite number of receptors
Yeah that would be a problem. Maybe Glial cells play a role, maybe they don't, opinions vary. https://streamable.com/1vi7r
The most striking result in this regard is that one can get backprop with random backprojections. It works regardless because, on average, the error vector will be less than 90° from the true error vector (which is good enough for hillclimbing) and the dynamics play out in a way that the learned weights adjust to the random projections:
That being said, it does seem to be the case that the brain simply memorizes an awful lot which must work by a different mechanism besides backprop because backprop cannot do one-shot learning. I think one-shot learning is how the brain gets past large discontinuities in the model fitness landscape: It can learn linguistic, logical rules, fragments of general computations and facts that are discovered by exploration (which includes learning about whether it was Bob or Mike) and passed on culturally. The brain basically outsources the problem of tunneling through large discontinuities, to cultural/individual trial-and-error and episodic memory. The greatest consequence is that these bodies of knowledge can concern the improvement of the organization of knowledge itself, resulting in a positive feedback loop in model fitness (especially science/Bayesian updating). Though such bodies of knowledge evolve respecting a learnability-by-hillclimbing soft constraint, which implies they often form a neat latent space where similar codes are organized to belong to similar meanings/representations, as this can easily be learned by stochastic hillclimbing (repetition) because each time the brain processes related information, it is nudged towards the latent space that is meant to be learned. Many parts of the world happen to be learnable in this way because everything is kinda smooth and continuous. Small causes tend to have small effects as everything consists of a myriad of small particles that affect each other in smooth ways if you squint at them. Though obviously not everything can be learned this way (implying large discontinuities) which is where a brute memorization based on reward and punishment comes in handy.
If you look through a neuroscience textbook section on memory systems, it's commonly suggested that the hippocampus does the one shot learning and transfers that over time to the cortex. This is backed up by clinical case studies.
> The brain basically outsources the problem of tunneling through large discontinuities, to cultural/individual trial-and-error and episodic memory
That seems like a good strategy. It also reminds me of AlphaGo's Monte Carlo search + neural network training setup. Since the search is non differential, you do lots of simulations and apply a differentiable DL model to the results to approximate a possibly discontinuous landscape
HC's role in episodic memory and consolidation via dreams seems kinda plausible, though I would not put much weight on it. I think dreams are a way of training a GAN-like discrimination between reality and imagination:
Repetition of any kind likely does improve the model, even if it's merely simulation/dreaming.
> AlphaGo's Monte Carlo search + neural network
I think, in effect, MCTS amounts to something like bagging/boosting/mixture of experts, as it computes a weighted average of the predictions when exploring different branches. But sure, the search mechanism implements a function which a recurrent neural network could probably not discover as it hides behind substantial discontinuities in fitness landscape (it's not a structure which you can uncover step by step, but you immediately need tree structure, a search recursion etc.). The RNN would likely need to conceptualize the search process (subvocally but) linguistically like humans do, which requires structure for the sequential composition of stable prototypes (symbols) which likely requires a one-shot sequential memory. I think even the human mind does not literally do MCTS (would require an overhead that the brain is just not capable of), but some heuristic approximation thereof. The brain can simulate MCTS by linguistic means, though, even if it's just words of wisdom like "take counsel with your pillow", which literally means explore the hypothesis space some more and let the temporal differences backup better value estimates.
Edit: The article and code try to show how backpropagation might work given neuron behavior. I'll leave discussing the quality of the model to biologists.
Another commenter pointed out that you don't need perfect backprop, some random backprijections are sufficient.