
Dopamine and Temporal Difference Learning - magoghm
https://deepmind.com/blog/article/Dopamine-and-temporal-difference-learning-A-fruitful-relationship-between-neuroscience-and-AI
======
BSVogler
I am currently working on my master's thesis on the topic of applying this
type of knowledge to build a reinforcement learning system with spiking neural
networks. The role of dopamine is crucial in learning.

By combining spike-timing-dependent plasticity (STDP) with the reward
(R-STDP), it is possible to address the spatial eligibility trace (which
synapse caused the reward?). However, the design of the reward/utility
function is one core issue, where this paper probably advances the field. I am
currently looking into ways to balance the reward. Without balancing each
trial causes more long term depression than long term potentiation, so that
after a while the end the network dies. A different reward function results in
the opposite. Another issue is the distal reward problem: Which event
correspondents to which activity? I would be very glad to discuss related
questions or exchange ideas with other experts or newcomers to this growing
field. My mail is in my profile.

This paper also got me thinking if this mechanism might explain why we enjoy
listening to music. Music follows and breaks rules, which our brain
continuously tries to predict and most of the time it succeeds.

~~~
longtom
> This paper also got me thinking if this mechanism might explain why we enjoy
> listening to music. Music follows and breaks rules, which our brain
> continuously tries to predict and most of the time it succeeds.

The link between predictability and art has been raised various times in the
literature, even _before_ Schmidhuber, but he had particularly _interesting_
ideas about it:
[http://people.idsia.ch/~juergen/beauty.html](http://people.idsia.ch/~juergen/beauty.html)

~~~
jcims
Great episode with Schmidhuber on Lex Fridman’s podcast.

[https://youtu.be/3FIo6evmweo](https://youtu.be/3FIo6evmweo)

I’m coming at this from a career in infosec, never heard of the guy but really
enjoyed his thought processes.

------
cs702
Great blog post on great research. Worth reading in its entirety.

Summarizing at a very high level abstraction: This work compares a mechanism
used for learning probability distributions of expected rewards in deep
reinforcement learning systems to the dopamine reward mechanism in mice
brains.

This passage near the end, in particular, caught my eye :

 _> ...our final question was if we could decode the reward distribution from
the firing rates of dopamine cells [in mice brains]. As shown in Figure 5, we
found that it was indeed possible, using only the firing rates of dopamine
cells, to reconstruct a reward distribution (blue trace) which was a very
close match to the actual distribution of rewards (grey area) in the task that
the mice were engaged in. This reconstruction relied on interpreting the
firing rates of dopamine cells as the reward prediction errors of a
distributional TD model, and performing inference to determine what
distribution that model had learned about._

In other words, mice brains seem to be using the same mechanism, and it
appears we can decode the probability distribution of expected rewards learned
by those brains by measuring only the firing rate of dopamine cells.

Very exciting!

~~~
keenmaster
Can we use brain scanners and ML to A/B test online lectures to perfection?

For example, you can show the top 20 Calculus 2 courses to groups of 50 people
each, all dawning brain scanners, and create a “brain activation map” for each
class from each professor. Among students of the top 10% of professors (as
measured by exam results and brain activation), we can analyze the most
engaging moments in each course, and hybridize them into a master course.
Furthermore, we can analyze differential learning outcomes in males, females,
students of different races, and K-clustered psychographic profiles (based on
DMN activity and other neurological measures taken before the course).

If the learning outcomes are significantly different, then it may be more
appropriate to create several different master courses for the people that
showed different learning outcomes. A class can be recommended, Netflix style,
based on your demographics, neural activation patterns, and learning velocity
from past coursework.

Everyone would have the best Calculus 2 class, into perpetuity, after one
cycle of experimentation and master class creation. The same can be done for
all canonical coursework, from Kindergarten through the top undergrad majors
to Med/Law/MBAs. The aforementioned learning outcome differentials would be
lessened by the “no child left behind” effect of each kid getting a solid
grasp of math and science early on with neurologically tailored coursework.

~~~
taurath
Why not brain scanners to optimize ads? Oh right they are :(

~~~
keenmaster
Neural marketing uses EEG headsets and eye tracking. I searched for EEG +
focus detection, and found the 2018 study linked below. Excerpt: “ the best
obtained classification accuracies were 77% and 83%, respectively, using SVM
binary classifiers.” They used ML for attentiveness classification. I’m sure
greater accuracy can be achieved with time, but that’s a good start. It would
be even better if we had wearable MRIs.

[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6263653/#!po=0....](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6263653/#!po=0.675676)

~~~
taurath
Maybe instead, we just freaking shouldn’t allow applications of this sort of
research at all. This is stuff that would have been a scandal if the CIA did
it, just because it makes money doesn’t mean it’s okay.

~~~
keenmaster
Don’t get me wrong, I am not a fan of neural marketing. I just mentioned it in
response to your comment, and only to indicate that the same scanning hardware
can be used in the education space.

------
afarrell
I can partly understand some this based on an EE101-level control theory, a
High School level model of how neurons work, and 3blue1brown's intro to neural
nets[1]. However, I have a strong personal interest in developing a much
deeper understanding of dopamine neurons and their role in Executive Function.

Can anyone recommend a good curriculum which can take a random web developer
from "Knows what a myelinated axon, a sigmoid function, and a feedback loop
are" to having a solid enough background to dive into the research on this?

[1]
[https://www.youtube.com/watch?v=aircAruvnKk](https://www.youtube.com/watch?v=aircAruvnKk)

~~~
westoncb
Kind of an odd title and book, but I think it may do a good job of what you're
looking for:

"Principles of Neural Design": [https://www.amazon.com/Principles-Neural-
Design-MIT-Press/dp...](https://www.amazon.com/Principles-Neural-Design-MIT-
Press/dp/0262534681/)

~~~
afarrell
That looks like _exactly_ the sort of thing I'm after.

------
afarrell
This seems to so far only address learning in a context where the reward
appears shortly after the behavior which caused it. That is valuable to
understand, but seems like it fails to yet explain how Executive Functions
work.

> What happens if an individual's brain “listens” selectively to optimistic
> versus pessimistic dopamine neurons? Does this give rise to impulsivity, or
> depression?

My intuition is that impulsivity would arise as a result of giving much
greater weight to the signals of a very recently-trained network than to a
less-recently trained network.

This all raises a few questions for me:

1) How does a brain _recognize_ reward in order to fire the signal which
trains a dopamine network? It seems straightforward for the taste of food, a
hug from a fellow-tribesman, or a bell from playing a video game for 5 hours
straight (in a simulated Atari environment).

How does a brain recognize reward while it is (for example) writing a teacher-
assigned essay for an unclear audience or a python program without Test-Driven
Development?

What does the brain use as its leading-KPIs?

2) How does a brain select how much it listens to different networks which
predict rewards from different actions so it is "robust to changes in the
environment or changes in the policy". How does the brain adjust its attention
in response to changing context?

~~~
pizza
Mere speculation but maybe there are multiplexed temporal-difference-ish
networks that correlate different reward frequencies per basis (like the
different x[k] for the Fourier transform of a signal x[n])

------
bradknowles
Is anyone else having problems with just getting a blank page when trying to
load that site?

Or is it just me on iOS?

~~~
riwsky
I hit this, and the solution for me was to turn off my content blocker.

~~~
keyle
"content" blocker :)

------
alexcnwy
Awesome work!!

