
An algorithm that learns through rewards may show how our brain does too - highfrequency
https://www.technologyreview.com/s/615054/deepmind-ai-reiforcement-learning-reveals-dopamine-neurons-in-brain/
======
wdabney
First author of the paper here. If the article piques your interest, you can
read the paper in question here: [http://rdcu.be/b0mtA](http://rdcu.be/b0mtA)

~~~
highfrequency
Hi Will, thanks for the reply. One thing I'm curious about: this paper
discusses how a neuron that uses a different slope for positive and negative
updates will converge to an expectile of the reward distribution. And the
behavior is very interpretable, in that a slope of 3 for negative updates and
a slope of 2 for positive updates will lead the neuron to converge to the 60th
expectile (3 / 3+2) if my understanding is correct.

But it seems that the more common approach in reinforcement learning is to
estimate quantiles via regression rather than to get expectiles via asymmetric
updates as in mice neurons. Do you have intuition for why this performs
better? And is there an analog to the asymmetrically updating neuron in the
quantile case?

~~~
wdabney
Hi, thanks for posting the news story.

We can think about asymmetric regression more generally. If you have an error
and apply some 'response' function f to that error you change the estimator
you learn. In the case of quantile regression f is a sign function, expectile
regression it is identity.

In my opinion, and this is entirely speculation, I think with further
experiments more completely studying the effect we found in our paper, that we
will find the response function (f) in the brain is _not_ linear, but a type
of saturating function like if we smoothed the sign function out. We repeated
our experiments in the paper using such a function, which has been proposed
for dopamine neuron responses before, and the analysis continues to hold
because the rewards are all quite small and likely simply in the linear region
of a non-linear response function (we know firing rate saturates eventually so
this isn't much of a surprise).

Regarding quantiles being more commonly used, it's actually the other way
around. The Huber-quantiles we saw perform best in the QR-DQN paper, and which
most often get used in the follow-on RL work, are actually more like the type
of saturating non-linearity you might expect in the brain (although the Huber
loss is not as smooth as you probably would expect the neuron response to be).

------
highfrequency
In reinforcement learning problems, significant improvement has been found
from approximating the _distribution_ of rewards for a particular action,
rather than just estimating the mean reward.

DeepMind's research suggests that a similar model is at work in the brains of
mice, mediated by dopamine. Each neuron has a different level of
optimism/pessimism when interpreting new data, which leads them to capture
different part of the reward distribution. Roughly, one neuron will converge
on the 20th percentile reward, while another converges on the 90th percentile
(technically, each neuron tracks an _expectile_ rather than a percentile, but
the intuition is the same).

This is a significant break from the previous model, which said that all
dopamine neurons roughly just estimate the mean.

