
Deep Reinforcement Learning - dcre
https://deepmind.com/blog
======
awwaiid
"Previous attempts to combine RL with neural networks had largely failed due
to unstable learning. To address these instabilities, our Deep Q-Networks
(DQN) algorithm stores all of the agent's experiences and then randomly
samples and replays these experiences to provide diverse and decorrelated
training data."

... so, made the machines dream. Fancy!

~~~
ehsanu1
Aha, so that's what (human) dreaming is for.

~~~
Houshalter
IIRC there is evidence rats replay their experiences, sped up, while they are
asleep. Dreaming may be something else entirely though, because my dreams
aren't anything like my memories of the day before.

Artificial neural networks can "dream" by predicting what frame it will see
next. This is a really cool technique. They've shown slightly blurry videos of
atari games being played that comes entirely from the network's dream. With no
interaction with the game at all. You can even train the reinforcement
learning on the dream sequences and improve it's performance.

But this also doesn't seem quite like what human dreams are. Human dreams are
wild and unrealistic, while the NN dreams try to match the training data as
closely as possible.

~~~
Jabbles
_You can even train the reinforcement learning on the dream sequences and
improve it 's performance._

I'm not sure how that would work. Surely you'd be overfitting on your training
set by definition?

~~~
Houshalter
Reinforcement learning has a problem in that it gets very little labelled
data. You may have a million frames, but the only label is the score. Which
may only change a few times per game.

Training the net to predict the next frame is sort of unsupervised learning.
It can learn the rules of the game without score information at all.

The second thing is that RL is different than prediction. Even if you can
predict the next frame exactly, finding the optimal set of moves is still a
hard problem. The algorithm needs to learn more than just predicting what will
happen, but also what the optimal action is in every situation. That is
something that can be practiced in simulations, or "dreams".

------
seanwilson
When it's playing a game (e.g. breakout) and it's being fed the pixels on the
screen, how is the AI being told what the score/progress is? Does it have
access to some numeric metric that is chosen by the researchers for each game?

~~~
sanxiyn
Yes.

For example, Breakout saves score in address 76 and 77. Arcade Learning
Environment has code to read the score, one per game. Code for Breakout is
here: [https://github.com/mgbellemare/Arcade-Learning-
Environment/b...](https://github.com/mgbellemare/Arcade-Learning-
Environment/blob/master/src/games/supported/Breakout.cpp)

~~~
seanwilson
Thanks for that! So I was looking at the one for Montezuma's Revenge:
[https://github.com/mgbellemare/Arcade-Learning-
Environment/b...](https://github.com/mgbellemare/Arcade-Learning-
Environment/blob/master/src/games/supported/MontezumaRevenge.cpp)

The reward seems to be only the game score which I believe is meant to be
problematic for this game because your score doesn't go up very often (so you
have to perform a lot of actions to get any feedback)? The lives are recorded
but aren't part of the "get reward" method...are the lives factored into
decision making somewhere else? Seems like knowing you just lost a life would
really help decision making for such a game.

~~~
Phemist
I think novel game situations were counted into the reward function as well

~~~
gwern
Only in the specialty novelty-oriented DQN agents; the Montezuma's Revenge
reward itself remains the same. The problem is defining 'novel' when every
screen's pixels may be different (for example, imagine any game which has a
timer ticking up).

~~~
Phemist
Not so much, a timer ticking up is only novel the first time round, and is
unrelated to actions taken by the agent. Over multiple plays the agent will
learn to ignore it.

EDIT: It could be that the agent will just stand there the first few plays
around, enjoying the novelty reward gained from simply watching the timer tick
up. Haha

~~~
gwern
The point is that every time the timer ticks, if you had defined 'novelty' as
the bitstring representing the screen, you get a 'new' state. This multiplies
against any blinking animations, any moving enemies, any of the _agent_ 's
moves, any visible scores, etc. You get thousands or millions of unique
framebuffer states before the agent has so much as left the first room in
_Montezuma's Revenge_. And DQN already is RAM intensive for the experience
replay buffer.

~~~
Phemist
Thanks for the extra explanation. It seems I assumed too much about these Deep
Q networks, due to some prior knowledge of the neuroscience related to RL.
Although I do remember having seen a video about Montezuma's Revenge a week
ago or so, where they talked about this exact problem.

Anyway, it would seem to me that novelty functions that would allow the agent
to ignore periodic changes in state such as timers going up, can be quite
simple. A function that estimates novelty of individual bit values in the
bitstring of the gamestate and then aggregates it, could quite easily account
for timers, or generally elements changing periodically regardless of agent
actions. A baseline novelty reward would seem relatively easy to predict by
the agent and thus result in low prediction errors and low reinforcement of
actions by agents. This function would have a linear space and time complexity
to the length of the gamestate, and fairly naive & simple to use, but would
get the job done I think?

P.S Just wanted to thank you for the work you've put into your website, it's
very informative and always a great starting point to dive deeper into topics
you cover!

~~~
gwern
You have to come up with _something_ or else the agent will never be able to
explore worth a damn in complex domains. Imagine trying to learn to write
Haskell programs by typing random gibberish...

'gamestate' is illegal. It's pointless to suppose an agent which has access to
the true groundtruth RAM of the Atari games, because that generalizes to
vanishingly few other domains. The goal is to create a general agent which can
be used elsewhere, such as in recommender systems. (And if you did have access
to the raw RAM, that would reduce the problem from an extremely challenging
POMDP or harder, to a fully-observed deterministic MDP, because you could then
construct a game-tree of each individual RAM state and the possible actions
taken in it; in which case, you would use a much faster and more powerful MDP
solver like MCTS rather than DQN.)

One can come up with hand-crafted heuristics which might improve over the
naive bitstring equality approach, but your suggestion still doesn't do the
trick, assuming you could figure out how to meaningfully define 'periodic
changes' and teach the NN to ignore them. Imagine a game in which the overall
screen lighting varies (perhaps it's set at night or during rain, or perhaps
each level has different color themes). As all the bits keep flipping with
changes in lighting/intensity, you'd be in about the same place.

------
tintor
Labyrinth? I have a feeling that Doom is next.

~~~
shogunmike
You might find the VizDoom project interesting:
[http://vizdoom.cs.put.edu.pl/](http://vizdoom.cs.put.edu.pl/)

