
Reinforcement Learning with Prediction-Based Rewards - lainon
https://blog.openai.com/reinforcement-learning-with-prediction-based-rewards/
======
magoghm
I like the idea of having the agent be attracted to the unpredictable, but I
guess there should be something to ensure that unpredictability doesn't
dominate which action is selected. For an interesting/funny example check
their two videos: "Agent in a maze without a noisy TV" and "Agent in a maze
with a noisy TV"

~~~
moosinho
How about an approach where the agent's reward is not the predictability
itself but the first derivative of it. This way the agent will be attracted to
the parts of environment where it can improve and will avoid white-noise parts
since its model of the world doesn't generalize on these.

Juergen Schmidhuber (the author of original LSTM paper) had a very similar
idea,
[http://people.idsia.ch/~juergen/driven2009.pdf](http://people.idsia.ch/~juergen/driven2009.pdf)

"This drive maximizes interestingness, the first derivative of subjective
beauty or compressibility, that is, the steepness of the learning curve. It
motivates exploring infants, pure mathematicians, composers, artists, dancers,
comedians, yourself, and (since 1990) artificial systems."

------
modeless
Finally someone beat Montezuma's Revenge without imitating human
demonstrations! Very cool. I wonder why the algorithm then fails so hard on
Pitfall? I would expect them to be similar problems.

~~~
qnsi
Looks like AI winter update needs another update

[https://blog.piekniewski.info/2018/10/29/ai-winter-
update/](https://blog.piekniewski.info/2018/10/29/ai-winter-update/)

~~~
paradoxparalax
All right, you led me into an interesting rabit-hole there, I saw this AI
winter page, and after reading a bit , the guy( Filip Piekniewski) said:
"...the only problem really worth solving in AI is the Moravec's paradox,
which is exactly the opposite of what DeepMind or OpenAI are doing...". Then I
dont know what the guy thinks about what is the exact meaning of the said
paradox, but in my ignorance I went to check Wikipedia, and they gave
me:"Moravec's paradox is the discovery...that, contrary to traditional
assumptions, high-level reasoning requires very little computation, but low-
level sensorimotor skills require enormous computational resources" Well, this
sounded pretty obvious, if you accept "high-level reasoning" as
meaning...well...lets see what Wikipedia says a few lines later, quoting
Moravec himself: "As Moravec writes, "it is comparatively easy to make
computers exhibit adult level performance on intelligence tests or playing
checkers,...". So, I can see the same contradiction in both, in the
Wikipedia's definition and in Moravecs statement, in my opinion and i think
the common sense opinion,"intelligence tests" sounds very broad or very
narrow, but anyway useless to make a point about the point, even more when
"playing checkers" is in the same sentence, the common sense is that from
playing checkers to the "high-level reasoning" used by Wikipedias definition
(makes me laugh, sorry for the sincerity), there is an ocean of distance. The
amount of memory required to do real "what-i-call" high-level reasoning can be
huge only to store the heuristics algorithms execution codes, not to mention
all the "concepts" of "things" and their possible ways of "interaction" and
all the interactions and their possible subjects-"things" , and the probable
mechanisms that we have no idea are in place, probably driven by the most
primitive reaction-reasoning mechanisms, who in heavens knows...just my-an
opinin after biting the bait : ) , in form of a light comment. Cheers.

~~~
paradoxparalax

      Just to add a bit about the point of the "AI winter reality check" X "Stephen Hawkings hype", I think both are right and wrong at the same time: I do believe that the analisis of the human reasoning and its simulation in computers can be achieved with much less computing power than that needed by most insects sensorial systems(so in this sense agree with Moravec, but...). The computing power can accelerate learning in a system, but its a question of time and nothing more, it doesnt matter really if you take a week or a year to "set-train-load" a reasoning human level AI, the achievement would be amazing anyway. So is the mistake, in my opinion, of all the wave-of-hype's direction: neural nets is just for acceleration and generation of datasets(what you say!? yes, generation, thats what they are really for, totally contrarilly to the world's opinion : ) !!! ) and , obviously for the "lesser utility" of  sensorial system autonomous programming, that is what they are using it for right now, but it have nothing to do with "Reasoning Intelligence", that would be, in ma humble opinion, lets say, a programm that could take part and give some insigthful contribution in the famous talk between Albert Einstein and Rabindranath Tagore.[1] That said, I do belive a non-neural net "real deal" AI is just around the corner, not that far, really, and here maybe I do tend a bit to the "Hawkings-hype" people, but completely differently from them, I do believe it has nothing to do with neural nets(that are just tools), and are...lets say: "non-neural networks heuristic based".

~~~
elcomet
Your text is unreadable on mobile. Can you remove the code tag?

~~~
paradoxparalax
Sorry, It was not the intention to put the code tag, I tried to edit it to
remove the scroll bar, but I couldnt, have to learn more about this, first
timer.

~~~
TotempaaltJ
I think the code tag shows up when you indent a line of text. Check if there's
any spaces at the start of your message.

------
Jabbles
It was literally only a few days ago that Google/DeepMind posted about a new
approach to curiosity: [https://ai.googleblog.com/2018/10/curiosity-and-
procrastinat...](https://ai.googleblog.com/2018/10/curiosity-and-
procrastination-in.html)

That blog post also mentions the noisy TV problem.

I am not skilled enough to describe how these approaches differ.

~~~
mooneater
From the google one:

> The environment also contains a TV for which the agent has the remote
> control. There is a limited number of channels (each with a distinct
> show)... even if the order of shows appearing on the screen is random and
> unpredictable, all those shows are already in memory!

So in google's "Curiosity and Procrastination in Reinforcement Learning" they
could not handle a TV with pure noise (snow) as it could not remember all that
noise.

~~~
Jabbles
I don't think that's a reasonable conclusion to draw from the fact the TV
wasn't pure noise. Why would the agent not be able to determine that every
noisy frame was one step away from every other?

~~~
mooneater
The google paper uses memory. You cant remember a never ending set of 2D
random noise, there are limitless variations.

~~~
Jabbles
They don't remember _all_ the previous scenes, just ones which are "novel"
enough.

 _But how do we decide whether the agent is seeing the same thing as an
existing memory? Checking for an exact match could be meaningless: in a
realistic environment, the agent rarely sees exactly the same thing twice. For
example, even if the agent returned to exactly the same room, it would still
see this room under a different angle compared to its memories.

Instead of checking for an exact match in memory, we use a deep neural network
that is trained to measure how similar two experiences are._

~~~
mooneater
Ok. Though it still depends on limited number of tv shows.

If there is an unlimited number of shows and the agent walks right up to it, I
think its still trapped.

~~~
jhurliman
Aren’t we all.
[http://rickandmorty.wikia.com/wiki/Interdimensional_Cable](http://rickandmorty.wikia.com/wiki/Interdimensional_Cable)

------
darawk
I'm becoming more and more convinced that reinforcement learning is equivalent
to AGI, we just haven't finished optimizing it yet.

~~~
sytelus
"AGI" is not well defined however RL is more like Freudian perspective on
human mind which has long fell out of favor in psychology. For example, one of
the opular classical view assumed that all of our behaviors are driven by
ultimate quest of survival and reproduction. But then how do you explain
suicides or soldiers going in certain death combats? How do you explain people
who never want to have child? How do you explain artist giving up big money
Wall Street job for simple life of doing some obscure art? How do you explain
people sitting on the beach to just do nothing?

Here are some interesting comment on the topic:

Is Global Reinforcement Learning (RL) a Fantasy?

[https://www.lesswrong.com/posts/QEgGp7gaogAQjqgRi/is-
global-...](https://www.lesswrong.com/posts/QEgGp7gaogAQjqgRi/is-global-
reinforcement-learning-rl-a-fantasy)

~~~
darawk
We're optimized for the survival of our genes, not ourselves. Sometimes that
means sacrificing ourselves, or not having children, etc.

------
zuypaweu
I love how this post gets like 20 points and the one about decensoring hentai
yesteday was like 200

------
levesque
Happy to see more work based on prediction. I've been of the opinion that
predictive rewards should largely bypass the hand-tuned rewards we have been
using for reinforcement learning so far, or at least speed up learning by
providing a much richer signal to use for training.

------
nestorD
> We also noticed significant improvements in performance of RND every time we
> discovered and fixed a bug [...]. Getting such details right was a
> significant part of achieving high performance even with algorithms
> conceptually similar to prior work.

Details.

------
ipsum2
This is a really interesting method that mimics human behavior on boredom and
achieves fantastic results in RL.

However, it's pretty disappointing that OpenAI, a non-profit organization that
claims to want to distribute AI as evenly as possible, to not release the
source code of their research to make their findings reproducible by others.
This paper is an example, but also other high profile work, such as their DoTA
bot.

Edit: My mistake, this paper was open sourced, see below comment.

~~~
outlace
But they did release the code.. [https://github.com/openai/random-network-
distillation](https://github.com/openai/random-network-distillation)

~~~
esigler
Note also there are many other repositories available in
[https://github.com/openai](https://github.com/openai) from various papers &
projects, such as
[https://github.com/openai/baselines](https://github.com/openai/baselines).

