
Reinforcement learning’s foundational flaw - andreyk
https://thegradient.pub/why-rl-is-flawed/
======
hprotagonist
_In the days when Sussman was a novice, Minsky once came to him as he sat
hacking at the PDP-6.

“What are you doing?”, asked Minsky.

“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman
replied.

“Why is the net wired randomly?”, asked Minsky.

“I do not want it to have any preconceptions of how to play”, Sussman said.

Minsky then shut his eyes.

“Why do you close your eyes?”, Sussman asked his teacher.

“So that the room will be empty.”

At that moment, Sussman was enlightened._

"picking the right reward function" in RL is shockingly hard. It actually
works OK-ish when the problem space is strictly bounded, like with a game
whose rules are known.

After that, you start getting into sky-humping cheetah
problems:[https://www.alexirpan.com/public/rl-
hard/upsidedown_half_che...](https://www.alexirpan.com/public/rl-
hard/upsidedown_half_cheetah.mp4)

[https://www.alexirpan.com/2018/02/14/rl-
hard.html](https://www.alexirpan.com/2018/02/14/rl-hard.html) is a better
article, perhaps, than this one.

~~~
mockingbirdy
> [https://www.alexirpan.com/public/rl-
> hard/upsidedown_half_che...](https://www.alexirpan.com/public/rl-
> hard/upsidedown_half_cheetah.mp4)

You can argue, but it works. Goal accomplished. Nature did things that are
stranger than that.

Now you just have to add efficiency aspects into your reward functions and
observe how it slowly finds a local minimum. Also something nature did a lot
of times. Now hope that the remaining inefficiency is ok for you.

Done.

-

You can start with a random neural net. It's not exactly empty, but it's ok.
The randomness defines which local minimum you'll find this time around while
you burn through your VC money desperately hoping that the AWS bills for all
the GPU time will arrive _after_ you've found the holy grail that makes your
startup worth <insert bullshit valuation>.

> "picking the right reward function" in RL is shockingly hard.

Agree. x1000. And computational power and data quality.

-

on another post I've added some links (AI playing Mario):
[https://news.ycombinator.com/item?id=17489459](https://news.ycombinator.com/item?id=17489459)

~~~
gwern
> You can start with a random neural net. It's not exactly empty, but it's ok.

One interesting thing about random NNs is how much they can already do. For
example, you can do single-shot image inpainting or superresolution with an
untrained randomly-wired CNN:
[https://dmitryulyanov.github.io/deep_image_prior](https://dmitryulyanov.github.io/deep_image_prior)
In RL, you can use randomly sampled NNs to create various artificial arbitrary
'reward functions' to force an RL agent to explore an environment & learn the
dynamics, and then when you give it the real reward function, it learns much
faster how to optimize it. Similarly, you can sample random NNs to execute for
entire trajectories for 'deep exploration', providing demonstrations of
potentially long-range strategies much more efficiently than simple random-
action strategies. In 'reservoir sampling', as I understand it, you don't even
bother training the NN, you just randomly initialize it and train a simple
model on the outputs, assuming that _some_ of the random highly nonlinear
relationships encoded into the NN will turn out to be useful, which sounds
crazy but apparently works. Makes one think about Tegmark's interpretations.

------
naturalgradient
This is a weirdly shallow article containing lots of diagrams and bullet
points to just summarize the known points that RL needs a lot of data and
needs to learn from scratch.

No mention of all the ongoing work in learning from demonstrations, or more
generally incorporating any off-policy knowledge. Vague speculations about the
philosophy of model free learning. Not really worth the read (as someone
working in RL).

~~~
andreyk
All that stuff is in part two! [https://thegradient.pub/how-to-fix-
rl/](https://thegradient.pub/how-to-fix-rl/)

Says as much at the end... to be fair we did warn up front "The first part,
which you're reading right now, will set up what RL is and why it is
fundamentally flawed. It will contain some explanation that can be skipped by
AI practitioners." But personally I think the board game allegory is fun and
that most people tend to forget the categorical simplicity of Go and Atari
games and overhype ; easy to say the main points are not new but the details
are important here.

~~~
ddoolin
In your opinion, is this a solution to the "AI winter" that is often talked
about? I'm an engineer but not involved in AI but things like meta-
reinforcement seem, from the info/perspective you've given, to address the
problem, at least partially.

~~~
andreyk
I think AI winter is unlikely to come about this time since non-RL stuff
(supervised learning) has been so successful and useful.

~~~
Iv
Yes, some techs are overhyped (chatbots, finance stuff) but deeplearning has
delivered a lot of incredible working applications. It is not just hot air or
marketing hype.

~~~
backpropaganda
Expert systems were not just hot air or marketing hype. Usefulness of a subset
of new AI technology is irrelevant. A winter or contraction is caused by
expectations not being met, and it seems, at least to me, that
investors/funders have already started expecting superhuman performance in
image/speech recognition, and there's a lot of expectation even in robotics,
which will probably not be met by actual results any time soon.

------
philipkglass
I appreciated the point that pure RL is insufficient for many tasks. But why
also downplay the achievements of pure RL where it has matched or surpassed
skilled humans?

Captioned chart:

 _The progression of AlphaGo Zero 's skill. Note that it takes a whole day and
thousands of lifetimes' worth of games to get to an ELO score of 0 (which even
the weakest human can achieve easily)._

I'm pretty sure that a one week old infant's ELO score will also fall short of
0. Sure, the AI did things that no human could do in order to match and then
surpass human performance. Great! Half of the fun of following AI research is
seeing it refute old intuitions about how human-like systems have to be to
perform well on tasks previously considered to require human intelligence.

Whatever "general intelligence" or "human level intelligence" comes to mean by
the 2050s, it looks like it's going to be a lot better pruned-by-
counterexample than it was in the 1950s.

~~~
dibstern
While I appreciate the sentiment, I thi k the fact that we can learn from
fewer examples demonstrates that the learning process isn’t as efficient as
ours, therefore it isn’t yet optimal. It seems like a goal should be for
learning to be as efficient or more efficient for computers than for humans.

~~~
mockingbirdy
We got 86 billion neurons in our brains that all crave to get used - so you
can even imagine them as single agents trying to get along.

It's like 86 billion guys that try to please that thing they simultaneously
produce (our consciousness). What I want to say: The algorithm can be dumb as
f*. I call it the f-star-algorithm. But the computational power in our brains
is extremely high.

~~~
dibstern
I don’t think throwing more computational power at the problem is the right
answer to all ML problems.

~~~
mockingbirdy
Those neurons are arranged and incentivized cleverly. The structure is also
very important and is necessary for the resulting intelligence.

So it's not only computational power, but also the unique structure nature
found through trial and error.

~~~
monetus
I wonder how much of that cleverness will be gleaned and appropriated by those
who design 3 dimensional chips.

------
sdhgaiojfsa
Didn't someone just recently post a DQN solution for Montezuma's Revenge (the
game that according to this article they cannot solve)?

> "Though DQN is great at games like Breakout, it is still not able to tackle
> relatively simple games like Montezuma's Revenge"

Yep:

[https://www.engadget.com/2016/06/09/google-deepmind-ai-
monte...](https://www.engadget.com/2016/06/09/google-deepmind-ai-montezumas-
revenge/)

[https://blog.openai.com/learning-montezumas-revenge-from-
a-s...](https://blog.openai.com/learning-montezumas-revenge-from-a-single-
demonstration/)

It's far too early in this research to say what exactly what can and can't be
solved by RL.

~~~
fnbr
The Open AI solution uses demonstrations though, which is the article's point,
that bare DQN can't solve the games, and something like demonstrations are
needed.

~~~
backpropaganda
It also assumes access to the simulator, which is an even more problematic
assumption. That's like saying you're doing image classification but assuming
access to the 3D model which generated the image.

~~~
radarsat1
I think that analogy is a bit bogus, but if you want to make it, it's more
like assuming access to a function that renders the 3D model from a variety of
perspectives on command, not having access to the model itself.

(Because the RL algorithm doesn't have access to the rules by which the
simulation is carried out, it only has access to the commands and the result.)

And frankly, that would be a perfectly fair and interesting classification
problem, so I don't see your point.

Otherwise, how exactly do you propose learning to drive a simulation without
access to the simulation? I really don't know what you're saying here.

~~~
backpropaganda
My point is that the two problems are quite distinct. This is not a small
change to how the problem is being solved, but a complete change of the
problem itself. Further the change significantly limits the feasibility of the
solution, which is not sufficiently made clear by the authors of the blog
post. Casual followers of AI/RL research might think that this is a
significant progress, while in fact it's actually a progress on a problem that
hasn't really received any attention due its uselessness. I think there may be
1-2 papers which might have experiments on this problem while probably 100s in
the model-free problem.

Thanks for your analogy though. I agree that it's better than mine. I was only
trying to give a rough idea, but I'll use your analogy if I have to now. :)

------
bcheung
I just skimmed the article but doesn't seem like there is any talk of more
modern approaches.

Newer approaches have the agent learn "primitives" through curiosity. It's
sub-goal is to predict future states given the current state + an action.

By doing this, the problem becomes more hierarchical and the search space is
reduced. This makes it feasible for more complex scenarios.

I haven't personally heard of a lot of research on this part but I imagine
that transfer learning becomes more feasible as well once some "primitives"
are established.

~~~
ddoolin
Have a read of part 2, where the author discusses approaches of solving the
problem in part 1 -- [https://thegradient.pub/how-to-fix-
rl/](https://thegradient.pub/how-to-fix-rl/)

It seems like there's a bit of research in this area but it's not receiving
the attention it may deserve. At least, that's how I interpreted the author's
tone.

~~~
andreyk
Yep, thanks, the idea was to highlight all the research going on and argue it
deserves more attention.

------
dj-wonk
The article's use of the word "flaw" is overstated.

For background, here are some selected quotes from the article:

> "The first part, which you're reading right now, will set up what RL is and
> why it is fundamentally flawed." > "In the typical model of RL, the agent
> begins only with knowledge of which actions are possible; it knows nothing
> else about the world, and it's expected to learn the skill solely by
> interacting with the environment and receiving rewards after every action it
> takes." > "how reasonable is it to design AI models based on pure RL if pure
> RL makes so little intuitive sense?"

To summarize, the article claims that this particular aspect of RL is a
"flaw".

I'd suggest it is more useful to call it a _design choice_. In many cases,
this design choice has beneficial properties.

Of course, there are other ways to build learning agents. The field of RL is
certainly open to alternatives, including hybrid models and/or relaxing this
particular assumption.

I've seen a good number of (popular) articles about RL making rather broad
claims, like this article. It appears to me that many of these articles
attempt to 'reduce' RL to a smaller/narrower version of itself in order to
make their claims. I hope more people start to see that RL is a set of
techniques (not a monolith) that can be mixed and matched in many ways for
particular applications.

~~~
andreyk
To be fair, in the article itself we wind up criticizing "Pure RL" (defined as
the basic formulation that is typically followed, in which all learning is
done from just the reward signal) and not RL as a whole. We call out a lot of
awesome non-pure RL work in the second part and suggest this deserves more
attention and excitement over eg AlphaGo.

~~~
dj-wonk
Fair enough. Your article makes a lot of good points, for sure.

Here is a quote from the article I want to mention: “Trying to learn the board
game 'from scratch' without explanation was absurd, right?”

No. It is hardly absurd. Sometimes it works, sometimes not. It is a great
starting point, if nothing else. So, I wonder if we have different ideas of
what ‘absurd’ means.

I agree that we’re in a period of hype. It requires careful work to write
clearly without too much zeal or oversimplification. My opinion here is that
your attempt to ‘balance’ the debate uses a lot of language that I (and
others) perceive as exaggerated.

------
dgant
Enjoyed the article. But I'm a bit confused by the Venn diagram; neither
StarCraft nor DOTA are deterministic or fully observable. And they are
discrete only at such extreme resolutions that they may as well be continuous.

~~~
batmansmk
Sc2 is deterministic. A replay file is simply a record of all actions
performed. There is no randomness in the action resolutions, which allows
actions to be the only thing needed to be streamed over to dync game states.

~~~
alcinos
Well it's deterministic once you know the random seed, which is stored in the
replay file. An agent doesn't know the seed, hence cannot predict the exact
outcome of its actions, only a probability over the outcomes. So, from the
agent's perspective, the game is indeed random.

------
leecho0
The article claims that RL is simplistic because it uses an unreasonable
amount of data. However, recent advances are significant _because_ it uses
unreasonable amount of data. As an example, I don't expect to be as good as
Michael Jordan no matter how much I play basketball, or beat Garry Kasparov no
matter how much I play chess. There's a fundamental flaw to my learning
algorithm that prevents me from becoming good at something even if I have
infinite experience.

Recent RL research about Policy Gradients / On Policy vs Off Policy / Function
approximation / Model-based vs model-free are all research about how to get
good at something with a lot of practice. RL has been around for a long time,
discussions about higher level learning / planning has been done over and
over. One doesn't discount the other. One deals with how to structure the
learning problem that you can continue to get better with more experience (RL
problem), while the other is about how to use higher level learning to speed
it up.

------
CritclyOptimstc
I think the author of this article has fundamentally missed the mark. He talks
about humans as if they come out of the womb being able to play Chess. On the
contrary, we try and fail to make even simple sounds, and later, words,
phrases, crawling, walking, etc.

------
jorgemf
We do have imitation Learning. I think the article is missing some important
parts in RL. One way to train a network is to use experience from others or
even past experience of the agent, but why is it interesting doing it from
scratch? Because doing it from scratch will allow us to face a lot more
problems where we don't have any information or skills and because we avoid
the bias in the data and we can discover new things (as it happened with
AlphaGo zero with its 'tactics' in go). Now we have a method that can be
applied to all the problems that are similar to a board game without any other
information, just the rules of the domain.

~~~
andreyk
All that stuff is in part two! [https://thegradient.pub/how-to-fix-
rl/](https://thegradient.pub/how-to-fix-rl/)

"In part two, we’ll overview the different approaches within AI that can
address those limitations (chiefly, meta-learning and zero-shot learning). And
finally, we'll get to a survey of monumentally exciting work based on these
approaches, and conclude with what that work implies for the future of RL and
AI as a whole. "

------
pnathan
I'd be curious to know how far you can go with marrying GOFAI systems for high
level planning work and RL systems for pattern recognition and tactical
actions.

