
Learning Diverse Skills via Maximum Entropy Deep Reinforcement Learning - endswapper
http://bair.berkeley.edu/blog/2017/10/06/soft-q-learning
======
indescions_2017
DeepMind's Combining Policy Gradient and Q-Learning provides good historical
background:

[https://arxiv.org/pdf/1611.01626.pdf](https://arxiv.org/pdf/1611.01626.pdf)

And archived videos from Berkeley's Deep RL Bootcamp is a crash course to get
one up to speed:

[https://sites.google.com/view/deep-rl-
bootcamp/lectures](https://sites.google.com/view/deep-rl-bootcamp/lectures)

But consider the multi-model Q-distribution in Fig. 3. That graph, and the
variance of its peaks, is the key to understanding why training Deep RL
algorithms becomes so intractable in stochastic, real-world settings.

Max Entropy Estimation and approximate Bellmen Equations will do very well in
Atari games for example. Or creating unbeatable MOBA competitors. But for
robotics a breakthrough that allows for quick generalization from sparse input
data is what is needed.

One-Shot Visual Imitation Learning via Meta-Learning

[https://sites.google.com/view/one-shot-
imitation](https://sites.google.com/view/one-shot-imitation)

~~~
WhitneyLand
But what approach would it take to beat a human at Ms. Pac Man?

When people hear ML has felled the worlds classic strategy games like Chess
and Go, leaving world champions helpless, the idea of humans beating computers
any game seems quaint.

But forget world champions, the best systems cannot beat a _mediocre_ player
at Ms. Pac Man. Its not because no one has tried. There have been some
competitions and research papers.

Of course, the challenge seems pretty damn unimportant compared to self
driving cars. But still, if it were trivial to do, I would have guessed
someone would have done it.

What other games exists, where AI can’t beat humans with less than 60 minutes
of lifetime experience?

~~~
tlarkworthy
Is that true about Ms. Pac Man? Microsoft seems to have posted near perfect
score:- [http://www.businessinsider.com/microsoft-ai-ms-pac-man-
highe...](http://www.businessinsider.com/microsoft-ai-ms-pac-man-highest-all-
time-score-2017-6)

~~~
WhitneyLand
Wow, that’s really new, within the last few months, thanks for the link. The
challenge sounds smaller than it was. It had been worked on for years prior,
as I mentioned with poor results.

Now on to Starcraft 2, where some major research labs are competing including
DeepMind.

It looks like the best bots this year struggled to beat average players.
[https://www.wired.com/story/facebook-quietly-enters-
starcraf...](https://www.wired.com/story/facebook-quietly-enters-starcraft-
war-for-ai-bots-and-loses)

~~~
Tenoke
I am somewhat curious which game you will be latching on to when SC2 is also
thoroughly beaten in a year (or whenever).

~~~
WhitneyLand
“latching on to”? It’s not about trying to disrespect the research, or
downplay the capabilities of AI after the last decade’s amazing progress.

It’s about noting all the little milestones and unsolved problems along the
way, not just the headlines. The idea is to improve perspective on the whole
of where the state of the art is.

As for the next game, I don’t imagine it will be too many years before we
reach the 99% percentile of AI playing all games at least as well as the best
humans. Why not throw out a prediction? Less than 5 years?

------
VHRanger
I fail to see how this is different than keeping non-zero weights on
suboptimal actions for exploration?

~~~
clickok
In standard policy gradient, the weights for (apparently) suboptimal actions
are free to trend downwards to arbitrarily small values. If you're using a
Gibbs distribution[0] the probability of executing alternative actions is
nonzero but potentially very small.

With entropy methods, you add an artificial bonus towards policies that
execute more varied actions[1]. So while there's a tendency towards improving
the policy by executing apparently favorable actions, it is counter-balanced
by a term incentivizing different choices.

Why not just optimize using standard gradient descent? Typically, the variance
is high and the algorithms optimize very greedily, so you end up in situations
where a few bad trajectories cause certain actions to be avoided. Or the agent
might need to refine its policy for the states that come afterwards, but
because that hasn't happened yet, the actions that lead to those states are
ignored[2].

So exploration needs to be incentivized, and while this can be done with
ε-greedy or optimistic initialization, those techniques are kinda ad hoc,
while entropy methods have a bit more mathematical/philosophical heft behind
them. Plus they seem to work, which is always nice.

\-----

0\. That is, if you'll excuse the notation: p(a) = e^α/Σ e^β where 'p(a)' is
the probability of taking action 'a', with weight 'α' and you're summing over
all possible actions (with weights denoted 'β') in the denominator.

1\. For a discrete space (absent other information), the maximum entropy
distribution is the uniform distribution, i.e., all actions are equally
likely.

2\. For example, a human-like robot might try stepping forward, but hasn't
learned how to balance after it lifts its foot, so it stumbles and falls.
Crawling is less efficient, but provides constant progress and is easier to
learn. So the robot 'gets stuck' crawling since it had a bad experience
walking and doesn't attempt it again.

------
AlexCoventry
Could someone please explain the explanation of equ. 18 in the appendix to the
paper? It seems key, but isn't clear to me.

> if one greedily maximize the sum of entropy and value with one-step look-
> ahead, then one obtains π˜ from π

~~~
skndr
Hi Alex! I don't have the answer, but I just sent this over to Seth. Looking
forward to seeing you all again in December!

~~~
AlexCoventry
Hey, Skander! Hope SF's treating you well.

