
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model - metasj
https://arxiv.org/abs/1911.08265
======
metasj
Beating AlphaZero at Go, Chess, and Shogi, an mastering a suite of Atari video
games that other AIs have failed to do efficiently. No explicit heads-up
contests with a trained AlphaZero; but apparently hits an ELO threshold w/
fewer training cycles. Yowsa.

~~~
EsssM7QVMehFPAs
And specifically, enabling the application of reinforcement tree searching
agents into domains with blackbox environments. This paper is not only about
increased convergence performance, it is about enabling agents in real world
scenarios where the definition of environment is not feasible. Maybe of
arbitrary complexity. I do not believe in AGI yet, but that's one of the
bigger steps in the right direction, one would assume.

------
Tomminn
It doesn't seem possible to give "no explicit rules", unless you count "making
an illegal move equivalent to a loss" as giving no explicit rules. Which
doesn't seem like anything but word laundering.

If you do any less than this, the net will be incentivized to make an illegal
move for a win. In which case, yah, I'd guess that net would win a lot of
chess games against other rule-bound nets.

~~~
gwern
> It doesn't seem possible to give "no explicit rules", unless you count
> "making an illegal move equivalent to a loss" as giving no explicit rules.
> Which doesn't seem like anything but word laundering.

The model is learning from game trajectories offline. Illegal moves will be
assigned a very low probability because they do not appear in real games
(whoever may be playing those games, whether in a simulation or in the real
world). AlphaGo/AlphaZero did in fact mask out illegal moves in the MCTS tree
search, but MuZero does not:

> MuZero only masks legal actions at the root of the search tree where the
> environment can be queried, but does not perform any masking within the
> search tree.This is possible because the network rapidly learns not to
> predict actions that never occur in the trajectories it is trained on.

And in ALE, all inputs are always legal, it's just that they may be useless
and a waste of a 100ms turn. So for ALE it doesn't matter.

Now, for Go/chess/shogi, they generate the training data for the supervised
learning part by reusing MuZero for MCTS self-play. They don't mention how
legal moves are handled there; they might be masking out like in AlphaZero.
You could argue that this is, in some indirect way, 'explicit rules'. But
since the MuZero is already learning to predict moves' value and which moves
actually get taken, I see no reason that the MCTS self-play couldn't implement
an instant-loss rule without any problem or slowing down training all that
much, removing even that objection.

~~~
sanxiyn
They do mention "MuZero only masks legal actions at the root of the search
tree... does not perform any masking within the search tree". So it's not an
instant-loss rule.

~~~
gwern
Searching within its internal game tree is different from what it actually
does in the game-trajectory-generation phase. It could be entirely true that
the search does not do anything at all to handle illegal moves, relying on the
RNN being good enough to predict that illegal moves are never taken, and that
the actual game-playing masks it out, or instant-loses, or something else
entirely.

------
2bitencryption
Every time a Deepmind paper is published, I feel simultaneously very excited,
and also very depressed that they keep speeding ahead while I'm not anywhere
close to understanding their first paper about AlphaGo.

~~~
sanxiyn
I recommend skipping the first AlphaGo paper, because it is completely
superseded by AlphaZero and actually harder to understand.

~~~
cbames89
I completely agree, the alphago paper is more like the working notes of a team
that just accomplished something. The alphazero paper is a more careful
analysis about the theoretical underpinnings and paints a clearer picture.

------
davidfoster
Just released - walkthrough of the MuZero pseudocode:
[https://link.medium.com/KB3f4RAu51](https://link.medium.com/KB3f4RAu51)

------
SmooL
Very impressive - key quote: "In addition, MuZero is designed to operate in
the general reinforcement learning setting: single-agent domains with
discounted intermediate rewards of arbitrary magnitude. In contrast, AlphaGo
Zero and AlphaZero were designed to operate in two-player games with
undiscounted terminal rewards of ±1."

It's worth noting that, as impressive as MuZeros performance is in the many
Atari games, it achieves a score of 0.0 in Montezuma's Revenge.

------
Animats
What sort of resources did this take? Racks of GPUs, or a laptop, or what?

~~~
Oreb
According to the paper:

> All experiments were run using third generation Google Cloud TPUs [12]. For
> each board game, we used 16 TPUs for training and 1000 TPUs for selfplay.
> For each game in Atari, we used 8 TPUs for training and 32 TPUs for
> selfplay. The much smaller proportion of TPUs used for selfplay in Atari is
> due to the smaller number of simulations per move (50 instead of 800) and
> the smaller size of the dynamics function compared to the representation
> function.

------
slilo
Are there any game records (kifu) of the MuZero - AlphaZero games?

~~~
vintermann
Nope, and it's a bit suspicious. Still, I'm sure many will reimplement it.
Unless they've left out some important detail, I think we'll get strong
players using this approach in a couple of months.

------
ArtWomb
So if you ever find yourself at the mercy of a Superintelligence, simply
challenge it to a round of Solaris for the Atari 2600 ;)

I still don't understand how the "prediction function" is generating frames?

From the last line of the paper it seems to suggest MuZero is generalizable to
other domains.

But the appendix states "the network rapidly learns not to predict actions
that never occur in the trajectories it is trained on"

Consider the problem of predicting the next N frames of video from a one
minute youtabe sample chosen at random. Where there is a high probability of
some sort of scene transition in the interval. Short of training on a large
subset of the youtube corpus.

~~~
sanxiyn
It doesn't generate frames at all! Actually that is the main idea of this
paper. (It's not my words, the paper itself calls it "the main idea".) To
quote:

> The main idea of the algorithm ... is to predict those aspects of the future
> that are directly relevant for planning. The model receives the observation
> ... as an input and transforms it into a hidden state... There is no direct
> constraint or requirement for the hidden state to capture all information
> necessary to reconstruct the original observation, drastically reducing the
> amount of information the model has to maintain and predict.

~~~
vintermann
It is the main idea of the algorithm, but as the paper itself says, there has
been a trend towards this recently. I haven't read many of the papers they
cite, but I do remember one of them ("Learning to predict without looking
ahead: world models without forward prediction"). They had a neat site
(learningtopredict.github.io). They write:

"We speculate that the complexity of world models could be greatly decreased
if they could fully leverage this idea: that a complete model of the world is
actually unnecessary for most tasks - that by identifying the important part
of the world, policies could be trained significantly more quickly, or more
sample efficiently".

That is just what this paper does, as I understand it, by bringing it together
with the tree search from AlphaZero.

------
notelonmusk
Exciting research. Not my area of expertise. An opinion:

> without any knowledge of the game rules

I'd prefer 1000 times an AI that can explain to me why an opposite-colored
bishop ending is drawish or why knights are more valuable near the center, and
can come up with those and more new concepts/relationships/models on its own
(regardless of whether we have given it the game rules or not), than a black
box that is excellent at beating you at chess but you can't understand or
trust. Adversarial examples (false positives) are supporters for this
preference.

~~~
waste_monk
There are ways to reverse engineer neural nets to try and understand the
reasoning but AFAIK none of them are very good.

For example, I went to a SAS/STAT course a few years ago and one of the
exercises was to train a simple neural network and then use it as input to
generate a decision tree, that would (with pretty good accuracy, around 95% or
so) explain the choices made by the neural network. I think the given scenario
was that the neural network was used to decide who to send marketing emails
to, and the head of marketing wanted to know why it was choosing certain
customers.

The problem with this was that it didn't offer insight into why decisions were
made, it could only show what variable was being branched on at a given point
in the tree. Also while it worked with this simple example, more complex NNs
generated huge decision trees that were not usefull in practice.

------
mindgam3
> In this work we present the MuZero algorithm which, by combining a tree-
> based search with a learned model, achieves superhuman performance in a
> range of challenging and visually complex domains, without any knowledge of
> their underlying dynamics.

<rant> DeepMind "superhuman" hype machine strikes again.

I mean, it's cool that computers are getting even better at chess and all (and
other perfectly constrained game environments), but come on. "Superhuman"
chess performance hasn't been particularly interesting since Deep Blue vs
Kasparov in 1997.

The fact that the new algorithms have "no knowledge of underlying dynamics"
makes it sound like an entirely new approach, and on one level it is. ML vs
non-statistical methods. But on a deeper level, it's the same shit.

Unless I'm grossly mistaken, (someone please correct me if this is
inaccurate), the superhuman performance is only made possible by massive
compute. In other words, brute force.

But it uses less training cycles, you say! AlphaZero et all mastered the game
in only 3 days! etc etc. This conveniently ignores the fact that this was 3
days of training on an array of GPUs that is way more powerful than the
supercomputers of old.

Don't get me wrong. These ML algorithms have value and can solve real
problems. I just really wish DeepMind's marketing department would stop
beating us over the head with all of this "superhuman" marketing.

For those just tuning in, this is the same company that got the term "digital
prodigy" on the cover of Science [0]. Which is again a form of cheating,
because the whole prodigy aspect conveniently ignores the compute power
required to achieve AlphaZero. For the record, if you took A0 and ran it on
hardware from a few years ago, you would have a computer that achieves
superhuman performance after a _very_ long time, which wouldn't be making
headlines.

</rant>

0\.
[https://science.sciencemag.org/content/362/6419](https://science.sciencemag.org/content/362/6419)

~~~
sanxiyn
Okay, I will bite. It uses less training cycles. Why is that not significant?
Both AlphaZero and MuZero are brute force, but MuZero is _less_ brute force
than AlphaZero, so it's heading in the right direction.

~~~
mindgam3
Sure, but both of them are _more_ brute force than Deep Blue (unless anybody
has a compelling counter argument), so we’re actually heading in the wrong
direction.

~~~
bonzini
AlphaZero evaluates comparably less positions per second than Deep Blue. It
doesn't even reach 100k nodes per seconds of I remember correctly, while Deep
Blue was in the millions[1]. Therefore, even though the training is done by
brute force, the evaluation is way less "brute forcing" in AlphaZero than in
Deep Blue.

Not just Stockfish on modern hardware can evaluate fewer nodes than Deep Blue
and still beat it left and right; in 1995 Fritz _on a Pentium_ was able to
beat Deep Thought II at the world chess championship. Deep Blue and its
ancestors, with their custom hardware, were perhaps the "most brute force" of
all chess engines.

[1]
[https://www.stmintz.com/ccc/index.php?id=91692](https://www.stmintz.com/ccc/index.php?id=91692)

~~~
mindgam3
As you yourself point out, the brute force here refers to solving a problem by
throwing more hardware at it.

Number of nodes searched is not the key metric for gauging how “smart” the
algorithm is. You have less nodes searched but you only got there by having
way more upfront processing.

~~~
lordgrenville
But that processing happens just once, and then you amortize it over the
software's lifetime. Play a million games and you probably come out ahead.

