Hacker News new | past | comments | ask | show | jobs | submit login
Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (arxiv.org)
161 points by metasj 14 days ago | hide | past | web | favorite | 37 comments



Beating AlphaZero at Go, Chess, and Shogi, an mastering a suite of Atari video games that other AIs have failed to do efficiently. No explicit heads-up contests with a trained AlphaZero; but apparently hits an ELO threshold w/ fewer training cycles. Yowsa.


And specifically, enabling the application of reinforcement tree searching agents into domains with blackbox environments. This paper is not only about increased convergence performance, it is about enabling agents in real world scenarios where the definition of environment is not feasible. Maybe of arbitrary complexity. I do not believe in AGI yet, but that's one of the bigger steps in the right direction, one would assume.


It doesn't seem possible to give "no explicit rules", unless you count "making an illegal move equivalent to a loss" as giving no explicit rules. Which doesn't seem like anything but word laundering.

If you do any less than this, the net will be incentivized to make an illegal move for a win. In which case, yah, I'd guess that net would win a lot of chess games against other rule-bound nets.


> It doesn't seem possible to give "no explicit rules", unless you count "making an illegal move equivalent to a loss" as giving no explicit rules. Which doesn't seem like anything but word laundering.

The model is learning from game trajectories offline. Illegal moves will be assigned a very low probability because they do not appear in real games (whoever may be playing those games, whether in a simulation or in the real world). AlphaGo/AlphaZero did in fact mask out illegal moves in the MCTS tree search, but MuZero does not:

> MuZero only masks legal actions at the root of the search tree where the environment can be queried, but does not perform any masking within the search tree.This is possible because the network rapidly learns not to predict actions that never occur in the trajectories it is trained on.

And in ALE, all inputs are always legal, it's just that they may be useless and a waste of a 100ms turn. So for ALE it doesn't matter.

Now, for Go/chess/shogi, they generate the training data for the supervised learning part by reusing MuZero for MCTS self-play. They don't mention how legal moves are handled there; they might be masking out like in AlphaZero. You could argue that this is, in some indirect way, 'explicit rules'. But since the MuZero is already learning to predict moves' value and which moves actually get taken, I see no reason that the MCTS self-play couldn't implement an instant-loss rule without any problem or slowing down training all that much, removing even that objection.


They do mention "MuZero only masks legal actions at the root of the search tree... does not perform any masking within the search tree". So it's not an instant-loss rule.


Searching within its internal game tree is different from what it actually does in the game-trajectory-generation phase. It could be entirely true that the search does not do anything at all to handle illegal moves, relying on the RNN being good enough to predict that illegal moves are never taken, and that the actual game-playing masks it out, or instant-loses, or something else entirely.


Illegal moves in Go do result in instant loss or are impossible. Examples

1. Taking a ko without playing elsewhere first: instant loss.

2. Playing on the top of another stone: in real world it's difficult to do because of the shape of the stones. They could make MuZero lose the game.

3. Playing when it's the opponent turn: instant loss. This is actually a way to resign: playing two stones together. Probably this is impossible to do for MuZero because the goal is to play one move. By the way, do those programs plan their next move even when the opponent is thinking, like humans do, or think only when it's their turn?


We changed the submitted title from "MuZero beats AlphaZero, with less training and no explicit rules: fully general" to that of the article.


Every time a Deepmind paper is published, I feel simultaneously very excited, and also very depressed that they keep speeding ahead while I'm not anywhere close to understanding their first paper about AlphaGo.


I recommend skipping the first AlphaGo paper, because it is completely superseded by AlphaZero and actually harder to understand.


I completely agree, the alphago paper is more like the working notes of a team that just accomplished something. The alphazero paper is a more careful analysis about the theoretical underpinnings and paints a clearer picture.


Just released - walkthrough of the MuZero pseudocode: https://link.medium.com/KB3f4RAu51

Very impressive - key quote: "In addition, MuZero is designed to operate in the general reinforcement learning setting: single-agent domains with discounted intermediate rewards of arbitrary magnitude. In contrast, AlphaGo Zero and AlphaZero were designed to operate in two-player games with undiscounted terminal rewards of ±1."

It's worth noting that, as impressive as MuZeros performance is in the many Atari games, it achieves a score of 0.0 in Montezuma's Revenge.


What sort of resources did this take? Racks of GPUs, or a laptop, or what?


According to the paper:

> All experiments were run using third generation Google Cloud TPUs [12]. For each board game, we used 16 TPUs for training and 1000 TPUs for selfplay. For each game in Atari, we used 8 TPUs for training and 32 TPUs for selfplay. The much smaller proportion of TPUs used for selfplay in Atari is due to the smaller number of simulations per move (50 instead of 800) and the smaller size of the dynamics function compared to the representation function.


Are there any game records (kifu) of the MuZero - AlphaZero games?


Nope, and it's a bit suspicious. Still, I'm sure many will reimplement it. Unless they've left out some important detail, I think we'll get strong players using this approach in a couple of months.


So if you ever find yourself at the mercy of a Superintelligence, simply challenge it to a round of Solaris for the Atari 2600 ;)

I still don't understand how the "prediction function" is generating frames?

From the last line of the paper it seems to suggest MuZero is generalizable to other domains.

But the appendix states "the network rapidly learns not to predict actions that never occur in the trajectories it is trained on"

Consider the problem of predicting the next N frames of video from a one minute youtabe sample chosen at random. Where there is a high probability of some sort of scene transition in the interval. Short of training on a large subset of the youtube corpus.


It doesn't generate frames at all! Actually that is the main idea of this paper. (It's not my words, the paper itself calls it "the main idea".) To quote:

> The main idea of the algorithm ... is to predict those aspects of the future that are directly relevant for planning. The model receives the observation ... as an input and transforms it into a hidden state... There is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict.


It is the main idea of the algorithm, but as the paper itself says, there has been a trend towards this recently. I haven't read many of the papers they cite, but I do remember one of them ("Learning to predict without looking ahead: world models without forward prediction"). They had a neat site (learningtopredict.github.io). They write:

"We speculate that the complexity of world models could be greatly decreased if they could fully leverage this idea: that a complete model of the world is actually unnecessary for most tasks - that by identifying the important part of the world, policies could be trained significantly more quickly, or more sample efficiently".

That is just what this paper does, as I understand it, by bringing it together with the tree search from AlphaZero.


Exciting research. Not my area of expertise. An opinion:

> without any knowledge of the game rules

I'd prefer 1000 times an AI that can explain to me why an opposite-colored bishop ending is drawish or why knights are more valuable near the center, and can come up with those and more new concepts/relationships/models on its own (regardless of whether we have given it the game rules or not), than a black box that is excellent at beating you at chess but you can't understand or trust. Adversarial examples (false positives) are supporters for this preference.


There are ways to reverse engineer neural nets to try and understand the reasoning but AFAIK none of them are very good.

For example, I went to a SAS/STAT course a few years ago and one of the exercises was to train a simple neural network and then use it as input to generate a decision tree, that would (with pretty good accuracy, around 95% or so) explain the choices made by the neural network. I think the given scenario was that the neural network was used to decide who to send marketing emails to, and the head of marketing wanted to know why it was choosing certain customers.

The problem with this was that it didn't offer insight into why decisions were made, it could only show what variable was being branched on at a given point in the tree. Also while it worked with this simple example, more complex NNs generated huge decision trees that were not usefull in practice.


Generally, it's a superhuman task. No one yet came up with an explanation of how to tell apart a picture of a kitten and a picture of a puppy that doesn't rely on our common low-level visual processing.

That is the explanation mechanism should take into account what kind of explanations humans consider understandable. For tasks that can be represented as mathematical structures (like two bishop ending), it's probably simple enough. For tasks that we don't really know how we do them (vision, hearing, locomotion planning), the explanation mechanism will have to learn somehow what we consider a good explanation and somehow translate internal workings of decision network and its own network (we'll want to know why it explains it like it does, right?) into them.


> In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics.

<rant> DeepMind "superhuman" hype machine strikes again.

I mean, it's cool that computers are getting even better at chess and all (and other perfectly constrained game environments), but come on. "Superhuman" chess performance hasn't been particularly interesting since Deep Blue vs Kasparov in 1997.

The fact that the new algorithms have "no knowledge of underlying dynamics" makes it sound like an entirely new approach, and on one level it is. ML vs non-statistical methods. But on a deeper level, it's the same shit.

Unless I'm grossly mistaken, (someone please correct me if this is inaccurate), the superhuman performance is only made possible by massive compute. In other words, brute force.

But it uses less training cycles, you say! AlphaZero et all mastered the game in only 3 days! etc etc. This conveniently ignores the fact that this was 3 days of training on an array of GPUs that is way more powerful than the supercomputers of old.

Don't get me wrong. These ML algorithms have value and can solve real problems. I just really wish DeepMind's marketing department would stop beating us over the head with all of this "superhuman" marketing.

For those just tuning in, this is the same company that got the term "digital prodigy" on the cover of Science [0]. Which is again a form of cheating, because the whole prodigy aspect conveniently ignores the compute power required to achieve AlphaZero. For the record, if you took A0 and ran it on hardware from a few years ago, you would have a computer that achieves superhuman performance after a very long time, which wouldn't be making headlines.

</rant>

0. https://science.sciencemag.org/content/362/6419


Okay, I will bite. It uses less training cycles. Why is that not significant? Both AlphaZero and MuZero are brute force, but MuZero is less brute force than AlphaZero, so it's heading in the right direction.


Sure, but both of them are more brute force than Deep Blue (unless anybody has a compelling counter argument), so we’re actually heading in the wrong direction.


AlphaZero evaluates comparably less positions per second than Deep Blue. It doesn't even reach 100k nodes per seconds of I remember correctly, while Deep Blue was in the millions[1]. Therefore, even though the training is done by brute force, the evaluation is way less "brute forcing" in AlphaZero than in Deep Blue.

Not just Stockfish on modern hardware can evaluate fewer nodes than Deep Blue and still beat it left and right; in 1995 Fritz on a Pentium was able to beat Deep Thought II at the world chess championship. Deep Blue and its ancestors, with their custom hardware, were perhaps the "most brute force" of all chess engines.

[1] https://www.stmintz.com/ccc/index.php?id=91692


As you yourself point out, the brute force here refers to solving a problem by throwing more hardware at it.

Number of nodes searched is not the key metric for gauging how “smart” the algorithm is. You have less nodes searched but you only got there by having way more upfront processing.


But that processing happens just once, and then you amortize it over the software's lifetime. Play a million games and you probably come out ahead.


And what is an estimation of the minimal hardware/time requirements for learning to beat humans at chess?

We need some baseline to call it "brute force".


Seems like ML/AI is still looking for its quantum supremacy moment.


It should be pointed out AlphaZero plays better than Deep Blue. I think comparing computational resource usage is important, but direct comparison only makes sense with equivalent performance level.


I did point this out at the top of my original comment.

“I mean, it's cool that computers are getting even better at chess and all“

> direct comparison only makes sense with equivalent performance level

This makes no sense to me. 50% increase in performance can be compared to 50% increase in processing power to evaluate level of brute force-ness.


> This makes no sense to me. 50% increase in performance can be compared to 50% increase in processing power to evaluate level of brute force-ness.

Computational complexity theory taught us that fundamental difficulty of solving specific types of problems does not always linearly scale with the size of the problems. I guess the same logic applies to the quality of the output?


Yes, but how is "50% better" measured? On what scale?


Easy. Chess uses the Elo rating system.

https://en.m.wikipedia.org/wiki/Elo_rating_system


I mean that ELO is not linear and there is no obvious ratio operation for ELO scores. How much is better than 2,600? Is it 5,200? Is it the rating at which you'd have 2:1 odds of winning a match? (Wikipedia says 200 points represent 74% chance of winning, so somewhere below 2,800) There's just no commonly agreed meaning of "50% better Chess player".



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: