If you do any less than this, the net will be incentivized to make an illegal move for a win. In which case, yah, I'd guess that net would win a lot of chess games against other rule-bound nets.
The model is learning from game trajectories offline. Illegal moves will be assigned a very low probability because they do not appear in real games (whoever may be playing those games, whether in a simulation or in the real world). AlphaGo/AlphaZero did in fact mask out illegal moves in the MCTS tree search, but MuZero does not:
> MuZero only masks legal actions at the root of the search tree where the environment can be queried, but does not perform any masking within the search tree.This is possible because the network rapidly learns not to predict actions that never occur in the trajectories it is trained on.
And in ALE, all inputs are always legal, it's just that they may be useless and a waste of a 100ms turn. So for ALE it doesn't matter.
Now, for Go/chess/shogi, they generate the training data for the supervised learning part by reusing MuZero for MCTS self-play. They don't mention how legal moves are handled there; they might be masking out like in AlphaZero. You could argue that this is, in some indirect way, 'explicit rules'. But since the MuZero is already learning to predict moves' value and which moves actually get taken, I see no reason that the MCTS self-play couldn't implement an instant-loss rule without any problem or slowing down training all that much, removing even that objection.
1. Taking a ko without playing elsewhere first: instant loss.
2. Playing on the top of another stone: in real world it's difficult to do because of the shape of the stones. They could make MuZero lose the game.
3. Playing when it's the opponent turn: instant loss. This is actually a way to resign: playing two stones together. Probably this is impossible to do for MuZero because the goal is to play one move. By the way, do those programs plan their next move even when the opponent is thinking, like humans do, or think only when it's their turn?
It's worth noting that, as impressive as MuZeros performance is in the many Atari games, it achieves a score of 0.0 in Montezuma's Revenge.
> All experiments were run using third generation Google Cloud TPUs . For each board game, we used
16 TPUs for training and 1000 TPUs for selfplay. For each game in Atari, we used 8 TPUs for training and 32 TPUs for selfplay. The much smaller proportion of TPUs used for selfplay in Atari is due to the smaller number of simulations per move (50 instead of 800) and the smaller size of the dynamics function compared to the representation function.
I still don't understand how the "prediction function" is generating frames?
From the last line of the paper it seems to suggest MuZero is generalizable to other domains.
But the appendix states "the network rapidly learns not to predict actions that never occur in the trajectories
it is trained on"
Consider the problem of predicting the next N frames of video from a one minute youtabe sample chosen at random. Where there is a high probability of some sort of scene transition in the interval. Short of training on a large subset of the youtube corpus.
> The main idea of the algorithm ... is to predict those aspects of the future that are directly relevant for planning. The model receives the observation ... as an input and transforms it into a hidden state... There is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict.
"We speculate that the complexity of world models could be greatly decreased if they could fully leverage this idea: that a complete model of the world is actually unnecessary for most tasks - that by identifying the important part of the world, policies could be trained significantly more quickly, or more sample efficiently".
That is just what this paper does, as I understand it, by bringing it together with the tree search from AlphaZero.
> without any knowledge of the game rules
I'd prefer 1000 times an AI that can explain to me why an opposite-colored bishop ending is drawish or why knights are more valuable near the center, and can come up with those and more new concepts/relationships/models on its own (regardless of whether we have given it the game rules or not), than a black box that is excellent at beating you at chess but you can't understand or trust. Adversarial examples (false positives) are supporters for this preference.
For example, I went to a SAS/STAT course a few years ago and one of the exercises was to train a simple neural network and then use it as input to generate a decision tree, that would (with pretty good accuracy, around 95% or so) explain the choices made by the neural network. I think the given scenario was that the neural network was used to decide who to send marketing emails to, and the head of marketing wanted to know why it was choosing certain customers.
The problem with this was that it didn't offer insight into why decisions were made, it could only show what variable was being branched on at a given point in the tree. Also while it worked with this simple example, more complex NNs generated huge decision trees that were not usefull in practice.
That is the explanation mechanism should take into account what kind of explanations humans consider understandable. For tasks that can be represented as mathematical structures (like two bishop ending), it's probably simple enough. For tasks that we don't really know how we do them (vision, hearing, locomotion planning), the explanation mechanism will have to learn somehow what we consider a good explanation and somehow translate internal workings of decision network and its own network (we'll want to know why it explains it like it does, right?) into them.
DeepMind "superhuman" hype machine strikes again.
I mean, it's cool that computers are getting even better at chess and all (and other perfectly constrained game environments), but come on. "Superhuman" chess performance hasn't been particularly interesting since Deep Blue vs Kasparov in 1997.
The fact that the new algorithms have "no knowledge of underlying dynamics" makes it sound like an entirely new approach, and on one level it is. ML vs non-statistical methods. But on a deeper level, it's the same shit.
Unless I'm grossly mistaken, (someone please correct me if this is inaccurate), the superhuman performance is only made possible by massive compute. In other words, brute force.
But it uses less training cycles, you say! AlphaZero et all mastered the game in only 3 days! etc etc. This conveniently ignores the fact that this was 3 days of training on an array of GPUs that is way more powerful than the supercomputers of old.
Don't get me wrong. These ML algorithms have value and can solve real problems. I just really wish DeepMind's marketing department would stop beating us over the head with all of this "superhuman" marketing.
For those just tuning in, this is the same company that got the term "digital prodigy" on the cover of Science . Which is again a form of cheating, because the whole prodigy aspect conveniently ignores the compute power required to achieve AlphaZero. For the record, if you took A0 and ran it on hardware from a few years ago, you would have a computer that achieves superhuman performance after a very long time, which wouldn't be making headlines.
Not just Stockfish on modern hardware can evaluate fewer nodes than Deep Blue and still beat it left and right; in 1995 Fritz on a Pentium was able to beat Deep Thought II at the world chess championship. Deep Blue and its ancestors, with their custom hardware, were perhaps the "most brute force" of all chess engines.
Number of nodes searched is not the key metric for gauging how “smart” the algorithm is. You have less nodes searched but you only got there by having way more upfront processing.
We need some baseline to call it "brute force".
“I mean, it's cool that computers are getting even better at chess and all“
> direct comparison only makes sense with equivalent performance level
This makes no sense to me. 50% increase in performance can be compared to 50% increase in processing power to evaluate level of brute force-ness.
Computational complexity theory taught us that fundamental difficulty of solving specific types of problems does not always linearly scale with the size of the problems. I guess the same logic applies to the quality of the output?