
Is AlphaZero really a breakthrough in AI? - borisjabes
https://medium.com/@josecamachocollados/is-alphazero-really-a-scientific-breakthrough-in-ai-bf66ae1c84f2
======
rstuart4133
The article glosses over why the 4 hours was possible.

Firstly, a major challenge in training an AI of this sort is getting enough
labelled data. They played 300,000 games from memory. Under normal
circumstances, that requires access to 300,000 games played by experts so the
AI can learn to copy what the export does. That is how Alpha Go did it.

AlphaZero neatly side steps this by generating it's own training data by
playing itself. If how to do this was "obvious", it would have been done a
long time ago.

Secondly, they parallelised things. Alpha Go trained the AI from the results
of each game as it is played. AlphaZero played 1,250 games simultaneously, and
feed the results into the AI as they became available. The result is it took
well over an order of magnitude less elapsed time to train AlphaZero than
Alpha Go, even though the CPU cycles may have been roughly similar.

Finally, he overstates how hard it is to customise the engine (Markov
algorithm + AI) to a game. There are two pointers to this. Firstly, it took
them over 2 years to create Alpha Go. It became the world champion on 23 May.
Now, 7 months later we have AlphaZero. But AlphaZero didn't to play just one
game in those 7 months: is the best player on the planet for 3 games: Go,
Chess, and Shogi.

I don't know whether they customised the AI for each game, but I suspect if
the input and output layers were wide enough to accommodate the largest game
they could use the same one for each. The Markov engine does have to know how
to make all legal moves from any game position, but coding that isn't rocket
science or particularly time consuming. The AI does _not_ start out knowing
those rules - it learns them from the Markov engine. It's all very DRY.

This sort of engine only works are a particular style of game - one where
their is only a smallish set of well known moves at each step, and the playing
board is also smallish (19x19 in the case of Go, with three possible states
for each position: empty, black, white). Most board and card games fit this
description. AlphaZero can teach itself to play any of them to a standard
higher than any human can play them, and do it within a few hours, not the
decades it takes to create a human grand master. The net result is Homo
sapiens reign of supremacy at playing this style of game is now over. For this
entire niche our brains have been firmly relegated to a 2nd class
intelligence.

I don't know whether you would call pulling this off a breakthrough, but I do
know the techniques they applied will be copied by man+dog for years if not
decades to come.

~~~
soVeryTired
> AlphaZero neatly side steps this by generating it's own training data by
> playing itself. If how to do this was "obvious", it would have been done a
> long time ago.

Learning by self-play is nearly as old as AI itself. TD Gammon, one of the
very first algorithms to reach superhuman levels in a nontrivial game, learned
by self play. The basic ingredients for AlphaGo Zero, Monte-Carlo tree search
and use of a convnet to evaluate board positions, were known already.

The major contribution of the AlphaGo Zero and AlphaZero was IMO the
realisation that MCTS acts as a "policy improvement operatior", and that
reinforcement learning becomes far more stable when it's used in conjunction
with MCTS.

It's a major contribution and could represent a big shift in the field. But we
won't be able to judge _how big_ of a contribution until the research is more
reproducable.

~~~
vixen99
its own training data

------
super_mario
I would like to see a rematch with Stockfish configured correctly. I give
Stockfish at least 1 GB of hash per thread. In AlphaZero match they had 64
threads and only 1 GB of hash for all of them.

No one knows how Stockfish behaves with that many search threads, since no one
tested it. I don't know if there is any data on how Stockfish scales with
number of CPUs but I seem to remember that being one of the weaknesses of the
engine, and that commercial engines like Komodo scaled better with larger
number of CPUs.

Anecdotally on 2.8 GHz 8 core 64bit CPU with 8 threads and 16 GB hash size it
calculates about 7-8 million ply per second at the beginning of the game, and
much more later when there are less peaces on the board. AlphaZero's setup,
with 64 threads on 32 physical CPU cores, calculated about 70 million ply per
second, with tiny 1 GB hash (i.e. the engine could remember less of what it
calculated previously).

But my Stockfish on computationally weaker setup clearly flags some of the
moves in the match as mistakes on the 64 threaded Stockfish. I would really
like to understand why. Is it because if you have more resources to see deeper
you see how hopeless the situation is, or is it something else?

I am sure we will see more of these matches. It would be nice if Google
volunteered some computing resources and entered TCEC regularly.

~~~
roganp
It seems to me that that someone can prove that the configuration mattered by
pitting a correctly configured Stockfish against the configuration used by
AlphaZero, and see what the outcome is for 100 games. I haven't read the
paper, but some of the constraint choices seem odd to me (1 minute max per
move?).

~~~
super_mario
Unfortunately, AlphaZero is private and no one has access to either the code
(as commercial product) or resources it needs to run (TPUs). So, only Google
can do this experiment.

~~~
thisacctforreal
OP is suggesting properly-configured-stockfish versus how-alphazero-
configured-stockfish for 100 games.

------
seanwilson
Can anyone summarise how self play works here if AlphaZero only starts out
being told the rules of the game? Does it initially plays games using
completely random moves as both players? Is it only told who the winner is
with no other feedback? How is it able to learn e.g. that certain moves at the
start eventually lead to a win?

~~~
eutectic
There is a value network which estimates the win rate for the current player
from a given board state, and a policy network which estimates the probability
that each move should be played. As of the more recent iterations, these
networks share their bottom layers for greater computational and training
efficiency.

The value network is simply updated to match the real outcomes of games of
self-play.

The policy network is updated to match the results of a tree search; for each
board position many thousands of lines are explored using the value and policy
networks, and then the policy is updated to match (a somewhat 'sharpened'
version of) the number of lines in which each move was chosen.

When exploring each lines, at each step the move 'a' is chosen from the
current board state 's' which maximizes Q(s, a) + P(s, a) / (1 + N(s, a)),
where R is the policy network, Q is the average value network evaluation for
lines where 'a' was picked from 's' (with the appropriate signs to match the
current player at 's'), and N is the number of simulations in which 'a' was
picked. When we reach an unseen board state, it is evaluated with the policy
network and a new line is explored from the root.

This is less circular than it may seem because:

a) The value network is trained using real outcomes. b) Towards the end of the
game, the tree search sees real outcomes.

This training procedure allows the network to 'bootstrap', learning
progressively more complex knowledge about how to play effectively.

~~~
seanwilson
Thanks. How does this play out when you're training against a single self
played game then? Does it play a whole game with its current networks and then
once it knows the winner it goes over each move after to train itself?

> and a policy network which estimates the probability that each move should
> be played.

So for this network, the input is the before and after board state and the
output is the probability that this move should be played?

~~~
eutectic
The input is the current board-state and the output is the probability of each
move.

------
ivanhoe
Not sure how big breakthrough in Ai domain it is, but after watching some of
the AplhaZero games against Stockfish (available at youtube) I'm convinced
it's a true revolution in how computers play chess. It plays so much more like
humans do, without the usual crazy micro management that other chess engines
love to do. It's not boring to watch, makes very little weird moves that only
a computer would ever make. If I didn't know what it is, I would presume that
it's a real player (and extremely good one).

~~~
mannykannot
After some initial successes in computer chess, the defeat of grand masters
was declared to be imminent, but human players struck back with strategies
that considerably delayed that outcome. I wonder if there is any possibility
that a good player could find a weakness in AlphaZero's game, but the way you
describe it makes that seem highly unlikely.

------
seanwilson
Given their track history, I don't think it's likely the DeepMind team are
trying to be sneaky here. They'd be found out eventually given how big their
claim is.

When working in academia, I found it very common for research papers to not
come with source code or enough information to allow you to replicate
experiments yourself. You usually have to pester the author. I don't find the
(valid) criticisms here that unusual. I'm not sure why they wouldn't release
the moves for all the test games played though seeing is that should be simple
to do.

As for Stockfish and AlphaZero running on different hardware...AlphaZero's
approach is built around taking full advantage of what TPUs can do quickly and
Stockfish doesn't utilise TPUs so how are you meant to make this fair? Does
Stockfish eventually level out when you throw enough hardware at it? Doesn't
DeepMind's claim about AlphaZero evaluating significantly less moves per turn
invalidate the criticism about the hardware used?

~~~
pdpi
More hardware still means better results — or there would be no need to cap
the time per move.

It’s always been a trade off between using more expensive heuristics on fewer
moves or cheaper heuristics on more moves. The “number of moves” thing only
says that they went all-in on the better heuristics angle. In fact, there was
something posted recently about test games where they played go by just using
the first move suggested by the search heuristic and that was pretty strong.

~~~
seanwilson
> More hardware still means better results — or there would be no need to cap
> the time per move.

OK, but AlphaZero essentially runs on different hardware so how are you
suppose to make the comparison fair? You could give Stockfish access to the
TPUs but wouldn't do anything with them.

------
zellyn
It seems weird for someone with experience in both chess and --especially--AI
to write, “This improvement on computing power paves the way for the
development of newer algorithms, and probably in a few years a game like chess
could be almost solved by heavily relying on brute force.”

It's like they don't understand the exponential nature of depth search in
chess…

------
zaf
"However, the experimental setting does not seem fair. The version of
Stockfish used was not the last one but, more importantly, it was run in its
released version run on a normal PC, while AlphaZero was ran using
considerable higher processing power. For example, in the TCEC competition
engines play against each other using the same processor."

That does sound fishy,

~~~
thom
It would be good to see a definitive playoff. I've no real doubt that
AlphaZero would triumph, but in people's breathless coverage of the games
nobody seems to point out Stockfish's various 5-10 pawn blunders (all of which
my version of Stockfish finds when annotating the games with 60 seconds per
move).

~~~
soVeryTired
What do you mean by "various 5-10 pawn blunders"? Is that a count of the
number of blunders or some sort of score?

~~~
thom
Stockfish and other mainstream engines will judge a position in units of
centipawns - one hundredth of a pawn. Largely this will be to do with material
on either side. For example, if you throw away a pawn with all else being
equal, I'd be up 100 centipawns. If I'm white this gets written as "+1.00",
and if I'm black it's "-1.00". The score is also based on a strategic
evaluation of the position (and positions to come) using (in Stockfish's case)
some heuristics. It's worth noting that it's this evaluation of a position at
which AlphaZero appears to be three orders of magnitude better than Stockfish.

In some of the games vs AlphaZero, Stockfish makes errors that it _itself_
appears to be able to judge as huge blunders. In game 3 (which people view as
a masterpiece of long term strategic thinking by AlphaZero) one of these is at
least a +10 swing to AlphaZero as white. That's about the same as throwing
away your queen. Without the weird time controls put in place, it seems
unlikely that we'd be seeing blunders like that. As I said before, I'd still
expect AlphaZero to win, and in many cases it was already ahead before these
mistakes, but it's worth mentioning in any analysis.

~~~
soVeryTired
Got you, thanks for the reply.

------
mannykannot
The author is actually claiming something more serious than the title
suggests: "...all the concerns added together cast reasonable doubts about the
current scientific validity of the main claims."

To me, what follows does not seem to justify this claim, but it is not my
field. In addition, some of his arguments seem to be beside the point - for
example, he asks "Does AlphaZero completely learn from self-play?", and while
saying generally, yes, he objects that encoding the rules was a non-trivial
matter. While that certainly seems to be true, it does not seem to have much
bearing on the claim that AlphaZero apparently learned to win through self-
play. That, to me, seems to be its singular achievement: the absence of human-
written tactics and strategy (unless the encoding of the rules somehow
prefigured them, which is not being claimed here, and which seems highly
unlikely.)

~~~
YeGoblynQueenne
As many others have pointed out, self-play in adversarial AI goes back to TD
Gammon, in 1992.

The fact alone that DeepMind is making such a big todo about self-play is a
bit iffy in and of itself. It's probably a sign that they're more interested
in catching the attention of the popular press and the general public, than of
anyone who has at least read through Russel and Norvig [i.e. a popular AI
textbook that mentions TD Gammon (in the Adversarial Search chapter)].

In any case- that's a claim in the paper and it's just as valid to scrutinise
it as any other claim. But even more so if it's repeated in the lay press
without anyone bothering to do their homework...

Btw, if I may be a bit nosy- what is your field?

~~~
mannykannot
To a typical insurance salesman, the fact that self-play has been around for a
while raises the question of why not until now, for chess? Maybe self-play in
chess had already come within a whisker of this result? Has the state of the
art reached the point where a bespoke self-play solution for any given
chess/go -like game is now unremarkable? Is chess regarded as a sideshow that
is given more attention than it deserves from the press and public? Even if
this outcome is completely unremarkable from a technical point of view, the
question remains, but as a social one.

The author can certainly scrutinize whatever claims he likes, but does he make
his case? I suppose he can certainly say that without further information, the
outcome simply cannot be independently evaluated.

------
simonh
When can we have an AI that plays Third Reich? But not too well, I want to at
least have a chance.

I'm actually not joking. I wonder how much different it would be to teach an
AI like this how to play more complex games. I imagine Axis and Allies
wouldn't take much, but Third Reich is notoriously complicated. The quickest
war-length game I've played took a week of playing 3-4 hours per day and games
like that seem to me to be much more similar to real world problems, with
multiple different sorts of trade offs that interlock with each other.

Are neural AIs like this actually feasible to train for problems like that or
are other AI techniques better suited to it? What about games with multiple
different game systems, like board games with a card game element to them like
Settlers of Catan? Would you need to use several different types of AI to
optimize different parts of the game?

~~~
disgruntledphd2
The hard part is probably writing a program to encode the rules of the game.
Then, you can figure out how to represent the state (i.e. the board and the
player's cards/resources etc) as a matrix or array (i like how this is done in
NLP, they tend to represent each character/word/token as a 1-hot vector and
then reduce the dimensionality of these (normally really sparse) inputs.

OpenAI's gym is probably a good place to start, as you can crib how they do it
for a whole bunch of games.
[https://github.com/openai/gym](https://github.com/openai/gym)

------
proc0
Engineering breakthrough?

------
jraut
An AI which excels in imperfect information games (card games, Starcraft)
would be a real breakthrough. Raw calculation power is bound to win games with
a finite set of possibilities. The huge leap would be the ability to handle
probabilites: taking guesses, making assumptions and coming into some kind of
successful conclusions based on those.

~~~
dingo_bat
Go cannot be won with raw power though. It has a lot of the constraints of
incomplete information problems.

~~~
jraut
So every move cannot be precalculated? Some of the constraints may be there.
The completeness of information available would be the definition of
breakthrough for me. In this case, the first state of the game is presented in
total with 100% accuracy. So are all the steps from there onward. In my
opinion, the challenge for a breakthrough comes when most of the actions of
other players happen in dark and no feedback is presented to the AI.

In a lifelike situation the AI will not have access to the inner state of the
game, but instead has to gather the information via the same (restricted)
mechanisms as other players.

edit: I should probably clarify that the above is about competitive StarCraft.
I should probably learn to play GO, too.

------
erikb
I'm no AI expert, but I won't start to worry about AI being used generally
(the last point of the article) until it beats a really complex game like real
time strategy title StarCraft.

Even if we all agree that AlphaGo is the DeepBlue of Go, we are still having a
few more layers to take before humans need to worry.

~~~
Gys
> Churchill guesses it will be five years before a StarCraft bot can beat a
> human. He also notes that many experts predicted a similar timeframe for
> Go—right before AlphaGo burst onto the scene.

[https://www.wired.com/story/googles-ai-declares-galactic-
war...](https://www.wired.com/story/googles-ai-declares-galactic-war-on-
starcraft-/)

(Aug 2017)

------
banachtarski
Whether it was a breakthrough or not, I have to say, the moves it played were
certainly "creative" in a profound sense.

~~~
tmalsburg2
If possible at all, could you give an example for casual chess players?

~~~
UweSchmidt
Seconded. We all want a glimpse of what the "AI" is doing, and see that one
step ahead that human minds could never do :)

~~~
khamisiyah
Not OP, but I highly recommend checking out ChessNetwork's recent videos on
the games:
[https://www.youtube.com/user/ChessNetwork/videos](https://www.youtube.com/user/ChessNetwork/videos)

------
zerostar07
probably a (big) incremental step over a previous breakthrough

------
a_imho
Considering how much ai is hyped it is getting harder and harder not to be a
skeptic.

------
vadimberman
From what I recall, DeepMind never mastered Pacman.

------
gcatalfamo
No, is an optimization of something already existing. An innovation, but not a
breakthrough per se.

Edit: this is an oversimplification

~~~
textor
It could perhaps be said, then, that the Alpha series as a whole is a rolling
breakthrough? Each new generation introduces changes that do not seem major
compared to the ideas in the foundation of Deep Learning, but this actually
allows to surpass state-of-the-art, improving performance on all relevant
parameters (and not by a few percent). Dismissing these updates as
technicalities is a dubious position.

~~~
pavs
Isn't that how all newer version of chess engines - including stockfish,
improve?

Edit:
[https://github.com/glinscott/fishtest](https://github.com/glinscott/fishtest)
[http://tests.stockfishchess.org/tests](http://tests.stockfishchess.org/tests)

~~~
textor
Prior to AZ, some competent people had doubts if self-taught NN could compete
at all with a mature chess engine at its strongest setting. You know how that
went. The improvement rate of subsequent Alpha revisions is incomparable to
what software like chess engines shows. Hence, the tendency to evaluate each
one as a breakthrough. Personally I believe that "breakthrough" is a non-
technical word that's hard to define, but "improvement rate surpassing most
optimistic expectations for a given field at a given time" could probably
qualify as a criteria.

