
AlphaGo Zero: Learning from scratch - stablemap
https://deepmind.com/blog/alphago-zero-learning-scratch/
======
aeleos
The fact that they only used self play with no outside input here is really
interesting. I wonder if this system produced more new styles of play. While I
am not that familiar with Go, I know in some of the other articles they talk
about things like Chinese starts that are specific to certain cultures. I
wonder if the fact that it had no outside reinforcement made it produce
movements that we have already seen that are somehow inherent to the game, or
if it produced many more new moves that were a result of it learning without
any possibility of cultural interference. According to the article it did
invent some unconventional and creative moves, but I also wonder how much it
rediscovered.

I also wonder how much it’s style of play changes if it were re trained, due
to the random start that it is given. Maybe that would produce something like
seeds for procedurally generated worlds in games. Like if they could find a
seed for a Chinese or japanese players, or ones that more aggressive styles.
This is some pretty cool work and may open up even more doors for pure
reinforcement learning

~~~
jasonwatkinspdx
> I wonder if this system produced more new styles of play.

One thing Alpha go has told us clearly is that it thinks human players over
value the margin of victory vs the probability of victory.

~~~
edanm
I'm not 100% sure I agree. It values probability of victory because that's
it's goal. For humans, aiming _only_ for probability of victory might not be
as good, because we're much worse at estimating probabilities. So aiming for
maintaining a large margin at all times is conceivably the best proxy that we
can use _in practice_.

~~~
pmontra
Agreed. I know I'm winning by 4 points but I have no idea about my probability
of winning. However if I'm winning I know that I should play low risk moves
and refrain from starting complicated fights. That increases the probability
of winning. IMHO the exact value is out of reach for human beings.

~~~
derefr
It would be interesting to play human go, assisted by a go computer that
doesn't say anything about moves, but rather just spits out, for each player,
their current likelihood of victory if all further moves by both players were
"what it would do."

That way, each player could know, at all times, (one major factor that goes
into) their probability of winning. They'd still have to mentally adjust it
for the likelihood of them and their opponent making an error, and how that
can be controlled by making intimidating moves, etc. But it could lead to much
tighter control on the abstract flow of the game.

It'd almost be like the computer was the general, issuing strategy, picking
battles; and the human player the tactician, fighting those battles.

~~~
pmontra
There is computer go program (maybe Crazy Stone?) that analyzes a game record
and annotates it with the winning percentage for every move.

Knowing that the opponent's winning probability changed from 52 to 57 was
interesting only because it hints at a mistake. In case of such a large change
the program suggests the move it would have played.

I saw an annotated game record and there were no variations: I remember a
suggested move that made me wonder "why!?".

Another benefit of seeing the value of the winning probability is an
assessment of who's ahead. However that's already possible with the score
estimation that programs and go servers provide. Sometimes is crude, sometimes
is good, but it's the score, not the winning probability that humans can
estimate when playing. The best probability estimate I can make is: if the
score is close and the game is still complicated, it's 50-50; if the score is
close but the game is almost over, it's 95-5 for who's ahead. If the score is
not close, the player with more points will probably win.

------
ericb
I'm reminded of Eliezer Yudkowski's article "There is no fire Alarm for
Artificial General Intelligence." Is this smoke?

[https://intelligence.org/2017/10/13/fire-
alarm/](https://intelligence.org/2017/10/13/fire-alarm/)

Yes, this is not an AGI. But the hockey-stick takeoff from defeats some
players, to defeats an undefeated world-champion, to defeats the version of
itself that beat the world champion 100% of the time is __nuts __. If this
happens in other domains, like finance, health, paper clip collection, the
word singularity is really well chosen--we can 't see past this.

~~~
empath75
You don't even need to produce an AGI for this kind of intelligence to be
frightening.

At some point, a military is going to develop autonomous weapons that are
vastly superior to human beings on the battle field, with no risk of losing
human lives, and there is going to be a blitzkrieg sort of situation as the
relative power of nations shifts dramatically.

If we have two such countries we could have massive drone and cyberwars being
fought faster than people even can comprehend what's happening.

Right now most countries insist on maintaining human control over the
machinery of death. But that will only last for as long as autonomous death
machines don't dominate the battlefield.

It's a fun challenge right now to build a machine that can win in Starcraft,
but it's really a hop skip and a jump from there to winning actual wars.

~~~
Veedrac
Nuclear ICBMs already push us past that boundary. The world can no longer
afford to fight a war seriously.

------
davidkuhta
> Previous versions of AlphaGo initially trained on thousands of human amateur
> and professional games to learn how to play Go. AlphaGo Zero skips this step
> and learns to play simply by playing games against itself, starting from
> completely random play.

So technically this version has lost every game it's ever won.

Jokes aside, it's pretty interesting to note that they were able to combine
the "policy" and "value" networks. Good SO answers on the difference
([https://datascience.stackexchange.com/questions/10932/differ...](https://datascience.stackexchange.com/questions/10932/difference-
between-alphagos-policy-network-and-value-network))

> accumulating thousands of years of human knowledge during a period of just a
> few days

It'd be interesting for what this would mean when things like a neural lace
become a reality.

As an aside, anyone have any other links or references to others investigating
learning algorithms with a 'tabula rasa' approach?

~~~
greysphere
TD-gammon is a well known version of this technique (with 2 ply lookahead, vs
a 1600 deep mcts) [https://en.m.wikipedia.org/wiki/TD-
Gammon](https://en.m.wikipedia.org/wiki/TD-Gammon)

Temporal difference learning was previously consider weak at 'tactical' games,
ie ones with gamestates that require long chains of precise moves to improve
position (like many checkmate scenarios in chess) .

For anyone more familiar with this technique, is it clear how the
mcts/checkpoint system overcomes this? How sensative is the system to the
tuning params for those parts of the alg. Like is Go a particularly good
candidate because of the ~400 play positions resulting in a (relatively) small
tree seach requirement? (I kinda cant believe im saying that go has 'a small
search tree'!)

We us td learning for the ai in our game Race for the Galaxy, so it's neat to
hear about possible avenues for improvement!

~~~
greysphere
After digging a bit deeper into the paper, it seems a key part of the new
scheme is the NN is trained to help guide a deep/sparse tree search (as
opposed to TD-gammons fully exhaustive 2-ply search). It's somewhat surprising
to me that the simple win/loss is a strong enough signal to train this very
'intermediate step' in the algorithm - a spectacular result! It begs the
question what other heuristic based algorithms would be improved by replacing
a hand rolled non-optimal heuristic function with a NN?

~~~
sjg007
It's estimating the probability of winning from the position based on what it
has already seen. So basically it's a giant conditional probability
distribution. Is it mistaken to interpret this as a bayesian network?

------
Jyaif
"It uses one neural network rather than two." and "AlphaGo Zero only uses the
black and white stones from the Go board as its input, whereas previous
versions of AlphaGo included a small number of hand-engineered features."

This is amazing! The technology they came up with must be super generic.

~~~
skinner_
Also, unsupervised. Also, no rollouts. They got rid of a lot of complexity. At
this point it looks like a reasonable challenge to write a superhuman Go AI in
500 lines of unobfuscated python.

~~~
andai
I was wondering about this: can we study AlphaGo Zero and other nets created
in the same way for similarities, extract and study them? Or are we limited to
observing the behavior and learning from that?

------
gfodor
I'm wondering if once one of these algorithms comes along that has been
perfected if it is going to "burn in" the domain it was built for as the
target of problem reductions, similar to 8086 assembly or the qwerty keyboard
living on today despite them being ancient relics.

For example, after this result it seems if you can reduce your problem domain
onto Go (or a similarly structured game) you now have a way to create a
superhuman solver. It may just be easier to do that then try to even figure
out how to design and tune a new network.

I could imagine waking up in 10 years being confused at why all software
efforts in the AI space are focused on just figuring out clever ways to map
real problems onto a hodgepodge of seemingly random "toy" domains like Go and
Chess and Starcraft. Hell, maybe the Starcraft bot will immortalize Starcraft
in a way the game never would have been able to if it becomes a good reduction
target for a lot of domains.

It kind of reminds me of how SVMs were "abused" by twisting non-linear domains
into them via kernel methods, or by proving the NP-equivalence of a problem by
reducing it onto 3-SAT, or how ImageNet's weights are being re-purposed for
other image oriented prediction tasks.

~~~
zodiac
I think the only thing about go that enables this technique is "turn-based
perfect information 0-sum game".

~~~
hmate9
Also it has fairly limited input. A real world problem may have much more
possible inputs at any time step as opposed to placing just 1 stone

------
shmageggy
Looks like the performance improvement comes from two key ingredients:

1) Using Residual networks instead of normal convolutional layers

2) Using a smarter policy training loss that uses the full information from a
MCTS at each move. In the previous version, I believe they just ran the policy
network to the end of the game and used a very weak {0, 1} reinforcement
signal over all of the moves played. Here, it looks like they use each run of
MCTS to provide a fully supervised signal over all moves it explores.

~~~
mtremsal
How is it different to apply the loss on each actual move at the end of the
game VS on each rollout (which is itself a tiny game)? Does it help reinforce
learning towards the end game as shorter rollouts are needed? Is the more
accurate information then propagated to earlier moves as well?

~~~
gwern
I think the difference is that under 1/0 policy gradient loss, it gets
feedback only on the actual chosen move. Under MCTS-rollouts-each-move, it
gets feedback on every move on the board whether its value estimate was
slightly too high or low plus the ultimate outcome of the 1 move it did make.

------
nyxtom
So this is fun:

"AlphaGo Zero is the program described in this paper. It learns from self-play
reinforcement learning, starting from random initial weights, without using
rollouts, with no human supervision, and using only the raw board history as
input features. It uses just a single machine in the Google Cloud with 4 TPUs
(AlphaGo Zero could also be distributed but we chose to use the simplest
possible search algorithm)."

~~~
VikingCoder
Single machine?

Stunning.

------
polskibus
I remember reading ages ago in Scientific American about a much more
interesting (and useful) AI application of this technique.

Genetic algorithms were used to evolve new, more efficient variants of
existing electronic circuits. I dug it up - it was:
[https://www.scientificamerican.com/magazine/sa/2003/02-01/#a...](https://www.scientificamerican.com/magazine/sa/2003/02-01/#article-
magnetars) Article "Evolving inventions". I have no idea if there is an open-
access version anywhere.

As far as I remember, that approach led to some patents, because some of the
inventions were better than existing solutions. One of the examples in the
article was a low-pass filter (I dont remember if AI version was actually
better or worse than human-made).

The essential element of this approach was that in electronics (as in go)
there exist a well defined set of rules, that allows researchers to build a
simulation engine with optimization/evaluation function that the AI targets by
itself, without supervision. It's great to see that this approach is still
alive, although in my humble opinion, application in electronics is much more
interesting than Go.

~~~
tmzt
Somebody needs to dig this up and apply it to an open FPGA toolchain like
ICEstorm.

The other SA article on this was The Darwin Chip which I think went into more
detail.

One of the limitations was the lack of documentation for the actual bitstream.

------
alexbeloi
Very impressive, the original implementation relied a lot on feature
engineering.

I'm surprised they're able to prevent a self-play equilibrium with such a
simple loss function.

It's sort of like they are using auxiliary outputs but instead of using them
to fit features, they are fitting to multiple ways of arriving at 'best play',
through predicting value (SL) and predicting probability for best outcome
(RL). In principle, they're doing the same thing but in practice it seems like
they are making up for each others shortcomings (e.g. self-play equilibrium
with RL).

------
grondilu
> If similar techniques can be applied to other structured problems, such as
> protein folding, reducing energy consumption or searching for revolutionary
> new materials,

Protein folding sounds like a nice idea for their next challenge.

~~~
thefalcon
[https://www.bloomberg.com/news/articles/2017-10-18/deepmind-...](https://www.bloomberg.com/news/articles/2017-10-18/deepmind-
s-superpowerful-ai-sets-its-sights-on-drug-discovery)

Indeed.

------
tc
When things will start getting _interesting_ is when we figure out how to get
move simulation and search into the network itself, rather than programming
that on the outside. As far as I know, no-one has even the faintest idea of
how to do that. We have an existence proof that this should be possible.

The networks are great at perception and snap-prediction. Anything a human can
do in 200ms is fair game. And with clever engineering, we can make magic
happen by iterating or integrating those things.

But it's after that first 200ms that humans get really _intelligent_. When we
can come up with an architecture that lets the networks themselves start
simulating possibilities, backtracking, deciding when to answer now or to
think more -- when the network owns the loop -- then it will get
_interesting_.

~~~
derefr
> We have an existence proof that this should be possible.

Not guaranteed. The human brain has diffusion signalling (i.e.
neurotransmitters passing out of the synaptic cleft, into a neighbouring one,
and activating a receptor on some other spacially-local axon as a result.) And
one of those signalling molecules is thought to represent, in its intensity, a
confidence-interval bias adjustment (i.e. a pruning bias factor for MCTS.) So
the brain's MCTS-equivalent process may rely on some extra-graphical
properties of the brain-as-embodied-meat-thing.

~~~
red75prime
That will be a couple of additional terms in activation function. Or am I
missing something?

~~~
derefr
“Neighbouring” is defined in terms of embedding in a metric space and inverse-
cube diffusion, rather than anything to do with graphic connectivity.

Also, these signals pile up in the synaptic cleft until they’re picked up, so
it’s not just about instantaneous transmissivity as if these were radio
signals.

But _also_ also, other stuff like monoamine oxidase is floating about in its
own diffusion patterns, cleaning up these signals.

It’s basically like a “scent” communication embodied-actor model, but a very
complex one where things like redox reactions with the atmosphere occur.

Oh, and there are “secondary messengers”: signals that trigger other signals
that, among other things, inhibit the release of the original signal when
received back at the sender, such that an dynamic equilibrium state is reached
between the two signal types.

------
galkk
Why don't use the same approach for chess?

It's very interesting to see if it is able to handle much more advanced and
tuned engines that exist for chess, game with considerable much more
complicated rules?

~~~
Florin_Andrei
I think chess is less compelling because, in a sense, it is a "solved problem"
\- superhuman AI chess players already exist.

And chess, while it does have more complex base rules, has a much lower
combinatorial complexity than Go.

~~~
EvgeniyZh
Well, I'd love to see NN solution beating top chess engines. It might also
introduce novelty to the game, just as regular engines did

~~~
feelix
It'd be particularly useful to have a chess bot that can play badly in the
same way a human does.

The problem with the current chess bots is that they play badly, badly. They
choose a terrible random mistake to make every few moves, while some of their
other moves are brilliant. They cannot accurately mimic beginner or
intermediate level players.

~~~
thefalcon
This seems like something DeepMind could create, given the incentive. They
were able to train AlphaGo to predict human moves in Go at a very high
accuracy (obviously not with AlphaGo Zero, but the inferior human-predictive
version is how they determined that AGZ is playing qualitatively differently).

------
titzer
This is pretty incredible, especially the power dissipation results. Only 4
TPUs? Humans are toast.

~~~
skykooler
What is a TPU?

~~~
red75prime
[https://en.wikipedia.org/wiki/Tensor_processing_unit](https://en.wikipedia.org/wiki/Tensor_processing_unit)

~~~
skykooler
Thanks!

------
dsp1234
Comparing the top player's ELO with Zero's ELO (assuming numbers are accurate,
etc):

 _Your rating: 3664

Opponent's rating: 5000

Probability of winning: 0.000456879355457417_

So 1 in 2,200 games... ouch

~~~
mda
I don't think you can apply this to alphago. I think probability for a human
to beat alphago now is zero.

Lee Sedol's single victory is the first and the last.

~~~
VikingCoder
I disagree. This is precisely what the ELO predicts, and it has been pretty
accurate over time - it's a good metric.

~~~
mda
For humans yes. For humans against machines? I don't think so. Can any human
beat a modern chess computer? The chance is zero.

~~~
VikingCoder
Zero means zero, yes?

Alpha particles can flip bits and cause erratic behavior, can they not?

"[The probability of] at least one bit error in 4 gigabytes of memory at sea
level on planet Earth in 72 hours is over 95%"

~~~
mda
Oh hey maybe they use ECC? Are we really arguing this? Pedantry on a weird
level.

------
panic
Is AlphaGo Zero the first Go program without special code to read ladders? I'm
curious how a pure neural net can read them, given how non-local they are.

~~~
thefalcon
The concept of locality is nothing but a human weakness in Go, the best AI
must read the whole board with every move.

EDIT: From the paper: "Surprisingly, shicho (“ladder” capture sequences that
may span the whole board) – one of the first elements of Go knowledge learned
by humans – were only understood by AlphaGo Zero much later in training" I'm
surprised by the author's use of the word "Surprisingly" here.

~~~
waqf
AlphaGo is still based around layers of 3×3 local convolutions.

That represents a strong assumption about locality in the network design. I
would expect AlphaGo to perform poorly on the game "Go with the vertices
randomly permuted".

~~~
Palptine
Well they didn't use inception. If their inception units have, say, a 7x7
conv, then ladders will probably be found much earlier.

------
sobellian
The catch is that this isn't _quite_ zero human knowledge, since the tree
search algorithm is a human discovery, and not one that came easily to humans.
It also massively cuts down on the search space for an appropriate policy
function.

That means that this setup isn't necessarily general. How applicable is MCTS
to games with asymmetric information, a la Starcraft? What about games that
can't quite be modeled with an alternating turn-based game tree like bughouse?

~~~
anomnomidas
There's a Dota 2 bot by OpenAI that played games with itself and managed to
beat a lot of pros in the scene. It's still SF mid only no runes and some
restricted items, but it shows that there is also potential for Starcraft.

[https://blog.openai.com/dota-2/](https://blog.openai.com/dota-2/)

~~~
zardo
Maybe you know something we don't...

"We’re not ready to talk about agent internals"

What makes you think it uses a tree search?

------
narvind
How I wish Marvin Minsky would have stayed alive for one more year and seen
this. He would have been so happy!

~~~
bradleybuda
In the days when Sussman was a novice, Minsky once came to him as he sat
hacking at the PDP-6.

“What are you doing?”, asked Minsky.

“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman
replied.

“Why is the net wired randomly?”, asked Minsky.

“I do not want it to have any preconceptions of how to play”, Sussman said.

Minsky then shut his eyes.

“Why do you close your eyes?”, Sussman asked his teacher.

“So that the room will be empty.”

At that moment, Sussman was enlightened.

[http://www.catb.org/jargon/html/koans.html](http://www.catb.org/jargon/html/koans.html)

~~~
tedunangst
So Sussman was right the first time?

~~~
zardo
A random net has some random preconception. That doesn't mean it's a bad idea
to try random preconceptions.

------
Aron
One idea occurs to me is to now evolve the Go game itself in a direction that
adds more challenges for an AI to solve, and then solve those problems. How
about being able to handle different and randomized board shapes? How about
being allowed to say one move the opponent cannot take when you play a piece?
It would be interesting to keep track of what variations the algorithm handles
well automatically, and which it falls flat on, etc.

~~~
lozenge
Arimaa was a chess inspired game intended to be difficult for computers. It
"fell" in 2015.

------
auggierose
This is such an impressive result, and so general, I bet many people
(including me) wish they knew exactly how to duplicate this result. It would
be great if they created an online course that explained all algorithms in
detail right up to the creation of AlphaGo Zero itself. The paper gives the
impression that it shouldn't be too hard for them to create such a course.

------
VanillaCafe
So in terms of training, they went from nothing to "Go Singularity" in about a
month? Impressive.

------
tim333
Slightly scary how it went from zero to superhuman play in three days. I
wonder if general AI will go that way one day.

~~~
goatlover
General AI relative to an individual human, or billions of humans? The sum
total of human beings, or organizations of humans is superhuman relative to an
individual. We've had superhuman organizations for millennia. I'm not sure how
much general AI will be different, other than the large scale automation of
jobs which would happen.

As Rodney Brooks pointed out, all technology happens within a context, not a
vacuum. A general AI will come to exist in a world with a lot of other
superhuman capabilities already in existence.

~~~
nothis
One of the more interesting things the success of "starting with zero"
suggests is that the idea that some mystical "human consciousness" is the end
goal for AI might be laughable in the long term. AI might just casually bypass
human consciousness, say "oh, hi!" and wave us goodbye a day later. Also, a
factor of 7 billion "happens" in computer science.

This is getting rather creepy to think of, even if it's still science fiction.
At this point, I could see a computer that out-thinks humanity within decades.
What would it think? What would we even do with its findings? Would we
understand it? Would it understand itself? Would it know how to manipulate us?

------
hetspookjee
Amazing results, though I am somewhat frightened by how generic this model is
and how it achieved such amazing results. I can't help but think that these
same techniques can be used to learn how humans react in certain situaties and
how they can be, very subtely, be worked to think in a certain way - one that
fits the agenda of whatever party is behind it.

With the mass surveillance that is Google it's quite doable to test for human
reactions on certain things. They got the tools to execute a certain plan and
evaluate the effectiveness. Ofcourse it can also go in a benelovent way: like
what kind of policy will benefit the most people? (semantics of 'benefiting'
aside)

I atleast certainly hope these kinds of generic algorithms will be used to
generate effective, meaningful policies that truly help the people. Still a
far away future but one that gets closer by the day.

~~~
SubiculumCode
I'd only worry if it can outperform humans when there are not rules per se.
That is, if I put a queen down in the GO board, and start knocking off stones,
moving three times a turn, then take a lighter and burn the go board, the AI
responds by decapitating my head.

~~~
goatlover
Ha! I do wonder about using a board game where the rules periodically change
in simple ways at random. A human could easily adapt to the rule changes while
playing and adjust their strategy accordingly. Would a Deep Learning algorithm
be able to do this?

If we keep the board and pieces digital, then the board could change shape,
the pieces could change color indicating a random association with a rule
change, and what not.

------
indescions_2017
What's fascinating (and admittedly somewhat worrying) about Self-play is that
an agent can accidentally become adept at tasks other than intended via
transfer learning. The "wrestling spiders" in OpenAI's demo quickly mastered
the art of Sumo Wrestling. And whatever skills they learned in resisting an
opposing force to stay standing on a platform, were immediately applicable to
myriad different domains. In this case, being subject to hurricane force
winds, and not as any normal spider may, be hurled into the sky!

It's more difficult to see how Go playing skills can translate to other
domains. But for tasks in robotics, cybersecurity or fintech the power of
self-play trained transfer learning becomes more apparent.

~~~
visarga
It is clear that these "self-play" scenarios depend on simulation - unless
there is an appropriate stage for self play to take place on, there can be no
play. The question is - how do we stand with simulation for robotics, self
driving cars, etc.

My bet is that simulation is going to be the crowning jewel in the AI field,
replacing static datasets and supervised learning with "dynamic datasets" and
rewards. It would help with data sparsity as well (where can you find an image
of a donkey riding an elephant for the new ImageNet? - but you can sim that or
any possible combination).

Not to mention that humans are fallen head over heels with simulation as well
- VR headsets and games in general. I see a great future for simulation with
both AI and humans. It will be our common learning/playing/research sandbox.

------
sidcool
I would be willing to spend an entire lifetime to perfectly understand how
this algorithm works. Currently I can barely write Djikstra's algorithm.

~~~
visarga
You can watch the RL course given by one of the inventors of AlphaGo, David
Silver.

[https://www.youtube.com/playlist?list=PLzuuYNsE1EZAXYR4FJ75j...](https://www.youtube.com/playlist?list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-)

~~~
sidcool
This is pretty cool. Thanks!

------
naveen99
Would be nice if there was an open source attempt at an alpha go clone on a
9x9 board, so it could be run on commodity hardware and maybe trained in more
reasonable time. Also would be interesting to see if human would still win on
a 190x190 or some arbitrary size board against alphagozero trained
appropriately.

------
Invictus0
Is this evidence of a broader leap forward in machine learning, or are these
advancements domain-specific? In other words, could these innovations be
applied to other fields and applications?

~~~
habitue
I think the fact that it's no longer using Monte Carlo tree search is a huge
step forward in the generalizability of the technique. But go is still

\- a perfect information game

\- with a relatively small input size (vs. arbitrary computer vision)

\- cheap to simulate

\- discrete action space

\- deterministic

This isn't to take away from the magnitude of the achievement, but the nature
of the problem itself makes the result less applicable to many tasks we might
want to use RL for.

~~~
abecedarius
Math research shares those qualities, except small input size if you include
the body of all the already-known theorems as an input. I don't know if we'll
see much smarter proof assistants soon, but it doesn't seem absurd to me as a
possible development.

~~~
dane-pgp
Actually, just converting all the already-known theorems into a form that can
be computationally verified (not just convince a skilled human) would be an
interesting starting point. This would really help the metamath project, and
perhaps make peer review of mathematical research papers easier:

[http://us.metamath.org/mpeuni/mmset.html](http://us.metamath.org/mpeuni/mmset.html)

------
conistonwater
_Extended data figure 2_ contains something really cool: how long it takes for
AG0 to discover a joseki, and how long it takes it to later discard it as non-
jokeki. So you could, in theory, evaluate a human joseki by plotting AG0's
probability of playing it against training time (or against its Elo rating).
What's also cool is that the differences that cause it stop playing a joseki
must be really minute, but it can see them anyway.

------
pedrohorta
I think that we cannot full understand the implications of that in our
present/near future. The use of tecnhology will make the difference. There is
a small book in amazon called something like Alpha Go Zero - 10 prophecies for
the world. Interesting point of view of the use of it.

------
matthewtoast
> The system starts off with a neural network that knows nothing about the
> game of Go. It then plays games against itself

I might have missed this, but: Where are the actual rules of Go encoded?
Mustn't there be some enumeration of what constitutes "capturing," how the win
condition of the game is calculated, correct?

~~~
klipt
Yes, I assume you start with an objective function ("winning Go") that
includes the rules.

~~~
dane-pgp
Surely if they are really starting with "zero", then all the AI is given is
the arrangement of stones on the board (which starts empty) with the
opportunity to select a position for its next stone after its opponent has
placed one, until the game is over. (Let's assume that there is another piece
of software responsible for determining when the game has finished, and who
has won). As such, the only "rules" the AI needs are that it can only place
one stone at a time, only in an empty position, and only when it is not the
opponent's turn.

To start with "less than zero", though, it would be interesting to see them
give the AI a 3D simulation of a room with a simulated Go board and a
simulated stone, and give the AI a fixed amount of time for it to have its
turn. Just by using the pixel data from a simulated camera, it could learn to
use a simulated arm to place the simulated stone on the board in a legal
position. The reward function would just have to say, at the end of each
allotted time period, whether a legal move had been made or not, and the AI
could bootstrap up from that.

------
partycoder
Very impressive. I wonder if the reduction in computing costs has to do with
pruning the resulting trained networks.

------
braymundo
I wonder if it could be applied to SC2 (since they already started the
research: [https://deepmind.com/blog/deepmind-and-blizzard-open-
starcra...](https://deepmind.com/blog/deepmind-and-blizzard-open-starcraft-ii-
ai-research-environment))

~~~
zardo
As is, no. There are too many possible actions and too little time between
decisions.

I wouldn't discount it entirely though, some sort of clustering of actions may
be able to reduce continuous action spaces to a manageable branching factor.

------
yazr
Did they publish any papers on AlphaGo Master ?

AFAIK the 2016 Nature paper is Alpha Go Lee. And now we skipped to AlphaGo
Zero.

~~~
gwern
No. Skipped straight to Zero from AG Fan Hui (not AG Lee).

------
infinity0
On the bright side, it means the several thousand years of humans playing Go,
we were actually going "in the right direction" in terms of optimal strategy,
despite not having reduced the game down into provably-optimal mathematical
theorems.

~~~
VikingCoder
I sometimes think about boats.

We built damn good boats, even before we knew anything formal about fluid
dynamics, or even AIR.

------
yters
Is there anywhere to see the games? I'm curious if the AI is superior to
humans or just human trained AI. It'd also be interesting to see the source,
but that is apparently not being released for some reason.

~~~
sillysaurus3
_It 'd also be interesting to see the source, but that is apparently not being
released for some reason._

Last time this was brought up, someone implied it's closed source so as not to
boost the Chinese competitor.

~~~
yters
If all their methods for success are in the paper the Chinese competitor can
just copy that.

~~~
sillysaurus3
Theoretically, yes. But anyone who has tried to implement scientific papers
will tell you it's far harder. The papers often lack critical details and
implementation hacks: all the little rough edges that go into making a
production system work. They also lack context in many cases, so you spend
more time reverse engineering the paper than figuring out how to make it work.

~~~
yters
It seems odd they make a big deal about tabula rasa, but won't release the
source for verification.

------
wwarner
I would love to know how this adversarial training doesn't end up overfitting.
Or, put another way, I'd love to see another piece of software (or even a
human being?) exploit Alpha Go's overfitted strategy.

------
SkyMarshal
One thing interesting here is how close the best humans appear to be to the
asymptotic theoretical max skill level.

------
smdz
Deep reinforcement learning is interesting and has plenty of potential. But
highlighting AlphaGo as an example of reinforcement learning is like
undermining the concepts of reinforcement learning.

~~~
ionforce
How so?

------
thefalcon
A great resource for "watching" AlphaGo's games: [http://www.alphago-
games.com](http://www.alphago-games.com)

------
mrcactu5
this is quite an achievement that it plays accurately. Does this have
implementations for Strategy? No two games are alike, yet we are able to learn
from our experience. Human's players be somehow comparing local positions from
various games and deciding (sometimes wrongly) that they can be played
similarly. Can it recommend not just individual positions, but general modes
of play for classes of positions?

------
andai
Serious question: could it be possible, even in theory, to translate a neural
network to a high level programing language? So we can see what it's doing?

------
tasuki
Absolutely amazing. I'm dying to see the game records!

~~~
weavie
You can find them here : [http://www.alphago-games.com](http://www.alphago-
games.com)

~~~
Radim
Weird how they anthropomorphized and capitalized "her Final Form" there.
Creepy!

------
seanwilson
How do they evaluate that AlphaGo Zero is better than the previous AlphaGos?
By playing them against each other? Or playing AlphaGo Zero against humans?

~~~
auggierose
The same way they make AlphaGo Zero better, by playing against other AlphaGos.

~~~
seanwilson
But what if the other AlphaGos are blind to some tactic that humans could take
advantage of? Not saying this is likely given how the Lee Sedol match and
others went but I'm curious how they come up with the rankings.

~~~
auggierose
At the end of the paper they describe how the come up with ELO rankings, and
that in order to avoid some bias due to self-play only they include the
results of the AlphaGo's versus Fan, Sedol, etc.

------
nicklovescode
Anyone know why they trained on TPUs? My understanding is that the main
benefit of a TPU is inference. Is this not true?

~~~
postnihilism
The first generation TPUs were focused on inference. Second generation TPUs
are focused on both training and inference.

[https://www.blog.google/topics/google-cloud/google-cloud-
off...](https://www.blog.google/topics/google-cloud/google-cloud-offer-tpus-
machine-learning/)

------
COil
What if a top human player could train like AlphaGO Zero during billions of
games and record everything?

------
Abishek_Muthian
Algorithms > Big data in this particular machine learning scenario makes it
very impressive.

------
curtisgiddings
Anyone happen to have a link (or can PM one to me) to a copy that isn't behind
a paywall?

~~~
jvolkman
From the Deepmind blog post:
[https://deepmind.com/documents/119/agz_unformatted_nature.pd...](https://deepmind.com/documents/119/agz_unformatted_nature.pdf)

------
foobaw
Would love to see some gameplay!

~~~
thefalcon
Here's a self-play game: AGZ v AGZ -
[http://eidogo.com/#u2UdsDFJ](http://eidogo.com/#u2UdsDFJ)

------
Dowwie
Can self-improving AI predict behaviors in emergent environments? Will they?

~~~
zardo
Is Go strategy not an emergent property of the rules of Go?

------
hzhou321
How does that timeline have any meaning?

~~~
Tepix
It does if you know that they used 4 TPUs.

