Hacker News new | past | comments | ask | show | jobs | submit login
AlphaGo Zero: Learning from scratch (deepmind.com)
923 points by stablemap on Oct 18, 2017 | hide | past | web | favorite | 311 comments

The fact that they only used self play with no outside input here is really interesting. I wonder if this system produced more new styles of play. While I am not that familiar with Go, I know in some of the other articles they talk about things like Chinese starts that are specific to certain cultures. I wonder if the fact that it had no outside reinforcement made it produce movements that we have already seen that are somehow inherent to the game, or if it produced many more new moves that were a result of it learning without any possibility of cultural interference. According to the article it did invent some unconventional and creative moves, but I also wonder how much it rediscovered.

I also wonder how much it’s style of play changes if it were re trained, due to the random start that it is given. Maybe that would produce something like seeds for procedurally generated worlds in games. Like if they could find a seed for a Chinese or japanese players, or ones that more aggressive styles. This is some pretty cool work and may open up even more doors for pure reinforcement learning

I don't think it's an overstatement to say that, since playing Lee Sedol in 2016, AlphaGo has completely revolutionized professional and amateur go. It's certainly not unprecedented — the last major revolution happened in the early 20th century (often called the 'Shin Fuseki' era [0]) — but AlphaGo has demonstrably surpassed any previous high-water mark.

> I wonder if this system produced more new styles of play.

Absolutely. One such innovation has been the use of early 3-3 invasions [1]. There are many more, and indeed AlphaGo's games are still being analyzed by professional players. Michael Redmond, a 9-dan professional, has been working with the American Go Association on one such series [2].

> I wonder if the fact that it had no outside reinforcement made it produce movements that we have already seen that are somehow inherent to the game...

Interestingly, yes. Strong players have commented that AlphaGo seems to agree with things that players like Go Seigen [3] have suggested in the past, but that were never fully developed or understood [4].

Very, very interesting work indeed.

[0] https://senseis.xmp.net/?ShinFuseki

[1] https://www.eurogofed.org/index.html?id=127

[2] http://www.usgo.org/news/category/go-news/computer-goai/mast...

[3] https://senseis.xmp.net/?GoSeigen

[4] https://lifein19x19.com/forum/viewtopic.php?f=13&t=14129

I have only skimmed the paper but one thing I don't see any discussion of is whether komi (the handicap given to white for going second) is correct.

They do say the rules used for all games, including self-play, set komi consistently to 7.5 .

If the strongest AI was consistently winning predominantly with one color it would be an indication that komi isn't fair for the best play.

Of the 20 games released for the strongest play it appears white won 14 times and black 6. I don't think that is enough to be conclusive but maybe komi is too high.

I wonder if different "correct" play at the strongest levels would be learned with a 6.5 komi.

You can only change komi by full point increments. There is a .5 to break ties, but a komi of 7.5 is identical to one of 7.4.

From a theoretical standpoint, any non-integer komi should lead to one player winning 100% of the time. So even if the actual win ratio is 14:6 at komi=7.5 that might still be the best value.

If you had an estimate of the real difference, you could switch to breaking ties randomly. Black wins 60% of the ties, white wins 40%. There will be a ratio at which each side should win 50% of the time.

I agree that with perfect play, it will be a 50% of a tie to each side. But it is still interesting to ask for a better estimate of practical play.

Michael Redmond mentions this in the AlphaGo vs AlphaGo review series he's doing with the AGA. AlphaGo selfplay games are with 7.5 komi under Chinese rules, and apparently, Deepmind has stated that black vs white wins is almost exactly 50/50. IIRC Redmond mentioned that white (?) only had some sub 1% advantage in the entire self-play corpus.

I don't remember where I read it but in some earlier versions of AlphaGo they tried a komi of 6.5 and black ended up winning more often. That indicates the correct komi value is 7, but since Go doesn't have ties, you have to pick which side you want to favor to break the tie. (White seems reasonable.)

That doesn't sound right because in Chinese rules, which is what AG uses, komi only changes in steps of two. Are you sure about 6.5? Could it be 5.5?

Well in Japanese rules the komi is 6.5 so that's the alternative that tends to come up. Some quick searching I found a transcript from one of the games where DeepMind said 7.5 slightly favors white but they didn't say anything about 6.5 or 5.5, while a random comment from r/baduk claims that pro game analysis shows 6.5 slightly favors black and 7.5 slightly favors white.

I guess they should then randomize komi so any player would have 50/50 chance of getting a small edge.

the correct komi number has puzzled Go players for centuries, now we might finally have a chance to figure out the right answer (although not without some reservations). over the last 5 decades, komi has consistently been raised to keep the game more leveled between white and black (black makes the first move, so has the advantage). historically, there was no komi, and people kept an even game by always playing even number of games with each player switching sides after each game.

for whatever reason, it's no longer feasible in modern pro game (not to mention that this could result in no winner if each player wins half the game), so komi was introduced. at first at 5.5, and steadily climbed higher to 7.5 at present. In pro game, even a change of 1 is considered a big deal, so from 5.5 to 7.5 is hardly trivial.

Now with alphago playing "perfect" games against itself, we might finally be able to put to rest the debate of the correct komi (the Japanese Go associations for decades have kept meticulous records of every professional game, in order to find the correct komi).

There is a big "but" though. The correct komi at Alphgo Zero's level might not be the correct komi for human level players (AlphaGo is estimated to be 2-3 handicaps above human play; this is a bigger gap between the average pro player and the best amateurs).

Indeed, the change from 5.5 komi to 7.5 komi also had a lot to do with the change in play style rather than simply zooming in on the "correct" komi number. In the 70s and 80s, predominant play style was more conservative, and 5.5 might well be the correct komi for the time (defined as resulting in 50:50 chance of winning for either side). As play style shifted to become more aggressive and confrontational (actually fueld somewhat by the introduction of komi), it was discovered that komi needs to be raised to keep chances of winning at 50:50.

To make an analogy, suppose one is playing a casino game of chance that gives the house a slight advantage (similar to the first mover advantage for black in go). If one only makes small bets, the house will end up winning only a small amount. in other words, the player needs to be compensated by a small amount to make the game "fair".

If however one makes big bets (i.e. more aggressive game play), then the compensation needs to be bigger too, to make the game "fair", even if the underlying probabilities have not changed.

following this logic, while 7.5 komi is fair for Alphago vs. alphago games, it might not be the right number for human games. I suspect it might be samller for humans.... if only we could calibrate Alphago to the average human level and generate millions of self-play games...

They could just test it right?

Or am I misunderstanding the hardware requirements?

With respect to your very interesting comment (I genuinely appreciate your input), you appear to have mis-understood the comment you were replying to.

You've commented on the differences in the style of play that AlphaGo introduced, but the post you were replying to (by aeleos) was going a step further and hypothesising about the potential for a newer, completely 'non-human' style that AlphaGo Zero may have created.

Your comments definitely contribute to the discussion but it was bugging me that there appeared to be a tangent forming about AlphaGo that was overlooking AlphaGo Zero which would be the more interesting area to explore.

"more new styles of play" seems to indicate that non-human play.

Yes, aeleos was interested in that, and so am I, and it seems to be what this entire thread _should be about_. kndyry steered back towards AlphaGo. I'm not sure this merits any further disection.

The key part from the paper:

> To assess the merits of self-play reinforcement learning, compared to learning from human data, we trained a second neural network (using the same architecture) to predict expert moves in the KGS Server data­ set; this achieved state-of-the-art prediction accuracy compared to pre­ vious work 12,30–33 (see Extended Data Tables 1 and 2 for current and previous results, respectively). Supervised learning achieved a better initial performance, and was better at predicting human professional moves (Fig. 3). Notably, although supervised learning achieved higher move prediction accuracy, the self-learned player performed much better overall, defeating the human-trained player within the first 24 h of training. This suggests that AlphaGo Zero may be learning a strategy that is qualitatively different to human play.

That is really interesting. Given a neural network that solely exist to play Go, one that is influenced by the human mind is limited compared to the exact same set of neurons that doesn't have that influence.

EDIT: changed a set of neurons to neural network per andbbergers comments

Please don't refer to it as 'a set of neurons' - it only serves to fuel the (IMO) absolutely ridiculous AI winter fearmongering, and is also just a bad description. Neural nets are linear algebra blackboxes, the connections to biology are tenuous at best.

Sorry to be that guy, but the AI hype is getting out hand. COSYNE this year was packed with papers comparing deep learning to the brain... it drives me nutty. Convnets can be reasonably put into analogy with the visual system.... because they were inspired from it. But that's about it.

To address your actual comment: I would argue that this is not really interesting or surprising (at least to the ML practitioner), it is very well known that neural nets are incredibly sensitive to initialization. Think of it like this: as training progresses, parameters of neural nets move along manifolds in parameter space, but they can get nudged off of the "right" manifold and will never be able to recover.

Sorry for the rant, the AI hype is just getting really out of hand recently.

Machine learning is specifically not magic. Blackboxes are not useful. Convnets work so well because they build the symmetries of natural scenes directly into the model - natural scenes are translation invariant (as well as a couple of other symmetries), anything that models them sure as hell better have those symmetries too, or you're just crippling your model with extra superfluous parameters.

I changed my comment to neural network since a set of neurons is somewhat wrong, but I don't really agree that there isn't much of a connection between this and biology. While there might not be much of a connection between how they currently work and how our brains work, the whole point of machine learning and neural networks is to improve computers performance on the things we are good at. And while originally it was loosely modeled on it, and might be different know, it doesn't make it so people can't compare it to the brain. It would be wrong to say it is exactly like the brain, but I don't think there is anything wrong with comparing and contrasting the two. If our goal is to improve performance and we are the benchmark, then why shouldn't we compare them.

What I found interesting was mainly that it was us who nudged the parameter space you talked about into the "wrong" one manifold, especially given how old and complicated Go is. The sheer amount of human brain power that has been put into getting good at a game wasn't able to find certain aspects of it, and in 60 hours of training a neural network was able to.

I'm not saying there is nothing of value to be obtained by investigating connections between ML and the brain. That's how I got into ML in the first place, doing theoretical neuro research.

We absolutely should and do look to the brain for inspiration.

I'm taking issue with the rather ham-fisted series of papers that have come out in recent years aggressively pushing the agenda of connections between ML and neuro that just aren't there.

Are you sure that humans have done more net compute on Go than Deepmind just did? The Go game tree is _enormous_, humans are bias. We don't invent strategies from scratch, we use heuristics handed down to us from the pros (who in turn were handed down the heuristics from their mentors).

To me, it's not so interesting or surprising that the human initialized net performed worse. We just built the same biases and heuristics we have into that net.

As far as we know the brain is just a "linear algebra blackbox". It's an uninteresting reduction since linear algebra can describe almost everything. Yes NNs aren't magic, but neither is the brain. Likely they use similar principles. Hinton has a theory about how real neurons might be implementing a variation of backpropagation and there are a number of other theories.

>As far as we know the brain is just a "linear algebra blackbox"...Likely they use similar principles.

I'm not an expert, but my impression is that this is not really a reasonable claim, unless you're only considering very small function-like subsystems of the brain (e.g. visual cortex). Neural nets (of the nonrecurrent sort) are strict feed-forward function approximators, whereas the brain appears to be a big mess of feedback loops that is capable of (sloppily, and with much grumbling) modeling any algorithm you could want, and, importantly, adding small recursions/loops to the architecture as needed rather than a) unrolling them all into nonrecursive operations (like a feedforward net) or b) building them all into one central singly-nested loop (like an RNN).

The brain definitely seems to be using something backprop-like (in that it identifies pathways responsible for negative outcomes and penalizes them). But brains also seem to make efficiency improvements really aggressively (see: muscle memory, chunking, and other markers of proficiency), even in the absence of any external reward signal, which seems like something we don't really have a good analogue for in ANNs.

There are some parts of the brain we have no clue about. Episodic memory or our higher level ability to reason. But most of the brain is just low level pattern matching just like what NNs do.

The constraints you mention aren't deal breakers. We can make RNNs without maintaining a global state and fully unrolling the loop. See synthetic gradients for instance. NNs can do unsupervised learning as well, through things like autoencoders.

A pattern matcher can learn high level reasoning. Reasoning is just a boolean circuit

> It's an uninteresting reduction since linear algebra can describe almost everything.

The question is whether it can do so efficiently. As far as I know, alternating applications of affine transforms and non-linearities are not so useful for some computations that are known to occur in the brain such as routing, spatio-temporal clustering, frequency filtering, high-dimensional temporal states per neuron etc.

Hinton changes his opinion about what the brain is doing every 5 years... Hinton is not a neuroscientist...

If he changes his opinion, which I understand to be models of the brain in this case, and each iteration improves the model, then that is perfectly fine. It would be bad if someone did not change their view in case of inconsistent evidence.

For political opinions sure, but if he's changing his opinions so often ...

When you're a big scientific figure, I think that you have some extra responsibility to the public to only say things you're very confident about. Or otherwise very clearly communicate your uncertainty!!

> Hinton is not a neuroscientist...

It's not like neuroscientists know that either.

Agreed. If we announce that A* search is superhuman in finding best routes, most technorati would't bat an eye. Technically it is probably accurate to say that the results here show that neural networks can find good heuristics for MCTS search through unsupervised training in the game of Go. According to DeepMind authors:

"These search probabilities usually select much stronger moves than the raw move probabilities of the neural network; MCTS may therefore be viewed as a powerful policy improvement operator. Self-play with search – using the improved MCTS-based policy to select each move, then using the game winner as a sample of the value – may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure ..."

The fact that this reinforcement training is unsupervised from the very beginning is quite exciting and may lead to better heuristics for other kinds of combinatorial optimization problems.

Please don't refer to them as black boxes. The internals are fully observable.

Fully observable and we still have no idea what the hell it's doing.

Makes neuroscience seem kinda bleak doesn't it?

There has been a lot of great work lately building up a theory of how these things work, but it is very much still in the early stage. Jascha Sohl-Dickstein in particular has been doing some great work on this.

We don't even have answers to the most basic questions.

For instance (pedagogically), how the hell is it possible to train these things at all? They have ridiculously non-convex loss landscapes and we optimize in the dumbest conceivable way, first-order stochastic gradient descent. This should not work. But it does, all too often.

Not a great example because there are easy hand wavy arguments as to why it should work, but as far as proofs go...

The hand wavy argument goes as follows: - we're in like a 10000 dimensional space, for the stationary point we're at to be a true local minima that means each one of those 10000 dimensions goes uphill in either direction. It's overwhelming likely that there's at least one way out - there are many many different ways to set the params of the net for each function. Permutation is a simple example.

We really have no idea how these things work.

Anyone who tells you otherwise is lying to you...

> Fully observable and we still have no idea what the hell it's doing.

Of course we do. It's matching a smooth multi-dimensional curve to sample data.

It's a conceptual black box. There's no way for us to understand what each individual neuron is doing.

The tools we have developed so far are limited, but that's different from "there's no way". Many academics are working hard right now to better understand deep neural networks.

Is there meaningful information in what one observes?

Yes, it turns out you can find meaningful information. etiam provided this https://arxiv.org/pdf/1312.6034.pdf The main issue is making sure what you are looking for is actually what the network is doing. You have to correctly interpret and visualize a jumble of numbers, which usually requires a hypothesis about how it worked in the first place. But assuming both go well you can gain meaningful information.

Can I train an NN to visualize the numbers?

> things like Chinese starts that are specific to certain cultures

While it's true that there are national styles of play, the Chinese opening is not called that because it's really popular among Chinese people. It's called that because a particular Chinese pro helped popularize it, even though it was invented by a Japanese amateur.

See https://en.wikipedia.org/wiki/Chinese_opening for some more info. FWIW, I (a caucasian American) use this opening all the time. It's just a generally good opening if you like a certain style of play.

> talk about things like Chinese starts that are specific to certain cultures

Came here to make this point.

It's Chinese Opening, not Chinese start – similarly recall that you have the French Defense / Italian Defense / Scandinavian Defense among chess opening variations and none of these implies that that opening variation is specific to that culture or nation.

> I wonder if this system produced more new styles of play.

One thing Alpha go has told us clearly is that it thinks human players over value the margin of victory vs the probability of victory.

I'm not 100% sure I agree. It values probability of victory because that's it's goal. For humans, aiming only for probability of victory might not be as good, because we're much worse at estimating probabilities. So aiming for maintaining a large margin at all times is conceivably the best proxy that we can use in practice.

Agreed. I know I'm winning by 4 points but I have no idea about my probability of winning. However if I'm winning I know that I should play low risk moves and refrain from starting complicated fights. That increases the probability of winning. IMHO the exact value is out of reach for human beings.

It would be interesting to play human go, assisted by a go computer that doesn't say anything about moves, but rather just spits out, for each player, their current likelihood of victory if all further moves by both players were "what it would do."

That way, each player could know, at all times, (one major factor that goes into) their probability of winning. They'd still have to mentally adjust it for the likelihood of them and their opponent making an error, and how that can be controlled by making intimidating moves, etc. But it could lead to much tighter control on the abstract flow of the game.

It'd almost be like the computer was the general, issuing strategy, picking battles; and the human player the tactician, fighting those battles.

There is computer go program (maybe Crazy Stone?) that analyzes a game record and annotates it with the winning percentage for every move.

Knowing that the opponent's winning probability changed from 52 to 57 was interesting only because it hints at a mistake. In case of such a large change the program suggests the move it would have played.

I saw an annotated game record and there were no variations: I remember a suggested move that made me wonder "why!?".

Another benefit of seeing the value of the winning probability is an assessment of who's ahead. However that's already possible with the score estimation that programs and go servers provide. Sometimes is crude, sometimes is good, but it's the score, not the winning probability that humans can estimate when playing. The best probability estimate I can make is: if the score is close and the game is still complicated, it's 50-50; if the score is close but the game is almost over, it's 95-5 for who's ahead. If the score is not close, the player with more points will probably win.

Ha, like watching someone play a game and moaning or cheering at their plays.

The Go community learned to understand that the margin of victory is meaningless along time ago. The most famous game of Honinbo Dosaku, a famous Go player from the late 1600s, is arguably a game where he gave a handicap to his opponent and lost by one point. Lee Chang-Ho, who was the reigning champion in the late 90s, had a style that consistently tried to win by small margins.

AlphaGo now appears to be better than humans in all aspects of gameplay, and it better at calculating very thin margins of probability that a human cannot. This is not unique to any individual aspect of its gameplay; against humans it can also win by huge margins depending on what mistakes the human makes.

I think if AlphaGo foresaw it was losing by one point, it would start playing reckless moves, as it did against Lee-Sedol in the only match it lost against him.

Risky, not reckless.

It wasn't risky. It was reckless. It played moves that made no sense and were obviously bad moves or pointless moves.

I'm talking specifically about game #4 of the Lee Sedol games.

Part of that could be trying to compensate for counting skills that aren't quite at machine levels.

Possibly a dumb q, but is ‘self play’ in any way related to ‘adversarial’ learning? I don’t see it mentioned in the article, but it reminds me of the principle.

In some ways it is, but the main difference is that adversarial learning (usually) produces a second neural network whose purpose is to exploit weakness is the first. Whereas reinforcement learning does not produce a second neural network to beat the first, it uses what it learned to solely improve the original.

As a side note, the main application I have seen with adversarial learning research is with photo recognition, but I guess you could have an adversarial network exist to help help improve an object recognition network. At that point it would probably become something between adversarial and reinforcement learning. However, with game based reinforcement learning, it doesn't require a second specific network as the adversary, it can easily just be paired against itself.

It isn't a dumb question, they are very similar in some ways. They mainly differ in what exactly the goal of the opponent is. In this case, it is to help improve itself, however in typical adversarial situations it is solely to exploit (become its adversary).

I'm reminded of Eliezer Yudkowski's article "There is no fire Alarm for Artificial General Intelligence." Is this smoke?


Yes, this is not an AGI. But the hockey-stick takeoff from defeats some players, to defeats an undefeated world-champion, to defeats the version of itself that beat the world champion 100% of the time is nuts. If this happens in other domains, like finance, health, paper clip collection, the word singularity is really well chosen--we can't see past this.

While this is promising, there's a long way to go between this and the other things you mentioned. Go is very well-defined, has an unequivocal objective scoring system that can be run very quickly, and can be simulated in such a way that the system can go through many, many iterations very quickly.

There's no way to train an AI like this for, say, health: We cannot simulate the human body to the level of detail that's required, and we definitely aren't going to be able to do it at the speed required for a system like this for a very long time. Producing a definitive, objective score for a paper clip collection is very difficult if not impossible.

AlphaGo/DeepMind represents a very strong approach to a certain set of well-defined problems, but most of the problems required for a general AI aren't well-defined.

> most of the problems required for a general AI aren't well-defined.

Do you care to give an example? Are they more or less well defined than find-the-cat-in-the-picture problem?

> Producing a definitive, objective score for a paper clip collection is very difficult if not impossible.

Erm, producing of objective comparison of relative values of Go board positions is still not possible.

> Do you care to give an example? Are they more or less well defined than find-the-cat-in-the-picture problem?

You mean like go over and feed the neighbor's cat while they're on vacation?

How about instead, being able to clean any arbitrary building?

Go isn't remotely similar to the real world. It's a board game. A challenging one, sure, and AlphaGo is quite a feat, but it's not exactly translatable to open ended tasks with variable environments and ill-specified rules (maybe the neighbor expects you to know to water the plants and feed the goldfish as well).

At this point, there is no evidence that the limiting factor in these cases is AI/software.

The limiting factor with the neighbors cat is the robotics of having a robust body and arm attachment. We know that the scope of current AI can:

1) Identify a request to feed a cat

2) Identify the cat, cat food and cat's bowl from camera data

3) Navigate an open space like a house

Being able to clean an arbitrary building is also more the challenge of building the robot than the AI identifying garbage on a floor or how to sweep something.

It is not clear there are hard theoretical limits on an AI any more. There are economic limits based on the cost of a programmer's attention. There are lots of hardware limits (including processor power).

In my opinion the deepest and most difficult aspect of this example is the notion of 'clean' which will be different across contexts. Abstractions of this kind are not even close to understood in the human semantic system, and in fact are still minimally researched. (I expect much of the progress on this to come from robotics, in fact.)

I remember seeing a demonstration by a deep learning guy of a commercially available robot cleaning a house under remote control. You are seriously underestimating the difficulty of developing software to solve these problems in an integrated way.

This. It is a lot like the business guy thinking it is trivial to program a 'SaaS business' because he has a high level idea in his mind. Like all things programming the devil is in the detail.

The hardware is certainly good enough to assist a person with a disability living in a ranch house with typical household tasks. As demonstrated by human in the loop operation.


We have have rockets that can go to orbit, and we have submersibles that can visit the ocean floor. That does not mean the rocket-submarine problem is solved, doing both together is not the same problem as doing both separately.

It also doesn't mean that a rocket-submarine is the way to go.

The difference is a go AI can play billions of games and a simple 20 line C program can check, for each game, who won.

For "cat in the picture", every picture must have the cat first identified by a person, so the training set is much smaller, and Google can't throw GPUs at the problem.

> Google can't throw GPUs at the problem.

The field progresses swiftly. https://arxiv.org/abs/1602.00955

The absolute value of any Go board position is well-defined, and MCTS provides good computationally tractable approximations that get better as the rest of the system improves but already start better than random.

Check the Nature paper (and I think this is one of the biggest take-aways from AlphaGo Zero):

"Finally, it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any Monte Carlo rollouts."

In this new version, MCTS is not even used to evaluate a position! Speaking as a Go player, the ability for the neural network to accurately evaluate a position without "reading" ahead is phenomenal (again, read the Nature paper last page for details).

the absolute value of a go board position is well defined? where?

As a human go player I can say that evaluating board position is close to impossible.

You may have a seemingly good position and in two turns it seems that you have lost the game already.

> We cannot simulate the human body to the level of detail that's required

A-ha! So we use AGI for this! :-)

You don't even need to produce an AGI for this kind of intelligence to be frightening.

At some point, a military is going to develop autonomous weapons that are vastly superior to human beings on the battle field, with no risk of losing human lives, and there is going to be a blitzkrieg sort of situation as the relative power of nations shifts dramatically.

If we have two such countries we could have massive drone and cyberwars being fought faster than people even can comprehend what's happening.

Right now most countries insist on maintaining human control over the machinery of death. But that will only last for as long as autonomous death machines don't dominate the battlefield.

It's a fun challenge right now to build a machine that can win in Starcraft, but it's really a hop skip and a jump from there to winning actual wars.

Nuclear ICBMs already push us past that boundary. The world can no longer afford to fight a war seriously.

In that case you just nuke the shit out of everybody or create army if autonomous suicide bomber with nukes, biological and chemical weapons of all kinds. Once all humans are extinct the harmony on earth will be restored and everyone will leave happily ever after.

i'm not sure robot soldier is scarier than nukes. generally speaking, if they are just single task robots performing functions in dangerous situations, that seems like an improvement to risking human lives.

The core technique of AlphaGo is using tree search as a "policy improvement operator". Tree search doesn't work on most real-world tasks: the "game state" is too complex, there are too many choices, it's hard to predict the full effect of any choice you might make, and there often isn't even a "win" or "lose" state which would let you stop your self-play.

This version explicitly does not use tree search.

MCTS means "Monte-Carlo Tree Search". It's the core of the algorithm. The big difference is that it doesn't use rollouts, or random play: it chooses where to expand the tree based only on the neural network.

No, 'habitue is correct. This new blog post says that the new software no longer does game readouts and just uses the neural net.

That's not what Monte Carlo Tree search is. The new version is still one neural network + MCTS. There's no way to store enough information to judge the efficiency of every possible move in a neural network, therefore a second algorithm to simulate outcomes is necessary.

Read the white paper. MCTS is still involved, right the way through.

The new version does use MCTS, you should read the paper again. :)

If you read the paper, they do in fact still use monte-Carlo tree search. They just simplify their usage in conjunction with reducing the number of neural networks to 1

It does, during training.

Tree search is also used during play. In the paper, they pit the pure neural net against other versions of the algorithm -- it ends up slightly worse than the version that played Fan Hui, at about 3000 ELO.

Oh, so it's just not using rollouts to estimate the board position? Thanks for the clarification.

It doesn't use rollouts at all:

> AlphaGo Zero does not use “rollouts” - fast, random games used by other Go programs to predict which player will win from the current board position. Instead, it relies on its high quality neural networks to evaluate positions.

Thanks for that link, well worth the read.

This is an interesting question to ask in these "how far away is AGI" discussions:

I was once at a conference where there was a panel full of famous AI luminaries, and most of the luminaries were nodding and agreeing with each other that of course AGI was very far off, except for two famous AI luminaries who stayed quiet and let others take the microphone.

I got up in Q&A and said, “Okay, you’ve all told us that progress won’t be all that fast. But let’s be more concrete and specific. I’d like to know what’s the least impressive accomplishment that you are very confident cannot be done in the next two years.”

There was a silence.

Eventually, two people on the panel ventured replies, spoken in a rather more tentative tone than they’d been using to pronounce that AGI was decades out. They named “A robot puts away the dishes from a dishwasher without breaking them”, and Winograd schemas. Specifically, “I feel quite confident that the Winograd schemas—where we recently had a result that was in the 50, 60% range—in the next two years, we will not get 80, 90% on that regardless of the techniques people use.”

IBM Watson on Winograd schemas? It beat jeopardy... ?

I spent an hour of my life that I'll never get back reading Yudkowski's overly-long article and I believe I can summarise it thusly:

"We don't know how AGI will arise; we don't know when; we don't know why; we don't know anything at all about it and we won't know anything about it until it's too late to do anything anyway; We must act now!!"

The question is- if we don't know anything about this unknowable threat, how can we protect ourselves against it? In fact, since we're starting from 0 information, anything we do has equal chances of backfiring and bringing forth AGI as it has of actually preventing it. Yudkowski is calling for random action, without direction and without reason.

Besides, if Yudkowski is none the wiser about AGI than anyone else, then how is he so sure that AGI _will_ happen, as he insists it will?

Yudkowski is fumbling around in the dark like everyone else in AI. Except he (and a few others) has decided that it's a good strategy, under the circumstances, to raise a hell of a racket. "It's dark!" he yells. "Beware of the darkness!". Yeah OK, friend. It's dark- we can all tell. Why don't you pipe down and let us find the damn light?

So, in your view, starting MIRI, doing fundamental research into AI safety and advocating for it, is not trying to find the damn light?

You exemplify exactly the attitude he's trying to combat. "Oh, nobody knows anything, let's not care about consequences and do whatever."

Sorry but I don't really see Yudkowski's contributions as "fundamental research into AI safety". More like navel-gazing without any practical implications. At best, listening to him is just a waste of time. At worse, AGI is a real imminent threat and having people like him generating useless noise like he does will make it harder for legitimate concerns to be heard, when the time comes.


Yes, I did and it's very bad form to go around asking people if they read the article. Try to remember that different people form different opinions from similar information.

Well, then you should have noticed what the article was about, which was not to detail a research program about AI safety. Different articles can address different aspects of a problem without being accused of advocating "random action". That's just ridiculous.

>The question is- if we don't know anything about this unknowable threat, how can we protect ourselves against it? In fact, since we're starting from 0 information, anything we do has equal chances of backfiring and bringing forth AGI as it has of actually preventing it. Yudkowski is calling for random action, without direction and without reason.

Are you sure you read the essay? That's literally the question he answers.

At any rate, we do have more than '0 information', and if you make an honest effort to think of what to do you can likely come up with better than 'random actions' for helping (as many have).

>> Are you sure you read the essay? That's literally the question he answers.

My reading of the article is that he keeps calling for action without specifying what that action should be and trying to justify it by saying he can't know what AGI would look like (so he can't really say what we can do to prevent it).

>> if you make an honest effort to think of what to do you can likely come up with better than 'random actions' for helping (as many have).

Sure. If my research gets up one day and starts self-improving at exponential rates I'll make sure to reach for th

... yeah, before reading that link my position was "Wow, that's super neat, but Go is a pretty well-defined game," and after reading it I remembered that my position maybe a year or two ago was "Chess is a well-defined game that's beatable by AI techniques but Go is acknowledged to be much harder and require actual intelligence to play and won't be solved for a long while" and now I'm worried. Thanks for posting that.

Go is still a well defined game within a limited space that doesn't change, and rules that don't change. It's just harder than Chess, but that doesn't make it similar to tons of real world tasks humans are better at.

That's probably true, but that's very much not what people were saying about Go a couple years ago. There were a lot of people talking about how there isn't a straightforward evaluation function of the quality of a given state of the board, how things need to be planned in advance, how there's much more combinatorial explosion than in chess, etc., to the point where it's a qualitatively different game.

For me, as someone who accepted and believed these claims about Go being qualitatively different, realizing that no, it's not qualitatively different (or that maybe it is, but not in a way that impedes state-of-the-art AI research) is increasing my skepticism in other claims that board games in general are qualitatively different from other tasks that AIs might get good at.

(If you didn't buy into these claims, then I commend you on your reasoning skills, carry on.)

About those claims- this is from Russel and Norvig, 3d ed. (from 2003, so a way back):

Go is a deterministic game, but the large branching factor makes it challeging. The key issues and early literature in computer Go are summarized by Boozy and Cazenave (2001) and Muller (2002). Up to 1997 there were no competent Go programs. Now the best programs play most of their moves at the master level; the only problem is that over the course of a game they usually make at least one serious blunder that allows a strong opponent to win. Whereas alpha—beta search reigns in most games, many recent Go programs have adopted Monte Carlo methods based on the UCT (upper confidence bounds on trees) scheme (Kocsis and Szepesvari, 2006). The strongest Go program as of 2009 is Golly and Silver's MoGo (Wang and Golly, 2007; Gelly and Silver, 2008). In August 2008, MoGo scored a surprising win against top professional Myungwan Kim, albeit with MoGo receiving a handicap of nine stones (about the equivalent of a queen handicap in chess). Kim estimated MOGO's strength at 2-3 dan, the low end of advanced amateur. For this match, MoGo was run on an 800-processor 15 terailop supercomputer (1000 limes Deep Blue). A few weeks later, MoGo, with only a five-stone handicap, won against a 6-dan professional. In the 9 x 9 form of Go, MoGo is at approximately the 1-dan professional level. Rapid advances are likely as experimentation continues with new forms of Monte Carlo search. The Computer Go Newsletter, published by the Computer Go Association, describes current developments.

There's no word about how Go is qualitatively different to other games, but maybe the referenced sources say something along those lines. Personally, I took a Masters course in AI two years ago, before AlphaGo and I remember one professor saying that the last holdout where humans can still beat computers in board games was GO, but I don't quite remember him saying anything about qualititative difference. Still, I can recall hearing about the idea that Go needs intuition or something like that, except I've no idea where I've heard that. I guess it might come from the popular press.

I guess this will sound a bit like the perenial excuse that "if it works, it's not AI" but my opinion about Go is that humans just weren't that good at it, after all. We may have thought that we have something special that makes us particularly good at Go, better than machines- but AlphaGo[Zero] has shown that, in the end, we just have no idea what it means to be really good at it (which, btw, is a damn good explanation of why it took us so long to make AI to beat us at it).

That, to my mind, is a much bigger and much more useful achievement than making a good AI game player. We can learn something from an insight into what we are capable of.

s/2003/2009/, I think, but the point stands. (Also I think I have the second edition at home and now I want to check what it says about Go.)

> my opinion about Go is that humans just weren't that good at it, after all. We may have thought that we have something special that makes us particularly good at Go, better than machines- but AlphaGo[Zero] has shown that, in the end, we just have no idea what it means to be really good at it (which, btw, is a damn good explanation of why it took us so long to make AI to beat us at it).

I really like that interpretation!

> the last holdout where humans can still beat computers in board games was GO

False, because nobody ever bothered to study modern boardgames rigorously.

Modern boardgames have small decision trees but very difficult evaluation functions. (Exactly opposite from computational games like Go.)

Modern boardgames can probably be solved by pure brute force calculation of all branches of the tree, but nobody knows if things like neural networks are any good for playing them.

In AI, "board games" generally means classical board games (nim, chess, backgammon, go etc) and "card games" means classical card games (bridge, poker, etc). Russel & Norvig also discuss some less well-known games, like kriegspiel (wargame) if memory serves, but those are all classical at least in the sense that they are, well, quite old.

I've seen some AI research in more modern board games actually. I've read a couple of papers discussing the use of Monte Carlo Tree Search to solve creature combat in Magic: the Gathering and my own degree and Master's dissertation were about M:tG (my Master's was in AI and my degree dissertation was an AI system also).

I don't know that much about modern board games, besides collectible card games, but for CCGs in particular, the game trees are not small. I once calculated the time complexity of traversing a full M:tG game tree as O(b^m * n^m) = 2.272461391808129337799800881135e+5564 (where b the branching factor, m the average number of moves in a game and n the number of possible deck permutations for a 60 card deck taking into account cards included multiple times). And mine was probably a very conservative estimate.

Also, to my knowledge, Neural nets have not been used for magic-playing AI (or any other CCG playing AI). What has been used is MCTS, on its own, without terrible success. The best AI I've seen incorporates some domain knowledge, in the form of card-specific strategies (how to play a given card).

There are some difficulties in using ANNs to make an M:tG AI. Primarily, the fact that a truly competent player should be able to pick up a card it's never seen before and play it correctly (or decide whether to include it in a deck, if the goal is to also address deck-building). For this, the AI player will need to have at least some understanding of M:tG's language (ability text). It is my understanding that other modern games have equal requirements to understand some game context outside of the main rules, which complicates the traditional tactic of generating all possible moves, pruning some and choosing the best.

In any case what I meant to say is that people in AI have indeed considered other games besides the classical ones- but when we talk about "games" in AI we do mean the classics.

> but when we talk about "games" in AI we do mean the classics

Only because of inertia. There's nothing inherently special about "classics". Eventually somebody will branch out once Go and poker are mined out of paper and article opportunity.

Once we do then maybe some new, interesting algorithms will be found.

In principle, every game can be solved by storing all possible game states in a database. Where brute-force storing is impractical due to size concerns, compression tricks have to be used.

E.g., Go is a simple game because at the end, every one of the fixed number of board spaces is either +1, -1 or 0. Add them up and you know if you won. This means that every move is either "correct" or "incorrect"; the problem of classifying multidimensional objects into two classes is a problem that we're pretty good at now, and things like neural networks get the job done.

A slightly more complex game like Agricola has no "correct" and "incorrect" moves because it's not zero-sum; you can make an "incorrect" move and still win as long as your opponent is forced to make a relatively more "incorrect" move.

Not sure how much of a difference that makes, but what's certain is that by (effectively) solving Go we've only scratched the surface. It's not the end of research, only the beginning.

Sure. Research in game playing AI doesn't end with Go, or any other game. We may see more research in modern board games, now that we're slowly running out of the classics.

I think you're underestimating the amount of work and determination it took to get to where we are today, though (I mean your comment about "inertia"). Classic board games have the advantage of a long history and of being well understood (the uncertainty about optimal strategies in Go notwithstanding). Additionally, for at least some of them like chess, there are rich databases of entire games that can be used outright, without the AI player having to generate-and-test them in the process of training or playing.

The same is not true for modern games. On the one hand, modern board games like Agricola (or, dunno, Settlers or Carcassonne etc) don't have such an extensive and multi-national following as the classics so it's much harder to find a lot of data to train on (which is obviously important for machine-learning AI players). I had that problem when considering an M:tG AI trained with machine learning: I would have liked to find play-by-play data on professional games but there just isn't any (or where there is it's not enough, or it's not in any standardised format).

Finally, classic board games have cultural significance that modern board games dont' quite match, despite the huge popularity of CCGs like M:tG or Pokemon, or Eurogame hits like Settlers. Go, chess and backgammon in particular have tremendous historical significance in their respective areas of the world- chess in Eastern Europe, backgammon in the Middle East, Go in SE Asia. People go to special academies to learn them, master players are widely recognised etc. You don't get that level of interest with modern board games- so there's less research interest for them, also.

People in game playing AI have been trying for a very long time to crack some games like Go and, recently, poker (not quite cracked yet). They didn't sit around twiddling their thumbs all those years, neither did they choose classical board games over modern ones just because they didn't have the imagination to think of the latter. In AI research, as in all research, you have to make progress before you can make more progress.

> Go is acknowledged to be much harder and require actual intelligence to play

No, Go is a much less intelligent[1] game. It has a huge decision tree and requires massive amounts of computation to play, but walking trees and counting is exactly what computers do well and what humans do poorly.

[1] 'Intelligence' here means exactly that which differentiates humans from calculators: the ability to infer new rules from old ones.

Nobody was saying that before AlphaGo beat Lee Sedol. So this feels like moving the goalposts.

The smoke is when things like the same simulated robot that learned to run around like a mentally challenged person also learns to simulate throwing and can read very basic language.

It will seem quite stupid and inept at first. So people will dismiss it. But when they have a system with general inputs and outputs that can acquire multiple different skills, that will be an AGI, and we can grow it's skills and knowledge passed human level.

> But the hockey-stick takeoff

The hockey stick is lying horizontally though instead of vertically. If it took 3 days to go from 0 to beating the top player in the world, I wouldn't have expected it to take 21 days to beat next version. I guess something happens at the top levels of Go that make training much harder.

On another note, I didn't look at the details closely but it seems AlphaGo Zero needed much less compute training time than Alpha Go Master. Could getting rid of any human inputs really make it that much more efficient? That implies it will be able to have an impact in many different areas, which is a bit scary...

(Updated - it took 3 days to beat the top player in the world.)

This type of curve is what I would expect out of machine learning. At first there is rapid improvement as it learns the easy lessons. The rate then slows down as further incremental improvements become less impact.

What is, perhaps, surprising is that human play happens to be relatively close to the asymptote. Although this could be explained by Alphago being the first system to beat humans. If its peek performance were orders of magnitude higher than humans, a weaker program would have already beaten us.

The horizontal hockey stick makes sense to me in terms of learning. Each increased layer of of understanding a complex system could mean a potentially exponentially increasing difficulty.

I'm sure it's naive to jump to sci-fi conclusions just yet, but I admit it's equal parts fascinating and terrifying. The general message of the posts is that human knowledge is cute but not required to find new insights. Define the measure of success and momma AI will find the answer. At this point, the path to AGI is about who first defines its goals right and that seems... doable? Even scarier: We think the holy grail of AI is simulating a human being. The AI of the future might chuckle at that notion.

Wait for Alpha StarCraft for some real panic. So far RL based method has limited success outside of simple games(Not to say Go is simple, but rather the presentation and control parts of the format).

I'd like to see a StarCraft player AI that wins using a mere 1/10th of the effective actions per minute (EPM) of world class players. To me it seems beating another player while using fewer actions indicates superior skill, understanding and/or intellect.

Not sure I agree with this fully. Certainly many actions used in a typical SC game are redundant, but there are reasons for it. Lag for one. If there's a possibility of lag or dropped packets, spamming a command will help nullify this problem.

The other is the entire reason for high APM, the stop/start problem. Pro players keep high APM so that when they actually need high EPM their muscle memory is already at full tilt. If you slow down your APM during lulls in the action it becomes harder to suddenly increase it when a fight happens.

Certainly that's an entirely human condition that a machine wouldn't need to worry about. But I'm not sure it means lack of skill.

You expressed my exact thoughts and I was about to link to the same insightful article. I guess my comment could've been shortened as a silent upvote, but I commented anyway.

Games are a joke compare with real life. The number of variables and rules is well defined in games, while in real life it is not. That is why AGI is not coming anytime soon.

> Previous versions of AlphaGo initially trained on thousands of human amateur and professional games to learn how to play Go. AlphaGo Zero skips this step and learns to play simply by playing games against itself, starting from completely random play.

So technically this version has lost every game it's ever won.

Jokes aside, it's pretty interesting to note that they were able to combine the "policy" and "value" networks. Good SO answers on the difference (https://datascience.stackexchange.com/questions/10932/differ...)

> accumulating thousands of years of human knowledge during a period of just a few days

It'd be interesting for what this would mean when things like a neural lace become a reality.

As an aside, anyone have any other links or references to others investigating learning algorithms with a 'tabula rasa' approach?

TD-gammon is a well known version of this technique (with 2 ply lookahead, vs a 1600 deep mcts) https://en.m.wikipedia.org/wiki/TD-Gammon

Temporal difference learning was previously consider weak at 'tactical' games, ie ones with gamestates that require long chains of precise moves to improve position (like many checkmate scenarios in chess) .

For anyone more familiar with this technique, is it clear how the mcts/checkpoint system overcomes this? How sensative is the system to the tuning params for those parts of the alg. Like is Go a particularly good candidate because of the ~400 play positions resulting in a (relatively) small tree seach requirement? (I kinda cant believe im saying that go has 'a small search tree'!)

We us td learning for the ai in our game Race for the Galaxy, so it's neat to hear about possible avenues for improvement!

After digging a bit deeper into the paper, it seems a key part of the new scheme is the NN is trained to help guide a deep/sparse tree search (as opposed to TD-gammons fully exhaustive 2-ply search). It's somewhat surprising to me that the simple win/loss is a strong enough signal to train this very 'intermediate step' in the algorithm - a spectacular result! It begs the question what other heuristic based algorithms would be improved by replacing a hand rolled non-optimal heuristic function with a NN?

It's estimating the probability of winning from the position based on what it has already seen. So basically it's a giant conditional probability distribution. Is it mistaken to interpret this as a bayesian network?

Wow, that was a really deep and enjoyable Wikipedia rabbit hole journey. I hadn't heard of Temporal Difference before (though I was familiar with Q-learning).

It was interesting to note that TD-Gammon improved with expert designed features. I wonder if this was simply related to the technology of the field as it stood over 20 years ago or some underlying categorization or complexity associated with the games themselves (backgammon being more favorable to human comprehension than Go in this case).

> Even though TD-Gammon discovered insightful features on its own, Tesauro wondered if its play could be improved by using hand-designed features like Neurogammon's. Indeed, the self-training TD-Gammon with expert-designed features soon surpassed all previous computer backgammon programs. It stopped improving after about 1,500,000 games (self-play) using 80 hidden units.

For others: Richard Sutton, one of the pioneers of TD makes his Reinforcement Learning: An Introduction textbook available for free on his website: http://incompleteideas.net/sutton/ (MIT Press also links to it)

PSA: The new edition of Sutton/Barto has a nice discussion of (the original) AlphaGo in the back.


Yeah it's not clear to me why temporal difference learning all of a sudden works so well here? Is it the case that nobody had really tried it for learning a policy for Go with a strong NN architecture? In the Methods they mention TD learning for value functions but I don't see anything about policies.

edit: OK, they're calling it policy iteration as opposed to TD learning. I guess I don't get the difference.

TD learning is, in some sense, a component of policy iteration. TD learning is about learning the value function for a given policy. In policy iteration you use a value function to decide how to update the policy for which the value function was estimated, and you iterate between the "learn value" and "update policy" steps.


It's my opinion that TD Gammon was solved in the 1990s because backgammon is a 1 dimensional board. It didn't need the convolutional techniques of the Go neural nets to gain insight into the game and could thus be solved by a traditional neural net.


I believe their DotA 2 AI uses that approach

Correct - we saw a similar phenomenon of rapid capability gain via self-play in our Dota 2 work: https://blog.openai.com/more-on-dota-2/


> So technically this version has lost every game it's ever won.

No, they've also played it against AlphaGo Lee and AlphaGo Master. The SGFs are available at: https://www.nature.com/nature/journal/v550/n7676/extref/natu...

I meant that tongue-in-cheek based on the "by playing games against itself" during training. Nonetheless, thanks for clarifying that in case it's unclear for others (and for the SGFs).

I remember reading about Blondie24, a program that learned to play checkers at a high level without human input. It was based on neural network and genetic algorithm technology. From the Wikipedia entry: "The significance of the Blondie24 program is that its ability to play checkers did not rely on any human expertise of the game. Rather, it came solely from the total points earned by each player and the evolutionary process itself." [1].

In addition to numerous journal articles, the creators wrote a lay-person book on their creation: Blondie24: playing at the edge of AI, by David B. Fogel [2].

[1]. https://en.wikipedia.org/wiki/Blondie24

[2]. https://dl.acm.org/citation.cfm?id=501597

"It uses one neural network rather than two." and "AlphaGo Zero only uses the black and white stones from the Go board as its input, whereas previous versions of AlphaGo included a small number of hand-engineered features."

This is amazing! The technology they came up with must be super generic.

It sounds like that's what they are going for. Minimal tuning, generally adaptable. Very interesting.

Also, unsupervised. Also, no rollouts. They got rid of a lot of complexity. At this point it looks like a reasonable challenge to write a superhuman Go AI in 500 lines of unobfuscated python.

I was wondering about this: can we study AlphaGo Zero and other nets created in the same way for similarities, extract and study them? Or are we limited to observing the behavior and learning from that?

'rollouts' ELI5? I didnt pick this up from the paper..

thx :)

Super, super impressive work. I'd love to see how hard it is to apply the architecture to other games/problems that work well with self-play.

I will be interested to see what kind of algorithms they have used to allow AlphaGo to learn from its own moves. Are these pretty generics algos or are these very customized and specific ones that only apply to AlphaGo and the game of Go?

They have a new reinforcement learning algorithm that should be generically applicable to anything where a long sequence of moves results in a specifically gradable outcome.

> The neural network in AlphaGo Zero is trained from games of selfplay by a novel reinforcement learning algorithm. In each position s, an MCTS search is executed, guided by the neural network fθ. The MCTS search outputs probabilities π of playing each move. These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network fθ(s); MCTS may therefore be viewed as a powerful policy improvement operator. Self-play with search—using the improved MCTS-based policy to select each move, then using the game winner z as a sample of the value—may be viewed as a powerful policy evaluation operator. The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network’s parameters are updated to make the move probabilities and value (p, v)= fθ(s) more closely match the improved search probabilities and selfplay winner (π, z); these new parameters are used in the next iteration of self-play to make the search even stronger.

> They have a new reinforcement learning algorithm that should be generically applicable to anything where a long sequence of moves results in a specifically gradable outcome.

Statements like these always make me wonder why certain obvious things weren't tried. If it's so generic, why wasn't it tried on Chess? Or was it tried, failed to impress and thus didn't make it into the press release?

This is a big problem with all these public discussion on AI. Almost no one speaks about algorithm failures. I haven't seen a single research paper that said "oh, and we also tried algorithm in X domain and it totally sucked".

The conventional wisdom for Chess engines is that aggressive pruning doesn't work well. Chess is much more tactical than Go, selective algorithms tend to lead to some crucial tactic being missed, and the greater the search depth, the more likely that is.

Modern Chess engines are designed to brute-force the search tree as efficiently as possible. I will go out on a limb here and say they would wipe the floor with AlphaGo, because AlphaGo's hardware would be more of a liability than an asset against a CPU.

See also: https://chessprogramming.wikispaces.com/Type+A+Strategy https://chessprogramming.wikispaces.com/Type+B+Strategy

Until I see AlphaGo zero defeating StockFish 100-0 and with same algorithm defeating best Go AI and killing the Atari games including montezuma’s revenge, I call this hype bullshit.

Give me your results on OpenAI gym in a variety of different styles of games including GTA and WoW. I will believe you if a generic unsupervised algorithm running on a single machine is absolutely destroying the best players.

Until then ...

Just like Lee Se-dol is a Go grandmaster, beats Gary Kasparov at chess and can also get a perfect score in Pac-Man, right? I mean, if you can't do all of those things then are you even a human-level intelligence?

This just illustrates that surpassing "human level" performance is a silly and arbitrary benchmark, because there is no such thing as general human level performance. But I bet Kasparov would be pretty good at Go, and Sedol would be pretty good at chess.

Universality is the real hard problem of AI. In the long run, a mediocre AI that does a lot of different things is far more useful that most targeted "superhuman" AIs. Most domains simply don't require better-than-human performance, but could still reap tremendous benefits from automation.

Agreed. It's great that we have domain-specific approaches that can beat humans in their domain (and that we're learning how to make these approaches more generic so that, with re-training, they can adapt to new domains), but the real "oh snap" moment will be when we build something that's barely-adequate but widely adaptable. Something with the adaptability of a corvid or an octopus, say. If we get to that level, it'll mean we've discovered the "universal glue" that joins specialist networks together into a conscious entity.

You forget to add "running on 20 watts of power". It's not reasonable to require it to run on a single machine, when brain performance is estimated to be more than 10 petaflops.

I don't know if you're being sarcastic or not. If not, I suggest you look at the cartoon at http://www.kurzweilai.net/robot-learns-self-awareness

Add Pacman and Pitfall to the list. Humans have played perfect games of both. My understanding is DeepMind performed poorly on those games.

Doesn't this sound very much like how a human learns to play the game? MCTS ~ play/experience (move probabilities); self-play with search ~ study/analysis (move evaluation); repetition and iteration to build intuition (NN parameters).

But I suppose they still do the searching/pruning with a separate piece of code (not a neural network).

From the paper: "it uses a single neural network, rather than separate policy and value networks ... it uses a simpler tree search that relies upon this single neural network to evaluate positions and sample moves, without performing any MonteCarlo rollouts. To achieve these results, we introduce a new reinforcement learning algorithm that incorporates lookahead search inside the training loop, resulting in rapid improvement and precise and stable learning."

Yes, but tree search + neural net is still pretty generic. It only assumes that you can enumerate branches.

It also presumes that one can simulate the world at low cost. In AlphaGo Zero it takes 0.4 s for 1.600 node extensions, but in this case the cost of the world is negligible. Anyway, assuming you need that many node extensions to get decent quality updates, that puts a rather a tight limit on the cost of simulating the world.

DM has already done a bunch of work on 'deep models' of environments to plan over. Use them and you have 'model-predictive control' and planning, and this tree extension to policy gradients would work as well (probably). It could be pretty interesting to see what would happen if you tried that sort of hybrid on ALE.

I guess deep world models are still severely riddled by all sorts of problems: vanishing gradients, BPTT being O(T), poor generalization ability of NNs (which likely is due to the lack of attractor state associative recall, as well as concept composability), lack of probabilistic message passing to deal with uncertainty, and perhaps some priors about the world are necessary to make learning tractable (such as spatial maps and fine-tuning for time scales that contain interesting information).

What are the main papers from DM on this ? Are you referring to "CONTINUOUS CONTROL WITH DRL" ?!

You're right, but MC rollouts work better (better estimate) for some games than others.

I'm wondering if once one of these algorithms comes along that has been perfected if it is going to "burn in" the domain it was built for as the target of problem reductions, similar to 8086 assembly or the qwerty keyboard living on today despite them being ancient relics.

For example, after this result it seems if you can reduce your problem domain onto Go (or a similarly structured game) you now have a way to create a superhuman solver. It may just be easier to do that then try to even figure out how to design and tune a new network.

I could imagine waking up in 10 years being confused at why all software efforts in the AI space are focused on just figuring out clever ways to map real problems onto a hodgepodge of seemingly random "toy" domains like Go and Chess and Starcraft. Hell, maybe the Starcraft bot will immortalize Starcraft in a way the game never would have been able to if it becomes a good reduction target for a lot of domains.

It kind of reminds me of how SVMs were "abused" by twisting non-linear domains into them via kernel methods, or by proving the NP-equivalence of a problem by reducing it onto 3-SAT, or how ImageNet's weights are being re-purposed for other image oriented prediction tasks.

In many domains mapping the problem to a tree search already gives you a superhuman solver or at least a passable solver. Problem mapping is what most of modern AI research is about. That's how the field was redefined in recent years. Just like Vladimir Vapnik says[1], it's becoming more engineering than science. (And sometimes more software alchemy than engineering.)

[1] https://www.youtube.com/watch?v=5mvfpSdWsOo "Brute Force and Intelligent Paradigms of Learning"

I think the only thing about go that enables this technique is "turn-based perfect information 0-sum game".

Also it has fairly limited input. A real world problem may have much more possible inputs at any time step as opposed to placing just 1 stone

What is the analogy with the QWERTY keyboard?

Looks like the performance improvement comes from two key ingredients:

1) Using Residual networks instead of normal convolutional layers

2) Using a smarter policy training loss that uses the full information from a MCTS at each move. In the previous version, I believe they just ran the policy network to the end of the game and used a very weak {0, 1} reinforcement signal over all of the moves played. Here, it looks like they use each run of MCTS to provide a fully supervised signal over all moves it explores.

How is it different to apply the loss on each actual move at the end of the game VS on each rollout (which is itself a tiny game)? Does it help reinforce learning towards the end game as shorter rollouts are needed? Is the more accurate information then propagated to earlier moves as well?

I think the difference is that under 1/0 policy gradient loss, it gets feedback only on the actual chosen move. Under MCTS-rollouts-each-move, it gets feedback on every move on the board whether its value estimate was slightly too high or low plus the ultimate outcome of the 1 move it did make.

Also (3) training a dual policy & value network that can benefit from a single shared representation of the game

So this is fun:

"AlphaGo Zero is the program described in this paper. It learns from self-play reinforcement learning, starting from random initial weights, without using rollouts, with no human supervision, and using only the raw board history as input features. It uses just a single machine in the Google Cloud with 4 TPUs (AlphaGo Zero could also be distributed but we chose to use the simplest possible search algorithm)."

Single machine?


I remember reading ages ago in Scientific American about a much more interesting (and useful) AI application of this technique.

Genetic algorithms were used to evolve new, more efficient variants of existing electronic circuits. I dug it up - it was: https://www.scientificamerican.com/magazine/sa/2003/02-01/#a... Article "Evolving inventions". I have no idea if there is an open-access version anywhere.

As far as I remember, that approach led to some patents, because some of the inventions were better than existing solutions. One of the examples in the article was a low-pass filter (I dont remember if AI version was actually better or worse than human-made).

The essential element of this approach was that in electronics (as in go) there exist a well defined set of rules, that allows researchers to build a simulation engine with optimization/evaluation function that the AI targets by itself, without supervision. It's great to see that this approach is still alive, although in my humble opinion, application in electronics is much more interesting than Go.

Somebody needs to dig this up and apply it to an open FPGA toolchain like ICEstorm.

The other SA article on this was The Darwin Chip which I think went into more detail.

One of the limitations was the lack of documentation for the actual bitstream.

Nature is actually hosting it without a paywall:


Very impressive, the original implementation relied a lot on feature engineering.

I'm surprised they're able to prevent a self-play equilibrium with such a simple loss function.

It's sort of like they are using auxiliary outputs but instead of using them to fit features, they are fitting to multiple ways of arriving at 'best play', through predicting value (SL) and predicting probability for best outcome (RL). In principle, they're doing the same thing but in practice it seems like they are making up for each others shortcomings (e.g. self-play equilibrium with RL).

> If similar techniques can be applied to other structured problems, such as protein folding, reducing energy consumption or searching for revolutionary new materials,

Protein folding sounds like a nice idea for their next challenge.

When things will start getting interesting is when we figure out how to get move simulation and search into the network itself, rather than programming that on the outside. As far as I know, no-one has even the faintest idea of how to do that. We have an existence proof that this should be possible.

The networks are great at perception and snap-prediction. Anything a human can do in 200ms is fair game. And with clever engineering, we can make magic happen by iterating or integrating those things.

But it's after that first 200ms that humans get really intelligent. When we can come up with an architecture that lets the networks themselves start simulating possibilities, backtracking, deciding when to answer now or to think more -- when the network owns the loop -- then it will get interesting.

> We have an existence proof that this should be possible.

Not guaranteed. The human brain has diffusion signalling (i.e. neurotransmitters passing out of the synaptic cleft, into a neighbouring one, and activating a receptor on some other spacially-local axon as a result.) And one of those signalling molecules is thought to represent, in its intensity, a confidence-interval bias adjustment (i.e. a pruning bias factor for MCTS.) So the brain's MCTS-equivalent process may rely on some extra-graphical properties of the brain-as-embodied-meat-thing.

That will be a couple of additional terms in activation function. Or am I missing something?

“Neighbouring” is defined in terms of embedding in a metric space and inverse-cube diffusion, rather than anything to do with graphic connectivity.

Also, these signals pile up in the synaptic cleft until they’re picked up, so it’s not just about instantaneous transmissivity as if these were radio signals.

But also also, other stuff like monoamine oxidase is floating about in its own diffusion patterns, cleaning up these signals.

It’s basically like a “scent” communication embodied-actor model, but a very complex one where things like redox reactions with the atmosphere occur.

Oh, and there are “secondary messengers”: signals that trigger other signals that, among other things, inhibit the release of the original signal when received back at the sender, such that an dynamic equilibrium state is reached between the two signal types.

I think what you are suggestion is similar to Deep Mind's Sokoban bot: https://deepmind.com/blog/agents-imagine-and-plan/

What do you mean by move simulation?

I think he means that the NN somehow learns MCTS without it being coded in explicitly.

Why don't use the same approach for chess?

It's very interesting to see if it is able to handle much more advanced and tuned engines that exist for chess, game with considerable much more complicated rules?

I think chess is less compelling because, in a sense, it is a "solved problem" - superhuman AI chess players already exist.

And chess, while it does have more complex base rules, has a much lower combinatorial complexity than Go.

Well, I'd love to see NN solution beating top chess engines. It might also introduce novelty to the game, just as regular engines did

It'd be particularly useful to have a chess bot that can play badly in the same way a human does.

The problem with the current chess bots is that they play badly, badly. They choose a terrible random mistake to make every few moves, while some of their other moves are brilliant. They cannot accurately mimic beginner or intermediate level players.

This seems like something DeepMind could create, given the incentive. They were able to train AlphaGo to predict human moves in Go at a very high accuracy (obviously not with AlphaGo Zero, but the inferior human-predictive version is how they determined that AGZ is playing qualitatively differently).

In a sense, that would be like replicating the human brain's functionality, including the bugs and limitations.

I have some idea how it MIGHT work, but it would be a very boring solution involving 'learning' Stockfish's parameters and HOPING to find improvements to something like integrating time management and search/pruning into it.

I wouldn't bet on it though. SMP is notoriously hard to work with alpha-beta search and there are a lot of clever tricks (which is probably still not perfect). Maybe with ASICs, you could make it stronger, but then it wouldn't be as fair a comparison.

Well, all top engines did some kind of search on parameters, not sure if you can find much improvement there.

I'm talking about something similar to the described in the paper, 100% self-learned solution without using human heuristics, based on NNs. That could bring a totally new ideas into chess.

Shogi is probably the closest historical game in terms of complexity to Go. Some of the larger variants might exceed Go's complexity if played with drops, though that's not normally done. And Go played on a 9x9 board (like standard Shogi) has a substantially lower state space complexity (and almost certainly lower by other measures as well.)

But shogi is much more obscure outside of Japan than go or chess, so it gets less interest, especially in the large-board variants.

I think that the existence of highly optimized chess AI makes it interesting from two angles: 1) Generalization: Can one make AI using same approach that can play both chess and Go at superhuman levels 2) Efficiency: Can these newer methods match or outperform also in terms of compute/energy costs

But maybe not sexy enough, or we just don't hear about it as much.

That makes it even more interesting. I think it would be very notable and significant if a neural network with MCTS and self-play reinforcement learning could surpass Stockfish, which has superhuman strength but was developed with an utterly different approach involving lots of human guidance and grandmaster input.

Giraffe attempted this (with more standard tree search than MCTS and with only a value function rather than a combined policy/value network), but only reached IM level -- certainly impressive, but nowhere close to Stockfish.

Denis Hassabis was asked this in a Q&A after a talk he gave and according to him someone did this (bootstrap a chess engine from self play) successfully, while still being a student and was hired by them subsequently.

I didn't see the talk, but I'm guessing he was referring to the Giraffe engine done by Matthew Lai (https://arxiv.org/abs/1509.01549). The main thing there is that he only learns an evaluation function, not a policy. Giraffe still uses classical alpha-beta search over the full action space. AFAIK nobody has learned a decent policy network for chess, probably because 1) it's super tactical, and 2) nobody cares that much because alpha-beta is so strong

Because Chess is a simpler game than Go.

Minimax with Alpha Beta pruning works in Chess because the search tree is way smaller. The reason why all this "Monte-Carlo Tree Search + Neural Nets" are being used in Go because Minimax + Alpha Beta pruning DOESN'T work in Go.

This is pretty incredible, especially the power dissipation results. Only 4 TPUs? Humans are toast.

That's still 10 times as much energy as a human body or 100 times as much as a human brain. But yeah, it's not like they're throwing a datacenter at this.

What is a TPU?


Comparing the top player's ELO with Zero's ELO (assuming numbers are accurate, etc):

Your rating: 3664

Opponent's rating: 5000

Probability of winning: 0.000456879355457417

So 1 in 2,200 games... ouch

I don't think you can apply this to alphago. I think probability for a human to beat alphago now is zero.

Lee Sedol's single victory is the first and the last.

I disagree. This is precisely what the ELO predicts, and it has been pretty accurate over time - it's a good metric.

For humans yes. For humans against machines? I don't think so. Can any human beat a modern chess computer? The chance is zero.

Zero means zero, yes?

Alpha particles can flip bits and cause erratic behavior, can they not?

"[The probability of] at least one bit error in 4 gigabytes of memory at sea level on planet Earth in 72 hours is over 95%"

Oh hey maybe they use ECC? Are we really arguing this? Pedantry on a weird level.

Is AlphaGo Zero the first Go program without special code to read ladders? I'm curious how a pure neural net can read them, given how non-local they are.

The concept of locality is nothing but a human weakness in Go, the best AI must read the whole board with every move.

EDIT: From the paper: "Surprisingly, shicho (“ladder” capture sequences that may span the whole board) – one of the first elements of Go knowledge learned by humans – were only understood by AlphaGo Zero much later in training" I'm surprised by the author's use of the word "Surprisingly" here.

AlphaGo is still based around layers of 3×3 local convolutions.

That represents a strong assumption about locality in the network design. I would expect AlphaGo to perform poorly on the game "Go with the vertices randomly permuted".

Well they didn't use inception. If their inception units have, say, a 7x7 conv, then ladders will probably be found much earlier.

>Surprisingly, shicho (‘ladder’ capture sequences that may span the whole board)—one of the first elements of Go knowledge learned by humans—were only understood by AlphaGo Zero much later in training. [0]

[0] https://www.nature.com/nature/journal/v550/n7676/full/nature...

The catch is that this isn't quite zero human knowledge, since the tree search algorithm is a human discovery, and not one that came easily to humans. It also massively cuts down on the search space for an appropriate policy function.

That means that this setup isn't necessarily general. How applicable is MCTS to games with asymmetric information, a la Starcraft? What about games that can't quite be modeled with an alternating turn-based game tree like bughouse?

There's a Dota 2 bot by OpenAI that played games with itself and managed to beat a lot of pros in the scene. It's still SF mid only no runes and some restricted items, but it shows that there is also potential for Starcraft.


Maybe you know something we don't...

"We’re not ready to talk about agent internals"

What makes you think it uses a tree search?

That's not quite what they're talking about WRT zero human knowledge.

The problem is that there's no intrinsic scoring system for Go, nothing specific to maximize, so it's difficult to tell a computer whether a given outcome is "good" or "bad". So early versions of AlphaGo used a collection of human-played Go games to get an idea of what constitutes "good" and what is "bad", so it can then train its model to predict whether a move will make things better or worse.

This new system forgoes that step, and instead has the model play itself starting at random and looking for patterns that end up winning games. It's as if you gave the rules to the game of Go to a culture that's never heard of it before, and they evolved their own play style entirely in isolation.

Their result is a model that is better than the one that was developed with human influence, and that's the interesting bit.

I understand that the paper means that they didn't train it on expert input. The significance of the research is that this is a more general way to construct a game AI. The question I am posing is how far we have to go on that front.

Yes. It'll be interesting to see if their starcraft project uses the same algorithms or not. Note that the link merely describes software that could be used for feature engineering. It doesn't describe what NN architecture or tree search algorithms deep mind is using.

> What about games that can't quite be modeled with an alternating turn-based game tree like bughouse?

Train a network which predicts future state of the game, given current state and input. Train a network which generates sensible inputs, given current state. Use MCTS.

Bughouse, starcraft, and other important games need to be modeled as simultaneous-decision games. Plain-vanilla MCTS is designed for alternating-decision games.

To see why this is important, consider why min-max (which MCTS approximates) actually works. At any given point, the equilibrium strategy for the player to move is the move that maximizes their payoff, and the utility for each move can be found recursively.

In simultaneous decision games, calculating the equilibrium strategy (which may even be a mixed strategy) is more complicated. See http://mlanctot.info/files/papers/cig14-smmctsggp.pdf for various ways in which MCTS can be extended to simultaneous-decision games.

It'll be interesting to see if DeepMind picks up a search algorithm someone else has researched, or if they come up with something entirely new.

Thanks for the link. The thing I described roughly corresponds to SUCT.

It's interesting how NN will be able to deal with uncertainty of enemy's state and moves.

How I wish Marvin Minsky would have stayed alive for one more year and seen this. He would have been so happy!

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.

“What are you doing?”, asked Minsky.

“I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.

“Why is the net wired randomly?”, asked Minsky.

“I do not want it to have any preconceptions of how to play”, Sussman said.

Minsky then shut his eyes.

“Why do you close your eyes?”, Sussman asked his teacher.

“So that the room will be empty.”

At that moment, Sussman was enlightened.


So Sussman was right the first time?

A random net has some random preconception. That doesn't mean it's a bad idea to try random preconceptions.

And he would have still said that deep learning lacks any sort of common sense understanding that's necessary to get close to human level intelligence.

I think he was cryopreserved, so he surely will be surprised once they wake him up in the future, assuming cryonics really works.

I'd certainly be surprised if I ever woke up from being cryopreserved. Which isn't to say that I'd object to the process if I had the disposable income and an understanding/cooperative family support structure, which I do not.

I wonder if it would be more popular if the cost was reduced to something similar to a regular funeral. It seems it might be a more cheery send off even if the chances of it working are questionable.

You can already buy an insurance plan that will pay for it in some states, and that's reasonably priced. In my case religious family members would never let it go down though even if the finances were solved.

One idea occurs to me is to now evolve the Go game itself in a direction that adds more challenges for an AI to solve, and then solve those problems. How about being able to handle different and randomized board shapes? How about being allowed to say one move the opponent cannot take when you play a piece? It would be interesting to keep track of what variations the algorithm handles well automatically, and which it falls flat on, etc.

Arimaa was a chess inspired game intended to be difficult for computers. It "fell" in 2015.

Like Arimaa, some other games were (at least partially) designed to be hard for computers: Havannah [1] and Octi [2]. Havannah has since been defeated by the machines. Octi remains unchallenged, but that is probably due to its obscurity.

[1]. https://en.wikipedia.org/wiki/Havannah#Computer_Havannah

[2]. https://news.yale.edu/1999/06/01/successor-chess-new-game-st...

This is such an impressive result, and so general, I bet many people (including me) wish they knew exactly how to duplicate this result. It would be great if they created an online course that explained all algorithms in detail right up to the creation of AlphaGo Zero itself. The paper gives the impression that it shouldn't be too hard for them to create such a course.

So in terms of training, they went from nothing to "Go Singularity" in about a month? Impressive.

Slightly scary how it went from zero to superhuman play in three days. I wonder if general AI will go that way one day.

General AI relative to an individual human, or billions of humans? The sum total of human beings, or organizations of humans is superhuman relative to an individual. We've had superhuman organizations for millennia. I'm not sure how much general AI will be different, other than the large scale automation of jobs which would happen.

As Rodney Brooks pointed out, all technology happens within a context, not a vacuum. A general AI will come to exist in a world with a lot of other superhuman capabilities already in existence.

One of the more interesting things the success of "starting with zero" suggests is that the idea that some mystical "human consciousness" is the end goal for AI might be laughable in the long term. AI might just casually bypass human consciousness, say "oh, hi!" and wave us goodbye a day later. Also, a factor of 7 billion "happens" in computer science.

This is getting rather creepy to think of, even if it's still science fiction. At this point, I could see a computer that out-thinks humanity within decades. What would it think? What would we even do with its findings? Would we understand it? Would it understand itself? Would it know how to manipulate us?

In those three days, it made X moves.

If you multiply X by the amount of time it takes, on average, for a human to make a move... How many human lifetimes did Zero take to get to superhuman?

Amazing results, though I am somewhat frightened by how generic this model is and how it achieved such amazing results. I can't help but think that these same techniques can be used to learn how humans react in certain situaties and how they can be, very subtely, be worked to think in a certain way - one that fits the agenda of whatever party is behind it.

With the mass surveillance that is Google it's quite doable to test for human reactions on certain things. They got the tools to execute a certain plan and evaluate the effectiveness. Ofcourse it can also go in a benelovent way: like what kind of policy will benefit the most people? (semantics of 'benefiting' aside)

I atleast certainly hope these kinds of generic algorithms will be used to generate effective, meaningful policies that truly help the people. Still a far away future but one that gets closer by the day.

I'd only worry if it can outperform humans when there are not rules per se. That is, if I put a queen down in the GO board, and start knocking off stones, moving three times a turn, then take a lighter and burn the go board, the AI responds by decapitating my head.

Ha! I do wonder about using a board game where the rules periodically change in simple ways at random. A human could easily adapt to the rule changes while playing and adjust their strategy accordingly. Would a Deep Learning algorithm be able to do this?

If we keep the board and pieces digital, then the board could change shape, the pieces could change color indicating a random association with a rule change, and what not.

What's fascinating (and admittedly somewhat worrying) about Self-play is that an agent can accidentally become adept at tasks other than intended via transfer learning. The "wrestling spiders" in OpenAI's demo quickly mastered the art of Sumo Wrestling. And whatever skills they learned in resisting an opposing force to stay standing on a platform, were immediately applicable to myriad different domains. In this case, being subject to hurricane force winds, and not as any normal spider may, be hurled into the sky!

It's more difficult to see how Go playing skills can translate to other domains. But for tasks in robotics, cybersecurity or fintech the power of self-play trained transfer learning becomes more apparent.

It is clear that these "self-play" scenarios depend on simulation - unless there is an appropriate stage for self play to take place on, there can be no play. The question is - how do we stand with simulation for robotics, self driving cars, etc.

My bet is that simulation is going to be the crowning jewel in the AI field, replacing static datasets and supervised learning with "dynamic datasets" and rewards. It would help with data sparsity as well (where can you find an image of a donkey riding an elephant for the new ImageNet? - but you can sim that or any possible combination).

Not to mention that humans are fallen head over heels with simulation as well - VR headsets and games in general. I see a great future for simulation with both AI and humans. It will be our common learning/playing/research sandbox.

I would be willing to spend an entire lifetime to perfectly understand how this algorithm works. Currently I can barely write Djikstra's algorithm.

You can watch the RL course given by one of the inventors of AlphaGo, David Silver.


This is pretty cool. Thanks!

Would be nice if there was an open source attempt at an alpha go clone on a 9x9 board, so it could be run on commodity hardware and maybe trained in more reasonable time. Also would be interesting to see if human would still win on a 190x190 or some arbitrary size board against alphagozero trained appropriately.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact