A secondary reason for AlphaGo Zero's performance is that it combines both value and policy networks into a single network, since it's redundant to have two networks for move selection.
These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.
The expected value of a contract at time T, estimated at some time t < T, is assumed to be equal (up to discounting) for all t -- e.g. if today we think the contract will be worth $100 a year later, then we also think that the expected estimate, made n months from now, of the value [12-n] months later, will also be $100. This allows you to shrink the state space considerably.
You can usually work out the payoff of a derivatives in different scenarios given rational exercise decisions by all contract participants. The calculation assumes that every market participant makes the best possible decision given the information they had available at the time by either explicitly or implicitly building a tree and working backwards, back-propagating the 'future' value back to the root.
This closely resembles the modeling of a discrete adversarial game, except the payoffs need to make reference to random market variables like the stock price, so the tree nodes are not just indexed by participant action, but also by variables.
There's actually a nice resemblance between the Longstaff-Schwarz method of pricing American options and MCTS + Alphago, except that the former is using kernel regressions instead of deep neural nets and we sample from a continuous space with an assumed probability distribution instead of a discrete space guided by a policy network .
My memory is hazy so there might not be a real connection here.
I'm not an expert, but he is: http://www.athenasc.com/dpbook.html
(Notice the accompanying proof.)
David Silver disagrees. The most critical distinguishing characteristic is the expert/tree iteration which makes stable self-play possible at all.
In the first paper, the selfplay trained policy is about 1500 in elo rating, while darkforest2 a supervised trained policy from Facebook is around the same, if not better. So selfplay wasn't of much use the first time around. While in the AlphaZero paper the selfplay trained policy has about 3000 elo rating.
Yeah, it's hilariously underemphasized. 1 sentence, literally. Fortunately I was able to ask Silver directly and get confirmation that it's the tree iteration: https://www.reddit.com/r/MachineLearning/comments/76xjb5/ama...
Sidenote: It used to be that simply googling "sutton barto book" would bring you to the right place with the first suggested link. Now this stuff is so popular all of a sudden, I needed to consult the link I had set on my own page in order to find it. It's curious how the growth of popularity of an idea will with time obscure it's own roots and primary sources. On the plus side, TIL that Sutton's working on a 2nd edition! =)
> TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel. This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that learned to play the game of backgammon at the level of expert human players.
> TD-Gammon achieved a level of play just slightly below that of the top human backgammon players of the time. It explored strategies that humans had not pursued and led to advances in the theory of correct backgammon play.
Backgammon is also interesting in that there is a non-determinstic element - the dice roll on every turn. This is where TD seems to shine.
“The Alpha Zero algorithm produces better and better expert policies and value functions over time by playing games against itself with accelerated Monte Carlo tree search. The expert policy π and the approximate value function Ŵ are both represented by deep neural networks. In fact, to increase efficiency, Alpha Zero uses one neural network f that takes in the game state and produces both the probabilities over the next move and the approximate state value. (Technically, it takes in the previous eight game states and an indicator telling it whose turn it is.)”
Regarding whether OP touches on temporal-difference learning I am unqualified to say but they do not explicitly mention it. Furthermore I am unqualified to judge how central this technique is to the level of play achieved. However in the DeepMind paper (pg. 20)† that start talking about temporal-difference learning thus:
“Self-play reinforcement learning has previously been applied to the game of Go. NeuroGo[40, 41] used a neural network to represent a value function, using a sophisticated architecture based on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by temporal-difference learning to predict territory in games of self-play, building on prior work. A related approach, RLGO, represented the value function instead by a linear combination of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporal-difference learning to predict the winner in games of self-play. Both NeuroGo and RLGO achieved a weak amateur level of play.”
I'm no expert but this implies to me that it was probably the sum total of all the subtle architectural decisions made by the DeepMind team plus their AI hardware and software platform that made AlphaGo Zero excel.
(Eg AlphaGo using two different networks for policy and valuation was a big breakthrough back when it was young, because people couldn't make the single network one work so well.)
For temporal difference learning the article about TD-Gammon, a backgammon AI from the early 90s, is great: http://www.bkgm.com/articles/tesauro/tdl.html (It's linked from the Wikipedia article you referenced, too.)
> A new paper was released a few days detailing a new neural net
I believe you mean "a few days ago"?
The AlphaGo neural network is able to do the first part (evaluating positions) but the search tree is still a hand crafted algorithm. Do they have plans to work on a version with a pure neural network? (i.e. a version which would be able to learn how to do a search tree.)
But you're right. A NN way to do Montecarlo search in GPUs would make things even simpler.
1. Possibly the elegant rule of the game cuts down the search space so much that there is a learnable function that gives us optimal MCTS supervision;
2. Or CNN approximates human visual intuition so well, so while Zero has not evaluated so many board positions it has evaluated all the positions that human has ever considered - so it remains possible that a different network could produce different strategies and be better than Zero.
This final paragraph is just editorializing. A computer will never care about anything (including games like Go and domination of other beings) that it is not programmed to imitate care about, and will thus remain perennially unmotivated.
Also, my intuition says that gradient descent is an ugly hack and that there HAS to be some better way (like a direct way) to get at the inverse of a matrix (not just in specific cases but in the general case!), but I digress, and not being a mathematician, perhaps someone has already proved somehow that a general method to directly and efficiently invert all possible matrices is impossible
I know I may be asking to make what may be a fundamentally-hard problem, easy... I'm just saying that I don't get the same, eh, "buzz" I get when I see a really cool solution for something (which I've often gotten from math, physics, programming, etc.). This is very hard to explain as it's probably irrational and thus I can't make a strong rational argument for it lol (admittedly!)
I really love linear algebra, btw, so this should be right up my alley.
Did they unleash a second "Master" program?
I am wondering if the "better" strategy moves are now super wacky and weird and break all theory.
In fact, one of the obvious differences between AlphaGo Zero and top human players, is much more play on safe opening spots, which has been out of fashion among human pros for a hundred years or so.
Empirically, this is not correct. The original AlphaGo achieved a 57% accuracy at predicting expert moves.
I can't find an exact number, but based on the graph in the nature article, AlphaGo Zero has a less than 50% accuracy at predicting human moves. Eyeballing the graph, it looks like the supervized learning variant of AlphaGo Zero scored <55% at predicting human moves. Better than reinforcement learning AlphaGo Zero, but still worse then the original AlphaGo.
Of course, it is not clear that ability to predict the move humans will play is the best metric to measure how human like a computer plays. It is just the only objective metric we have. Although, if this were an actual research question, we could probably come up with better metrics.
anyone try it yet ?
Been following the AlphaGo Zero developments, which leap-frog what was going on in the movie (although still very much worth seeing).
One thing I was curious about is if Go would be considered solved, either hard or weakly solved, since AlphaGo Zero at this point doesn't seem to be able to be beat by any living human. Wikipedia does not list it as solved in either sense, and I was wondering if this was an oversight.
If AlphaGo has weakly solved go, then it should either have a 100% win rate when playing against itself as white, or a 100% win rate when playing against itself as black.
The most common komi values these days are 6.5 or 7.5, depending on ruleset.
In Go this is very rare (as opposed to e.g. chess).
Monte Carlo was already used in 2005 in AIs playing on KGS. Gradient Descent is a basic algorithm is a basic algorithm that I saw in an AI class in ~2008 as well. I bet both are even a lot older and well known by all experts.
This is not what makes AlphaGo special or Zero successful. The curious thing about Zero is that usually with Gradient Descent you run a huge risk of running into a local maximum and then stop evolving because every evolution makes you not better than the current step.
So one question is actually how they used these same old algorithms so much more efficiently, and the second question is how did they overcame the local maximum problem. Additionally there may be other problems involved that experts know better than me.
But an explanation of basic algorthims can't be the answer.
No. The curious thing is that you can train a godawful huge NN with 40 layers via pure self-play with no checkpoints or baselines or pretraining or library of hard problems or any kind of stabilization mechanism, and it won't diverge but will learn incredibly rapidly and well and stably. As Silver says in the AmA, all their attempts at pure self-play ran into the usual divergence problems where the training explodes and engages in catastrophic forgetting, which is what the RL folklore predicts will happen if you try to do that. Local maximums are not the problem - the problem is the self-play can't even find a local maximum much less maintain it or improve it.
> So one question is actually how they used these same old algorithms so much more efficiently, and the second question is how did they overcame the local maximum problem.
Er, this is exactly what OP is all about: the Monte Carlo tree search supervision. That's how they used them.
This is a new problem I haven't heard about. Thanks for adding it to the discussion.
> Er, this is exactly what OP is all about: the Monte Carlo tree search supervision.
And here, I don't know. I think I exclaimed confidently that this is a super old algorthim. What is new about it, that you need to mention it again? Really, may be that I'm misinterpreting something, or forgetting something, but the algorithm itself I'm quite sure I discussed with Go AI scientists in 2005 who had them in all their bots, afaik. Please correct if you believe my memory is cheating me here.
You did not because first, MCTS was only published by Coulom in 2006 (so they couldn't 'all' have been using it in 2005), and second because you are missing the crucial iteration between the heavy heuristic (NN) for playouts and the refined estimates from the playouts: https://news.ycombinator.com/item?id=15627834 Any MCTS in 2006 for Go would either be using random light playouts or simple unchanging hand-engineered heavy heuristics, hence, no possible recursive self-improvement in self-play.
Deep networks have many parameters and represent extremely high-dimensional cost functions. Some basic heuristics suggest that sufficiently random high-dimensional functions have relatively few local extrema, but lots of saddle points. (Look at the signs of the eigenvalues of the Hessian matrix and apply a coin flipping argument. This suggests that local extrema become exponentially rarer as the dimension increases.)
Moreover, stochastic gradient descent is actually pretty good at escaping saddle points.
I imagine that this is an iteration of the Alpha Go engine, people working on this are very current with Alpha Go.
If Chess is similar, then wouldn't DeepMind be able to bootstrap game knowledge. Perhaps this isn't a big goal, but Chess is Chess after all.
He thinks it's only a matter of time till machine learning beats hand crafted systems like stockfish even in chess.
The consequence of Alpha Go Zero is that it can, in a matter of days, disregard and surpass all human knowledge about the game.
Maximizing a score margin has been equated for a long time with maximizing your probability of winning. Alpha Go doesn't play like that... it's a substantial paradigm shift. If you see the early commentaries you will see that human players were initially saying Alpha Go moves were mistakes because they were slow and wasted opportunities to get more territory, to then realize that Alpha Go was actually winning.
Another obstacle is that whereas an image classifier is basically a purely learned function, Monte Carlo tree search has a hard-coded element of applying potential moves to find new states to explore - always following the game’s rules correctly. If you go more than a few moves ahead, it has a relatively unpredictable choice of what states to explore; but as you get closer to the current move, I think it starts to look a bit more like brute force. So if you want the AI to make a really bad move - one that has severe consequences in the next few turns - you need to not just make the NN think it looks good, but that the game state still looks good even after you make any possible counterplay.
Disclaimer: I don’t actually know much about either neural networks or Go.
(On a side note, the more I think about Anthony et al 2017/AlphaGo Zero's tree/expert iteration, the more beautifully simple it appears to me.)
AlphaGo Zero has played itself, and they have published 20 of those games.
GAN's were not used for alphaGo, as the article points out Deepmind uses reinforcement learning and MCTS.
Personally, I look forward to the day when the software I own works for me to the extent of optimizing the decisions I make during the day, even many mundane ones. Properly executed, such a system could make a big difference in my quality of life. I believe that a big piece that is missing is a solid life-model, a life-representation, that can be optimized. Once that is defined, an RNN or MCS can optimize it and I can reap the benefits.
I do this similarly but with supplements and medication. I track a lot of my supplements that boost productivity and sleep and disregard anything that's negative. I've been continuing this over the course of a year.
I haven't added anything like a RNN to it for recommendations though. Some fears about missing up the gradient descent component scare me :)
My interpretation is that AI can allow you to rapidly load and unload mental models so that you're free to devote 100% of your mind to what only you (a human) can do.
Based on human nature and history I don't expect anything good coming from that.
Introducing MCTS for Go playing programs in 2006 brought upon a revolution in playing strength. For the first time Go programs stood a chance against a dan-level amateur.
One of the things that amazed me the most about AlphaGo Zero was that they didn't do any tricks to minimize the latency of the network evaluation!
Still, it's certainly worth a try, I'd be extremely interested to see what style of chess a self-trained MCTS chess version would have :).
It would be very interesting to see a computer improve on chess openings.
If by 'solving' you mean, developing a strong AI program to play chess, I think the reason they haven't done it because chess is easier to solve than Go. The best computer chess engines are already a lot better than the best humans. Thus, there is not much to gain for Google by finding a new approach.
There have been attempts to use similar approaches for chess, but they al seemed to perform worse than the traditional approach for chess. (Alpha-beta search with a hand-written evaluation function and null-move pruning).
The differences between Go and Chess that might explain this are:
- Chess has a smaller branching factor (the number of legal moves is around 40 in a typical position in chess, whereas it's over 100 in a typical position for Go)
- A chess position is easier to evaluate. Simply counting the material already gives a decent evaluation in a lot of cases. This supplemented by hand-written rules gives a solid evaluation of a position
- The average game of chess is shorter. A game of chess lasts between 20 and 80 moves, whereas a game of Go lasts around 200 games. This makes it a lot more feasible to extend the search far enough to oversee the consequences of each move in chess when compared to Go
Still, it would be interesting to see if can surpass the human world champion...