Alpha Go Zero: How and Why It Works 354 points by Mageek on Nov 4, 2017 | hide | past | web | favorite | 109 comments

 The main reason AlphaGo Zero learns so much faster than its predecessors is because it uses temporal-difference learning.[1] This effectively removes a huge amount of the value network's state space for the learning algorithm to search through, since it bakes in the assumption that a move's value ought to equal that of the best available move in the following board position, which is exactly what you'd expect for a game like Go.A secondary reason for AlphaGo Zero's performance is that it combines both value and policy networks into a single network, since it's redundant to have two networks for move selection.These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.
 Interestingly, the idea behind temporal difference learning is more or less the intuition behind how people price derivatives in finance.The expected value of a contract at time T, estimated at some time t < T, is assumed to be equal (up to discounting) for all t -- e.g. if today we think the contract will be worth \$100 a year later, then we also think that the expected estimate, made n months from now, of the value [12-n] months later, will also be \$100. This allows you to shrink the state space considerably.You can usually work out the payoff of a derivatives in different scenarios given rational exercise decisions by all contract participants. The calculation assumes that every market participant makes the best possible decision given the information they had available at the time by either explicitly or implicitly building a tree and working backwards, back-propagating the 'future' value back to the root.This closely resembles the modeling of a discrete adversarial game, except the payoffs need to make reference to random market variables like the stock price, so the tree nodes are not just indexed by participant action, but also by variables.There's actually a nice resemblance between the Longstaff-Schwarz method of pricing American options and MCTS + Alphago, except that the former is using kernel regressions instead of deep neural nets and we sample from a continuous space with an assumed probability distribution instead of a discrete space guided by a policy network [1].
 I think the bellman equation (which is used extensively in reinforcement learning) is also taught in stochastic calculus for finance (except in the continuous form?). https://en.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E2%80%...My memory is hazy so there might not be a real connection here.
 Yup, and a lot more! The Hamilton-Jacobi-Bellman equations come up in anything that can be formulated as an optimal control problem.I'm not an expert, but he is: http://www.athenasc.com/dpbook.html
 Yes see book by oksendal
 Or in the parlance of probability theory: "The expectation of the posterior probability, after viewing the evidence, must equal the prior probability."http://lesswrong.com/lw/ii/conservation_of_expected_evidence...(Notice the accompanying proof.)
 > These are the two biggest distinguishing characteristics of AlphaGo Zero compared to previous AlphaGos, and the OP doesn't discuss either of them.David Silver disagrees. The most critical distinguishing characteristic is the expert/tree iteration which makes stable self-play possible at all.
 A lot of people reading the paper miss this. I guess it's not emphasized enough.In the first paper, the selfplay trained policy is about 1500 in elo rating, while darkforest2 a supervised trained policy from Facebook is around the same, if not better. So selfplay wasn't of much use the first time around. While in the AlphaZero paper the selfplay trained policy has about 3000 elo rating.
 > A lot of people reading the paper miss this. I guess it's not emphasized enough.Yeah, it's hilariously underemphasized. 1 sentence, literally. Fortunately I was able to ask Silver directly and get confirmation that it's the tree iteration: https://www.reddit.com/r/MachineLearning/comments/76xjb5/ama...
 For anyone interested: Learn more on TD and RL in general from Sutton (inventor of TD-lambda) and Barto's book: http://www.incompleteideas.net/sutton/book/the-book.htmlSidenote: It used to be that simply googling "sutton barto book" would bring you to the right place with the first suggested link. Now this stuff is so popular all of a sudden, I needed to consult the link I had set on my own page in order to find it. It's curious how the growth of popularity of an idea will with time obscure it's own roots and primary sources. On the plus side, TIL that Sutton's working on a 2nd edition! =)
 Here is the latest publicly available draft of the 2nd edition (June 2017, 538 pages): http://incompleteideas.net/sutton/book/bookdraft2017june19.p...
 Better to use my link, not the deep one. On mine you can click through to the latest draft (as of writing this, November 5th, 2017)!
 very good book. but it is also very old. why are they using it only now ???
 Are you sure AlphaGo Zero even uses temporal difference learning? The Nature paper suggests it does not and merely references some older programs that did. I think it just uses a somewhat custom form of self-play reinforcement learning combined with MCTS.
 You are correct. There is no TD learning in AGZ. The value network is trained to directly predict the game outcome given the current state, and is not trained through "bootstrapping" based on the next state's value estimate.
 Interesting, a TD algorithm, developed by a Canadian AI researcher now working with Deepmind in the early 1990s, was previously used to beat expert players at Backgammon and advanced human understanding of the game:> TD-Lambda is a learning algorithm invented by Richard S. Sutton based on earlier work on temporal difference learning by Arthur Samuel. This algorithm was famously applied by Gerald Tesauro to create TD-Gammon, a program that learned to play the game of backgammon at the level of expert human players.> TD-Gammon achieved a level of play just slightly below that of the top human backgammon players of the time. It explored strategies that humans had not pursued and led to advances in the theory of correct backgammon play.
 TD-Gammon was taught to us as a part of a classroom course on Reinforcement Learning (RL) in 2007. ML was known to a small set of people back then, there weren't many jobs in the area (this is in India), and even to many in this set, RL was either not known or not well known. It's interesting to see RL surge in popularity. In fact just a couple of weeks back, I was talking to the professor who taught us that course, and it was fun comparing ML/RL related awareness then to now :-)
 Yes, me too. TD was also pretty much considered useless, until it was used for backgammon. And like many things in the AI/ML world, no one really knew exactly why it worked so well.Backgammon is also interesting in that there is a non-determinstic element - the dice roll on every turn. This is where TD seems to shine.
 OP explicitly discusses the second big distinguishing in the opening paragraph of the section titled, 'The Alpha Zero Neural Ne':“The Alpha Zero algorithm produces better and better expert policies and value functions over time by playing games against itself with accelerated Monte Carlo tree search. The expert policy π and the approximate value function Ŵ are both represented by deep neural networks. In fact, to increase efficiency, Alpha Zero uses one neural network f that takes in the game state and produces both the probabilities over the next move and the approximate state value. (Technically, it takes in the previous eight game states and an indicator telling it whose turn it is.)”Regarding whether OP touches on temporal-difference learning I am unqualified to say but they do not explicitly mention it. Furthermore I am unqualified to judge how central this technique is to the level of play achieved. However in the DeepMind paper (pg. 20)† that start talking about temporal-difference learning thus:“Self-play reinforcement learning has previously been applied to the game of Go. NeuroGo[40, 41] used a neural network to represent a value function, using a sophisticated architecture based on Go knowledge regarding connectivity, territory and eyes. This neural network was trained by temporal-difference learning[42] to predict territory in games of self-play, building on prior work[43]. A related approach, RLGO[44], represented the value function instead by a linear combination of features, exhaustively enumerating all 3 × 3 patterns of stones; it was trained by temporal-difference learning to predict the winner in games of self-play. Both NeuroGo and RLGO achieved a weak amateur level of play.”I'm no expert but this implies to me that it was probably the sum total of all the subtle architectural decisions made by the DeepMind team plus their AI hardware and software platform that made AlphaGo Zero excel.
 Of course, these techniques you have mentioned were known before and tried. The big thing in AlphaGo Zero is that they found a way to make them work and the resulting architecture even manages to looks simple.(Eg AlphaGo using two different networks for policy and valuation was a big breakthrough back when it was young, because people couldn't make the single network one work so well.)For temporal difference learning the article about TD-Gammon, a backgammon AI from the early 90s, is great: http://www.bkgm.com/articles/tesauro/tdl.html (It's linked from the Wikipedia article you referenced, too.)
 The seminal article on TDRL: Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–34, PDF:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.132...
 could you elaborate on this point? what you're saying sounds like dynamic programming, which does not reduce the state space at all, just saves on redundant computations (and is a favourite of programming interviews everywhere)
 Training via bootstrapping (i.e. dynamic programming) does reduce the state space for search when working with function approximation. It represents a bias in what sorts of values the value function approximator should predict for each state. It encodes a sort of "local continuity/coherence" constraint that wouldn't necessarily be induced by simply training to predict the raw values -- collected stochastically via interaction with the environment. This local coherence constraint acts as a regularizer (i.e. bias) while training the value function approximator.
 How would OpenAI's self-play be different?
 If the author's here: some of the math formulas don't render correctly. In particular, 10^170 is parsed as 10^{1}70, and \$5478\$ shows up without TeX applied to it.
 One stats with a randomly initialized set -> starts
 Thanks, fixed those
 Another minor fix:> A new paper was released a few days detailing a new neural netI believe you mean "a few days ago"?
 kk
 Another minor fix: the number of chess diagrams is at most 10^46, according to http://tromp.github.io/chess/chess.html
 They are two things a human brain does when playing chess or go: evaluating a position and mentally playing some positions (by doing a search tree).The AlphaGo neural network is able to do the first part (evaluating positions) but the search tree is still a hand crafted algorithm. Do they have plans to work on a version with a pure neural network? (i.e. a version which would be able to learn how to do a search tree.)
 Who is to say that the human brain doesn't have a different cell type to do this kind of search. The neocortex is shaped in a different way than other neural cells so I imagine search and evaluation are different architectures.But you're right. A NN way to do Montecarlo search in GPUs would make things even simpler.
 Would be really cool to see a generic framework for this, where you can plug in the rules of your discrete-deterministic-game-with-perfect-information and get a superhuman bot. Does something like this already exist?
 If anyone's interested who understands alpha zero in depth. I would love to start a github project to implement a super not that players checkers, tic tac toe, chess, go etc and all related games on the browser via an openai like interface.
 Check out the field of general game playing:https://en.wikipedia.org/wiki/General_game_playinghttp://www.ggp.org
 Given that AlphaGo Zero was trained on several million games of self plays, each game involving hundreds of steps, each step with 1600 MCTS simulations, the total number of board positions it has considered is on the order of trillions. While impressive it pales in comparison to the number of possible board positions of 10^170 (https://en.m.wikipedia.org/wiki/Go_and_mathematics). So its amazing performance tells us that:1. Possibly the elegant rule of the game cuts down the search space so much that there is a learnable function that gives us optimal MCTS supervision;2. Or CNN approximates human visual intuition so well, so while Zero has not evaluated so many board positions it has evaluated all the positions that human has ever considered - so it remains possible that a different network could produce different strategies and be better than Zero.
 > It is interesting to see how quickly the field of AI is progressing. Those who claim we will be able to see the robot overlords coming in time should take heed - these AI's will only be human-level for a brief instant before blasting past us into superhuman territories, never to look back.This final paragraph is just editorializing. A computer will never care about anything (including games like Go and domination of other beings) that it is not programmed to imitate care about, and will thus remain perennially unmotivated.Also, my intuition says that gradient descent is an ugly hack and that there HAS to be some better way (like a direct way) to get at the inverse of a matrix (not just in specific cases but in the general case!), but I digress, and not being a mathematician, perhaps someone has already proved somehow that a general method to directly and efficiently invert all possible matrices is impossible
 What do you mean with inverting a matrix? In the usual definition it is well studied and understood https://en.m.wikipedia.org/wiki/Invertible_matrix#Methods_of...
 I just... Here's the deal. Most good math is elegant, it has a beauty about it. And as a programmer, good code has the same "feel." But when I took Andrew Ng's class, gradient descent felt... not beautiful at all... and just based on that beauty-seeking intuition, there MUST be a more elegant way to get directly at the inverse of a matrix (assuming it exists and is possible) than all of these... complicated expensive workarounds. And the pot at the end of that rainbow is you get the EXACT value that gradient descent is wasting all this time trying to snowboard down N-dimensional space to get to!I know I may be asking to make what may be a fundamentally-hard problem, easy... I'm just saying that I don't get the same, eh, "buzz" I get when I see a really cool solution for something (which I've often gotten from math, physics, programming, etc.). This is very hard to explain as it's probably irrational and thus I can't make a strong rational argument for it lol (admittedly!)I really love linear algebra, btw, so this should be right up my alley.
 I elaborated here: https://news.ycombinator.com/item?id=15639863
 I wonder how the STYLE of Alpha Go Zero is regarded by human experts. Is it far different from AlphaGo? Why bother learning from AlphaGo if they can learn from AlphaGo Zero?Did they unleash a second "Master" program?I am wondering if the "better" strategy moves are now super wacky and weird and break all theory.
 At least initial reports are that alphaGo Zero is more human-like than Master. Zero packs even more of the inhuman ability to pick the most critical part of the board for each move, but less weird looking stuff.In fact, one of the obvious differences between AlphaGo Zero and top human players, is much more play on safe opening spots, which has been out of fashion among human pros for a hundred years or so.
 >At least initial reports are that alphaGo Zero is more human-like than Master.Empirically, this is not correct. The original AlphaGo achieved a 57% accuracy at predicting expert moves.I can't find an exact number, but based on the graph in the nature article, AlphaGo Zero has a less than 50% accuracy at predicting human moves. Eyeballing the graph, it looks like the supervized learning variant of AlphaGo Zero scored <55% at predicting human moves. Better than reinforcement learning AlphaGo Zero, but still worse then the original AlphaGo.Of course, it is not clear that ability to predict the move humans will play is the best metric to measure how human like a computer plays. It is just the only objective metric we have. Although, if this were an actual research question, we could probably come up with better metrics.https://storage.googleapis.com/deepmind-media/alphago/AlphaG...
 I'd have thought the metric to watch was "% of moves made which seemed kooky to human experts", not "% of human experts moves with which the program agrees."
 Interesting. It may be possible that while the style of play of Master is able to exploit the errors of human players easily and score convincing victories by many stones, Zero is closer to playing Go perfectly, even if it means winning by one stone. There is a chance that Zero would not score such convincing victories over humans as Master but is much less exploitable. If a strong player plays obvious novices (and humans are now at a level of novices compared to these programs), it doesn't have to worry about taking more unsound risks.
 You might be interested in the Redmon'd Review series on YouTube! Michael Redmond, 9P, and Chris Garlock just posted a short announcement that they'll be reviewing some Zero games. Michael mentioned that he'll focus on the opening in the next video.
 someone posted an attempt at an open source implementation of alphagozero https://github.com/yhyu13/AlphaGOZero-python-tensorflowanyone try it yet ?
 There's also Leela Zero [1].
 Saw the AlphaGo movie at a festival recently.Been following the AlphaGo Zero developments, which leap-frog what was going on in the movie (although still very much worth seeing).One thing I was curious about is if Go would be considered solved, either hard or weakly solved, since AlphaGo Zero at this point doesn't seem to be able to be beat by any living human. Wikipedia does not list it as solved in either sense, and I was wondering if this was an oversight.
 Go still has not been Ultra-weakly solved (eg. we do not know who is supposed to win).If AlphaGo has weakly solved go, then it should either have a 100% win rate when playing against itself as white, or a 100% win rate when playing against itself as black.
 Or 100% rate of a draw. Like tic tac toe.
 Due to komi, a go game cannot be drawn. That is, the 2nd player is awarded some number of points for the disadvantage of moving 2nd, and that number of points typically has a fractional 0.5 in it to break ties.The most common komi values these days are 6.5 or 7.5, depending on ruleset.
 Depending on the rules, a game may be drawn due to an "eternal life" (an endless forced repetition).https://senseis.xmp.net/?EternalLifeIn Go this is very rare (as opposed to e.g. chess).
 if alpha go is just an adversarial network to brute force states, then it is not solved (note I don't research alphaGo, and most of what I know about it is from HN comments)
 I don't get what is new in the set of attributes that this article describes.Monte Carlo was already used in 2005 in AIs playing on KGS. Gradient Descent is a basic algorithm is a basic algorithm that I saw in an AI class in ~2008 as well. I bet both are even a lot older and well known by all experts.This is not what makes AlphaGo special or Zero successful. The curious thing about Zero is that usually with Gradient Descent you run a huge risk of running into a local maximum and then stop evolving because every evolution makes you not better than the current step.So one question is actually how they used these same old algorithms so much more efficiently, and the second question is how did they overcame the local maximum problem. Additionally there may be other problems involved that experts know better than me.But an explanation of basic algorthims can't be the answer.
 > The curious thing about Zero is that usually with Gradient Descent you run a huge risk of running into a local maximum and then stop evolving because every evolution makes you not better than the current step.No. The curious thing is that you can train a godawful huge NN with 40 layers via pure self-play with no checkpoints or baselines or pretraining or library of hard problems or any kind of stabilization mechanism, and it won't diverge but will learn incredibly rapidly and well and stably. As Silver says in the AmA, all their attempts at pure self-play ran into the usual divergence problems where the training explodes and engages in catastrophic forgetting, which is what the RL folklore predicts will happen if you try to do that. Local maximums are not the problem - the problem is the self-play can't even find a local maximum much less maintain it or improve it.> So one question is actually how they used these same old algorithms so much more efficiently, and the second question is how did they overcame the local maximum problem.Er, this is exactly what OP is all about: the Monte Carlo tree search supervision. That's how they used them.
 If you read the AGZ paper closely, they actually use checkpoints during training. Specifically, during training they only perform updates to the "stable" set of parameters when the current "learning" set of parameters produces a policy which beats the stable set at least 55% of the time. The current stable parameters are what they use for generating the self-play data which they use to update the current "learning" parameters. I believe this is only mentioned in the supplementary material...
 I did and I would point out that while they use checkpoints, the training curves indicate this is not necessary, and what I meant is that they do not use the usual self-play (and evolutionary) mechanism of checkpoints from throughout the training history which is necessary to combat catastrophic forgetting (and which apparently wasn't enough to stabilize Zero on its own as it is the single most obvious thing to do but Silver notes all the pre-Zero self-plays diverged until they finally came up with that of MCTS supervision). The checkpoint mechanism there appears no more necessary than checkpoints in training any NN - it's critical to avoid a random error or bug wasting weeks of time but does not affect the training dynamics in any important way.
 (And Anthony et al 2017 don't use checkpoints at all, noting that it slows things down a lot for no benefit in their Hex agent.)
 > divergence problems where the training explodes and engages in catastrophic forgettingThis is a new problem I haven't heard about. Thanks for adding it to the discussion.> Er, this is exactly what OP is all about: the Monte Carlo tree search supervision.And here, I don't know. I think I exclaimed confidently that this is a super old algorthim. What is new about it, that you need to mention it again? Really, may be that I'm misinterpreting something, or forgetting something, but the algorithm itself I'm quite sure I discussed with Go AI scientists in 2005 who had them in all their bots, afaik. Please correct if you believe my memory is cheating me here.
 > but the algorithm itself I'm quite sure I discussed with Go AI scientists in 2005 who had them in all their bots, afaikYou did not because first, MCTS was only published by Coulom in 2006 (so they couldn't 'all' have been using it in 2005), and second because you are missing the crucial iteration between the heavy heuristic (NN) for playouts and the refined estimates from the playouts: https://news.ycombinator.com/item?id=15627834 Any MCTS in 2006 for Go would either be using random light playouts or simple unchanging hand-engineered heavy heuristics, hence, no possible recursive self-improvement in self-play.
 > The curious thing about Zero is that usually with Gradient Descent you run a huge risk of running into a local maximum and then stop evolving because every evolution makes you not better than the current step.Deep networks have many parameters and represent extremely high-dimensional cost functions. Some basic heuristics suggest that sufficiently random high-dimensional functions have relatively few local extrema, but lots of saddle points. (Look at the signs of the eigenvalues of the Hessian matrix and apply a coin flipping argument. This suggests that local extrema become exponentially rarer as the dimension increases.)Moreover, stochastic gradient descent is actually pretty good at escaping saddle points.
 Are there any plans to do this for Chess?I imagine that this is an iteration of the Alpha Go engine, people working on this are very current with Alpha Go.If Chess is similar, then wouldn't DeepMind be able to bootstrap game knowledge. Perhaps this isn't a big goal, but Chess is Chess after all.
 There was a project for chess, called giraffe https://bitbucket.org/waterreaction/giraffe whose author shut it down after joining google deep mind: http://www.talkchess.com/forum/viewtopic.php?t=59003He thinks it's only a matter of time till machine learning beats hand crafted systems like stockfish even in chess.
 Does alphazero beat stock fish purely by self play. That would be huge. Stock fish is the result of so many hand crafted optimizations over a large game play dataset.
 Intuitively you'd expect deep learning to improve on Stockfish's evaluation of positions, but perhaps not at the same level of throughput. I'm also intrigued as to whether a purely self-taught AI system can compare in the endgame to an engine with access to a 7-piece endgame tablebase, whose evaluation of those positions is obviously _perfect_.
 Go has been studied for hundreds of years. In many cases, by people who study the game since their childhood and work on it as a full-time occupation.The consequence of Alpha Go Zero is that it can, in a matter of days, disregard and surpass all human knowledge about the game.Maximizing a score margin has been equated for a long time with maximizing your probability of winning. Alpha Go doesn't play like that... it's a substantial paradigm shift. If you see the early commentaries you will see that human players were initially saying Alpha Go moves were mistakes because they were slow and wasted opportunities to get more territory, to then realize that Alpha Go was actually winning.
 I think people have always understood the difference between maximising probability of victory versus score. Even in amateur games you'll get the feeling "I could fight here, and it would be complicated, but I might get a huge advantage, or I could just play safe and keep my advantage."
 Even kyus like myself play more conservatively when winning. Alphago just takes this to the extreme but it is common on most levels of play (from single digit kyus to the best professional players).
 Are there adversarial examples for Alpha Go Zero?
 It seems like it would be drastically harder to attack a Go-playing network compared to an image classifier, just because there’s so much less freedom in input. With an image, you can vary each pixel independently however you want. In a game of Go, on any given turn there are probably only a handful of moves that don’t amount to throwing away territory, which would probably be harmful enough to offset any gain by ‘confusing’ the AI. Not only that. the list of suitable moves depends on the AI’s choices in addition to its opponent’s, making the state space highly discontinuous. Who knows, though… maybe there’s some vulnerability so severe that you can spend half the game wasting your moves and win anyway, because the AI gets so confused that it stops playing with even basic competence. I guess that image classifiers giving 99.9% confidence to the wrong class also seems pretty ‘incompetent’… and it would be very cool to see.Another obstacle is that whereas an image classifier is basically a purely learned function, Monte Carlo tree search has a hard-coded element of applying potential moves to find new states to explore - always following the game’s rules correctly. If you go more than a few moves ahead, it has a relatively unpredictable choice of what states to explore; but as you get closer to the current move, I think it starts to look a bit more like brute force. So if you want the AI to make a really bad move - one that has severe consequences in the next few turns - you need to not just make the NN think it looks good, but that the game state still looks good even after you make any possible counterplay.Disclaimer: I don’t actually know much about either neural networks or Go.
 Possibly but no one has demonstrated them, and there are probably many fewer in Zero than in previous AlphaGos. Previously AlphaGos suffered from problems which Silver calls 'delusions' (https://www.reddit.com/r/MachineLearning/comments/76xjb5/ama...) ie persistent long-term misevaluations of specific board positions, which were however (entirely? how would we know if not?) fixed by Zero's self-play - arguably these are not 'adversarial examples' in the sense you're thinking of but then I'm not sure what would constitute an adversarial example for a Go agent, so maybe the AlphaGo 'delusions' were the equivalent.(On a side note, the more I think about Anthony et al 2017/AlphaGo Zero's tree/expert iteration, the more beautifully simple it appears to me.)
 What do you mean?AlphaGo Zero has played itself, and they have published 20 of those games.
 the OP is referring to the use of a GAN, which is a different type of setup.GAN's were not used for alphaGo, as the article points out Deepmind uses reinforcement learning and MCTS.
 They're most likely referring to adversarial attacks where degenerate inputs are constructed that could cause AlphaGo Zero to perform sub-optimally or catastrophically fail (see OpenAI Research [0]). This is distinct from generative adversarial networks (GANs) or adversarial self-play (which I guess AlphaGo Zero is an example of).
 The NN played against itself.
 "On a long enough timeline, everything is a discrete game." (With apologies to _Fight Club_)Personally, I look forward to the day when the software I own works for me to the extent of optimizing the decisions I make during the day, even many mundane ones. Properly executed, such a system could make a big difference in my quality of life. I believe that a big piece that is missing is a solid life-model, a life-representation, that can be optimized. Once that is defined, an RNN or MCS can optimize it and I can reap the benefits.
 If it is optimising all your decisions, even the mundane ones, is it really your life any more? If it plugged into your brainstem and took over, by definition nobody else would notice the difference (you always did what it said anyway, or you’d be suboptimal), so why not just flood the 20 watt carbon-based neural network with bliss drugs and let the silicon-based neural network be the person instead?
 I think sometimes it's hard for us to zoom out and solve bigger problems that we might not see. Like, if I don't enjoy the chore I'm doing, I can make it easier, or I can eliminate it. (Paying a bill in person is a good example). Or, maybe there is an object that is generating a lot of work for me, like a house, and there is an alternative object (an apartment). Heck, sometimes it would be nice to have a record of all the things I enjoy and just have an AI path-find to maximize revisiting those things.
 Or just replace the brain completely with the computer, like in Greg Egan's jewel stories. http://will.tip.dhappy.org/blog/Compression%20Trees/.../book...
 You set the destination, it sets the route. Maybe you tell it you feel like you aren't getting enough exercise, or you need more time at night for a friend. It schedules the mundanity around those goals. I would love that.
 Sort of like Surrogates: http://m.imdb.com/title/tt0986263/
 well if you chose and modified the representation of the world yourself, and set it so it would be making decisions based on a heuristic you provided it, then at the end of the day, it is not so different from how our biological circuits work, to the extent we have control over them
 Complete with creepy predictions like "Predicted optimal life path: \$120,000 net worth, 63 year life span".
 Site : https://betterself.io/ GitHub : https://github.com/jeffshek/betterselfI do this similarly but with supplements and medication. I track a lot of my supplements that boost productivity and sleep and disregard anything that's negative. I've been continuing this over the course of a year.I haven't added anything like a RNN to it for recommendations though. Some fears about missing up the gradient descent component scare me :)
 This may interest you, AI as an artificial belief system. http://us1.campaign-archive.com/?u=78cbbb7f2882629a5157fa593...My interpretation is that AI can allow you to rapidly load and unload mental models so that you're free to devote 100% of your mind to what only you (a human) can do.
 Replying late (a day is ancient history in HN time) but I liked your comment a lot. My view is that our belief system is encoded in our behavior, not our words. So an AI could provide us with the first honest measurement of morality, according to our own standards.
 What if everybody uses the same software by a handful of companies, maybe only one? It will be interesting to see how it resolves conflicts between goals of different customers. More interestingly, how does it regards the goals of its company compared to the ones of its customers?Based on human nature and history I don't expect anything good coming from that.
 I don't know why you're downvoted. It's an interesting thing to think about. Personally, I'd be eager to try it out for a year or so. I set the goal, my computer tells me what to do.
 I seem to get a lot of upvotes, and then a lot of downvotes all in a flurry. I'd be curious to look at the data, to be honest, to see if it's a cohort. (No idea why they'd target me in particular)
 What makes this different from a minimax algorithm with alpha-beta pruning?
 Aside from MCTS being a different tree search method, there is no 'closing of the loop'. In regular MCTS, it is far from unheard of to do the random playouts with instead some 'heavier' heuristic to make the playouts a little better estimators of the node value, but the heavy playouts do not do any kind of learning, the heuristic you start with is what you end with; what makes this analogous to policy iteration (hence the names for the Zero algorithm of 'tree iteration' or 'expert iteration') is that the refined estimates from the multiple heavy playouts are then used to improve the heavy playout heuristic (ie. a NN which can be optimized via backpropagation in a supervised learning of board position -> value). Then in a self-play setting, the MCTS continually refines its heavy heuristic (the NN) until it's so good that the NN+MCTS is superhuman. Then at play time you can drop the MCTS entirely and just use the heavy heuristic to do a very simple tree search to choose a move (which I think might actually be a minimax with a fixed depth but I forget).
 The branching factor for games like Go is too high. Alpha Go Zero uses a stochastic policy to focus its attention on useful branches. This policy is optimized alongside the predicted value for each node.
 Minimax algorithm with alpha-beta pruning is an exhaustive search algorithm. MCTS is a probabilistic algorithm that optimises for converging faster for a partial "good enough" result.Introducing MCTS for Go playing programs in 2006 brought upon a revolution in playing strength. For the first time Go programs stood a chance against a dan-level amateur.
 I wonder how AlphaGo Zero would fare against the others if they were all using the same search algorithm, and I wonder how the search depth vs breadth changes in Zero compared to earlier variants.
 Mageek, any reason why they haven't applied this to chess yet?
 MCTS has done very poorly on chess compared to alpha-beta. Chess has a high number of forcing moves in capture sequences, and it's been very difficult to evaluate positions that are not settled. Traditionally an algorithm called a quiescence search is used, but it relies on doing an evaluation at each node of the search, which would be prohibitive with the latency for a network evaluation.One of the things that amazed me the most about AlphaGo Zero was that they didn't do any tricks to minimize the latency of the network evaluation!Still, it's certainly worth a try, I'd be extremely interested to see what style of chess a self-trained MCTS chess version would have :).
 The founder of DeepMind was an internationally ranked chess prodigy, so that's a good question.
 Can this technique be used to write a strong chess engine?
 Yes. As described, all that’s needed is a way to imagine all possible moves from a game state, and check if a game state corresponds with win/tie/loss. That is possible in Tic-Tac-Toe, Chess, and Go.
 Surprisingly, they haven't done it though. Wonder why. Is it because of the number of different state changes that can occur from a particular position? Maybe Go is easier to solve than chess.It would be very interesting to see a computer improve on chess openings.
 First of all, Go is not 'solved'. While the AI plays better than the best humans, it does not play perfectly (nor will any AI program in the forseeable future).If by 'solving' you mean, developing a strong AI program to play chess, I think the reason they haven't done it because chess is easier to solve than Go. The best computer chess engines are already a lot better than the best humans. Thus, there is not much to gain for Google by finding a new approach.There have been attempts to use similar approaches for chess, but they al seemed to perform worse than the traditional approach for chess. (Alpha-beta search with a hand-written evaluation function and null-move pruning).The differences between Go and Chess that might explain this are:- Chess has a smaller branching factor (the number of legal moves is around 40 in a typical position in chess, whereas it's over 100 in a typical position for Go)- A chess position is easier to evaluate. Simply counting the material already gives a decent evaluation in a lot of cases. This supplemented by hand-written rules gives a solid evaluation of a position- The average game of chess is shorter. A game of chess lasts between 20 and 80 moves, whereas a game of Go lasts around 200 games. This makes it a lot more feasible to extend the search far enough to oversee the consequences of each move in chess when compared to Go
 The Giraffe chess engine?https://chessprogramming.wikispaces.com/Giraffe
 No doubt it will be very strong by human standards, but it won't be competitive with current top programs based on alpha beta (currently duking it out in the unofficial engine world championship at http://tcec.chessdom.com/live.php).Still, it would be interesting to see if can surpass the human world champion...
 As someone who has played many a game of Tic-Tac-Toe, I found the numerical examples really hard to follow. s(0,5) is obviously the winning move for the X player, but for some reason all examples seem to favor s(0,1).
 Winning move how? With which board?with`````` O__ _X_ O_X `````` Then X can't win in the next move and must choose (0,1) between the two O's the left column in order to not lose.
 D'oh, thanks. I suppose I just misread the oh's.

Applications are open for YC Summer 2019

Search: