Paper Summary: "Mastering the game of Go with deep neural networks and tree search"

Paper Summary: "Mastering the game of Go with deep neural networks and tree search"
TL;DR + Importance: AlphaGo attains a major leap forward in Go playing performance (thought to be a decade away). They do this by having a unique combination of supervised and reinforcement learning on both policy and value networks. They take these networks, integrate it with MCTS, bring it to scale and are able to achieve professional human performance!
Abstract:Go = really hard for ML to do becuase of its giant search space + it's hard to evaluate positions and moves. Here, we use value networks to assign values to positions, and policy networks to actually choose the moves. 
We train them by supervised learning (on expert human games) + RL (through self-play). We do 0 lookahead search and we can beat SOTA Monte Carlo tree search (which does lookahead). 99.8% win rate + defeated the European Go Champion (5-0).
Introduction:Go is a game of perfect information, thus theoretically there's always an optimum action to be played. How do we determine that action? Well, we could literally simulate all the games that are played and pick the move that grants us the highest probability of winning.
Doing so we have ~b^d possible moves. b (breadth) = the number of possible actions to take, and d (depth) = how long the game goes (number of states there will be). For big games like chess (b ~ 35, d ~ 80) and Go (b ~ 250, d = 150) it's impossible to do it. But we can reduce the b and d! 
Instead of going through all possible future games, why don't we just get a function that approximates how valuable playing that move (v(x) --> v*(s), where v = the value function that approximates value, s = the state of the board). We can also decrease the possible actions. We just get a policy function that creates a probability distribution over the possible actions given the state (p(a|s)). 
Monte Carlo tree search (MCTS) is like Q-learning. We start off by randomly taking moves, and we keep playing it until it finishes (win or loss). This is then backpropagated up the chain of moves to adjust the probability of taking that move in the future. This converges to optimum play. But it's not optimum Go (only amateur level) because they're limited by shallow policy + value function
Alpha go uses CNN's for their policy + value functions. We pass in the board, the value function will evaluate how good something is, while the policy network will sample actions. 
The pipeline = Create a supervised policy network that predicts expert human moves (this kickstarts our RL agent). This is transfer learned to a policy network which starts play itself (RL) and we create a value network which predicts the winner games --> AlphaGO = takes the policy + value networks and shoves it into MCTS.
﻿
﻿
Supervised Learning of Policy Networks:The first stage is supervised learning - which is a CNN with ReLU activation, 13 layers and trained on 30 million positions from the KGS Go Server (in the form of state, action (x,y)). Its goal is to maximize the probability of predicting the next move. 
﻿
They achieved a 57.0% accuracy - which is a bit better than the state of the art (44.4%). But this improvement allows for dramatic changes in player strength. 
Reinforcement learning of policy networks:The RL policy network is the exact same structure as the SL policy network (we transfer learn the weights). We then get our policy network to play against a random sample of previous generations of the model - this way it stabilizes learning to prevent our policy from overfitting the current policy.
The reward comes in, only at the end - when we win or lose the game. And we train it to maximize the probability of winning:
﻿
When it faces off again the old SL - it won 80% of the time. And when we play it against the strongest Go program - it won 85% of the time (previous SOTA = 11%)
﻿
Reinforcement Learning of value networks:The value network estimates the true value function (aka the true probability of winning given a state) by taking in the state, and the probability distribution of the actions:
﻿
That's the value function for our strongest RL policy, but we actually estimate that too... So it ends up looking like this:
﻿
The network is in the same architecture as the policy - except we switch out the head for a different one with only 1 output (since it only predicts the probability). Now we just train the value approximator on state-outcome pairs and get it to minimize MSE of predicted and ground truth:
﻿
We don't want to train on our original dataset, becuase we end up overfitting it like crazy. Instead we just use the self play games to do it - and we end up with minimal overfitting. We end up with a better overall function that's 15,000 times less computationally expensive.
﻿
﻿
Searching with Policy and value networks:
﻿
AlphaGo uses a MCTS. So the edge of the tree contrains lots of things: (state, action), Action value ( Q(s,a) ), visit count ( N(s,a) ), and prior probability ( P(s,a) ). 
We start at the root of the tree and start decending it. We decend it depending on this function:
﻿
We select the action that yields the highest value (Q) and also the bonus. And the bonus is:
﻿
This bonus depends on the prior probability but decays as we visit it more (which encourages exploration). 
Now if we hit a leaf node, meaning there isn't anything below it, we cna add a new leaf node. We do this by pumping in the current state and action into the SL policy network to produce a new state. We store the probability of it as well.
Now it's time to evaluate how good of a leaf node it is. We evaluate it by doing 2 things: First, we find the value by a shove in the state given into the value function. But also we use a random rollout's (aka we get a policy function to go BRRRRRR and it will quickly produce the rest of the game) outcome. And we combine it with a mixing parameter to adjust the value of each one:
﻿
Finally we update the action values and visit counts of all the nodes that we traversed through:
﻿
The reason why we use the SL policy network to create the leafnode is because SL mirrors humans more, thus has a diverse beam of possible moves. But the value function from the RL agent is stronger than from the SL. [this requires a ton of computation, so we get tons and tons of CPUs and GPUs]
Evaluating the playing strength of AlphaGo:First, we get it to play a ton of games with other programs - it beat them up 494/495 (99.8% win rate). We also handicap AlphaGo (giving free moves to the other program. Still won from 77% to 99% of the time (depending on the opponent's strength. 
There's also a distributed version of Alphago (beating up the single machine Alphago 77% of the time). Distributed version won 100% of games against other programs. 
After we run Alphago with different mixing parameter functions. We can have it only look at the value network, or only look at the rollouts. If we only use the value network - we still beat all other GO programs. But the best is when mixing parameter = half-half. This actually beats >95% of all other variants of AlphaGo -> Meaning they both play an important role.
Additionally, this version of AlphaGo beat Fan Hui, who's Europe's best player, 5-0. This feat was believed to be a decade away, but Deepmind did it!
Discussion:AlphaGo achieved a task that was deemed as one of the grand challenges of ML. They did this through a new combination of both supervised and reinforcment learning. These networks were combined with MCTS, at scale.
This achievement is much like DeepBlue defeating Kasparov, except Alphago evaluated thousands of times fewer positions --> Through policy and value networks. Plus DeepBlue had handcrafted evaluation functions, while AlphaGo = general-purpose SL + RL.
This achievement allows for the possibility of opening up doors in other areas where it seems impossible. Plus we can use these learnings and methods in other areas of ML!
Paper Summary: "Mastering the game of Go with deep neural networks and tree search"

Abstract:

Introduction:

Supervised Learning of Policy Networks:

Reinforcement learning of policy networks:

Reinforcement Learning of value networks:

Searching with Policy and value networks:

Evaluating the playing strength of AlphaGo:

Discussion:

Comments (loading...)

ML Paper Collection