Paper Summary: "Dota 2 with Large Scale Deep Reinforcement Learning"

Paper Summary: "Dota 2 with Large Scale Deep Reinforcement Learning"

Abstract:
OpenAI Five was the first AI that beat world champions at an esports game.
Dota 2 = super hard for RL. We have long time horizons, imperfect information, complex continuous state action spaces
Scaled learning = most important - they digested 2 million frames every 2 seconds
Trained for 10 months through self-play reinforcement learning
﻿
Introduction:
Fundamental goal of ML = solve real world problems
Games = stepping stones for RL (Backgammon, Chess, Atari, Go, StarCraft, Minecraft). 
RL is also used for robotic manipulation + text summarization
What's excited about Dota 2 is that they're much closer to reality. It's complex and has a continuous nature (compared to chess + Go)
Dota 2 = super hard for RL. We have long time horizons, imperfect information, complex continuous state action spaces
They were able to beat humans - the main key: scale RL to giant levels: Thousands of GPUs and many months of training
Results:
Beat world champions
Won 99.4% over 7000 games
One challenge was that the environment + code changed constantly, we had to make sure that the agent doesn't have to restart training. --> Overcame with a technique they called surgery
For the 10 months, they had to do surgery 1 time per 2 weeks
This is actually super useful outside of Dota. In the real world new things might come up and we have to learn how to adapt
﻿
Dota 2:
[Summarize the game itself]
Challenges:
Long time horizons:
Game runs at 30 frames / s for 45 minutes.
OpenAI makes an action every 4th step --> 20000 steps per episode
Chess in comparison only lasts around 80-150 moves
Partially observed state
You can only see some parts of the game (the part of the map that your team can see).
So you need to make actions with incomplete data and predict the opponent's behavior
High Dimensional action and observation spaces
There are just a ton of objects in the game. OpenAI Five Observes 16,000 total values at each time step
There are also tons of actions. Between 8,000 to 80,000 actions
Chess = 1000 values per observation + 35 valid actions. Go = 6000, 250.
So Openai also put some limitations so it wouldn't go crazy:
only 17 heros out of the 117 heros
No support for a playing controlling multiple units
﻿
Training System:
The game runs at 30 frames per second. We make an action every 4 frames
At the timestep we get the observation. This observation has all the information that human would see (health, positions etc)
It then spits out a discrete action
Some game mechanics were hand-scripted logic. I would have been better if we didn't script them, but they achieved super-human performance.
They also added some randomization to the environment itself to make sure that it was robust
Policy function takes in a history of observations and spits out the probability distribution of actions
159 million parameters
Mostly single layer 4096 unit LSTMs
﻿
We just duplicate this policy function for the other 4 heros.
The inputs are almost identical since we want each agent to have total information of as much as we can get our hands on
As for the inputs: It's not going to be the frames of the game itself
That would be a huge amount of parameters
Instead we take all the information and encode it in a data array
This information would also be available to a human
They're okay with doing this since we're studying the strategic planning + high level decision making rather than video processing
Also as an additional input, each network gets a little input telling them which player they control
Reward function
Our RL is trying to maximize this guy
It's reflective of the probability that we have of winning but also players dying, resources, and also exploiting the fact that it's a zero sum multiplayer game
Policy is trained using PPO (the better version of AAC). Also used Generalized Advantage Estimation, which helps stabilize and speed up training.
Training it = from self play
We have optimizer GPUs that gets game data --> into experience buffers
We pull minibatches from the experience buffer to get gradients
Then we just pool them using NCCL2, an Adam optimizer and backpropagated every 16 timesteps
Every 32 gradient steps we push the new update to the central Redis controller (responsible for storing metadata about the system + stopping + restarting games)
Rollout machines are the self playing guys
They around 50% of real time, since they can fit it more than 2x the games
80% of time plays with the current policy, 20% of the time it plays with older versions
The rollout machines run the game engine, but the policy i at the central pool of GPU machines which run in batches of 60. This central pool of GPUs pulls from the controller to get an updated version of the parameters
It's also done all asynchronously
﻿
Surgery:
The code and environment had to change because:
OpenAI changed stuff (like rewards, observations, the policy networks etc) since they learned new things to help it train
At first they restricted the game mechanics that the agent could do. Then incrementally they released more and more parts
Also the game was being updated
When they change we have our model --> Our old model still has some good stuff, but it's hard to transfer over
There was no way we're going to train from scratch every time there was a change --> We have to be able to continue training
Surgery = tools to take the old model and turn it into a new model that was compatible with the new environment.
We could still get the same performance and it would just continue as usual
[how the surgery works = in the Appendix... I'll be reading that after the conclusion...]
﻿
Experiments and Evaluation:
Had to scale batch size from 1 to 3 million timesteps. The model was over 150 million parameters. Plus it was trained for 180 days (because there was lots of restarts and reverts).
Compared to AlphaGo: 50-150X bigger batch size, 20X larger model, 25X longer training time
Human evaluation:
Along the way they competed against varying levels of Dota players
At the end they were able to beat the top players in the world
To test if there was some way to game the system, they played 7000 games, only 29 teams (42 games) were lost
Human dexterity = measured with reaction time. OpenAI = 217ms, while humans = 250ms
Other ways of evaluating = playing against known TrueSkill rated bots:
﻿
Playstyle = hard to analyze, but was shaped by the reward function. Humans have noted that it has a distinct playstyle with similarities and differences with humans.
At the beginning it did big group fights --> Quick wins in 20 minutes, but if the other team avoided it, they'd be crushed because they didn't farm resources
After they played like humans, but kept the characteristics. Used group battles for pressure, and gathered resources when behind
Big differences: OpenAI Five moved heros across lanes often. Also it wasn't afraid to fight when it had low health. They often used resources + abilities with long cooldowns
Validating Surgery:
For 2 month they ran a second agent run on the final environment
There were no surgeries done on it so it could run longer than any other agent had.
If we had the right environment from the start we would obviously had trained using that environment
This model surpassed OpenAI Five's Skill --> 98% winrate against it
The surgery let us avoid training from scratch, but it did worse than if we trained from scratch
They stopped the from scratch model, but it was likely going to increase:
﻿
Surgery is far from perfect, but it's a good start
Batch Size
If we increase batch size by 2X --> We have to 2X the # of optimizer GPUs + 2X the number of rollout machines to give the batches in the first place
But the potential benefit is that we can potentially 2X the time in which it takes to achieve a level of TrueSkill
﻿
But it's not exactly 1:1 - It's less than linear.
Data Quality:
Our rollout machines and the optimizers act asynchronously. 
The rollout machines play for a little bit, dump the data to the optimizers, then pickup the new parameters.
If we waited until the whole episode to finish, the gradients would be useless or destructive for our algorithm
Now where comes the data quality. The shorter the time interval, the "fresher" the data was. So we have to optimize for that (they aimed for every 30 seconds)
Also, in the experience replay, we want to make sure that the same samples don't keep popping up. The number of times it's reused is also optimized (they aimed for 1)
﻿
Data quality = super super important since it compounds over the entire training period
Long term credit assignment
They played around with the discount rate and how it affects long term rewards (more specifically they want to find the time horizon which the agent takes into account)
﻿
6-12 minutes was the best - the longer the time horizons the better!
﻿
Related work:
Lots of people in the past have used games as a testbed for learning (Backgammon, Checkers, Chess)
Self Play has been used for Poker too! (Except we're using the most updated agent instead of the average)
Also builds off of AlphaGo (and all the iterations). They forgo imitation learning and go with Monte-Carlo Tree Search instead.
Alphastar especially. They were training to play StarCraft 2. Both were very similar in terms of network and training. But the cool thing that they did was that the critique network had full information about the game that allowed it to be a better critique
Other work = for high complexity data
Also distributed training systems, along with PPO, A3C
Surgery method inspired by Net2Net. Transferring the model without affecting performance. There are also other cool methods that help do this too!
﻿
Conclusion:
RL is good enough to beat world class esports games
Key ingredient = more computation power.
Surgery technique = helped them continue training for a long time
These results are also applicable for any other team based 0 sum game
As complexity increases in the future, more computing is super important
﻿
Surgery:
We have to do this for changes in architecture, environment, observation space and action space!
Changing the architecture
Let's say we have a fully connected network: We're just adding more units
What we do is in one layer we keep the original parameter, but beside it we have a randomly intialized value.
In the next layer we keep the old weight, but have a 0 beside it
The 0 will kill off all the random parameters --> Thus we have the same performance
But in the future, those random parameters will become useful and start changing, and the 0 will move away from 0 and towards a better value
For LSTMs, we have a recurrent neural network. SO instead we set the parameters to be randomized but small - small enough to have 0 effect on the TrueSkill
Observation Space:
So we don't actually map inputs directly to the network
We take inputs and pass it through an encoding layer
So when we add more inputs, we're just enriching the encoding layer with even better inputs
We also do the same 0 parameter set in the encoding function
Changing the environment and action space
For game version changes it actually adapted quite well
But they implemeted annealing, where you slowly creap from 0% of the rollout games playing in the new version/actions to 100%
This allows them to continue having the skill, without having to relearn everything
Also if the TrueSkill drops too much, they just restart with a lower anneal rate
Removing Model parts:
They can't do it
Smooth Training Restart:
We also want to make sure that old games don't try and update. So they set lr to 0 for a couple of hours.
This lets the updated version rollout to all the machines
We also have to update all our opponents (since they all play on the old version)
Benefits of Surgery:
Let them have a tighter iteration loop
Allowed them to finish the project without needing to restart all the time
Allowed them to add a ton of minor details without fuss
﻿
^^ List of all the major surgeries done
Paper Summary: "Dota 2 with Large Scale Deep Reinforcement Learning"

Comments (loading...)

ML Paper Collection