Loading...

Postulate is the best way to take and share notes for classes, research, and other learning.

More info

Paper Summary: "Dota 2 with Large Scale Deep Reinforcement Learning"

Profile picture of Dickson WuDickson Wu
Jun 11, 202111 min read

Paper Summary: "Dota 2 with Large Scale Deep Reinforcement Learning"

Abstract:

  • OpenAI Five was the first AI that beat world champions at an esports game.

  • Dota 2 = super hard for RL. We have long time horizons, imperfect information, complex continuous state action spaces

  • Scaled learning = most important - they digested 2 million frames every 2 seconds

  • Trained for 10 months through self-play reinforcement learning



Introduction:

  • Fundamental goal of ML = solve real world problems

  • Games = stepping stones for RL (Backgammon, Chess, Atari, Go, StarCraft, Minecraft).

  • RL is also used for robotic manipulation + text summarization

  • What's excited about Dota 2 is that they're much closer to reality. It's complex and has a continuous nature (compared to chess + Go)

  • Dota 2 = super hard for RL. We have long time horizons, imperfect information, complex continuous state action spaces

  • They were able to beat humans - the main key: scale RL to giant levels: Thousands of GPUs and many months of training

  • Results:

    • Beat world champions

    • Won 99.4% over 7000 games

  • One challenge was that the environment + code changed constantly, we had to make sure that the agent doesn't have to restart training. --> Overcame with a technique they called surgery

    • For the 10 months, they had to do surgery 1 time per 2 weeks

    • This is actually super useful outside of Dota. In the real world new things might come up and we have to learn how to adapt



Dota 2:

  • [Summarize the game itself]

  • Challenges:

  • Long time horizons:

    • Game runs at 30 frames / s for 45 minutes.

    • OpenAI makes an action every 4th step --> 20000 steps per episode

    • Chess in comparison only lasts around 80-150 moves

    • Partially observed state

      • You can only see some parts of the game (the part of the map that your team can see).

      • So you need to make actions with incomplete data and predict the opponent's behavior

    • High Dimensional action and observation spaces

      • There are just a ton of objects in the game. OpenAI Five Observes 16,000 total values at each time step

      • There are also tons of actions. Between 8,000 to 80,000 actions

      • Chess = 1000 values per observation + 35 valid actions. Go = 6000, 250.

  • So Openai also put some limitations so it wouldn't go crazy:

    • only 17 heros out of the 117 heros

    • No support for a playing controlling multiple units



Training System:

  • The game runs at 30 frames per second. We make an action every 4 frames

    • At the timestep we get the observation. This observation has all the information that human would see (health, positions etc)

    • It then spits out a discrete action

  • Some game mechanics were hand-scripted logic. I would have been better if we didn't script them, but they achieved super-human performance.

  • They also added some randomization to the environment itself to make sure that it was robust

  • Policy function takes in a history of observations and spits out the probability distribution of actions

    • 159 million parameters

    • Mostly single layer 4096 unit LSTMs

  • 

  • We just duplicate this policy function for the other 4 heros.

  • The inputs are almost identical since we want each agent to have total information of as much as we can get our hands on

  • As for the inputs: It's not going to be the frames of the game itself

    • That would be a huge amount of parameters

    • Instead we take all the information and encode it in a data array

    • This information would also be available to a human

    • They're okay with doing this since we're studying the strategic planning + high level decision making rather than video processing

    • Also as an additional input, each network gets a little input telling them which player they control

  • Reward function

    • Our RL is trying to maximize this guy

    • It's reflective of the probability that we have of winning but also players dying, resources, and also exploiting the fact that it's a zero sum multiplayer game

  • Policy is trained using PPO (the better version of AAC). Also used Generalized Advantage Estimation, which helps stabilize and speed up training.

  • Training it = from self play

    • We have optimizer GPUs that gets game data --> into experience buffers

    • We pull minibatches from the experience buffer to get gradients

    • Then we just pool them using NCCL2, an Adam optimizer and backpropagated every 16 timesteps

    • Every 32 gradient steps we push the new update to the central Redis controller (responsible for storing metadata about the system + stopping + restarting games)

    • Rollout machines are the self playing guys

    • They around 50% of real time, since they can fit it more than 2x the games

    • 80% of time plays with the current policy, 20% of the time it plays with older versions

    • The rollout machines run the game engine, but the policy i at the central pool of GPU machines which run in batches of 60. This central pool of GPUs pulls from the controller to get an updated version of the parameters

    • It's also done all asynchronously

    • 

  • Surgery:

  • The code and environment had to change because:

    • OpenAI changed stuff (like rewards, observations, the policy networks etc) since they learned new things to help it train

    • At first they restricted the game mechanics that the agent could do. Then incrementally they released more and more parts

    • Also the game was being updated

  • When they change we have our model --> Our old model still has some good stuff, but it's hard to transfer over

  • There was no way we're going to train from scratch every time there was a change --> We have to be able to continue training

  • Surgery = tools to take the old model and turn it into a new model that was compatible with the new environment.

  • We could still get the same performance and it would just continue as usual

  • [how the surgery works = in the Appendix... I'll be reading that after the conclusion...]



Experiments and Evaluation:

  • Had to scale batch size from 1 to 3 million timesteps. The model was over 150 million parameters. Plus it was trained for 180 days (because there was lots of restarts and reverts).

  • Compared to AlphaGo: 50-150X bigger batch size, 20X larger model, 25X longer training time

  • Human evaluation:

    • Along the way they competed against varying levels of Dota players

    • At the end they were able to beat the top players in the world

    • To test if there was some way to game the system, they played 7000 games, only 29 teams (42 games) were lost

    • Human dexterity = measured with reaction time. OpenAI = 217ms, while humans = 250ms

    • Other ways of evaluating = playing against known TrueSkill rated bots:

    • 

    • Playstyle = hard to analyze, but was shaped by the reward function. Humans have noted that it has a distinct playstyle with similarities and differences with humans.

      • At the beginning it did big group fights --> Quick wins in 20 minutes, but if the other team avoided it, they'd be crushed because they didn't farm resources

      • After they played like humans, but kept the characteristics. Used group battles for pressure, and gathered resources when behind

      • Big differences: OpenAI Five moved heros across lanes often. Also it wasn't afraid to fight when it had low health. They often used resources + abilities with long cooldowns

  • Validating Surgery:

    • For 2 month they ran a second agent run on the final environment

    • There were no surgeries done on it so it could run longer than any other agent had.

    • If we had the right environment from the start we would obviously had trained using that environment

    • This model surpassed OpenAI Five's Skill --> 98% winrate against it

    • The surgery let us avoid training from scratch, but it did worse than if we trained from scratch

    • They stopped the from scratch model, but it was likely going to increase:

    • 

    • Surgery is far from perfect, but it's a good start

  • Batch Size

    • If we increase batch size by 2X --> We have to 2X the # of optimizer GPUs + 2X the number of rollout machines to give the batches in the first place

    • But the potential benefit is that we can potentially 2X the time in which it takes to achieve a level of TrueSkill

    • 

    • But it's not exactly 1:1 - It's less than linear.

  • Data Quality:

    • Our rollout machines and the optimizers act asynchronously.

    • The rollout machines play for a little bit, dump the data to the optimizers, then pickup the new parameters.

    • If we waited until the whole episode to finish, the gradients would be useless or destructive for our algorithm

    • Now where comes the data quality. The shorter the time interval, the "fresher" the data was. So we have to optimize for that (they aimed for every 30 seconds)

    • Also, in the experience replay, we want to make sure that the same samples don't keep popping up. The number of times it's reused is also optimized (they aimed for 1)

    • 

    • Data quality = super super important since it compounds over the entire training period

  • Long term credit assignment

    • They played around with the discount rate and how it affects long term rewards (more specifically they want to find the time horizon which the agent takes into account)

    • 

    • 6-12 minutes was the best - the longer the time horizons the better!



Related work:

  • Lots of people in the past have used games as a testbed for learning (Backgammon, Checkers, Chess)

  • Self Play has been used for Poker too! (Except we're using the most updated agent instead of the average)

  • Also builds off of AlphaGo (and all the iterations). They forgo imitation learning and go with Monte-Carlo Tree Search instead.

  • Alphastar especially. They were training to play StarCraft 2. Both were very similar in terms of network and training. But the cool thing that they did was that the critique network had full information about the game that allowed it to be a better critique

  • Other work = for high complexity data

  • Also distributed training systems, along with PPO, A3C

  • Surgery method inspired by Net2Net. Transferring the model without affecting performance. There are also other cool methods that help do this too!



Conclusion:

  • RL is good enough to beat world class esports games

  • Key ingredient = more computation power.

  • Surgery technique = helped them continue training for a long time

  • These results are also applicable for any other team based 0 sum game

  • As complexity increases in the future, more computing is super important



Surgery:

  • We have to do this for changes in architecture, environment, observation space and action space!

  • Changing the architecture

    • Let's say we have a fully connected network: We're just adding more units

    • What we do is in one layer we keep the original parameter, but beside it we have a randomly intialized value.

    • In the next layer we keep the old weight, but have a 0 beside it

    • The 0 will kill off all the random parameters --> Thus we have the same performance

    • But in the future, those random parameters will become useful and start changing, and the 0 will move away from 0 and towards a better value

    • For LSTMs, we have a recurrent neural network. SO instead we set the parameters to be randomized but small - small enough to have 0 effect on the TrueSkill

  • Observation Space:

    • So we don't actually map inputs directly to the network

    • We take inputs and pass it through an encoding layer

    • So when we add more inputs, we're just enriching the encoding layer with even better inputs

    • We also do the same 0 parameter set in the encoding function

  • Changing the environment and action space

    • For game version changes it actually adapted quite well

    • But they implemeted annealing, where you slowly creap from 0% of the rollout games playing in the new version/actions to 100%

    • This allows them to continue having the skill, without having to relearn everything

    • Also if the TrueSkill drops too much, they just restart with a lower anneal rate

  • Removing Model parts:

    • They can't do it

  • Smooth Training Restart:

    • We also want to make sure that old games don't try and update. So they set lr to 0 for a couple of hours.

    • This lets the updated version rollout to all the machines

    • We also have to update all our opponents (since they all play on the old version)

  • Benefits of Surgery:

    • Let them have a tighter iteration loop

    • Allowed them to finish the project without needing to restart all the time

    • Allowed them to add a ton of minor details without fuss

  • 

  • ^^ List of all the major surgeries done


Comments (loading...)

Sign in to comment

ML Paper Collection

A Collection of Summaries of my favourite ML papers!