The entrypoint function
muzero is passed a
MuZeroConfig object, which stores important information about the parameterisation of the run, such as the
action_space_size (number of possible actions) and
num_actors (the number of parallel game simulations to spin up). We’ll go through these parameters in more detail as we encounter them in other functions.
At a high level, there are two independent parts to the MuZero algorithm — self-play (creating game data) and training (producing improved versions of the neural network). The
ReplayBuffer objects can be accessed by both halves of the algorithm and store neural network versions and game data respectively.
Shared Storage and the Replay Buffer
SharedStorage object contains methods for saving a version of the neural network and retrieving the latest neural network from the store.
We also need a
ReplayBuffer to store data from previous games. This takes the following form:
Notice how the
window_size parameter limits the maximum number of games stored in the buffer. In MuZero, this is set to the latest 1,000,000 games.
After creating the shared storage and replay buffer, MuZero launches
num_actors parallel game environments, that run independently. For chess,
num_actors is set to 3000. Each is running a function
run_selfplay that grabs the latest version of the network from the store, plays a game with it (
play_game) and saves the game data to the shared buffer.
So in summary, MuZero is playing thousands of games against itself, saving these to a buffer and then training itself on data from those games. So far, this is no different to AlphaZero.
To end Part 1, we will cover one of the key differences between AlphaZero and MuZero — why does MuZero have three neural networks, whereas AlphaZero only has one?
The 3 Neural Networks of MuZero
Both AlphaZero and MuZero utilise a technique known as Monte Carlo Tree Search (MCTS) to select the next best move.
The idea is that in order to select the next best move, it makes sense to ‘play out’ likely future scenarios from the current position, evaluate their value using a neural network and choose the action that maximises the future expected value. This seems to be what we humans are doing in our head when playing chess, and the AI is also designed to make use of this technique.
However, MuZero has a problem. As it doesn’t know the rules of the game, it has no idea how a given action will affect the game state, so it cannot imagine future scenarios in the MCTS. It doesn’t even know how to work out what moves are legal from a given position, or whether one side has won.
The stunning development in the MuZero paper is to show that this doesn’t matter. MuZero learns how to play the game by creating a dynamic model of the environment within its own imagination and optimising within this model.
The diagram below shows a comparison between the MCTS processes in AlphaZero and MuZero: