How To Build Your Own MuZero AI Using Python (Part 1/3)

By David Foster

The entrypoint function muzero is passed a MuZeroConfig object, which stores important information about the parameterisation of the run, such as the action_space_size (number of possible actions) and num_actors (the number of parallel game simulations to spin up). We’ll go through these parameters in more detail as we encounter them in other functions.

At a high level, there are two independent parts to the MuZero algorithm — self-play (creating game data) and training (producing improved versions of the neural network). TheSharedStorage and ReplayBuffer objects can be accessed by both halves of the algorithm and store neural network versions and game data respectively.

Shared Storage and the Replay Buffer

We also need a ReplayBuffer to store data from previous games. This takes the following form:

Notice how the window_size parameter limits the maximum number of games stored in the buffer. In MuZero, this is set to the latest 1,000,000 games.

Self-play (run_selfplay)

So in summary, MuZero is playing thousands of games against itself, saving these to a buffer and then training itself on data from those games. So far, this is no different to AlphaZero.

To end Part 1, we will cover one of the key differences between AlphaZero and MuZero — why does MuZero have three neural networks, whereas AlphaZero only has one?

The 3 Neural Networks of MuZero

The idea is that in order to select the next best move, it makes sense to ‘play out’ likely future scenarios from the current position, evaluate their value using a neural network and choose the action that maximises the future expected value. This seems to be what we humans are doing in our head when playing chess, and the AI is also designed to make use of this technique.

However, MuZero has a problem. As it doesn’t know the rules of the game, it has no idea how a given action will affect the game state, so it cannot imagine future scenarios in the MCTS. It doesn’t even know how to work out what moves are legal from a given position, or whether one side has won.

The stunning development in the MuZero paper is to show that this doesn’t matter. MuZero learns how to play the game by creating a dynamic model of the environment within its own imagination and optimising within this model.

The diagram below shows a comparison between the MCTS processes in AlphaZero and MuZero: