-
Notifications
You must be signed in to change notification settings - Fork 612
Hyperparameter Optimization
MuZero General has an asynchronous hyperparameter search method. It uses Nevergrad behind the hood to search the hyperparameter space.
By default it is optimizing the learning rate and the discount rate. You can edit this parametrization in the __main__
of muzero.py. More details about the parametrization is available here.
This page is dedicated to documenting hyperparameters and gathering advice for tuning them. You can add your advice. You can also discuss it on the discord server.
Here are some references for AlphaZero hyperparameters which are quite similar to those of MuZero:
- Lessons from AlphaZero (part 3): Parameter Tweaking
- Lessons From Alpha Zero (part 6) — Hyperparameter Tuning
- seed
- max_num_gpus
- Game
- Evaluate
- Self-Play
- Network
- Training
- Replay Buffer
- Adjust the self play / training ratio to avoid over/underfitting
- visit_softmax_temperature_fn
Seed for numpy, torch and the game.
Fix the maximum number of GPUs to use. It's usually faster to use a single GPU (set it to 1) if it has enough memory. None will use every GPUs available.
Dimensions of the game observation, must be 3D (channel, height, width). For a 1D array, please reshape it to (1, 1, length of array).
Fixed list of all possible actions. You should only edit the length.
List of players. You should only edit the length.
Number of previous observations and previous actions to add to the current observation.
Turn Muzero begins to play (0: MuZero plays first, 1: MuZero plays second).
Hard coded agent that MuZero faces to assess his progress in multiplayer games. It doesn't influence training. None, "random" or "expert" if implemented in the Game class.
Number of simultaneous threads self-playing to feed the replay buffer.
True / False. Use the GPUs for the selfplay.
Maximum number of moves if game is not finished before.
Number of future moves self-simulated.
Chronological discount of the reward. Should be set to 1 for board games with a single reward at the end of the game.
Number of moves before dropping the temperature given by visit_softmax_temperature_fn to 0 (ie selecting the best action). If None, visit_softmax_temperature_fn is used every time.
Select the type of network to use: "resnet" / "fullyconnected"
Value and reward are scaled (with almost sqrt) and encoded on a vector with a range of -support_size to support_size
Downsample observations before representation network (See paper appendix Network Architecture).
Number of blocks in the ResNet.
Number of channels in the ResNet.
Number of channels in reward head.
Number of channels in value head.
Number of channels in policy head.
Define the hidden layers in the reward head of the dynamic network.
Define the hidden layers in the value head of the prediction network.
Define the hidden layers in the policy head of the prediction network.
Define the hidden layers in the representation network.
Define the hidden layers in the dynamics network.
Define the hidden layers in the reward network.
Define the hidden layers in the value network.
Define the hidden layers in the policy network.
Path to store the model weights and TensorBoard logs.
Total number of training steps (ie weights update according to a batch).
Number of parts of games to train on at each training step.
Number of training steps before using the model for self-playing.
Scale the value loss to avoid overfitting of the value function, paper recommends 0.25 (See paper appendix Reanalyze).
True / False. Use the GPUs for the training.
"Adam" or "SGD". Paper uses SGD.
Coefficient of the L2 weights regularization.
Used only if optimizer is SGD.
Initial learning rate.
Set it to 1 to use a constant learning rate. ###lr_decay_steps
Number of self-play games to keep in the replay buffer.
Number of game moves to keep for every batch element.
Number of steps in the future to take into account for calculating the target value. Should be equal to max_moves for board games with a single reward at the end of the game.
Prioritized Replay (See paper appendix Training). Select in priority the elements in the replay buffer which are unexpected for the network.
If False, use the n-step TD error as initial priority. Better for large replay buffer.
How much prioritization is used, 0 corresponding to the uniform case, paper suggests 1.
Use the last model to provide a fresher, stable n-step value (See paper appendix Reanalyze).
True / False. Use the GPUs for the reanalyse.
Number of seconds to wait after each played game.
Number of seconds to wait after each training step.
Desired training steps per self played step ratio. Equivalent to a synchronous version, training can take much longer. Set it to None to disable it.
Parameter to alter the visit count distribution to ensure that the action selection becomes greedier as training progresses. The smaller it is, the more likely the best action (ie with the highest visit count) is chosen.