deep-reinforcement-learning reinforcement-learning muzero-general

About the Project

Of the back of the parallel APPO SOTA project I have also started training MuZero models. MuZero is the successor to AlphaZero but without any knowledge of the environment underlying dynamics. MuZero learns a model of the environment and uses an internal representation that contains only the useful information for predicting the reward, value, policy and transitions. MuZero is also close to Value prediction networks.

MuZero is far more sample efficient than PPO and has several other advantages which have enabled it to reach SOTA performance in a variety of Atari games. To test it out on consumer hardware I used a small T600 laptop GPU which worked ok but ran slowly vs bigger GPUs as you would expect.

Project Aims

This is an earlier experimental project but the aim is to successfully train fairly complex models on complex games prior to porting to medical domain tasks which will require something like a replay buffer to carry out successfully. Sample inefficiency is not an option here.

About the Model

This cartpole model used the default parameters which can be found in my maintained fork: https://github.com/MattStammers/muzero-experiments

However, I performed hyperparameter tuning to adjust both the initial learning rate and discount to achieve a score of 500. Without this change convergence at 10,000 timesteps was not possible.

edited_hyperparameters= {'lr_init': 0.0155098444309102, 'discount': 0.9908385211252329}

A MuZero model trained on the cartpole-v1 environment.

To use this model you will probably be best served by installing the muzero-general repository as above and loading the checkpoint from there while more robust HuggingFace integrated pipelines are built out.