[paper]
Reinforcement learning (RL) experiments have notoriously high variance. We demonstrate that one can optimize directly for lower variance, without hurting average performance. Specifically, using a few simple methods, we decrease the variance by a factor ~3x (over 21 DMC tasks) for the competitive actor-critic agent DRQv2 without decreasing the average reward. This repo contains minimal modifications on top of the DRQv2 code base to reproduce our results.
Simply use the conda environment:
conda env create -f conda_env.yml
conda activate drqv2
- To run without any tricks enabled, simply run:
python train.py task=cheetah_run
- To use all all tricks, run:
python train.py task=cheetah_run agent.pnorm_critic=True agent.pnorm_actor=True agent.asymmetric_clip=True agent.action_penalty=0.0001 agent.cpc_until=10000
Note that the number of frames where CPC is used and the parameter for the action penalty are explicitly set.
-
Our proposed methods can be independently toggled with
Method Flags pnorm for critic agent.pnorm_critic=True
pnorm for actor agent.pnorm_actor
assymetric clip agent.asymmetric_clip=True
action penalty agent.action_penalty=0.0001
contrastive learning agent.cpc_until=10000
For citation, please use:
@article{bjorck2021high,
title={Is High Variance Unavoidable in RL? A Case Study in Continuous Control},
author={Bjorck, Johan and Gomes, Carla P and Weinberger, Kilian Q},
journal={arXiv preprint arXiv:2110.11222},
year={2021}
}
and for DRQv2:
@article{yarats2021drqv2,
title={Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning},
author={Denis Yarats and Rob Fergus and Alessandro Lazaric and Lerrel Pinto},
journal={arXiv preprint arXiv:2107.09645},
year={2021}
}
Our experiments are built on top of the open-sourced code for DRQv2.