The implementation is based on stable-baselines. Thanks to the original authors!
conda env create -n peer-bc-ct python==3.6
conda activate peer-bc-ct
pip install -e .
To replicate the experiments, we first need to train an imperfect expert policy.
python3 train.py --algo ppo2 --env BreakoutNoFrameskip-v4 --save-freq 1000000
We need to generate expert dataset first using stable_baselines/ppo2/record_expert.py
For example, we want to generate dataset for PongNoFrameskip-v4
:
python -m stable_baselines.ppo2.record_expert logs/PongNoFrameskip-v4/baseline/rl_model_2000000_steps.zip --note baseline/2e6_steps --env PongNoFrameskip-v4
The expert policy
Then, we can run peer
python -m stable_baselines.ppo2.copier --env Acrobot-v1 --policy mlp