Code Monkey home page Code Monkey logo

ooo's Introduction

Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate Exploration Bias

This repository contains the official implementation of Offline RL for Online RL: Decoupled Policy Learning for Mitigating Exploration Bias by Max Sobol Mark*, Archit Sharma*, Fahim Tajwar, Rafael Rafailov, Sergey Levine, and Chelsea Finn.

We build on top of the original implementations for each algorithm, and since each of them uses different package versions, a separate Python environment is needed for each algorithm.

Setting-up environments | OOO for IQL | OOO for Cal-QL | OOO for RLPD

Setting up environments

Harder exploration tasks from paper

The paper introduces antmaze-goal-missing-large-v2 and maze2d-missing-data-large-v1, environments in which no goal-reaching data is provided, as well as point-mass-wall, a didactic example environment. To run these environments, please download and unzip these files into the datasets subfolder.

Adroit Manipulation Suite Setup

  1. Download and unzip the .npy files into ~/.datasets/awac-data/ from here.
  2. Install mjrl on the conda environment of either IQL or Cal-QL:
$ git clone --recursive https://github.com/nakamotoo/mj_envs.git
$ cd mjrl
$ git submodule update --remote
$ pip install -e .

OOO for IQL

We build on top of the original IQL implementation.

Prepare environment for IQL

$ conda create -n ooo_iql python=3.9
$ conda activate ooo_iql

$ pip install -r OOO_for_iql/requirements.txt

$ conda install -c conda-forge patchelf

Run offline pre-training and online fine-tuning for exploratory data collection

XLA_PYTHON_CLIENT_PREALLOCATE=false python OOO_for_iql/train_finetune_decoupled.py --env_name=kitchen-{dataset:partial,complete,mixed}-v0 \
                --config=OOO_for_iql/configs/kitchen_decoupled_finetune_config.py \
                --exp_name=iql_rnd10_kitchen_{dataset} \
                --replay_buffer_size=4500000 --max_steps=4000000 \
                --num_pretraining_steps=1000000 \
                --seed=0 \
                --rewards_bias=-4 \
                --use_rnd=True \
                --intrinsic_reward_scale=10

See scripts for every environment here

Run Offline RL for Exploitation phase

XLA_PYTHON_CLIENT_PREALLOCATE=false python OOO_for_iql/train_only_exploitation.py --exp_name=iql_rnd10_exploitation_kitchen_{dataset:partial,complete,mixed} \
                --env_name=kitchen-{dataset}-v0 \
                --seed=0 \
                --config=OOO_for_iql/configs/kitchen_exploitation_only_upsampling_config.py \
                --max_steps=2000000 \
                --replay_buffer_path=./results/iql_rnd10_kitchen_{dataset}/0/replay_buffer.npz \
                --offline_dataset_size={dataset:136950,3680,136950} \
                --online_eval_timesteps=1000000,2000000,3000000,4000000 \
                --rewards_bias=-4

See scripts for every environment here

OOO for Cal-QL

We build on top of the original Cal-QL implementation.

Prepare environment for Cal-QL

$ conda create -n ooo_calql python=3.9
$ conda activate ooo_calql

$ pip install -r OOO_for_calql/requirements.txt

$ conda install -c conda-forge patchelf

Run offline pre-training and online fine-tuning for exploratory data collection

XLA_PYTHON_CLIENT_PREALLOCATE=false python OOO_for_calql/conservative_sac_main.py --exp_name=calql_rnd10_kitchen_{dataset} \
                --env=kitchen-{dataset:partial,complete,mixed}-v0 \
                --seed=0 \
                --cql_min_q_weight=5.0 \
                --cql.cql_importance_sample=False \
                --policy_arch=512-512-512 \
                --qf_arch=512-512-512 \
                --n_pretrain_epochs=500 \
                --max_online_env_steps={dataset:1e6,4e6,1e6} \
                --mixing_ratio=0.25 \
                --reward_bias=-5 \
                --logging.output_dir=./results/calql_rnd10_kitchen_{dataset}/seed_0/ \
                --cql.use_rnd=True \
                --cql.rnd_reward_scale=10 \
                --cql.bound_q_functions=True \
                --cql.max_reward=0.5

See scripts for every environment here

Run Offline RL for Exploitation phase

XLA_PYTHON_CLIENT_PREALLOCATE=false python OOO_for_calql/conservative_sac_exploitation.py --exp_name=cql_exploitation_kitchen_{dataset:partial,complete,mixed}_timestep_{timestep:100000,250000,500000,1000000,2000000,3000000,4000000} \
                --env=kitchen-{dataset}-v0 \
                --seed=0 \
                --cql_min_q_weight=5.0 \
                --cql.cql_importance_sample=False \
                --policy_arch=512-512-512 \
                --qf_arch=512-512-512 \
                --mixing_ratio=0.25 \
                --reward_bias=-4.0 \
                --replay_buffer_original_bias=-5 \
                --cql.use_rnd=False \
                --logging.output_dir=./results/cql_exploitation_***env***/timestep_{timestep}/seed_0 \
                --exploitation_timestep={timestep} \
                --replay_buffer_path=./results/calql_rnd10_kitchen_{dataset}/seed_0/replay_buffer.npz \
                --replay_buffer_size=4000000 \
                --bound_q_functions_according_to_data=True \
                --cql.feature_normalization=True

See scripts for every environment here

OOO for RLPD

We build on top of the original RLPD implementation.

Prepare environment for RLPD

$ conda create -n ooo_rlpd python=3.9
$ conda activate ooo_rlpd

$ pip install -r OOO_for_rlpd/requirements.txt

$ conda install -c conda-forge patchelf

Run the Online phase for halfcheetah and ant sparse

Ant sparse:

XLA_PYTHON_CLIENT_PREALLOCATE=false python OOO_for_rlpd/train_finetuning_decoupled.py --exp_name=rlpd_rnd10_ant_sparse \
                --env_name=ant-sparse-v2 \
                --utd_ratio=20 \
                --start_training 5000 \
                --max_steps 1000000 \
                --config=OOO_for_rlpd/configs/rlpd_rnd_10.py \
                --seed=0 \
                --offline_ratio=0

Halfcheetah sparse:

XLA_PYTHON_CLIENT_PREALLOCATE=false python OOO_for_rlpd/train_finetuning_decoupled.py --exp_name=rlpd_rnd10_halfcheetah_sparse \
                --env_name=halfcheetah-sparse-v2 \
                --utd_ratio=20 \
                --start_training=5000 \
                --max_steps=1000000 \
                --config=OOO_for_rlpd/configs/rlpd_rnd_10.py \
                --seed=0 \
                --offline_ratio=0

Run Offline RL for Exploitation phase

Ant sparse:

XLA_PYTHON_CLIENT_PREALLOCATE=false python OOO_for_rlpd/train_only_exploitation.py --exp_name=rlpd_exploitation_ant_sparse_from_rnd10_timestep_{timestep:250000,500000,750000,1000000} \
                --env_name=ant-sparse-v2 \
                --seed=0 \
                --buffer_end_index={timestep} \
                --dataset_path=./results/rlpd_rnd10_ant_sparse/0/buffers/buffer \
                --upsample_batch_size=32

Halfcheetah sparse:

XLA_PYTHON_CLIENT_PREALLOCATE=false python train_offline.py --exp_name=rlpd_exploitation_halfcheetah_sparse_from_rnd10_timestep_{timestep:250000,500000,750000,1000000} \
                --env_name=halfcheetah-sparse-v2 \
                --seed=0 \
                --buffer_end_index={timestep} \
                --dataset_path=./results/rlpd_rnd10_halfcheetah_sparse/0/buffers/buffer \
                --upsample_batch_size=32

ooo's People

Contributors

maxsobolmark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.