desikrengarajan / logo Goto Github PK

View Code? Open in Web Editor NEW

23.0 1.0 6.0 2.46 MB

[ICLR 2022 Spotlight] Code for Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

Python 100.00%

deep-reinforcement-learning pytorch reinforcement-learning iclr2022 learning-from-demonstration trpo

logo's Introduction

Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

Code for Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration, ICLR 2022 (Spotlight)

Video of TurtleBot Demonstration

This codebase is based on a publicly available github repository Khrylx/PyTorch-RL

To run experiments, you will need to install the following packages preferably in a conda virtual environment

gym 0.18.0
pytorch 1.8.1
mujoco-py 2.0.2.13
tesnsorboard 2.5.0

The python file to run LOGO is present in logo/run_logo.py

To run the code with the default parameters, simply execute the following command

python run_logo.py --env-num i

Where i is an integer between 1-8 corresponding to the following experiments

Hopper-v2
Censored Hopper-v2
HalfCheetah-v2
Censored HalfCheetah-v2
Walker2d-v2
Censored Walker2d-v2
InvertedDoublePendulum-v2
Censored InvertedDoublePendulum-v2

The tensorboard logs will be saved in a folder titled 'Results'

For the full observation setting, we can initialize the policy network using behavior cloning, this enables faster learning, to do so simply execute the following command

python run_logo.py --env-num i --init-BC

logo's People

Contributors

Stargazers

Watchers

Forkers

souradip-chakraborty zeronilzero xuwangxw vananle wenchanggaot ruiiu

logo's Issues

The sparse reward settings on mujoco

The reward setting in Mujoco is confusing. When the agent steps a fixed distance from the starting point (i.e., 0) (2 or 20, temporarily denoted by symbol d), the agent receives a reward of 1 at each state and step. So, with this reward setup proposed by the authors, it doesn't feel like a sparse reward problem. In addition, in this reward setting proposed by the author, it feels like the agent is encouraged to go out of the circle of radius d (when stepping out of the circle of radius d, the agent can get a reward for every step even if it stands still), whereas the original dense reward setting encourages the agent to go further. So, this modification changes the original mission's intent. Finally, I tried to modify the reward, giving the agent a reward of 1 for every d distance traveled, and I found that this approach did not work.
If other readers have also read this question, please help me to answer my doubts, thank you very much!!!

You summarize the LOGO algorithm in formula (10) in your paper, but I don't know why the formula of minimize the upper bound is a minus sign. Why can the same function be used to update parameters in the code,

Thank you so much for sharing your work,You summarize the LOGO algorithm in formula (10) in your paper, but the formula of minimize the upper bound is a minus sign，The plus sign is used in the TRPO algorithm to update parameters by maximizing, I don't know Why the same function be used to update parameters in the code，I don't see where the minus sign is？
trpo_step(policy_net, value_net, states, actions, returns, advantages, args.max_kl, args.damping, args.l2_reg)
if (kl > 6e-7):
trpo_step(policy_net, value_net_exp, states, actions, returns_exp, advantages_exp, kl, args.damping, args.l2_reg,fixed_log_probs = fixed_log_probs)

Whether LOGO also applies to PPO algorithm？

Dear author：

First of all, thank you very much for your contribution to the sparse reward problem of reinforcement learning！

Secondly, I want to know if LOGO also applies to PPO algorithm？ Because PPO algorithm has the same performance as TRPO, and it is simpler to use than TRPO.

some issues about logo

Dear Desik Rengarajan,

Recently, I learned the LOGO algorithm you published in ICLR2022, and the method of guiding policy learning with demonstration data has achieved remarkable results in mujoco with theoretical assurance. It is a good job! I have some doubts about the collection method of behavioral data and the theoretical derivation, and hope to get your answers.

Following your instructions, I collected behavior data as follows:
Take Hopper-v2 as an example:
(1) The default iteration numbers is 1500. I trained TRPO in dense reward settings for 1000 iterations (namely, the training and test envs are all dense reward settings).
(2) Using the TRPO model trained in 1000 iterations to collecte about 10 eposides (about 3000 rows of data) as the behavioral data.

Except that the behavior data is different from the default data in LOGO code, the other training settings are consistent with the LOGO code. But my training results are very poor (as the below picture). I think it should be a problem of behavioral data collection. Can you elaborate further on your approach to constructing behavioral data, and can you open source the corresponding parts of your behavioral data collection program?

question 2:

question 3:

desikrengarajan / logo Goto Github PK

logo's Introduction

Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

logo's People

Contributors

Stargazers

Watchers

Forkers

logo's Issues

The sparse reward settings on mujoco

You summarize the LOGO algorithm in formula (10) in your paper, but I don't know why the formula of minimize the upper bound is a minus sign. Why can the same function be used to update parameters in the code,

Whether LOGO also applies to PPO algorithm？

some issues about logo

The sparse reward settings on mujoco

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent