Code Monkey home page Code Monkey logo

logo's Introduction

Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

Code for Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration, ICLR 2022 (Spotlight)

Video of TurtleBot Demonstration

This codebase is based on a publicly available github repository Khrylx/PyTorch-RL

To run experiments, you will need to install the following packages preferably in a conda virtual environment

  • gym 0.18.0
  • pytorch 1.8.1
  • mujoco-py 2.0.2.13
  • tesnsorboard 2.5.0

The python file to run LOGO is present in logo/run_logo.py

To run the code with the default parameters, simply execute the following command

python run_logo.py --env-num i

Where i is an integer between 1-8 corresponding to the following experiments

  1. Hopper-v2
  2. Censored Hopper-v2
  3. HalfCheetah-v2
  4. Censored HalfCheetah-v2
  5. Walker2d-v2
  6. Censored Walker2d-v2
  7. InvertedDoublePendulum-v2
  8. Censored InvertedDoublePendulum-v2

The tensorboard logs will be saved in a folder titled 'Results'

For the full observation setting, we can initialize the policy network using behavior cloning, this enables faster learning, to do so simply execute the following command

python run_logo.py --env-num i --init-BC

logo's People

Contributors

desikrengarajan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

logo's Issues

The sparse reward settings on mujoco

The reward setting in Mujoco is confusing. When the agent steps a fixed distance from the starting point (i.e., 0) (2 or 20, temporarily denoted by symbol d), the agent receives a reward of 1 at each state and step. So, with this reward setup proposed by the authors, it doesn't feel like a sparse reward problem. In addition, in this reward setting proposed by the author, it feels like the agent is encouraged to go out of the circle of radius d (when stepping out of the circle of radius d, the agent can get a reward for every step even if it stands still), whereas the original dense reward setting encourages the agent to go further. So, this modification changes the original mission's intent. Finally, I tried to modify the reward, giving the agent a reward of 1 for every d distance traveled, and I found that this approach did not work.
If other readers have also read this question, please help me to answer my doubts, thank you very much!!!

You summarize the LOGO algorithm in formula (10) in your paper, but I don't know why the formula of minimize the upper bound is a minus sign. Why can the same function be used to update parameters in the code,

Uploading D815E59B7DC3C7A077FB40D5A932EC41.png…
Thank you so much for sharing your work,You summarize the LOGO algorithm in formula (10) in your paper, but the formula of minimize the upper bound is a minus sign,The plus sign is used in the TRPO algorithm to update parameters by maximizing, I don't know Why the same function be used to update parameters in the code,I don't see where the minus sign is?
trpo_step(policy_net, value_net, states, actions, returns, advantages, args.max_kl, args.damping, args.l2_reg)
if (kl > 6e-7):
trpo_step(policy_net, value_net_exp, states, actions, returns_exp, advantages_exp, kl, args.damping, args.l2_reg,fixed_log_probs = fixed_log_probs)

Whether LOGO also applies to PPO algorithm?

Dear author:

First of all, thank you very much for your contribution to the sparse reward problem of reinforcement learning!

Secondly, I want to know if LOGO also applies to PPO algorithm? Because PPO algorithm has the same performance as TRPO, and it is simpler to use than TRPO.

some issues about logo

Dear Desik Rengarajan,

Recently, I learned the LOGO algorithm you published in ICLR2022, and the method of guiding policy learning with demonstration data has achieved remarkable results in mujoco with theoretical assurance. It is a good job! I have some doubts about the collection method of behavioral data and the theoretical derivation, and hope to get your answers.

Following your instructions, I collected behavior data as follows:
Take Hopper-v2 as an example:
(1) The default iteration numbers is 1500. I trained TRPO in dense reward settings for 1000 iterations (namely, the training and test envs are all dense reward settings).
(2) Using the TRPO model trained in 1000 iterations to collecte about 10 eposides (about 3000 rows of data) as the behavioral data.

Except that the behavior data is different from the default data in LOGO code, the other training settings are consistent with the LOGO code. But my training results are very poor (as the below picture). I think it should be a problem of behavioral data collection. Can you elaborate further on your approach to constructing behavioral data, and can you open source the corresponding parts of your behavioral data collection program?
image

question 2:
image

question 3:
image

The sparse reward settings on mujoco

The reward setting in Mujoco is confusing. When the agent steps a fixed distance from the starting point (i.e., 0) (2 or 20, temporarily denoted by symbol d), the agent receives a reward of 1 at each state and step. So, with this reward setup proposed by the authors, it doesn't feel like a sparse reward problem. In addition, in this reward setting proposed by the author, it feels like the agent is encouraged to go out of the circle of radius d (when stepping out of the circle of radius d, the agent can get a reward for every step even if it stands still), whereas the original dense reward setting encourages the agent to go further. So, this modification changes the original mission's intent. Finally, I tried to modify the reward, giving the agent a reward of 1 for every d distance traveled, and I found that this approach did not work.
If other readers have also read this question, please help me to answer my doubts, thank you very much!!!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.