Code Monkey home page Code Monkey logo

oac-explore's Introduction

Optimistic Actor Critic

This repository contains the code accompanying the NeurIPS 2019 paper 'Better Exploration with Optimistic Actor Critic'.

If you are reading the code to understand how Optimistic Actor Critic works, have a look at the file optimistic_exploration.py, which encapsulates the logic of optimistic exploration. The remaining files in the repository implement a generic version of Soft Actor Critic.

Reproducing Results

The bash script reproduce.sh will run Soft Actor Critic and Optimistic Actor Critic on the environment Humanoid-v2, each with 5 seeds. It is recommended you execute this script on a machine with sufficient resources.

After the script finishes, to plot the learning curve, you can run

python -m plotting.plot_against_baseline

which should produce the graph below. Optimistic Actor Critic takes ~6 million timesteps to obtain an average episode return of 8000, while Soft Actor Critic requires 10 million steps. This represents a ~40% improvement in sample efficiency.

oac_vs_sac

Note that the result in the paper was produced by modifying the Tensorflow code as provided in the softlearning repo.

Running Experiments

The repository supports automatic saving and restoring from checkpoint. This is useful if you run experiments on pre-emptive cloud compute.

For software dependencies, please have a look inside the environment folder, you can either build the Dockerfile, create a conda environment with environment.yml or pip environment with environments.txt.

To create the conda environment, cd into the environment folder and run:

python install_mujoco.py
conda env create -f environment.yml

To run Soft Actor Critic on Humanoid with seed 0 as a baseline to compare against Optimistic Actor Critic, run

python main.py --seed=0 --domain=humanoid

To run Optimistic Actor Critic on Humanoid with seed 0,

python main.py --seed=0 --domain=humanoid --beta_UB=4.66 --delta=23.53

Hyper-parameter Selection

Note that we are able to remove an hyperparameter relative to the code used for the paper (the k_LB hyper-parameter). The result in the graph above was obtained without using the hyper-parameter k_LB.

Acknowledgement

This reposity was based on rlkit.

Citation

If you use the codebase, please cite the paper:

@misc{oac,
    title={Better Exploration with Optimistic Actor-Critic},
    author={Kamil Ciosek and Quan Vuong and Robert Loftin and Katja Hofmann},
    year={2019},
    eprint={1910.12807},
    archivePrefix={arXiv},
    primaryClass={stat.ML}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

oac-explore's People

Contributors

microsoft-github-operations[bot] avatar microsoftopensource avatar quanvuong avatar robotics-transformer-x avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

oac-explore's Issues

`RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation`

the following code generates an error in some of the most recent versions of py-torch:

"""
Update networks
"""
self.qf1_optimizer.zero_grad()
qf1_loss.backward()
self.qf1_optimizer.step()
self.qf2_optimizer.zero_grad()
qf2_loss.backward()
self.qf2_optimizer.step()
self.policy_optimizer.zero_grad()
policy_loss.backward()
self.policy_optimizer.step()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

In order to solve it is necessary to move these lines

q_new_actions = torch.min(
self.qf1(obs, new_obs_actions),
self.qf2(obs, new_obs_actions),
)
policy_loss = (alpha * log_pi - q_new_actions).mean()

between the q networks gradient steps and the steps on the policy network as so:

"""
Update networks
"""
self.qf1_optimizer.zero_grad()
qf1_loss.backward(retain_graph=True)
self.qf1_optimizer.step()

self.qf2_optimizer.zero_grad()
qf2_loss.backward(retain_graph=True)
self.qf2_optimizer.step()

q_new_actions = torch.min(
    self.qf1(obs, new_obs_actions),
    self.qf2(obs, new_obs_actions),
)
policy_loss = (alpha * log_pi - q_new_actions).mean()

self.policy_optimizer.zero_grad()
policy_loss.backward(retain_graph=True)
self.policy_optimizer.step()

Be aware that if you simply use an old version of pytorch to solve this problem the behaviour might not be what you expect since the policy_loss was computed based on a network which no longer exists

raise InvalidGitRepositoryError(epath) git.exc.InvalidGitRepositoryError: /home/f/Downloads/oac-explore-master

think you for your code ,can you tell me how to deal with this error
/home/f/anaconda3/envs/f/bin/python /home/f/Downloads/oac-explore-master/main.py
Traceback (most recent call last):
File "/home/f/Downloads/oac-explore-master/main.py", line 219, in
variant['log_dir'] = get_log_dir(args)
File "/home/f/Downloads/oac-explore-master/main.py", line 165, in get_log_dir
get_current_branch('./'),
File "/home/f/Downloads/oac-explore-master/main.py", line 35, in get_current_branch
repo = Repo(dir)
File "/home/f/anaconda3/envs/f/lib/python3.7/site-packages/git/repo/base.py", line 181, in init
raise InvalidGitRepositoryError(epath)
git.exc.InvalidGitRepositoryError: /home/f/Downloads/oac-explore-master

Gradient calc in deterministic OAC

Hi Quan,

I came across your paper and found it to be interesting. One of the doubts I have is with the implementation of the optimistic policies. Why are you computing gradients of the upper bound w.r.t pre-tanh of the policies? As per the paper, isn' it supposed to be the deterministic action (output of the tanh policy)?

Regards,
Kartik

disabling code signature checker

I've forked the repo and started making some changes to try something out, but the code appears to have a signature checker, because when I try to run main.py the only output I get is the git diff and then the code exits. Any suggestions on how to disable that?

Obviously I have verified that I can run main.py locally on the master branch, the issue is on my custom branch.

Documenation

Hi,

It seems the current code lack documentation. I just want to implement OAC and I do not know exactly how I can put the code together to do so.
I would appreciate it if you could make it more clear how people can use your code for OAC. There is little documentation on this.

Calculation of alpha loss in SAC is different from the original paper

Hello, in the SAC paper "Soft Actor-Critic Algorithms and Applications" the calculation of the loss of alpha is:

J(alpha) = E[-alpha * (log(pi) + H)]

However, in your implementation, the calculation of the loss of alpha is instead (line 109 of "trainer.py"):

J(alpha) = E[-log(alpha) * (log(pi) + H)]

I am curious why the loss is calculated in this way. I have searched in Github for a couple of PyTorch based SAC implementations and they call calculate the loss in this way. But the TensorFlow based SAC implementations calculate the J(alpha) in the same way as the SAC paper (https://github.com/rail-berkeley/softlearning/blob/master/softlearning/algorithms/sac.py). TensorFlow implementations still calculate the gradient with respect to log(alpha), but when calculating the loss J(alpha) they use exp(log(alpha)) (which is alpha) instead of log(alpha).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.