Code Monkey home page Code Monkey logo

reinforcement-implementation's Introduction

Reinforcement-Implementation

This project aims to reproduce the results of several model-free RL algorithms in continuous action domain (mujuco environment).

This projects

  • uses pytorch package
  • implements different algorithms independently in seperate files / minimal files
  • is written in simplest style
  • tries to follow the original paper and reproduce their results

My first stage of work is to reproduce this figure in the PPO paper.

  • A2C
  • ACER (A2C + Trust Region): It seems that this implementation has some problems ... (welcome bug report)
  • CEM
  • TRPO (TRPO single path)
  • PPO (PPO clip)
  • Vanilla PG

On the next stage, I want to implement

Then next stage, discrete action space problem and raw video input (Atari) problems:

  • Rainbow: DQN and relevant techniques (target network / double Q-learning / prioritized experience replay / dueling network structure / distributional RL)
  • PPO with random network distillation (RND)

Rainbow on Atari with only 3M: It works but may need further tuning.

And then model-based algorithms (not planned)

  • PILCO
  • PE-TS

TODOs:

  • change the way reward counts, current way may underestimate the reward (evaluate a deterministic model rather a stochastic/exploratory model)

PPO Implementation

PPO implementation is of high quality - matches the performance of openai.baselines.

Update

Recently, I added Rainbow and DQN. The Rainbow implementation is of high quality on Atari games - enough for you to modify and write your own research paper. The DQN implementation is a minimum workaround and reaches a good performance on MountainCar (which is a simple task but many codes on Github do not achieve good performance or need additional reward/environment engineering). This is enough for you to have a fast test of your research ideas.

reinforcement-implementation's People

Contributors

rl4inventory avatar zhangchuheng123 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

reinforcement-implementation's Issues

Issues regarding PPO implementation

Hi Chuheng,

Hope you're having a great day and staying safe. Thank you for open sourcing your implementation of PPO, which has helped me a lot during the initial stages of my research. However, after carefully inspecting the PPO implementation from openai/baselines, my investigation found 32 implementation details, which you can found at https://costa.sh/blog-the-32-implementation-details-of-ppo.html, yet it appears your implementation has some discrepancies. Namely, they are

  • per mini batch advantages normalization
    • the PPO implementation from openai/baselines normalize the advantage on the mini batch level. However your implementation normalizes the advantage on the whole batch level.
  • mini batch construction
    • the PPO implementation from openai/baselines make sure each transition in the whole batch will be used for doing the policy updates. However, your implementation randomly samples the transitions
  • value loss implementation
    • the PPO implementation from openai/baselines does not normalized value loss as you did. Instead, the normalized environment does this special kind of reward scaling that is mentioned in my blog post. In addition, the value loss is clipped in a manner that is similar to the clipped policy objective
  • clip range annealing
    • although it is possible to talk of this option, the PPO implementation from openai/baselines does not use it for both atari and MuJoCo experiments.
  • the use of parallel environments
    • although this may seem as a minor difference from your implementation, the use of the parallel environments actually changes the kind of trajectories collected for the PPO agents. Please see my blog post for more details.
  • orthogonal initialization
    • the scale for the hidden layers should be 2**0.5 instead of just 1 that is presented in your implementation
  • entropy loss calculation
    • the PPO implementation from openai/baselines calculates the entropy loss by using formulas instead of approximating the entropy loss using episodic data like you did.
  • global gradient clipping
    • the PPO implementation from openai/baselines make sure the norm of all concatenated gradients does not exceed 0.5
  • the epsilon parameter off the Adam optimizer
    • the PPO implementation from openai/baselines sets the epsilon parameter after Adam optimizer to be 1e-5, which is different from the default 1e-8
  • My additional comment is that your use of the Memory seems quite unnecessary.

I found your implementation to be especially helpful when I was getting into the field, And I feel quite strongly to give you these feedbacks. Hope they will be helpful to you.

Best wishes,
Costa

About the line-search in TRPO

In your TRPO implementation, when doing the line-search, instead of using the KL constraint explicitly, it seems that your acceptance for model update only depends on the surrogate loss. The condition is like

if actual_improve > 0 and actual_improve > alpha * expected_improve:
    return true, xnew

Did I misunderstand your codes? Can you please give an explanation of the implementation here? Thank you.

ppo中的loss_entropy方向应为最大化

在PPO算法实现中,https://github.com/zhangchuheng123/Reinforcement-Implementation/blob/master/code/ppo.py

total_loss = loss_surr + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy

所定义的total_loss中,loss_entropy的方向是否应该为负?即最大化entropy。

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.