Code Monkey home page Code Monkey logo

reinforcement-implementation's Issues

ppo中的loss_entropy方向应为最大化

在PPO算法实现中,https://github.com/zhangchuheng123/Reinforcement-Implementation/blob/master/code/ppo.py

total_loss = loss_surr + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy

所定义的total_loss中,loss_entropy的方向是否应该为负?即最大化entropy。

Issues regarding PPO implementation

Hi Chuheng,

Hope you're having a great day and staying safe. Thank you for open sourcing your implementation of PPO, which has helped me a lot during the initial stages of my research. However, after carefully inspecting the PPO implementation from openai/baselines, my investigation found 32 implementation details, which you can found at https://costa.sh/blog-the-32-implementation-details-of-ppo.html, yet it appears your implementation has some discrepancies. Namely, they are

  • per mini batch advantages normalization
    • the PPO implementation from openai/baselines normalize the advantage on the mini batch level. However your implementation normalizes the advantage on the whole batch level.
  • mini batch construction
    • the PPO implementation from openai/baselines make sure each transition in the whole batch will be used for doing the policy updates. However, your implementation randomly samples the transitions
  • value loss implementation
    • the PPO implementation from openai/baselines does not normalized value loss as you did. Instead, the normalized environment does this special kind of reward scaling that is mentioned in my blog post. In addition, the value loss is clipped in a manner that is similar to the clipped policy objective
  • clip range annealing
    • although it is possible to talk of this option, the PPO implementation from openai/baselines does not use it for both atari and MuJoCo experiments.
  • the use of parallel environments
    • although this may seem as a minor difference from your implementation, the use of the parallel environments actually changes the kind of trajectories collected for the PPO agents. Please see my blog post for more details.
  • orthogonal initialization
    • the scale for the hidden layers should be 2**0.5 instead of just 1 that is presented in your implementation
  • entropy loss calculation
    • the PPO implementation from openai/baselines calculates the entropy loss by using formulas instead of approximating the entropy loss using episodic data like you did.
  • global gradient clipping
    • the PPO implementation from openai/baselines make sure the norm of all concatenated gradients does not exceed 0.5
  • the epsilon parameter off the Adam optimizer
    • the PPO implementation from openai/baselines sets the epsilon parameter after Adam optimizer to be 1e-5, which is different from the default 1e-8
  • My additional comment is that your use of the Memory seems quite unnecessary.

I found your implementation to be especially helpful when I was getting into the field, And I feel quite strongly to give you these feedbacks. Hope they will be helpful to you.

Best wishes,
Costa

About the line-search in TRPO

In your TRPO implementation, when doing the line-search, instead of using the KL constraint explicitly, it seems that your acceptance for model update only depends on the surrogate loss. The condition is like

if actual_improve > 0 and actual_improve > alpha * expected_improve:
    return true, xnew

Did I misunderstand your codes? Can you please give an explanation of the implementation here? Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.