The reinforcement-implementation's discuss from zhangchuheng123

ppo中的loss_entropy方向应为最大化

在PPO算法实现中，https://github.com/zhangchuheng123/Reinforcement-Implementation/blob/master/code/ppo.py

total_loss = loss_surr + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy

所定义的total_loss中，loss_entropy的方向是否应该为负？即最大化entropy。

Issues regarding PPO implementation

Hi Chuheng,

Hope you're having a great day and staying safe. Thank you for open sourcing your implementation of PPO, which has helped me a lot during the initial stages of my research. However, after carefully inspecting the PPO implementation from openai/baselines, my investigation found 32 implementation details, which you can found at https://costa.sh/blog-the-32-implementation-details-of-ppo.html, yet it appears your implementation has some discrepancies. Namely, they are

per mini batch advantages normalization
- the PPO implementation from openai/baselines normalize the advantage on the mini batch level. However your implementation normalizes the advantage on the whole batch level.
mini batch construction
- the PPO implementation from openai/baselines make sure each transition in the whole batch will be used for doing the policy updates. However, your implementation randomly samples the transitions
value loss implementation
- the PPO implementation from openai/baselines does not normalized value loss as you did. Instead, the normalized environment does this special kind of reward scaling that is mentioned in my blog post. In addition, the value loss is clipped in a manner that is similar to the clipped policy objective
clip range annealing
- although it is possible to talk of this option, the PPO implementation from openai/baselines does not use it for both atari and MuJoCo experiments.
the use of parallel environments
- although this may seem as a minor difference from your implementation, the use of the parallel environments actually changes the kind of trajectories collected for the PPO agents. Please see my blog post for more details.
orthogonal initialization
- the scale for the hidden layers should be 2**0.5 instead of just 1 that is presented in your implementation
entropy loss calculation
- the PPO implementation from openai/baselines calculates the entropy loss by using formulas instead of approximating the entropy loss using episodic data like you did.
global gradient clipping
- the PPO implementation from openai/baselines make sure the norm of all concatenated gradients does not exceed 0.5
the epsilon parameter off the Adam optimizer
- the PPO implementation from openai/baselines sets the epsilon parameter after Adam optimizer to be 1e-5, which is different from the default 1e-8
My additional comment is that your use of the Memory seems quite unnecessary.

I found your implementation to be especially helpful when I was getting into the field, And I feel quite strongly to give you these feedbacks. Hope they will be helpful to you.

Best wishes,
Costa

About the line-search in TRPO

In your TRPO implementation, when doing the line-search, instead of using the KL constraint explicitly, it seems that your acceptance for model update only depends on the surrogate loss. The condition is like

if actual_improve > 0 and actual_improve > alpha * expected_improve:
    return true, xnew

Did I misunderstand your codes? Can you please give an explanation of the implementation here? Thank you.

zhangchuheng123 / reinforcement-implementation Goto Github PK

reinforcement-implementation's Issues

ppo中的loss_entropy方向应为最大化

Issues regarding PPO implementation

About the line-search in TRPO

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent