zhangchuheng123 / reinforcement-implementation Goto Github PK

View Code? Open in Web Editor NEW

455.0 14.0 81.0 672 KB

Implementation of benchmark RL algorithms

Python 98.56% Shell 1.44%

reinforcement-implementation's Introduction

reinforcement-implementation's People

Contributors

Stargazers

Watchers

Forkers

megayeye liyuanqi123 leefree-git conike elphinkuo qiuyue1993 airopti mamengyiyi yanchang-liang joomladigger bzp92 bqzhu922 gpeng2119 xkong1112 yyf17 zhangyi-3 gao370829 jzx95 ianchen28 zhaoyangacc wormpartner a-little-chicken dixit91 simerplaha shercklo tli347 niujh andyyue1893 nuaasgq qiu1234567 waybaba lutterwa lt310 lixuanliang simple667 trexces zmhhmz robot-xyh wohaiaini511 agcxgz321 eugeneyuz doupichen yuxiangzhaozzy gxpjzbg yxy1995123 ruifeng-chen zedrover yushun007 flyingcode365 tzq2doc pgkang wzh41012 dokincui mfleiym gylov3 yyummyu zhaoxueyan1 floraljq exilesaber leopold-fitz-ai shen-hang domifance tjevgerres hzcirving hykon123 kinsozheng ycl010203 hsuth1996 myleosu spik3y4n9 freeclouds znlxjz yidan-zhang sanelyx klzonmyway qfxlcyc niceboy120 samuelsuntree sychaha lianhuaxindi

reinforcement-implementation's Issues

Issues regarding PPO implementation

Hi Chuheng,

Hope you're having a great day and staying safe. Thank you for open sourcing your implementation of PPO, which has helped me a lot during the initial stages of my research. However, after carefully inspecting the PPO implementation from openai/baselines, my investigation found 32 implementation details, which you can found at https://costa.sh/blog-the-32-implementation-details-of-ppo.html, yet it appears your implementation has some discrepancies. Namely, they are

per mini batch advantages normalization
- the PPO implementation from openai/baselines normalize the advantage on the mini batch level. However your implementation normalizes the advantage on the whole batch level.
mini batch construction
- the PPO implementation from openai/baselines make sure each transition in the whole batch will be used for doing the policy updates. However, your implementation randomly samples the transitions
value loss implementation
- the PPO implementation from openai/baselines does not normalized value loss as you did. Instead, the normalized environment does this special kind of reward scaling that is mentioned in my blog post. In addition, the value loss is clipped in a manner that is similar to the clipped policy objective
clip range annealing
- although it is possible to talk of this option, the PPO implementation from openai/baselines does not use it for both atari and MuJoCo experiments.
the use of parallel environments
- although this may seem as a minor difference from your implementation, the use of the parallel environments actually changes the kind of trajectories collected for the PPO agents. Please see my blog post for more details.
orthogonal initialization
- the scale for the hidden layers should be 2**0.5 instead of just 1 that is presented in your implementation
entropy loss calculation
- the PPO implementation from openai/baselines calculates the entropy loss by using formulas instead of approximating the entropy loss using episodic data like you did.
global gradient clipping
- the PPO implementation from openai/baselines make sure the norm of all concatenated gradients does not exceed 0.5
the epsilon parameter off the Adam optimizer
- the PPO implementation from openai/baselines sets the epsilon parameter after Adam optimizer to be 1e-5, which is different from the default 1e-8
My additional comment is that your use of the Memory seems quite unnecessary.

I found your implementation to be especially helpful when I was getting into the field, And I feel quite strongly to give you these feedbacks. Hope they will be helpful to you.

Best wishes,
Costa

About the line-search in TRPO

In your TRPO implementation, when doing the line-search, instead of using the KL constraint explicitly, it seems that your acceptance for model update only depends on the surrogate loss. The condition is like

if actual_improve > 0 and actual_improve > alpha * expected_improve:
    return true, xnew

Did I misunderstand your codes? Can you please give an explanation of the implementation here? Thank you.

ppo中的loss_entropy方向应为最大化

在PPO算法实现中，https://github.com/zhangchuheng123/Reinforcement-Implementation/blob/master/code/ppo.py

total_loss = loss_surr + args.loss_coeff_value * loss_value + args.loss_coeff_entropy * loss_entropy

所定义的total_loss中，loss_entropy的方向是否应该为负？即最大化entropy。

zhangchuheng123 / reinforcement-implementation Goto Github PK

reinforcement-implementation's Introduction

Reinforcement-Implementation

PPO Implementation

Update

reinforcement-implementation's People

Contributors

Stargazers

Watchers

Forkers

reinforcement-implementation's Issues

Issues regarding PPO implementation

About the line-search in TRPO

ppo中的loss_entropy方向应为最大化

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent