Comments (6)
你这个流程完全等同于
- policy 生成数据
- policy 迭代更新
- loop
from ppo-pytorch.
You are right, but the author is also right to do so.
I think the current process is not redundant:
- Update the policy K(=80) times per epoch. If the KL divergence needs to be calculated inside the epoch, policy_old must be retained.
- In fact, KL divergence can be used as a trick for PPO algorithm. The torch version PPO, implemented by OpenAI Spinning Up, early stops the epoch just when the KL divergence is too large.
from ppo-pytorch.
感谢回复,你的观点我认同,old Policy确实可以用于计算KL散度,让模型本次更新不至于过大
另外当我知道PPO绕了一大圈,最后跟我说另外个分布就是上次的分布时,我是真的想掀桌子
整个PPO完全就是工程化的时候,看到for循环迭代缺少个阻尼项,而做的尝试。实际写出的论文却如此晦涩
from ppo-pytorch.
Although the motivation of PPO might not be that simple, I guess partial of PPO's motivation were found when applying TRPO in engineering.
Actually, introducing KL divergence into PPO to early stop is just a trick. The authors of PPO paper not meant to do that.
In my opinion, PPO is motivated to solve TRPO's computational complexity problem. Instead of computing KL divergence (very slow) like TRPO, PPO(clip version) simply limit the policy's update with a clip() function.
You can check it deeper by viewing OpenAI Spinning Up for PPO, which I have cited it as an url below:)
PPO is motivated by the same question as TRPO: how can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse? Where TRPO tries to solve this problem with a complex second-order method, PPO is a family of first-order methods that use a few other tricks to keep new policies close to old. PPO methods are significantly simpler to implement, and empirically seem to perform at least as well as TRPO.
There are two primary variants of PPO: PPO-Penalty and PPO-Clip.
PPO-Penalty approximately solves a KL-constrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately.
PPO-Clip doesn’t have a KL-divergence term in the objective and doesn’t have a constraint at all. Instead relies on specialized clipping in the objective function to remove incentives for the new policy to get far from the old policy.
——from OpenAI Spinning Up for PPO
from ppo-pytorch.
嗯,是的。如果从TRPO出发,PPO的改进是成功的
再次感谢你的回答,我后续会看下OpenAI的版本
from ppo-pytorch.
You are welcomed :)
from ppo-pytorch.
Related Issues (20)
- The reward function for training? HOT 1
- policy.eval() after load_state_dict() HOT 1
- How are you ensuring that actions are in range of (-1,1) after sampling in continuous action HOT 1
- How to improve the performance based on your code? HOT 1
- how can I use this code for a problem with 3 different actions? HOT 1
- About environment configuration HOT 2
- Convolutional? HOT 1
- Confusion about the loss function HOT 1
- roboschool is deprecated HOT 1
- error HOT 5
- Setting Model to eval() mode in test.py
- Would a shared network work ?
- Test results are not good
- Continuous action space should use Independent Normal instead of MultivariateNormal HOT 1
- optimize the existing Chinese generation model
- question
- ValueError: expected sequence of length 8 at dim 1 (got 0) HOT 1
- the version problem about gym and roboschool HOT 1
- (Solved) No env.reset() at the end of each training epoch. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ppo-pytorch.