Code Monkey home page Code Monkey logo

ppo_pytorch_implementation's Introduction

PPO implementation in Pytorch

General info

The implementation is written for environments with continuous action space. Currently, only one agent learning is supported.

The code has been written as an exercise while exploring Reinforcement Learning concepts. The algorithm is based on the description provided in original Proximal Policy Optimization paper by OpenAI. However, to get a working version of algorithm, important code-level details were added from The 32 Implementation Details of Proximal Policy Optimization (PPO) Algorithm and this implementation. For more info check Implementation details

Additional Functionality

  • Save GIFs from rendered episodes, the files are saved in /images directory (render=True)
  • Tensorboard logs for monitoring score and episode length (tensorboard_logging=True)
  • Load pretrained model to evaluate or continue training (pretrained=True)

Dependencies

All dependencies are provided in requirements.txt file. The implementation uses Pytorch for training and Gym for environments. The imageio is an optional dependency needed to save GIFs from rendered environment.

Example result

Training progress of an agent in BipedalWalker-v3 environment.

500 episodes 2000 episodes 5000 episodes
Training chart with score averaged over 20 consecutive episodes (marked with red)

Implementation details

Initialization

The last layer of policy network is initialized with weights rescaled by 0.01. It is used to enforce more random choices in the beginning of the training and thus improving the training performance. It is one of the suggestions provided in What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study

The std value used for distribution to estimate actions is set to 0.5.

Advantage calculation

The advantage is calculated using normalized discounted rewards. The advantage values are computed during every iteration of policy update.

Loss calculation

The loss is a sum of these 3 components:

  • Clipped Surrogate Objective from PPO paper with epsilon value = 0.2
  • MSE Loss calculated from estimated state value and discounted reward (0.5)
  • entropy of action distribution (-0.01)

ppo_pytorch_implementation's People

Contributors

dependabot[bot] avatar faildeny avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

jose-escamilla

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.