Code Monkey home page Code Monkey logo

ppo-for-beginners's Introduction

ppo-for-beginners's People

Contributors

clemens-tolboom avatar ericyangyu avatar raymondxzr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ppo-for-beginners's Issues

Using for custom environment with different actions

Hi,
I use your code in different gym environments, and I get good results. I try to control a robot to navigate in a dynamic environment by changing the linear and angular velocities. The robot's interaction time with the environment with a particular action is also not fixed. (That is, the length of the time step is also different, and from the three options, 0.2, 0.4,0.6, and 0.8 is selected)

Since the action space that the agent must learn is three parts, how to use log_prob to calculate the ratio in my problem?

  1. Robot linear velocity in the range of [-3, 3] from a tanh activation from actor
  2. robot angular velocity in the range of [-pi/12, pi/12] from a tanh activation from actor
  3. Robot time_step length from a set of [0.2, 0.4,0.6, 0.8] using softmax on actor

You use MultivariableNormal that indicates the possibility of selecting all actions together, but my actions dist is Normal and has different means.
If I use multivariable normal for velocities but how to add my step_time categorical prob?

Dummy way:

Can I use this method?
I get 3 outputs from actor-network in range [-1 , 1], next change output [0] to [-3, 3], output[1] to [-pi/12, pi/12], and output[2] to [0, 1],
output[0] for linear, output[1] for angular, output[2] for time. So I manually change the mean for time to the range [0.2, 0.4, 0.6, 0.8] with if condition, next use MultivariableNormal with different std, for the first two actions, use std as you use, and for time use very small std, like 1e-17 and get prob of them?

The average Episodic Return and Average Loss is nan

Hi Eric,

I am a beginner with PPO and I tried your code with the _log_summary() function implemented with the following main block.

if __name__ == '__main__':
	hyperparameters = {
				'timesteps_per_batch': 2048, 
				'max_timesteps_per_episode': 200, 
				'gamma': 0.99, 
				'n_updates_per_iteration': 10,
				'lr': 3e-4, 
				'clip': 0.2
			  }
	env = gym.make('Pendulum-v0')
	model = PPO(env=env, **hyperparameters)
	print(f'Model information ====== {model}')
	model.learn(10000)

The output keeps on giving nan value on Episodic Return and Average Loss. For example

-------------------- Iteration #50 --------------------
Average Episodic Length: 1.0
Average Episodic Return: nan
Average Loss: nan
Timesteps So Far: 50

Can you kindly help with this? The policy network is FeedForwardNN.

How to fix: Broken with latest gym pip package

The env.step return values changed, so this code is now how to get it going:

        # Number of timesteps run so far this batch
        t = 0 
        while t < self.timesteps_per_batch:
            # Rewards this episode
            ep_rews = []

            obs = self.env.reset()
            if isinstance(obs, tuple):
                obs = obs[0]  # Assuming the first element of the tuple is the relevant data

            terminated = False
            for ep_t in range(self.max_timesteps_per_episode):
                # Increment timesteps ran this batch so far
                t += 1
                # Collect observation
                batch_obs.append(obs)
                action, log_prob = self.get_action(obs)

                obs, rew, terminated, truncated, _ = self.env.step(action)
                if isinstance(obs, tuple):
                    obs = obs[0]  # Assuming the first element of the tuple is the relevant data

                # Collect reward, action, and log prob
                ep_rews.append(rew)
                batch_acts.append(action)
                batch_log_probs.append(log_prob)

            if terminated or truncated:
                break

Note that you now have to check terminated and truncated return values. Latest documentation is here: https://www.gymlibrary.dev/api/core/

Without this if you follow along with the blog post, it will fail at the end of Blog 3 at this step:

import gym
env = gym.make('Pendulum-v1')
model = PPO(env)
model.learn(10000)

Also you need to update Pendulum-v0 to Pendulum-v1.

why critic's loss is mean squared error of the predicted values with rewards-to-go.

Thanks very much for the tutorial but I have a question.
From my understanding, critic's loss should be 'sqr(predicted valve - true value)'
but from code and paper, it is
critic_loss = nn.MSELoss()(V, batch_rtgs)
V is predicted value, but why we can see 'batch_rtgs' is true value? It was previously seen as Q value in advantage function.
A_k = batch_rtgs - V.detach()

covariance matrix

I don't know why you choose the determinate covariance matrix?
The covariance matrix should be learned by actor network.

PPO gets stuck in custom environment

Hello and thank you for the implementation it really helps. I have an environment that is sparse and only receives a reward on task completion. I followed your code and implemented a PPO algorithm that uses a simple actor-critic network. I am attaching my code for the network and PPO here.

ActorCritic

class ActorCritic(nn.Module):

    def __init__(self):
        super(ActorCritic, self).__init__()

        self.fc1 = nn.Linear(33, 128)
        self.fc2 = nn.Linear(128, 128)

        self.critic = nn.Linear(128, 1)
        self.actor = nn.Linear(128, 3)
        self.apply(init_weights)

    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        x = torch.tanh(self.fc2(x))

        return self.critic(x), F.softmax(self.actor(x), dim=-1)

PPO

class PPO():
    
    def __init__(
        self,
        env,
        policy,
        lr,
        gamma,
        betas,
        gae_lambda,
        eps_clip,
        entropy_coef,
        value_coef,
        max_grad_norm,
        timesteps_per_batch,
        n_updates_per_itr,
        summary_writer,
        norm_obs = True):

        self.policy = policy
        self.env = env
        self.lr = lr
        self.gamma = gamma
        self.betas = betas
        self.gae_lambda = gae_lambda
        self.eps_clip = eps_clip
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.max_grad_norm = max_grad_norm
        self.timesteps_per_batch = timesteps_per_batch
        self.n_updates_per_itr = n_updates_per_itr
        self.summary_writer = summary_writer
        self.norm_obs = norm_obs
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.total_updates = 0

        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=self.lr, betas=self.betas)

    def learn(self, total_timesteps=1000000, callback=None):
        
        timesteps = 0
        while timesteps < total_timesteps:
            batch_obs, batch_actions, batch_log_probs, batch_rtgs, batch_advantages, batch_lens = self.rollout()

            timesteps += np.sum(batch_lens)
            
            advantage_k = (batch_advantages - batch_advantages.mean()) / (batch_advantages.std() + 1e-10)

            for i in range(self.n_updates_per_itr):
                state_values, action_probs = self.policy(batch_obs)
                state_values = state_values.squeeze()

                dist = Categorical(action_probs)
                curr_log_probs = dist.log_prob(batch_actions)

                ratios = torch.exp(curr_log_probs - batch_log_probs)

                surr1 = ratios * advantage_k
                surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantage_k

                policy_loss = (-torch.min(surr1, surr2)).mean()
                value_loss = F.mse_loss(state_values, batch_rtgs)
                total_loss = policy_loss + self.value_coef * value_loss
                self.optimizer.zero_grad()
                total_loss.backward()
                self.optimizer.step()

                self.total_updates += 1
                self.summary_writer.add_scalar("policy_loss", policy_loss, self.total_updates)
                self.summary_writer.add_scalar("value_loss", value_loss, self.total_updates)
                self.summary_writer.add_scalar("total_loss", total_loss, self.total_updates)


            if callback:
                callback.eval_policy(self.policy, self.summary_writer, self.norm_obs)
        

    def rollout(self):
        batch_obs = []
        batch_acts = []
        batch_state_values = []
        batch_log_probs = []
        batch_rewards = []
        batch_rtgs = []
        batch_lens = []
        batch_advantages = []
        batch_terminals = []

        timesteps_collected = 0
        while timesteps_collected < self.timesteps_per_batch:
            eps_rewards = []
            eps_state_values = []
            eps_terminals = []
            obs = self.env.reset()

            done = False
            eps_timesteps = 0
            for _ in range(50):
                timesteps_collected += 1
                if self.norm_obs:
                    obs = (obs - obs.mean()) / (obs.std() - 1e-10)
                batch_obs.append(obs)

                state_value, action_probs = self.policy(torch.from_numpy(obs).type(torch.float).to(self.device))

                dist = Categorical(action_probs)
                action = dist.sample()
                act_log_prob = dist.log_prob(action)

                obs, reward, done, _ = self.env.step(action.cpu().detach().item())
                
                eps_rewards.append(reward)
                eps_state_values.append(state_value.squeeze().cpu().detach().item())
                eps_terminals.append(0 if done else 1)

                batch_acts.append(action.cpu().detach().item())
                batch_log_probs.append(act_log_prob.cpu().detach().item())
                
                eps_timesteps += 1
                if done:
                    break
            batch_lens.append(eps_timesteps)
            batch_rewards.append(eps_rewards)
            batch_state_values.append(eps_state_values)
            batch_terminals.append(eps_terminals)

        batch_obs = torch.tensor(batch_obs, dtype=torch.float).to(self.device)
        batch_acts = torch.tensor(batch_acts, dtype=torch.float).to(self.device)
        batch_log_probs = torch.tensor(batch_log_probs, dtype=torch.float).flatten().to(self.device)
        
        for eps_rewards, eps_state_values, eps_terminals in zip(reversed(batch_rewards), reversed(batch_state_values), reversed(batch_terminals)):
            discounted_reward = 0
            gae = 0
            next_state_value = 0
            next_terminal = 0
            for reward, state_value, terminal in zip(reversed(eps_rewards), reversed(eps_state_values), reversed(eps_terminals)):
                discounted_reward = reward + self.gamma * discounted_reward
                delta = reward + self.gamma * next_state_value * next_terminal - state_value
                gae = delta + self.gamma * self.gae_lambda * next_terminal * gae
                batch_rtgs.insert(0, discounted_reward)
                batch_advantages.insert(0, gae)
                next_state_value = state_value
                next_terminal = terminal

        batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float).to(self.device)
        batch_advantages = torch.tensor(batch_advantages, dtype=torch.float).to(self.device)
        return batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_advantages, batch_lens

I am using the Generalised Advantage Estimate in my case but even when using the simpler advantage function,R-V(s) my implementation still gets stuck and will always choose the same action when I am evaluating. This is how I evaluate the policy in a deterministic way. I am not sampling from a categorical distribution.

_, action_probs = policy(torch.from_numpy(obs).type(torch.float).to(device))
action = torch.argmax(action_probs).item()

Can you provide any pointers as to where the problem might be? I have used stablebaselines3 with the same environment implementation however because I want to have more control over the model I am using I opted for a custom implementation. I can't seem to figure out where the problem might be however.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.