ericyangyu / ppo-for-beginners Goto Github PK

A simple and well styled PPO implementation. Based on my Medium series: https://medium.com/@eyyu/coding-ppo-from-scratch-with-pytorch-part-1-4-613dfc1b14c8.

License: MIT License

Python 97.08% Shell 2.92%

ppo reinforcement-learning reinforcement-learning-algorithms machine-learning pytorch

ppo-for-beginners's Introduction

Eric Yang Yu

https://ericyangyu.github.io/

ppo-for-beginners's People

Contributors

Stargazers

Watchers

Forkers

britig clemens-tolboom rishabhdevyadav newxei zhangtjtongxue zclab ayusukemiake qianxiao1111 pgkang walkacross jfyao90 zivzone milkigit kellsky roberthalwass maritimenn zhili-zh nihilistparth junqiqian whoismanoj alirezashamsoshoara longjiao993 bigtomatokim rodrigoclira faherngeit maukappel canu2esp xinyusun mikesifanele djt-hust hsuth1996 dulayjm affordck vldknd jinwoop nitesh4146 miki-yuasa choijunnyeong liaoxianglai mortorit nicoleorzan ankurneural nancysaxena1-eng ice-bear-git shiweiba narong-b nicoschif xiongtongzhao wyhmhs ibagur googleja ebron01 crystalxy123 chadmcintire josetorraca whit3snow kiyosio ubeydemavus elenguo bigfoot496 dmnkhppl zombasy wt160 peins amegatron roeyg vulcandth mrwonderfulness ttdow marcuslaw0074 fungust aidenfavish mcpfirefly shashankkapoor wayan123 genius2787 ahmed-m2020 panyuewh raymondxzr jiaquan01 mdzimmer akamsali zhiyangdeng1994 phoenixfirestone mocking286 jymikezhang paulzbigniew dbstmdgks93 mali-erel jamesz101ece edwardjjj wangjueya unk0vvvn rabirajb techthiyanes nalgae73 simonachiurato matteomariani99 brightgems lu-research

ppo-for-beginners's Issues

Using for custom environment with different actions

Hi,
I use your code in different gym environments, and I get good results. I try to control a robot to navigate in a dynamic environment by changing the linear and angular velocities. The robot's interaction time with the environment with a particular action is also not fixed. (That is, the length of the time step is also different, and from the three options, 0.2, 0.4,0.6, and 0.8 is selected)

Since the action space that the agent must learn is three parts, how to use log_prob to calculate the ratio in my problem?

Robot linear velocity in the range of [-3, 3] from a tanh activation from actor
robot angular velocity in the range of [-pi/12, pi/12] from a tanh activation from actor
Robot time_step length from a set of [0.2, 0.4,0.6, 0.8] using softmax on actor

You use MultivariableNormal that indicates the possibility of selecting all actions together, but my actions dist is Normal and has different means.
If I use multivariable normal for velocities but how to add my step_time categorical prob?

Dummy way:

Can I use this method?
I get 3 outputs from actor-network in range [-1 , 1], next change output [0] to [-3, 3], output[1] to [-pi/12, pi/12], and output[2] to [0, 1],
output[0] for linear, output[1] for angular, output[2] for time. So I manually change the mean for time to the range [0.2, 0.4, 0.6, 0.8] with if condition, next use MultivariableNormal with different std, for the first two actions, use std as you use, and for time use very small std, like 1e-17 and get prob of them?

The average Episodic Return and Average Loss is nan

Hi Eric,

I am a beginner with PPO and I tried your code with the _log_summary() function implemented with the following main block.

if __name__ == '__main__':
	hyperparameters = {
				'timesteps_per_batch': 2048, 
				'max_timesteps_per_episode': 200, 
				'gamma': 0.99, 
				'n_updates_per_iteration': 10,
				'lr': 3e-4, 
				'clip': 0.2
			  }
	env = gym.make('Pendulum-v0')
	model = PPO(env=env, **hyperparameters)
	print(f'Model information ====== {model}')
	model.learn(10000)

The output keeps on giving nan value on Episodic Return and Average Loss. For example

-------------------- Iteration #50 --------------------
Average Episodic Length: 1.0
Average Episodic Return: nan
Average Loss: nan
Timesteps So Far: 50

Can you kindly help with this? The policy network is FeedForwardNN.

How to fix: Broken with latest gym pip package

The env.step return values changed, so this code is now how to get it going:

        # Number of timesteps run so far this batch
        t = 0 
        while t < self.timesteps_per_batch:
            # Rewards this episode
            ep_rews = []

            obs = self.env.reset()
            if isinstance(obs, tuple):
                obs = obs[0]  # Assuming the first element of the tuple is the relevant data

            terminated = False
            for ep_t in range(self.max_timesteps_per_episode):
                # Increment timesteps ran this batch so far
                t += 1
                # Collect observation
                batch_obs.append(obs)
                action, log_prob = self.get_action(obs)

                obs, rew, terminated, truncated, _ = self.env.step(action)
                if isinstance(obs, tuple):
                    obs = obs[0]  # Assuming the first element of the tuple is the relevant data

                # Collect reward, action, and log prob
                ep_rews.append(rew)
                batch_acts.append(action)
                batch_log_probs.append(log_prob)

            if terminated or truncated:
                break

Note that you now have to check terminated and truncated return values. Latest documentation is here: https://www.gymlibrary.dev/api/core/

Without this if you follow along with the blog post, it will fail at the end of Blog 3 at this step:

import gym
env = gym.make('Pendulum-v1')
model = PPO(env)
model.learn(10000)

Also you need to update Pendulum-v0 to Pendulum-v1.

the python version of this repo?

Hi Yu
GOOD work! your tutorials save me lots of time. But which version of python you use?

why critic's loss is mean squared error of the predicted values with rewards-to-go.

Thanks very much for the tutorial but I have a question.
From my understanding, critic's loss should be 'sqr(predicted valve - true value)'
but from code and paper, it is
critic_loss = nn.MSELoss()(V, batch_rtgs)
V is predicted value, but why we can see 'batch_rtgs' is true value? It was previously seen as Q value in advantage function.
A_k = batch_rtgs - V.detach()

covariance matrix

I don't know why you choose the determinate covariance matrix?
The covariance matrix should be learned by actor network.

ImportError: libboost_filesystem.so.1.65.1 in Collab

Did anyone run into this error?

PPO gets stuck in custom environment

Hello and thank you for the implementation it really helps. I have an environment that is sparse and only receives a reward on task completion. I followed your code and implemented a PPO algorithm that uses a simple actor-critic network. I am attaching my code for the network and PPO here.

ActorCritic

class ActorCritic(nn.Module):

    def __init__(self):
        super(ActorCritic, self).__init__()

        self.fc1 = nn.Linear(33, 128)
        self.fc2 = nn.Linear(128, 128)

        self.critic = nn.Linear(128, 1)
        self.actor = nn.Linear(128, 3)
        self.apply(init_weights)

    def forward(self, x):
        x = torch.tanh(self.fc1(x))
        x = torch.tanh(self.fc2(x))

        return self.critic(x), F.softmax(self.actor(x), dim=-1)

PPO

class PPO():
    
    def __init__(
        self,
        env,
        policy,
        lr,
        gamma,
        betas,
        gae_lambda,
        eps_clip,
        entropy_coef,
        value_coef,
        max_grad_norm,
        timesteps_per_batch,
        n_updates_per_itr,
        summary_writer,
        norm_obs = True):

        self.policy = policy
        self.env = env
        self.lr = lr
        self.gamma = gamma
        self.betas = betas
        self.gae_lambda = gae_lambda
        self.eps_clip = eps_clip
        self.entropy_coef = entropy_coef
        self.value_coef = value_coef
        self.max_grad_norm = max_grad_norm
        self.timesteps_per_batch = timesteps_per_batch
        self.n_updates_per_itr = n_updates_per_itr
        self.summary_writer = summary_writer
        self.norm_obs = norm_obs
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.total_updates = 0

        self.optimizer = torch.optim.Adam(self.policy.parameters(), lr=self.lr, betas=self.betas)

    def learn(self, total_timesteps=1000000, callback=None):
        
        timesteps = 0
        while timesteps < total_timesteps:
            batch_obs, batch_actions, batch_log_probs, batch_rtgs, batch_advantages, batch_lens = self.rollout()

            timesteps += np.sum(batch_lens)
            
            advantage_k = (batch_advantages - batch_advantages.mean()) / (batch_advantages.std() + 1e-10)

            for i in range(self.n_updates_per_itr):
                state_values, action_probs = self.policy(batch_obs)
                state_values = state_values.squeeze()

                dist = Categorical(action_probs)
                curr_log_probs = dist.log_prob(batch_actions)

                ratios = torch.exp(curr_log_probs - batch_log_probs)

                surr1 = ratios * advantage_k
                surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantage_k

                policy_loss = (-torch.min(surr1, surr2)).mean()
                value_loss = F.mse_loss(state_values, batch_rtgs)
                total_loss = policy_loss + self.value_coef * value_loss
                self.optimizer.zero_grad()
                total_loss.backward()
                self.optimizer.step()

                self.total_updates += 1
                self.summary_writer.add_scalar("policy_loss", policy_loss, self.total_updates)
                self.summary_writer.add_scalar("value_loss", value_loss, self.total_updates)
                self.summary_writer.add_scalar("total_loss", total_loss, self.total_updates)


            if callback:
                callback.eval_policy(self.policy, self.summary_writer, self.norm_obs)
        

    def rollout(self):
        batch_obs = []
        batch_acts = []
        batch_state_values = []
        batch_log_probs = []
        batch_rewards = []
        batch_rtgs = []
        batch_lens = []
        batch_advantages = []
        batch_terminals = []

        timesteps_collected = 0
        while timesteps_collected < self.timesteps_per_batch:
            eps_rewards = []
            eps_state_values = []
            eps_terminals = []
            obs = self.env.reset()

            done = False
            eps_timesteps = 0
            for _ in range(50):
                timesteps_collected += 1
                if self.norm_obs:
                    obs = (obs - obs.mean()) / (obs.std() - 1e-10)
                batch_obs.append(obs)

                state_value, action_probs = self.policy(torch.from_numpy(obs).type(torch.float).to(self.device))

                dist = Categorical(action_probs)
                action = dist.sample()
                act_log_prob = dist.log_prob(action)

                obs, reward, done, _ = self.env.step(action.cpu().detach().item())
                
                eps_rewards.append(reward)
                eps_state_values.append(state_value.squeeze().cpu().detach().item())
                eps_terminals.append(0 if done else 1)

                batch_acts.append(action.cpu().detach().item())
                batch_log_probs.append(act_log_prob.cpu().detach().item())
                
                eps_timesteps += 1
                if done:
                    break
            batch_lens.append(eps_timesteps)
            batch_rewards.append(eps_rewards)
            batch_state_values.append(eps_state_values)
            batch_terminals.append(eps_terminals)

        batch_obs = torch.tensor(batch_obs, dtype=torch.float).to(self.device)
        batch_acts = torch.tensor(batch_acts, dtype=torch.float).to(self.device)
        batch_log_probs = torch.tensor(batch_log_probs, dtype=torch.float).flatten().to(self.device)
        
        for eps_rewards, eps_state_values, eps_terminals in zip(reversed(batch_rewards), reversed(batch_state_values), reversed(batch_terminals)):
            discounted_reward = 0
            gae = 0
            next_state_value = 0
            next_terminal = 0
            for reward, state_value, terminal in zip(reversed(eps_rewards), reversed(eps_state_values), reversed(eps_terminals)):
                discounted_reward = reward + self.gamma * discounted_reward
                delta = reward + self.gamma * next_state_value * next_terminal - state_value
                gae = delta + self.gamma * self.gae_lambda * next_terminal * gae
                batch_rtgs.insert(0, discounted_reward)
                batch_advantages.insert(0, gae)
                next_state_value = state_value
                next_terminal = terminal

        batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float).to(self.device)
        batch_advantages = torch.tensor(batch_advantages, dtype=torch.float).to(self.device)
        return batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_advantages, batch_lens

I am using the Generalised Advantage Estimate in my case but even when using the simpler advantage function,R-V(s) my implementation still gets stuck and will always choose the same action when I am evaluating. This is how I evaluate the policy in a deterministic way. I am not sampling from a categorical distribution.

_, action_probs = policy(torch.from_numpy(obs).type(torch.float).to(device))
action = torch.argmax(action_probs).item()

Can you provide any pointers as to where the problem might be? I have used stablebaselines3 with the same environment implementation however because I want to have more control over the model I am using I opted for a custom implementation. I can't seem to figure out where the problem might be however.