Hi Eric, I am a beginner with PPO and I tried your code with the _lo

Hi Eric, Thank you for your reply. Attaching my <a href="https:/

The average Episodic Return and Average Loss is nan about ppo-for-beginners HOT 4 CLOSED

ericyangyu commented on May 27, 2024

The average Episodic Return and Average Loss is nan

from ppo-for-beginners.

Comments (4)

ericyangyu commented on May 27, 2024 1

I resolved the nans :) So the issue was actually in rollout and compute_rtgs; you had some pieces of code on the wrong indent, which were early terminating the two functions before they had a chance to finish their loops.

Specifically in rollout, I shifted:

# Collect episodic length and Rewards
batch_lens.append(ep_t + 1)
batch_rews.append(ep_rews)

one indent left, and

#Reshape data as tensors
batch_obs = torch.tensor(batch_obs, dtype=torch.float)
batch_acts  = torch.tensor(batch_acts, dtype=torch.float)
batch_log_probs = torch.tensor(batch_log_probs, dtype=torch.float)

# ALG STEP 4

batch_rtgs = self.compute_rtgs(batch_rews)

print(f'Batch rewards before logger======={batch_rews}')

# Log the episodic returns and episodic lengths in this batch.
self.logger['batch_rews'] = batch_rews
self.logger['batch_lens'] = batch_lens

# Return the batch data
return batch_obs, batch_acts, batch_log_probs, batch_rtgs, batch_lens

two indents left.

In compute_rtgs, I shifted:

#Convert the reward-to-go into a tensor
batch_rtgs = torch.tensor(batch_rtgs, dtype=torch.float)

return batch_rtgs

one indent left.

Here's ppo.py with the fixes: ppo.py.zip. Let me know if there's anything else I can help with :) Otherwise, marking this issue as closed.

from ppo-for-beginners.

britig commented on May 27, 2024 1

Works :) Thanks a lot..

from ppo-for-beginners.

ericyangyu commented on May 27, 2024

Hi Briti!

I just tried your block of code with the repo code, and it works out for me (had to change model = PPO(env=env, **hyperparameters) to model = PPO(policy_class=FeedForwardNN, env=env, **hyperparameters) though).

Output:

Model information ====== <ppo.PPO object at 0x7fa61287c358>
Learning... Running 200 timesteps per episode, 2048 timesteps per batch for a total of 1000000 timesteps

-------------------- Iteration #1 --------------------
Average Episodic Length: 200.0
Average Episodic Return: -1358.09
Average Loss: -0.00129
Timesteps So Far: 2200
------------------------------------------------------


-------------------- Iteration #2 --------------------
Average Episodic Length: 200.0
Average Episodic Return: -1172.53
Average Loss: -0.001
Timesteps So Far: 4400
------------------------------------------------------

^C

I suspect something in your ppo.py implementation is incorrect; would you mind sending me your ppo.py code?

from ppo-for-beginners.

britig commented on May 27, 2024

Hi Eric,

Thank you for your reply. Attaching my
ppo.zip
ppo.py. I suspect there is something wrong with the get_action() method. Kindly help.

from ppo-for-beginners.

The average Episodic Return and Average Loss is nan about ppo-for-beginners HOT 4 CLOSED

Comments (4)

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent