Hello there! I'm trying to train a CNN to control a robot with a dif

My images are in [0, 255], and not normalized. The <code class="notr

Continuous action space: range not taken into account about pytorch-a2c-ppo-acktr-gail HOT 18 CLOSED

ikostrikov commented on May 22, 2024

Continuous action space: range not taken into account

from pytorch-a2c-ppo-acktr-gail.

Comments (18)

maximecb commented on May 22, 2024 2

PPO does run with drop_last = False. The policy loss values seem normal. Action values don't seem to immediately saturate.

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024 1

Ok. Then I will add it, it might help.

Yes, this is a good idea. I will try to add it this week.

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

One solution is just to process actions in such a way that they are in [-1, 1], for example to add an action wrapper that uses tanh to remap them. But it's a little bit surprising that the output is so large (because the network is initialized to produce outputs roughly in the range of [-1, 1]).

Do you normalize the inputs?

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

What do you mean by normalize the inputs?

Regarding using tanh, do you think it would be a good idea to add code to always pass the outputs through tanh, and then remap that to the range of the action space?

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

So they have zero mean and unit variance (also then you need to remove / 255 from the model).

Yes, but just as an action wrapper for gym.

In any case, I'm going to add an optional normalization for image inputs probably today.

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

My images are in [0, 255], and not normalized.

The CNNPolicy takes the gym action_space as an input, so it should probably have the action scaling code.

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

Thanks for being so responsive. Your code is the best PyTorch implementation of A2C/ACKTR I found. :)

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

Thanks! I'm glad that this code can be useful.

At the moment, you can use this wrapper (I want to keep my code as consistent as possible with OpenAI implementations, but I'm going to add it to a new branch at some point).

class ScaleActions(gym.ActionWrapper):
    def __init__(self, env=None):
        super(ScaleActions, self).__init__(env)

    def _step(self, action):
        action = (np.tanh(action) + 1) / 2 * (self.action_space.high - self.action_space.low) + self.action_space.low
        return self.env.step(action)

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

At the moment, you can use this wrapper

Thanks. I will give it a try. I do think you should scale the actions in the main branch though. If the actions aren't within the action space bounds, it's obviously a bug, even if OpenAI is doing it. TBH, I don't like their code that much. I find it's only geared to work with a few environments they tested, everything else requires modifications to the environment and to their code.

One small update: the huge action values are only a problem with your A2C implementation. This problem does not happen with ACKTR. The huge action values correspond with huge policy loss values (in the 10^9+ range).

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

By the way, did you try to use PPO?

After playing with all algorithms I found PPO to be the most robust and reliable one.

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

I didn't try PPO because of this requirement:

assert args.num_processes * args.num_steps % args.batch_size == 0

I can only run one process at the moment, my gym environment connects to a ROS/gazebo robot simulator.

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

Oh. Actually, this requirement isn't really necessary. You can just remove this line and also change

sampler = BatchSampler(SubsetRandomSampler(range(args.num_processes * args.num_steps)), args.batch_size * args.num_processes, drop_last=False)

sampler = BatchSampler(SubsetRandomSampler(range(args.num_processes * args.num_steps)), args.batch_size * args.num_processes, drop_last=True)

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

Hi again. There's an issue with wrapping the action outputs in a tanh I think. The actions quickly saturate at [1, 1]. At this point, I think the algorithm stays stuck there, unable to learn anymore. It would probably work better with a function that has less quick saturation?

Will try PPO now (without the wrapper)...

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

I think it probably saturates so quickly because advantages are not normalized (so it makes a huge step). In PPO advantages are normalized so it should work better.

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

Having an issue with PPO:

Traceback (most recent call last):
  File "main.py", line 250, in <module>
    main()
  File "main.py", line 243, in main
    final_rewards.max(), -dist_entropy.data[0],
UnboundLocalError: local variable 'dist_entropy' referenced before assignment

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

It means that an iteration of PPO didn't happen.

from pytorch-a2c-ppo-acktr-gail.

maximecb commented on May 22, 2024

That's what I thought. Due to there just being one process?

from pytorch-a2c-ppo-acktr-gail.

ikostrikov commented on May 22, 2024

Probably, because batch size was larger than the number of samples.

Try to keep drop_last = False.

from pytorch-a2c-ppo-acktr-gail.

Continuous action space: range not taken into account about pytorch-a2c-ppo-acktr-gail HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent