Hello, I have questions on exploration and Gumbel-Softmax. In the ps

Hello, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hello, <a class="user-mention notranslate" data-hovercard-type="user" dat

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

action exploration & Gumbel-Softmax about maddpg HOT 9 OPEN

openai commented on August 18, 2024

action exploration & Gumbel-Softmax

from maddpg.

Comments (9)

pengzhenghao commented on August 18, 2024

I think in this implementation they use softmax as the output activation function when sampling action.
And see the code below, you can find that they have attempted to use Argmax activation by return CategoricalPdType(ac_space.n) when sampling. But eventually they use softmax activation when training the Q net.

def make_pdtype(ac_space):
    from gym import spaces
    if isinstance(ac_space, spaces.Box):
        assert len(ac_space.shape) == 1
        return DiagGaussianPdType(ac_space.shape[0])
    elif isinstance(ac_space, spaces.Discrete):
        # return CategoricalPdType(ac_space.n)
        return SoftCategoricalPdType(ac_space.n)
    elif isinstance(ac_space, spaces.MultiDiscrete):
        #return MultiCategoricalPdType(ac_space.low, ac_space.high)
        return SoftMultiCategoricalPdType(ac_space.low, ac_space.high)
    elif isinstance(ac_space, spaces.MultiBinary):
        return BernoulliPdType(ac_space.n)
    else:
        raise NotImplementedError

from maddpg.

djbitbyte commented on August 18, 2024

Hello, @pengzhenghao!

I've looked into involved functions again, I guess they use SoftCategoricalPdType(ac_space.n), then SoftCategoricalPdType.sample() to somehow add noise to actions, finally to softmax(logits - noise) as output of actor network.

And the noise added to action is from:
def sample(self):
u = tf.random_uniform(tf.shape(self.logits))
return U.softmax(self.logits - tf.log(-tf.log(u)), axis=-1)

I don't quite get it why they handle the noise in this way.

from maddpg.

djbitbyte commented on August 18, 2024

The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.

How do you handle the action exploration then?

from maddpg.

LiuQiangOpenMind commented on August 18, 2024

Hello, @pengzhenghao!
I don't quite get it why they handle the noise in the form of the log-log link function. Due to the log-log link function is non-linear function, the noise randomly generated every time could fluctuate, how to control the degree of noise to ensure adequate action exploration?

from maddpg.

pengzhenghao commented on August 18, 2024

Hello, @pengzhenghao!
I don't quite get it why they handle the noise in the form of the log-log link function. Due to the log-log link function is non-linear function, the noise randomly generated every time could fluctuate, how to control the degree of noise to ensure adequate action exploration?

Gumble-Softmax Trick is an important re-parameterization trick that can help smoothing the back propagation. I refer you to search with keyword "gumbel softmax" for more information. I am sorry for not providing more info since I do not thoroughly understand the whole process of gumbel softmax...

from maddpg.

Ah31 commented on August 18, 2024

Hello @djbitbyte!

You said that gumbel softmax helps to speed up stabilize the training. I am trying to reproduce the results in pytorch and using torch.nn.functional._gumbel_softmax_sample while sampling the action for current state as follows:

Also I am using torch.nn.functional.gumbel_softmax for computing target actions for next states and for computing action for current agent to be fed into actor_local.
Based on the original code and the algorithm, I am not able to understand why training is not converging after i use gumbel_softmax.

Thanks in advance!

from maddpg.

Ah31 commented on August 18, 2024

Hello!
Just to mention that there were many other issues in the code instead of gumbel-softmax because of which the training was not converging.

from maddpg.

kargarisaac commented on August 18, 2024

Hello!
Just to mention that there were many other issues in the code instead of gumbel-softmax because of which the training was not converging.

Hi, I'm trying to understand how to use gumbel_softmax in pytorch to reproduce the results. I'm using PPO but it cannot even learn the task for only one agent and one landmark completely. It reaches some good level but it's not as good as MADDPG at all. I think the problem is with this simple softmax and Categorical distribution I use and want to change it to humble softmax. I used:

policy_dist = distributions.Categorical(F.gumbel_softmax(policy_logits_out, tau=1, hard=False).to("cpu"))

But didn't get good results. There is also a distribution.Gumble in pytorch. I think I use them incorrectly.

Can you provide an example to use them in your own algorithm?

Thank you

from maddpg.

tanxiangtj commented on August 18, 2024

The sample function in distribution is implementation of Gumbel-softmax, I added it to my code, now it helps to speed up stabilize the training, but my speaker still can not tell the different landmarks.

How do you handle the action exploration then?

can you provide the code of your implementation of Gumbel-softmax? I meet the same problem when using MADDPG.
many thanks.

from maddpg.

action exploration & Gumbel-Softmax about maddpg HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent