Code Monkey home page Code Monkey logo

Comments (7)

hartikainen avatar hartikainen commented on June 10, 2024

Regarding question 1: The automatic temperature tuning presented in [1] requires us to choose some entropy lower bound. The value we choose based on the action dimension is actually not the initial value for temperature, but rather the target value for our entropy. There's no task-independent target entropy value that is always guaranteed to work, but empirically it seems that using the negative number of action dimensions works well, which is why we default to such heuristic. This is basically saying that the higher-dimensional our action space is the higher the entropy value should be, which hopefully makes sense intuitively. Let me know if it doesn't.

For many tasks, there's quite a wide range of entropy values that work, so in your case, using a number like 0.1 could work well. The heuristic is there just as a default value so that you don't necessarily have to know anything about the task at hand in order to get it running in the first place.

Regarding question 2: The way the temperature is learned considers the expected entropy, i.e. the constraint is taken in expectation over the states. Thus it's not unexpected to see arbitrary entropy values for a single state. The reason we want this to happen is that there might be states where we want the policy to be of very low entropy whereas some other states might permit much higher entropy, and so the objective only considers the expectation. Hopefully that makes sense!

[1] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P. and Levine, S., 2018. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.

from softlearning.

dbsxdbsx avatar dbsxdbsx commented on June 10, 2024

@hartikainen, thanks for your answer. And what about question 3? And frankly, I got some value very high like 20 or above when implementing SAC with discrete action like here.

from softlearning.

hartikainen avatar hartikainen commented on June 10, 2024

Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between (-inf, action_dim * log(2)] and, as the temperature (alpha) is dependent on the reward scale it can basically take any positive value.

I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.

from softlearning.

lidongke avatar lidongke commented on June 10, 2024

First,you said that "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be",but in your "heuristic_target_entropy", the target_entropy will getting smaller and smaller with the action_dim (target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)?
Second,you said that"Reasonable values for entropy can be anything between (-inf, action_dim * log(2)]",why the upper bound is action_dim * log(2)? I think that the differential entropy's upper bound for continupus space is inf?

@hartikainen @dbsxdbsx

from softlearning.

dbsxdbsx avatar dbsxdbsx commented on June 10, 2024

@lidongke , I am not quit familiar with the math part of SAC. But practically, I could say:
First, the target_entropy will getting smaller and smaller , target_entropy is not changeable, and it is worked as a threhold. I set this value like this๏ผš

		self.target_entropy = 0.2  #-np.log((1.0 / self.act_dim)) * 0.98

You can also use the one which I commented out, just don't let this value be set as a negative value, or it would be meaningless.

In additon, what is changeable is parameter alpha. According to @hartikainen said, when using reinforcement learning with function approximation, it is possible that alpha could be LOWER than target_entropy.

Second , I agree on I think that the differential entropy's upper bound for continupus space is inf?. And practically, I found the alpha would go as large as like say, 1000 with some specific initialization of target_entropy in discrete_action version! For more detail, see here. I couldn't figure it out----It seems to be a math issue.Is it also possible for continuous_action version? I have no idea.
Anyway, what I did to work around it is to clip it, like :

	self.alpha = torch.clamp(self.log_alpha.exp(),
					min=self.target_entropy,
					max=1)

So that alpha would finally drop in the reasonable range.

Just another thing to mention, I think you should realize that target_entropy is always used as a threhold (no matter for discrete or continuous action), and with which we want to prevent parameter alpha from being lower than it (though sometimes it does happen). And when there is no restriction, alpha would be in range [0,inf].

from softlearning.

lidongke avatar lidongke commented on June 10, 2024

@dbsxdbsx "the target_entropy will getting smaller and smaller " i mean that different task has different target value, i know the target_entropy is not changeable ,but @hartikainen said "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be".I can't understand this, i found that higher-dimensional aciton space will get smaller target entropy value(target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)

from softlearning.

dbsxdbsx avatar dbsxdbsx commented on June 10, 2024

@hartikainen ,

Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between (-inf, action_dim * log(2)] and, as the temperature (alpha) is dependent on the reward scale it can basically take any positive value.

I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.

What does a negative value mean in context of continuous action?
And why the upper bound is action_dim * log(2)?
And why the lower bound should be 0 but that of could be -inf?

Sorry for these questions, maybe I still don't get the true meaning of entropy here.

from softlearning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.