Question1: From <a href="https://github.com/rail-berkeley/softlearning/blob/a01fd3

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Question on initialization of alpha and entropy about softlearning HOT 7 CLOSED

dbsxdbsx commented on June 10, 2024

Question on initialization of alpha and entropy

from softlearning.

Comments (7)

hartikainen commented on June 10, 2024

Regarding question 1: The automatic temperature tuning presented in [1] requires us to choose some entropy lower bound. The value we choose based on the action dimension is actually not the initial value for temperature, but rather the target value for our entropy. There's no task-independent target entropy value that is always guaranteed to work, but empirically it seems that using the negative number of action dimensions works well, which is why we default to such heuristic. This is basically saying that the higher-dimensional our action space is the higher the entropy value should be, which hopefully makes sense intuitively. Let me know if it doesn't.

For many tasks, there's quite a wide range of entropy values that work, so in your case, using a number like 0.1 could work well. The heuristic is there just as a default value so that you don't necessarily have to know anything about the task at hand in order to get it running in the first place.

Regarding question 2: The way the temperature is learned considers the expected entropy, i.e. the constraint is taken in expectation over the states. Thus it's not unexpected to see arbitrary entropy values for a single state. The reason we want this to happen is that there might be states where we want the policy to be of very low entropy whereas some other states might permit much higher entropy, and so the objective only considers the expectation. Hopefully that makes sense!

[1] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P. and Levine, S., 2018. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.

from softlearning.

dbsxdbsx commented on June 10, 2024

@hartikainen, thanks for your answer. And what about question 3? And frankly, I got some value very high like 20 or above when implementing SAC with discrete action like here.

from softlearning.

hartikainen commented on June 10, 2024

Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between (-inf, action_dim * log(2)] and, as the temperature (alpha) is dependent on the reward scale it can basically take any positive value.

I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.

from softlearning.

lidongke commented on June 10, 2024

First,you said that "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be",but in your "heuristic_target_entropy", the target_entropy will getting smaller and smaller with the action_dim (target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)?
Second,you said that"Reasonable values for entropy can be anything between (-inf, action_dim * log(2)]",why the upper bound is action_dim * log(2)? I think that the differential entropy's upper bound for continupus space is inf?

@hartikainen @dbsxdbsx

from softlearning.

dbsxdbsx commented on June 10, 2024

@lidongke , I am not quit familiar with the math part of SAC. But practically, I could say:
First, the target_entropy will getting smaller and smaller , target_entropy is not changeable, and it is worked as a threhold. I set this value like this：

		self.target_entropy = 0.2  #-np.log((1.0 / self.act_dim)) * 0.98

You can also use the one which I commented out, just don't let this value be set as a negative value, or it would be meaningless.

In additon, what is changeable is parameter alpha. According to @hartikainen said, when using reinforcement learning with function approximation, it is possible that alpha could be LOWER than target_entropy.

Second , I agree on I think that the differential entropy's upper bound for continupus space is inf?. And practically, I found the alpha would go as large as like say, 1000 with some specific initialization of target_entropy in discrete_action version! For more detail, see here. I couldn't figure it out----It seems to be a math issue.Is it also possible for continuous_action version? I have no idea.
Anyway, what I did to work around it is to clip it, like :

	self.alpha = torch.clamp(self.log_alpha.exp(),
					min=self.target_entropy,
					max=1)

So that alpha would finally drop in the reasonable range.

Just another thing to mention, I think you should realize that target_entropy is always used as a threhold (no matter for discrete or continuous action), and with which we want to prevent parameter alpha from being lower than it (though sometimes it does happen). And when there is no restriction, alpha would be in range [0,inf].

from softlearning.

lidongke commented on June 10, 2024

@dbsxdbsx "the target_entropy will getting smaller and smaller " i mean that different task has different target value, i know the target_entropy is not changeable ,but @hartikainen said "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be".I can't understand this, i found that higher-dimensional aciton space will get smaller target entropy value(target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)

from softlearning.

dbsxdbsx commented on June 10, 2024

@hartikainen ,

Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between (-inf, action_dim * log(2)] and, as the temperature (alpha) is dependent on the reward scale it can basically take any positive value.

I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.

What does a negative value mean in context of continuous action?
And why the upper bound is action_dim * log(2)?
And why the lower bound should be 0 but that of could be -inf?

Sorry for these questions, maybe I still don't get the true meaning of entropy here.

from softlearning.

Question on initialization of alpha and entropy about softlearning HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent