Comments (7)
Regarding question 1: The automatic temperature tuning presented in [1] requires us to choose some entropy lower bound. The value we choose based on the action dimension is actually not the initial value for temperature, but rather the target value for our entropy. There's no task-independent target entropy value that is always guaranteed to work, but empirically it seems that using the negative number of action dimensions works well, which is why we default to such heuristic. This is basically saying that the higher-dimensional our action space is the higher the entropy value should be, which hopefully makes sense intuitively. Let me know if it doesn't.
For many tasks, there's quite a wide range of entropy values that work, so in your case, using a number like 0.1 could work well. The heuristic is there just as a default value so that you don't necessarily have to know anything about the task at hand in order to get it running in the first place.
Regarding question 2: The way the temperature is learned considers the expected entropy, i.e. the constraint is taken in expectation over the states. Thus it's not unexpected to see arbitrary entropy values for a single state. The reason we want this to happen is that there might be states where we want the policy to be of very low entropy whereas some other states might permit much higher entropy, and so the objective only considers the expectation. Hopefully that makes sense!
[1] Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P. and Levine, S., 2018. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.
from softlearning.
@hartikainen, thanks for your answer. And what about question 3? And frankly, I got some value very high like 20 or above when implementing SAC with discrete action like here.
from softlearning.
Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between (-inf, action_dim * log(2)]
and, as the temperature (alpha
) is dependent on the reward scale it can basically take any positive value.
I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.
from softlearning.
First,you said that "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be",but in your "heuristic_target_entropy", the target_entropy will getting smaller and smaller with the action_dim (target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)?
Second,you said that"Reasonable values for entropy can be anything between (-inf, action_dim * log(2)]",why the upper bound is action_dim * log(2)? I think that the differential entropy's upper bound for continupus space is inf?
from softlearning.
@lidongke , I am not quit familiar with the math part of SAC. But practically, I could say:
First, the target_entropy will getting smaller and smaller
, target_entropy
is not changeable, and it is worked as a threhold. I set this value like this๏ผ
self.target_entropy = 0.2 #-np.log((1.0 / self.act_dim)) * 0.98
You can also use the one which I commented out, just don't let this value be set as a negative value, or it would be meaningless.
In additon, what is changeable is parameter alpha
. According to @hartikainen said, when using reinforcement learning with function approximation, it is possible that alpha
could be LOWER than target_entropy
.
Second , I agree on I think that the differential entropy's upper bound for continupus space is inf?
. And practically, I found the alpha
would go as large as like say, 1000 with some specific initialization of target_entropy
in discrete_action version! For more detail, see here. I couldn't figure it out----It seems to be a math issue.Is it also possible for continuous_action version? I have no idea.
Anyway, what I did to work around it is to clip it, like :
self.alpha = torch.clamp(self.log_alpha.exp(),
min=self.target_entropy,
max=1)
So that alpha
would finally drop in the reasonable range.
Just another thing to mention, I think you should realize that target_entropy
is always used as a threhold (no matter for discrete or continuous action), and with which we want to prevent parameter alpha
from being lower than it (though sometimes it does happen). And when there is no restriction, alpha would be in range [0,inf].
from softlearning.
@dbsxdbsx "the target_entropy will getting smaller and smaller " i mean that different task has different target value, i know the target_entropy is not changeable ,but @hartikainen said "This is basically saying that the higher-dimensional our action space is the higher the entropy value should be".I can't understand this, i found that higher-dimensional aciton space will get smaller target entropy value(target_entropy = -1 for action dimensional size = 1,target_entropy = -2 for action dimensional size = 2)
from softlearning.
Ah, apologies, I forgot to address question 3. Reasonable values for entropy can be anything between
(-inf, action_dim * log(2)]
and, as the temperature (alpha
) is dependent on the reward scale it can basically take any positive value.I don't have much experience with discrete SAC. But in that case the entropy is always non-negative but the temperature could still be anything depending on the reward.
What does a negative value mean in context of continuous action?
And why the upper bound is action_dim * log(2)
?
And why the lower bound should be 0 but that of could be -inf
?
Sorry for these questions, maybe I still don't get the true meaning of entropy here.
from softlearning.
Related Issues (20)
- SAC gradients weighted inconsistently
- Question on the soft q learning implementation HOT 1
- Not use GPU failed call to cuInit: CUDA_ERROR_NO_DEVICE HOT 4
- `dm_control` `cheetah` `run` training stops suddenly HOT 3
- error when running example
- How can I generate eval/test output video of dm_control tasks?
- No rendering/headless mode available?
- No module named 'example.instruments' HOT 2
- target_entropy discrete_space HOT 2
- Implementation of automatic entropy temperature tuning(alpha loss) HOT 6
- SQL algorithm is not working HOT 1
- MultiGoal Env not working, please give instruction. HOT 6
- Error on installing with docker HOT 1
- Incompatible with ray 1.2.0 HOT 1
- Incompatible with tensorflow 2.4.0 HOT 1
- Differences between softlearning implementation and formula 18 in paper of alpha loss
- using deterministic policy in enviroment like lunarlander?
- why target entropy is -dim(A)?
- The issue of softlearning implementation HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from softlearning.