lweitkamp / option-critic-pytorch Goto Github PK

PyTorch implementation of the Option-Critic framework, Harb et al. 2016

Python 100.00%

option-critic-pytorch's Issues

continuous action space

Does this code support the continuous action space environment?

Biased gradients

you need to re-evaluate the features/state after the optimization step
optim.step()
because that updates the feature layer hence the features themselves

Termination prob calculated over current state instead of the next state

The termination probability is calculated over the next state according to the original paper. So it should be using next_obs instead of obs.

option-critic-pytorch/option_critic.py

Line 238 in 0c57da7

    
           termination_loss = option_term_prob * (Q[option].detach() - Q.max(dim=-1)[0].detach() + args.termination_reg) * (1 - done)

Actor called with a single sample 1/batchsize proportion of the time.

I believe, when the batch size is reached, the action policy/loss is only computed with a single sample instead of one of size batch_size. This looks to be a major difference from the original implementation unless I am understanding the code wrong.

Should "--num-options 4" be added?

"python main.py --switch-goal True --env fourrooms"
Should "--num-options 4" be added?
For the fourrooms environment, the number of Option is 4.
Maybe it's something I didn't understand correctly, looking forward to and thank you for your answer.

Got an error of inplace operation while running the code

Hi, I got inplace operation message while running your code. It seems to be caused by wrong detach in calculating the loss function. I tried to find a similar issue but I could not find anything. Could you have a look at it?

Traceback (most recent call last):
  File "main.py", line 147, in <module>
    run(args)
  File "main.py", line 129, in run
    loss.backward()
  File "/home/mw/anaconda3/envs/HRL/lib/python3.6/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/mw/anaconda3/envs/HRL/lib/python3.6/site-packages/torch/autograd/__init__.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32, 64]], which is output 0 of TBackward, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

loss.backward error

Hello when I download the code and run it in my computer, and I met an error in loss.backward()
RuntimeError: one of the variables needed for gradient computaion has been modified by an inplace operation:[torch.FloatTensor [32, 64], which is output 0 of AsStrideBackward0, is at version 2; expected version 1 instead]

I didn't modify the code anywhere.
my dependencies
pytorch 1.3.0
python 3.6.13
tensorboard 2.0.2
gym 0.15.3

Q[option] should also have a detach() when calculating ac_loss

option-critic-pytorch/option_critic.py

Line 241 in 0c57da7

policy_loss = -logp * (gt.detach() - Q[option]) - args.entropy_reg * entropy

Why not clean replay buffer after each episode for on-policy policy gradient update?

Thanks for providing the pytorch version of option critic. I want to ask why don't we clean replay buffer after each episode for on-policy policy gradient update? I think both algorithm 1 in the paper and the derivation for the intra-option policy gradient theorem is done under the setting of on-policy setup. If we do not clean the replay buffer, importance sampling should be implemented to account for the off-policy update. But I did not see any part of code related to that. I tried to read the original Theano repo, but it seems that they did the same thing. Do you have any comments on this?

lweitkamp / option-critic-pytorch Goto Github PK

option-critic-pytorch's Issues

continuous action space

Biased gradients

Termination prob calculated over current state instead of the next state

Actor called with a single sample 1/batchsize proportion of the time.

Should "--num-options 4" be added?

Got an error of inplace operation while running the code

loss.backward error

Q[option] should also have a detach() when calculating ac_loss

Why not clean replay buffer after each episode for on-policy policy gradient update?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent