<input type="checkbox" id="" disabled=""

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

RNN support,about thu-ml/tianshou

Trinkle23897 commented on May 16, 2024 2

And actor is not called in learn().

No. The actor is called in learn(). For example,

tianshou/tianshou/policy/modelfree/ppo.py

Line 149 in 52be533

dist = self(b).dist

self(b) calls self.forward() and uses the actor network. (Remember in PyTorch network, if we define nn.forward, we can use nn(input) directly)

from tianshou.

Trinkle23897 commented on May 16, 2024 2

@ChengshuLi Thanks! I'll have a look after I pass the midterm exam of CSAPP 3 days later...

from tianshou.

fengredrum commented on May 16, 2024 1

Actually, my experiments show that a RNN policy is nearly always inferior to a MLP policy in robotic control domains.
Besides, RNNs suffer from slow training speed :)

from tianshou.

Trinkle23897 commented on May 16, 2024 1

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

Sure! I currently fix issue #38. Maybe today or tomorrow I’ll resolve your issue.

from tianshou.

Trinkle23897 commented on May 16, 2024 1

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

from tianshou.

Trinkle23897 commented on May 16, 2024

Oh yes, in fact it has already support the rnn policy but I will write a small demonstration script after finishing the documentation. Thanks for pushing me.

from tianshou.

miriaford commented on May 16, 2024

Thanks! Do you mind quickly posting a snippet here and explain how to use the RNN policy? I need to run something ASAP, thanks a lot!

from tianshou.

miriaford commented on May 16, 2024

It varies from case to case, but I agree that RNN might not be the best option. I do need it as a baseline though, so will still appreciate it if someone can explain how to run RNN policy with Tianshou. Thanks!

from tianshou.

Trinkle23897 commented on May 16, 2024

@miriaford I upload a script in test/discrete/test_drqn.py.
What I have changed in the codebase:

modify the replay buffer to support frame_stack sampling;
fix collector for clearing a state when the corresponding episode is finished;

How to use:

add an argument "stack_num" in training collector's replay buffer code snippet;
change your actor-network to recurrent style code snippet;
use current git master branch instead of pypi whl: pip3 install git+https://github.com/thu-ml/tianshou.git@master.

That's all. If you find anything wrong, please give me feedback.

from tianshou.

miriaford commented on May 16, 2024

Cool, thanks a lot!

Does it also work out of the box for PPO and SAC? Do you have any benchmark scripts/results for them?

from tianshou.

Trinkle23897 commented on May 16, 2024

@miriaford I haven't tried RNN on PPO and SAC, but you can be the first. Following the changes between test_dqn.py and test_drqn.py and you can construct your first RNN-style PPO and SAC!

from tianshou.

miriaford commented on May 16, 2024

Hi, I might have missed something obvious. Could you please educate me on:

In these lines (https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L90-L91), why do you add .detach()? What happens if you don't?
In PPO policy (lines), why doesn't critic support RNN interface like actor?

Thanks for your help!

from tianshou.

Trinkle23897 commented on May 16, 2024

It's my fault. Previously the step in collector was not under with torch.no_grad(). I'll remove it.
Good catch! When I implemented the GAE, I did not think about the RNN. I'll upgrade the interface.

from tianshou.

miriaford commented on May 16, 2024

This is great! Would love to check out the new interface.

Another quick question:
https://github.com/thu-ml/tianshou/blob/master/test/discrete/net.py#L70-L79
If I understand this correctly, the [bsz, len, dim] will only happen when learning the parameters, but for the rest of cases, [bsz, dim] will be passed in because evaluation runs the RNN one forward step at a time? Then what's the practical impact of longer len vs shorter len for training?

Thanks again!

from tianshou.

Trinkle23897 commented on May 16, 2024

Yes, you are right. The longer the length, the smaller the difference of RNN performance between training and testing.

from tianshou.

miriaford commented on May 16, 2024

Please kindly let me know when you have finished RNN-critic. I'm a bit bottlenecked on this. Thank you so much!

from tianshou.

miriaford commented on May 16, 2024

Thanks for reopening the issue. I'd be happy to test it when you finish, thanks!

from tianshou.

miriaford commented on May 16, 2024

Any updates? Sorry for pinging again. Thanks!

from tianshou.

Trinkle23897 commented on May 16, 2024

I'm coding it.

from tianshou.

Trinkle23897 commented on May 16, 2024

diff --git a/test/continuous/test_ppo.py b/test/continuous/test_ppo.py
index e94a917..c496de2 100644
--- a/test/continuous/test_ppo.py
+++ b/test/continuous/test_ppo.py
@@ -12,7 +12,7 @@ from tianshou.trainer import onpolicy_trainer
 from tianshou.data import Collector, ReplayBuffer
 
 if __name__ == '__main__':
-    from net import ActorProb, Critic
+    from net import RecurrentActorProb, RecurrentCritic
 else:  # pytest
     from test.continuous.net import ActorProb, Critic
 
@@ -71,11 +71,11 @@ def test_ppo(args=get_args()):
     train_envs.seed(args.seed)
     test_envs.seed(args.seed)
     # model
-    actor = ActorProb(
+    actor = RecurrentActorProb(
         args.layer_num, args.state_shape, args.action_shape,
         args.max_action, args.device
     ).to(args.device)
-    critic = Critic(
+    critic = RecurrentCritic(
         args.layer_num, args.state_shape, device=args.device
     ).to(args.device)
     optim = torch.optim.Adam(list(
@@ -95,7 +95,7 @@ def test_ppo(args=get_args()):
         gae_lambda=args.gae_lambda)
     # collector
     train_collector = Collector(
-        policy, train_envs, ReplayBuffer(args.buffer_size))
+        policy, train_envs, ReplayBuffer(args.buffer_size, stack_num=4))
     test_collector = Collector(policy, test_envs)
     # log
     log_path = os.path.join(args.logdir, args.task, 'ppo')

The code has been pushed.

from tianshou.

Trinkle23897 commented on May 16, 2024

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value (this is in

tianshou/test/continuous/net.py

Lines 119 to 144 in c2a7caf

    
           class RecurrentCritic(nn.Module): 
        
               def __init__(self, layer_num, state_shape, action_shape=0, device='cpu'): 
        
                   super().__init__() 
        
                   self.state_shape = state_shape 
        
                   self.action_shape = action_shape 
        
                   self.device = device 
        
                   self.nn = nn.LSTM(input_size=np.prod(state_shape), hidden_size=128, 
        
                                     num_layers=layer_num, batch_first=True) 
        
                   self.fc2 = nn.Linear(128 + np.prod(action_shape), 1) 
        
               def forward(self, s, a=None): 
        
                   if not isinstance(s, torch.Tensor): 
        
                       s = torch.tensor(s, device=self.device, dtype=torch.float) 
        
                   # s [bsz, len, dim] (training) or [bsz, dim] (evaluation) 
        
                   # In short, the tensor's shape in training phase is longer than which 
        
                   # in evaluation phase. 
        
                   assert len(s.shape) == 3 
        
                   self.nn.flatten_parameters() 
        
                   s, (h, c) = self.nn(s) 
        
                   s = s[:, -1] 
        
                   if a is not None: 
        
                       if not isinstance(a, torch.Tensor): 
        
                           a = torch.tensor(a, device=self.device, dtype=torch.float) 
        
                       s = torch.cat([s, a], dim=1) 
        
                   s = self.fc2(s) 
        
                   return s

).
But for PPO, the current version is enough.

from tianshou.

miriaford commented on May 16, 2024

Thanks!! Will take a look and let you know!

from tianshou.

Trinkle23897 commented on May 16, 2024

I'm not sure whether or not to support stacked action for the Q network. This needs to change some lines of code in all policy which need a Q-network (ddpg, td3, sac). Or you can create the Q-network as (stacked-s, single-a) -> Q value

Aha, I know how to support it (Q(stacked-s, stacked-a)) without modifying any part of the codebase. Try to add an gym.wrapper which would modify the state as {'original_state': state, 'action': action}. That means in the traditional way we save (s, a, s', r, d) in the replay buffer, but now we save ([s, a], a, [s', a'], r, d) in the buffer.
And in your network side, extract the [s, a] pair from the batch.state.

from tianshou.

Arthur-Null commented on May 16, 2024

I'm sorry but I am still confused about how RNN works. Take PPO policy as an example, it seems that the frame_stack option only takes effect in learn(). And actor is not called in learn(). So the actor won't actually get a sequence as input?

from tianshou.

ChengshuLi commented on May 16, 2024

@Trinkle23897 Thank you for adding the RNN support.

If I understand correctly, currently hidden states are only used during policy rollout.

tianshou/tianshou/data/collector.py

Line 256 in 52be533

result = self.policy(batch, self.state)

But they are not used during "learning".

tianshou/tianshou/policy/modelfree/dqn.py

Line 145 in 52be533

q = self(batch).logits

tianshou/tianshou/policy/modelfree/dqn.py

Line 76 in 52be533

a = self(batch, input='obs_next', eps=0).act

I am wondering whether we should also store the hidden states into the replay buffer and use them during learning as well (e.g. computing TD error).

Some examples might be helpful here:
https://github.com/astooke/rlpyt/blob/master/rlpyt/replays/sequence/n_step.py
https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/master/a2c_ppo_acktr/storage.py

Thank you in advance for your help!

from tianshou.

Arthur-Null commented on May 16, 2024

Thank you for your responce. I still have one question, it seems that a sequence with length n would generate n subsequences when stacked (stack_num > n).
Does it means that the time complexity for a rnn model to get output for each timestamp in the sequence is O(n^2) rather than O(n) ?

from tianshou.

Trinkle23897 commented on May 16, 2024

Nope. Suppose we have a trajectory which length is 100, and the stack_num is equal to 4.

In the evaluation (or sampling) phase, where the policy uses the previous hidden state to generate its action, at timestep t it computes: pi(o_t, h_t) => (a_t, h_{t+1}). So, in this case, the RNN model's runtime complexity in one sequence is O(n), because each step is O(1).

In the training phase, (in the current codebase) the policy will be trained on bootstrap-sampling examples. For instance, the sampled trajectory may be as: (the batch size is 64)

0: [3, 4, 5, 6]
1: [67, 68, 69, 70]
2: [54, 55, 56, 57]
...
63: [5, 6, 7, 8]

Each of them has length=4. If you want to generate the latest action, you should run pi(o_0, h_0) => (a_0, h_1), pi(o_1, h_1) => (a_1, h_2), pi(o_2, h_2) => (a_2, h_3), pi(o_3, h_3) => (a_3, h_4), where h_0 is the saved hidden state (defaults to zero).
Thus at this time, computing an action by an RNN policy will cause O(stack_num). It is not related to n.

from tianshou.

Arthur-Null commented on May 16, 2024

I was actually considering the case when n = stack_num because I am dealing with a envionment with length of an episode fixed.
In this case, say we have a sequence of length 5: [1, 2, 3, 4, 5]
a generated batch could be:

0: [1, 1, 1, 1, 1]
1: [1, 1, 1, 1, 2]
2: [1, 1, 1, 2, 3]
3: [1, 1, 2, 3, 4]
4: [1, 2, 3, 4, 5]

If I understand correctly, 25 RNN steps is needed to generate all 5 actions for training in current implementation. However,
there is much duplicated calculation here as only the last sequence is needed to get all the actions.

Please correct me if I'm wrong, as I am considering including seqlen in the batch and using pytorch's packed_sequence to rewrite this part.

from tianshou.

Trinkle23897 commented on May 16, 2024

Yes, the current code will produce this behavior as mentioned.
A simple solution is to add an argument in ReplayBuffer (change the buffer.sample method) which will let it only choose the available index. I'll add this feature today.

from tianshou.

Trinkle23897 commented on May 16, 2024

@Arthur-Null it should be okay now. For your example, if you turn on the sample_avail in ReplayBuffer, it will only get the last sample: 4: [1, 2, 3 ,4 ,5].

from tianshou.

Trinkle23897 commented on May 16, 2024

@ChengshuLi when you are in policy.learn, the hidden state is stored in batch.policy._state, where its shape is [batch_size, stack_num, ...]. In other words, batch.policy._state[:, 0] is h_0.
I would like to further modify this behavior, e.g. self(batch) is lack of hidden state, but this should wait for the advanced slicing method in Batch (batch[:, i]).

from tianshou.

Arthur-Null commented on May 16, 2024

I think here is a bug in the implementation of the framestack option.

tianshou/tianshou/data/batch.py

Line 544 in 47e8e26

keys_partial = reduce(set.symmetric_difference, keys_map)

Note that if you apply symmetric_difference sequentially on a list of sets, the elements who exist in odd number of sets would be returned. As a result, if stack_num is set to an odd number, keys_shared may have same value as keys_partial, which would cause an error. It should be changed to keys_partial = set.union(*keys_map) - keys_share to my understanding.

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

from tianshou.

duburcqa commented on May 16, 2024

I think here is a bug in the implementation of the framestack option.

This issue hass been fixed on dev branch. Yes, keys_partial = set.union(*keys_map) - keys_share is now used instead.

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

I can't help you on this point. @Trinkle23897 ?

from tianshou.

Trinkle23897 commented on May 16, 2024

I think here is a bug in the implementation of the framestack option.

This issue hass been fixed on dev branch. Yes, keys_partial = set.union(*keys_map) - keys_share is now used instead.

Should we merge the essential commit from dev to master at this point?

from tianshou.

Trinkle23897 commented on May 16, 2024

Also, it seems that if sample_avail is set, the batch buffer.sample() returns would not contain all the rewards of an episode, which makes it impossible for policy to learn from it. Any suggestions for fixing this problem?

What's your desired feature? Could you please provide a simple example?

from tianshou.

Arthur-Null commented on May 16, 2024

I am trying to reach O(n) complexity in policy.learn(). In the current implementation, when sample_avail is not set, O(n) can't be reached. When sample_avail is set, policy.learn() can only get part of the reward, which makes it impossible to calculate returns.

from tianshou.

Arthur-Null commented on May 16, 2024

It is possible to use the original replaybuffer without framestack and sample_avail and leave all the dirty work to policy. I wonder is there are better way?

from tianshou.

Trinkle23897 commented on May 16, 2024

Do you mean sample only available index with on-policy method for RNN usage? I haven't figured it out clearly so let's take an example:

(This is a buffer)
Index 0 1 2 3 4 5 6 7 8 9 a b c d e f
Done  0 0 1 0 0 0 1 0 0 0 1 0 1 / / /

If I understand correctly, you want to only extract [0-2], [3-6], [7-a], [b-c] without overlap. What I'm not sure about is the data format that you expect. (and for me "O(n)" is ambiguous)
buffer.sample(0) can do something like this since it will extract all of the data, but only once. I think it is not what you want.

from tianshou.

Arthur-Null commented on May 16, 2024

Let me state my question more clearly and thoroughly.
To me, the frame_stack solution has two problems:

As mentioned before, some redundant calculation is included. To calculate the action distribution for sequence [0, 1, 2], we have to do RNN calculation on at least three sequences [0, 0, 0], [0, 0, 1], [0, 1, 2]. (This is what I mean by O(n^2). Only running RNN([0, 1, 2]) once is O(n)).
Sequences like [0, 0, 0] would cause deviation as we always take the last output of the RNN as the final output.

By setting sample_avail to True. The first problem is solved to some extend. However sequences like [b, c, c] may still exist so the second problem is still there. Also, in the current implementation, only reward at the end of an episode will be returned by buf.sample() which makes training impossible.

The data format I am expecting is something like pytorch's PackedSequence, which include the sequences and the lengths of the sequences. It seems a bit complecated to implement this in replay buffer. I am now implementing it in process_fn of my policy.

from tianshou.

Trinkle23897 commented on May 16, 2024

So you want a new sample method: it points out the number of episodes you want and will return the sampled episodes, for example, [[7-a], [3-6], [7-a], [b-c]] with episode_size=4, am I right?

from tianshou.

Arthur-Null commented on May 16, 2024

Yes

from tianshou.

Trinkle23897 commented on May 16, 2024

Okay, I see. I think the simplest way is to implement it in the buffer side. Since we can easily maintain the done flag, we can do something like:

# begin_index stores all the start index of one episode (if available), e.g., [0, 3, 7, b]
episode_index = np.random.choice(begin_index, episode_size)  # assume we get [b, 7, 3, 7]
# do a sorting with key=-episode_length, since we want to mimic torch's PackedSequence
# and after that the episode_index is [7, 7, 3, b]
# should maintain a list of episode_length, in the above case: [3, 4, 4, 2]
data, seq_length = [], []
while not self._meta.done[episode_index][0]:  # since the first episode's length is the longest
    # find out the episodes that haven't terminated
    available_episode_index = episode_index[self._meta.done[episode_index] == 0]
    data.append(self[available_episode_index])  # should set stack_num=0
    seq_length.append(len(available_episode_index))
    episode_index += self._meta.done[episode_index] == 0
data = Batch.cat(data)
# for now, (data, seq_length) is something like a PackedSequence.

I think maintaining begin_index and episode_length is not a big problem. It is very similar to my implementation of sample_avail.

I prefer to add a new argument in the initialization of Collector, to indicate whether to use this mode.

from tianshou.

duburcqa commented on May 16, 2024

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

from tianshou.

Trinkle23897 commented on May 16, 2024

Should we merge the essential commit from dev to master at this point?

Yes I think it is appropriate.

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

from tianshou.

duburcqa commented on May 16, 2024

But should we release a new version like 0.2.4.rc1? Since we want to keep master stable.

It's up to you. It's really a personal choice. For Python-only project, I tend to prefer releasing patches like 0.2.4.rc1 in the case of "critical" bug fixes. But there is no clear answer. I would say the minimum is to push them on Master, but not necessarily to create an actual release.

from tianshou.

heartInsert commented on May 16, 2024

Did someone try transfomrer before ? In NLP area , transformer seems perform better than RNN unit .

from tianshou.

Trinkle23897 commented on May 16, 2024

But in RL, the bigger the model, the harder it converges to an optimal policy. :(

from tianshou.

heartInsert commented on May 16, 2024

Ok , I get it.

from tianshou.

doylech commented on May 16, 2024

Is there a reason why in the DQNPolicy.learn method during training you don't pass an initial hidden state to the forward method? Wouldn't this allow it to train from better predictions?

From Recurrent Experience Replay in Distributed Reinforcement Learning by Kapturowski et al. in 2018 (https://openreview.net/pdf?id=r1lyTjAqYX)

...although zero state initialization was often used in previous works (Hausknecht & Stone, 2015; Gruslys et al., 2018), we have found that it leads to misestimated action-values, especially in the early states of replayed sequence

from tianshou.

cbellinger27 commented on May 16, 2024

Is it possible to use the recurrent network in the discrete soft actor critic?

from tianshou.

BFAnas commented on May 16, 2024

No, because as you mentioned zero state initialization was often used in previous works, here we mimic this workflow.

@Trinkle23897 Could you please point me to the exact workflow you based on this RNN implementation? I'm trying to figure out why doesn't it work (I have tested RNN with SAC and with DQN on 5 environments, it only worked with DQN for Cartpole)

from tianshou.

xiaoshenxian commented on May 16, 2024

Hello, @Trinkle23897 , I am encountering a problem at line 89-90 in
https://github.com/thu-ml/tianshou/blob/master/tianshou/policy/modelfree/a2c.py#L89
as

v_s.append(self.critic(minibatch.obs))
v_s_.append(self.critic(minibatch.obs_next))

I am wondering why the critie does not support 'states' input as actor does? As it is common for the critic sharing the same RNN input network as the actor using, how can I pass the same state that the actor used to the critic? I noticed there is a state batch stored in the 'policy' key of the minibatch, but this should be the output state if I understand correctly, right? Or I am not supposed to pass any state to the critic during training?

Do I missed anything here? Thanks a lot.

from tianshou.

Trinkle23897 commented on May 16, 2024

Or I am not supposed to pass any state to the critic during training?

Currently, yes

from tianshou.

xiaoshenxian commented on May 16, 2024

Or I am not supposed to pass any state to the critic during training?

Currently, yes

Thanks. I finally put the input state into the model input dict to solve the problem.

Another thing, I found the torch.nn.utils.clip_grad_norm_ cannot deal with inf gradient which is the case usually happening in RNN. Do you mind to add a torch.nn.utils.clip_grad_value_ before it in all policy implementation like below? Thanks a lot.

if self._grad_value:  # clip large gradient
    nn.utils.clip_grad_value_(
        self._actor_critic.parameters(), clip_value=self._grad_value
    )
if self._grad_norm:  # clip large gradient
    nn.utils.clip_grad_norm_(
        self._actor_critic.parameters(), max_norm=self._grad_norm
    )

experiment:

>>> w.grad
tensor([[0.4000, 0.4000, 0.4000, 0.4000, 0.4000],
        [   inf,    inf,    inf,    inf,    inf],
        [2.3000, 2.3000, 2.3000, 2.3000, 2.3000]])
>>> torch.nn.utils.clip_grad_norm_([w], 0.5) # this cannot deal with inf
tensor(inf)
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # and this cannot deal with nan
>>> w.grad
tensor([[0., 0., 0., 0., 0.],
        [nan, nan, nan, nan, nan],
        [0., 0., 0., 0., 0.]])

========

>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [   inf,    inf,    inf,    inf,    inf],
        [1.8000, 1.8000, 1.8000, 1.8000, 1.8000]])
>>> torch.nn.utils.clip_grad_value_([w], 0.5) # but this can deal with inf
>>> w.grad
tensor([[0.2000, 0.2000, 0.2000, 0.2000, 0.2000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
        [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]])

from tianshou.

Trinkle23897 commented on May 16, 2024

Sure! Feel free to submit a PR

from tianshou.

RNN support about tianshou HOT 55 CLOSED

Comments (55)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	class RecurrentCritic(nn.Module):
	def __init__(self, layer_num, state_shape, action_shape=0, device='cpu'):
	super().__init__()
	self.state_shape = state_shape
	self.action_shape = action_shape
	self.device = device
	self.nn = nn.LSTM(input_size=np.prod(state_shape), hidden_size=128,
	num_layers=layer_num, batch_first=True)
	self.fc2 = nn.Linear(128 + np.prod(action_shape), 1)

	def forward(self, s, a=None):
	if not isinstance(s, torch.Tensor):
	s = torch.tensor(s, device=self.device, dtype=torch.float)
	# s [bsz, len, dim] (training) or [bsz, dim] (evaluation)
	# In short, the tensor's shape in training phase is longer than which
	# in evaluation phase.
	assert len(s.shape) == 3
	self.nn.flatten_parameters()
	s, (h, c) = self.nn(s)
	s = s[:, -1]
	if a is not None:
	if not isinstance(a, torch.Tensor):
	a = torch.tensor(a, device=self.device, dtype=torch.float)
	s = torch.cat([s, a], dim=1)
	s = self.fc2(s)
	return s