I am trying to reproduce your results and I have an issue running the code (not running on a docker). The backpropagation fails with the following error:
[ERROR 13:59:02] pymarl Failed after 0:00:23!
Traceback (most recent calls WITHOUT Sacred internals):
File "src/main.py", line 34, in my_main
run(_run, _config, _log)
File "/home/USER/.tmp/MAVEN/maven_code/src/run.py", line 48, in run
run_sequential(args=args, logger=logger)
File "/home/USER/.tmp/MAVEN/maven_code/src/run.py", line 181, in run_sequential
learner.train(episode_sample, runner.t_env, episode)
File "/home/USER/.tmp/MAVEN/maven_code/src/learners/noise_q_learner.py", line 168, in train
loss.backward()
File "/home/USER/.general_env/lib/python3.8/site-packages/torch/tensor.py", line 195, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/USER/.general_env/lib/python3.8/site-packages/torch/autograd/__init__.py", line 97, in backward
Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32, 56, 3, 9]], which is output 0 of SliceBackward, is at version 2; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
During handling of the above exception, another exception occurred:
Traceback (most recent calls WITHOUT Sacred internals):
File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3.8/subprocess.py", line 1079, in wait
return self._wait(timeout=timeout)
File "/usr/lib/python3.8/subprocess.py", line 1796, in _wait
raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command '['tee', '-a', '/tmp/tmp8fr28bt_']' timed out after 1 seconds
This is what I changed to have an up and running code, in the train function of src/learners/noise_q_learner.py:
# Max over target Q-Values
if self.args.double_q:
mac_out_detach = mac_out.clone().detach()
mac_out_detach[avail_actions == 0] = -9999999
cur_max_actions = mac_out_detach[:, 1:].max(dim=3, keepdim=True)[1]
target_max_qvals = torch.gather(
target_mac_out, 3, cur_max_actions
).squeeze(3)
# Get actions that maximise live Q (for double q-learning)
#mac_out[avail_actions == 0] = -9999999
#cur_max_actions = mac_out[:, 1:].max(dim=3, keepdim=True)[1]
#target_max_qvals = th.gather(target_mac_out, 3, cur_max_actions).squeeze(3)
else:
target_max_qvals = target_mac_out.max(dim=3)[0]
# Discriminator
mac_out_detach = mac_out.clone().detach()
mac_out_detach[avail_actions == 0] = -9999999
q_softmax_actions = torch.nn.functional.softmax(
mac_out_detach[:, :-1], dim=3
)
#mac_out[avail_actions == 0] = -9999999
#q_softmax_actions = th.nn.functional.softmax(mac_out[:, :-1], dim=3)
Can you tell me if these changes are ok and if not how the gradient propagation should be fixed? I assume the proper way to fix the discriminator backprop problem would be to remove the part of the target that is in line with unavailable actions. I will keep looking for that in the code but I would really like to have your input.