denys88 / rl_games Goto Github PK

View Code? Open in Web Editor NEW

808.0 808.0 135.0 366.53 MB

RL implementations

License: MIT License

Python 1.65% Jupyter Notebook 98.35%

deep-learning pytorch reinforcement-learning

rl_games's People

Contributors

Stargazers

Watchers

Forkers

pavlog schroederdewitt leimao supermina999 victor8733 allensmile wps1215 arthurallshire veyron2121 zuxinrui cremebrule melfm nikitardn zhehui-huang yuchen-x wwxfromtju ac-93 jingweiz romanzwang fizzlewick zhanyon deams51 krishpop kevinleestone wenjoyu haochihlin danfoa arjun-krishna yzqin icleen jc-bao akniloy6 arslan-z yuemingl erwincoumans rl-code-lib maltemosbach kelseyjharvey mayankm96 umsukgod vwxyzjn jmcoholich mfkiwl jgsimard wwchung91 mtchen2016 frank-dz jacarvalho niklaska alex-petrenko ankurhanda robot-learning-library mcx 13253591602 marcin-projectx ziyiliubird kylem73 yixuanl noseworm danieltakeshi mihdalal lsh-159 sonsang gagkhan dtch1997 skeli9989 generush shiyk-0517 jimmy-inl uwrobotlearning yasu31 antoinerichard mridulmahajan44 vortexmath xindaq yibodi s-karnik annaisdevil kellyguo11 kkpan11 wualbert yasuohayashibara tforgaard josebarreiros-tri kunimatsu-tri jameshennessytempus flferretti tylerlum giakoumidis catachiii wesley7137 p90-rushb patricknaughton01 myelinsheathxd woshialex mmozejko dieprado ke-wang1017 btaba yiwenlu66

rl_games's Issues

Using SAC

Hi,

I saw you were working on making it possible to train SAC agents with rl_games. Is that possible already? I was checking the configs and couldn't find anything. Everything seems ppo related. So I guess you can't run SAC currently?

wrapper flatten issue

sry no time for PR but in common/wrappers.py it should be in BatchedFrameStack

    def _get_ob(self):
        assert len(self.frames) == self.k
        if self.transpose:
            frames = np.transpose(self.frames, (1, 2, 0))
        else:
            if self.flatten:
                frames = np.array(self.frames)
                shape = np.shape(frames)
                frames = np.transpose(frames, (1, 0, 2))
                frames = np.reshape(frames, (shape[1], shape[0] * shape[2]))
            else:
                frames = np.transpose(self.frames, (1, 0, 2))
        return frames

How to use CNN in PPO

Hi, does anyone know where I can find an example config file like rl_games/configs/ppo_continuous.yaml except that I want to use CNN to handle the image input? I tried to set the config file as follows:

config:
    name: ${resolve_default:FrankaCabinet,${....experiment}}
    full_experiment_name: ${.name}
    env_name: rlgpu
    ppo: True
    mixed_precision: False
    normalize_input: True
    normalize_value: True
    num_actors: ${....task.env.numEnvs}
    reward_shaper:
      scale_value: 0.01
    normalize_advantage: True
    gamma: 0.99
    tau: 0.95
    learning_rate: 5e-4
    lr_schedule: adaptive
    kl_threshold: 0.008
    score_to_win: 10000
    max_epochs: ${resolve_default:1500,${....max_iterations}}
    save_best_after: 200
    save_frequency: 100
    print_stats: True
    grad_norm: 1.0
    entropy_coef: 0.0
    truncate_grads: True
    e_clip: 0.2
    horizon_length: 16
    minibatch_size: 5
    mini_epochs: 8
    critic_coef: 4
    clip_value: True
    seq_len: 4
    bounds_loss_coef: 0.0001
    use_entral_value: True
    central_value_config:
      normalize_input: True
      learning_rate: 0.0005
      input_shape: [3, 320, 480]
      model:
        name: continuous_a2c_logstd

      network:
        name: resnet_actor_critic
        separate: False
        value_shape: 1
        space:
          discrete:

        cnn:
          conv_depths: [ 16, 32, 32 ]
          activation: relu
          initializer:
            name: default
          regularizer:
            name: 'None'

        mlp:
          units: [ 256, 128, 64 ]
          activation: elu
          d2rl: False

          initializer:
            name: default
          regularizer:
            name: None

I set use_entral_value to True and set central_value_config. But an error occurred as

Traceback (most recent call last):
  File "train.py", line 133, in launch_rlg_hydra
    'checkpoint': cfg.checkpoint
  File "/home/quan/rl_games/rl_games/torch_runner.py", line 109, in run
    self.run_train(args)
  File "/home/quan/rl_games/rl_games/torch_runner.py", line 88, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', params=self.params)
  File "/home/quan/rl_games/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/quan/rl_games/rl_games/torch_runner.py", line 38, in <lambda>
    self.algo_factory.register_builder('a2c_continuous', lambda **kwargs : a2c_continuous.A2CAgent(**kwargs))
  File "/home/quan/rl_games/rl_games/algos_torch/a2c_continuous.py", line 59, in __init__
    self.central_value_net = central_value.CentralValueTrain(**cv_config).to(self.ppo_device)
  File "/home/quan/rl_games/rl_games/algos_torch/central_value.py", line 37, in __init__
    self.model = network.build(state_config)
  File "/home/quan/rl_games/rl_games/algos_torch/models.py", line 28, in build
    return self.Network(self.network_builder.build(self.model_class, **config), obs_shape=obs_shape,
  File "/home/quan/rl_games/rl_games/algos_torch/network_builder.py", line 766, in build
    net = A2CResnetBuilder.Network(self.params, **kwargs)
  File "/home/quan/rl_games/rl_games/algos_torch/network_builder.py", line 599, in __init__
    NetworkBuilder.BaseNetwork.__init__(self, **kwargs)
  File "/home/quan/rl_games/rl_games/algos_torch/network_builder.py", line 35, in __init__
    nn.Module.__init__(self, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'num_agents'

So, is there any config file that I can refer to?

Continuing training from checkpoint

Hi, on the computing infrastructure I am using I need to continue interrupted training regularly. I have been trying to use the checkpointing utility (for PPO, but I think these issues appear for all) to reload the checkpoints, but the training doe not actually continue from those checkpoints. I believe that is because other important parameters such as the optimizer are not stored in the checkpoints (please correct me if I am wrong).

In the image below, I interrupted two runs with the same seed at two different states and continued training from the latest checkpoint.

Would it be possible to checkpoint all components of the algorithms to enable continuing training from a checkpoint?

rl_games with Brax trains too fast, so step_time = 0 -> crash

Running

python runner.py --train --file rl_games/configs/brax/ppo_ant.yaml

trains so fast that step_time becomes 0.0 and then leads to a crash in two different places:

fps step: 1048568.0 fps step and policy inference: 699047.1  fps total: 299589.9 epoch: 3/1000
fps step: 1048576.0 fps step and policy inference: 699047.1  fps total: 299589.9 epoch: 4/1000
fps step: 699050.7 fps step and policy inference: 419429.1  fps total: 262141.5 epoch: 5/1000
fps step: 524284.0 fps step and policy inference: 524284.0  fps total: 253261.5 epoch: 6/1000
fps step: 613857.2 fps step and policy inference: 613857.2  fps total: 282772.3 epoch: 7/1000
fps step: 699043.6 fps step and policy inference: 524280.0  fps total: 259049.0 epoch: 8/1000
fps step: 699040.0 fps step and policy inference: 419433.0  fps total: 233017.7 epoch: 9/1000
fps step: 699054.2 fps step and policy inference: 524286.0  fps total: 262141.5 epoch: 10/1000
fps step: 699054.2 fps step and policy inference: 524282.0  fps total: 262140.5 epoch: 11/1000
fps step: 699047.1 fps step and policy inference: 524284.0  fps total: 262141.5 epoch: 12/1000
fps step: 699043.6 fps step and policy inference: 349519.1  fps total: 209712.6 epoch: 13/1000
fps step: 524282.0 fps step and policy inference: 349520.9  fps total: 209712.3 epoch: 14/1000
fps step: 699040.0 fps step and policy inference: 349520.9  fps total: 234601.1 epoch: 15/1000
Traceback (most recent call last):
  File "runner.py", line 67, in <module>
    runner.run(args)
  File "F:\dev\rl_games\rl_games\torch_runner.py", line 122, in run
    self.run_train(args)
  File "F:\dev\rl_games\rl_games\torch_runner.py", line 103, in run_train
    agent.train()
  File "F:\dev\rl_games\rl_games\common\a2c_common.py", line 1158, in train
    self.write_stats(total_time, epoch_num, step_time, play_time, update_time, a_losses, c_losses, entropies, kls, last_lr, lr_mul, frame, scaled_time, scaled_play_time, curr_frames)
  File "F:\dev\rl_games\rl_games\common\a2c_common.py", line 284, in write_stats
    self.writer.add_scalar('performance/step_fps', curr_frames / step_time, frame)
ZeroDivisionError: float division by zero

fps step: 744015.2 fps step and policy inference: 488626.7  fps total: 301957.5 epoch: 741/1000
fps step: 699032.9 fps step and policy inference: 523978.2  fps total: 286478.0 epoch: 742/1000
fps step: 795304.5 fps step and policy inference: 432675.5  fps total: 328279.1 epoch: 743/1000
fps step: 2097216.0 fps step and policy inference: 524284.0  fps total: 299591.2 epoch: 744/1000
fps step: 524276.0 fps step and policy inference: 524276.0  fps total: 299587.9 epoch: 745/1000
fps step: 524286.0 fps step and policy inference: 524286.0  fps total: 299591.2 epoch: 746/1000
Traceback (most recent call last):
  File "runner.py", line 67, in <module>
    runner.run(args)
  File "F:\dev\rl_games\rl_games\torch_runner.py", line 122, in run
    self.run_train(args)
  File "F:\dev\rl_games\rl_games\torch_runner.py", line 103, in run_train
    agent.train()
  File "F:\dev\rl_games\rl_games\common\a2c_common.py", line 1153, in train
    fps_step = curr_frames / step_time
ZeroDivisionError: float division by zero

rl_device in Network

Hi there, I am just wondering, in A2CBuilder in Network, which cuda device is actually used for the training and how would I hand down the sim_device respectively rl_device variables from env_creator: create_env_thunk in Isaacgmyenvs or alike into the Network class so I can put the tensors onto the right GPU for torchrun Multi-GPU training? Or is the right cuda-device set automatically?
Kind regards

rl_games won't load trained model with command line arguemnt `--checkpoint=` given

Problem

I am using rl-games with IsaacGym to train my RL agent. However, when I was trying to use the --checkpoint= command line argument to resume the training, I found that the training always restarts from the very begining. I uses rl-games in the way below:

runner = Runner(algo_observer)
runner.load(cfg_train)
runner.reset()
runner.run(args)

and resume my training with command:

$ python ./rlg_train.py --task=[my_task_name] --checkpoint=[absolute path of trained model]

My Solution

I take a look at the source code, and found that the class method Runner.run_train(self) has a duplicated load_config() command.

else:
    self.reset()
    **self.load_config(self.default_config)**

This line causes the command line argument --checkpoint be covered by configurations in config file.

I thought that this command should be deleted, and another command should be added in Runner.run(self, arg) function:

if 'checkpoint' in args and args['checkpoint'] is not None:
    if len(args['checkpoint']) > 0:
        **self.load_check_point = True**
        self.load_path = args['checkpoint']

so that I can use command line argument to resume the training without modifying my config file.

Could you please take a look and check if I've gotten it right? Thanks a lot!

Sequential Multi-agent PPO with DR

Hi,
I have a few doubts wrt to implementing multiple agents in Isaac Gym (or Brax). (apologies if they are too trivial)

I want to use 2 or more agents in the same experiment (agents will have different environments, especially if Domain Randomisation is enabled) and train them sequentially (i.e. first Agent 1 gets trained via PPO, then Agent 2 and so on...)

How can I go about implementing this? I am not sure which files I should be modifying and how to configure train.py to support the above functionality.

Thanks!

RuntimeError: rnn: hx is not contiguous when using multilayer-LSTM as network

Hi, I came into the error

Traceback (most recent call last):
  File "./train.py", line 110, in launch_rlg_hydra
    runner.run({
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/torch_runner.py", line 125, in run_train
    agent.train()
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1143, in train
    step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1023, in train_epoch
    self.train_central_value()
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 521, in train_central_value
    return self.central_value_net.train_net()
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 176, in train_net
    loss += self.train_critic(self.dataset[idx])
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 155, in train_critic
    loss = self.calc_gradients(input_dict)
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 201, in calc_gradients
    values, _ = self.forward(batch_dict)
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 136, in forward
    value, rnn_states = self.model(input_dict)
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/rl_games/algos_torch/network_builder.py", line 403, in forward
    out, states = self.rnn(out, states)
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/fuchaojie/DATA_UBUNTU/Isaac_env/lib/python3.8/site-packages/torch/nn/modules/rnn.py", line 691, in forward
    result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
RuntimeError: rnn: hx is not contiguous

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

when running IsaacGym (Preview 3)'s ShadowHandOpenAI_LSTM example with the parameter layers in training config file ShadowHandPPOAsymmLSTM.yaml set to 2. It seems like that rl_games don't currently support multilayer-LSTM. Is it true or it's just a bug?

Pull Request #113 breaks Issac Gym

I was modifying some of the rl_games code when I noticed newer version do not work with Issac Gym. Prior to this merge #113 things appear to be working correctly.

Segmentation fault when importing env_configurations

Dear colleagues,

Thanks for your great contribution!

I found one problem when:
$ import rl_games.common.env_configurations
My PC will output: Segmentation fault (core dumped)
I tried many different virtual environment but the same problem occurs.

Finally, I solved the problem by move import rl_games.envs.test into line 1 in env_configurations.py.

I do not know the exact reason why it could work after moving line 3 to the beginning.
Just let you know in case it is a potential bug.

PC environment: python 3.7, ubuntu 18.04, AMD 3990x cpu, NV RTX 3080.

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Running
python rlg_train.py --task Ant
from carbgym/python/rlgpu

gives the following error:

Traceback (most recent call last): File "rlg_train.py", line 13, in <module> from rl_games.common import env_configurations, experiment, vecenv File "/home/dcg-adlr-gradeyw-source/rl_games/rl_games/common/env_configurations.py", line 1, in <module> from rl_games.common import wrappers File "/home/dcg-adlr-gradeyw-source/rl_games/rl_games/common/wrappers.py", line 8, in <module> import cv2 File "/opt/conda/lib/python3.6/site-packages/cv2/__init__.py", line 5, in <module> from .cv2 import * ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Using the master branch of carbgym and Python 3.6

version number in `rl_games`

>>> import rl_games
>>> rl_games.__
rl_games.__cached__          rl_games.__doc__             rl_games.__getattribute__(   rl_games.__le__(             rl_games.__new__(            rl_games.__repr__(           rl_games.__subclasshook__(
rl_games.__class__(          rl_games.__eq__(             rl_games.__gt__(             rl_games.__loader__          rl_games.__package__         rl_games.__setattr__(        
rl_games.__delattr__(        rl_games.__file__            rl_games.__hash__(           rl_games.__lt__(             rl_games.__path__            rl_games.__sizeof__(         
rl_games.__dict__            rl_games.__format__(         rl_games.__init__(           rl_games.__name__            rl_games.__reduce__(         rl_games.__spec__            
rl_games.__dir__(    ```

I don't see `rl_games.__version__` so if you install rl_games you can't tell which version you have.

SAC Integration

Hi! Been using rl_games for a few months, awesome work guys :) Was wondering if the SAC integration will be ready anytime soon to try out?

Thanks!

Why is the performance of GRU+PPO poor?

The performance of your project is shocking. I want to know why CNN+PPO can be so excellent.
Thank you!!!!!!

Difficulties in adoption of code

Hi to all,

First, congrats on the work. It is truly appealing.

I came to use RL-games repo through the IsaacGymEnvs repository. I am extending several of my works to operate on this new simulator with IsaacGymEnvs which have as dependency the 1.1.3 version of rl-games.

This issue is more of a request/advise to have a more organized and structured repository and collaboration framework. Things that I believe could help encouraging contributions from third party and a larger adoption of your repository are:

Release version tags
Commits with version numbering
Enforcing abstract class inheritance on algorithms and builders. This will eliminate problems such as using different words for the same concepts (frame, step), (epoch, episode) which lead to confusions and probably bugs as these are in reality different concepts.
Not documented configurations.
Reduce number of branches with identifiable goals.

Perhaps it would be useful to clarify and standarize some concepts like:

step: Step can reefer to simulation step, or to a single agent simulation/experience step. In this case of parallel simulation a single simulation step accounts for multiple experience "steps".
frame: Its unclear how you use this concept, sometimes it seems to reefer to a simulation step, and sometimes to epochs.
epoch: User defined numerical value to trigger a logging sequence of metrics and statistics. The units with which you define this and other frequency variables (e.g., max_iterations) is also ambiguous as sometimes it appears to be defined in terms of samples of experience collected (preferably) and sometimes in epochs or batches (dependent on specific size of batch or epoch).
actor: in your implementation of SAC the concepts of actors, agents and envs are constantly interchanged generating confusion. For the sake of generality (multi agent) an env might hold multiple agents and each agent could have multiple actor networks.

I was cleaning and fixing some of the bugs of your SAC implementation when I found all of this problems, which have made it really difficult to contribute, and to work with different versions of the repo code with which IsaacGymEnvs depends on.

Export torchscript models to C++

I've been testing the PPO implementation, and It doesn't seem like it is currently possible to export a model as a c++ compatible module.
Is it something you are planning?

If not, I could try to give it a go, though would appreciate it if you have any pointers.

Is the step on the picture the sum of the steps of all child processes?

I found that the number of subprocesses in the yaml file is 8.
However, the number of subprocesses used in QMIX is 1.

Wandb does not seem to record time or step correctly

I am running PPO with wandb integration, but the statistics seem to not be recorded as intended.

I am testing this with Isaac Gym environments but I am unsure if this issue is specific to Isaac Gym.

Steps to reproduce: after installing following the IsaacGymEnvs instructions, run a command like this in the isaacgymenvs/ directory:

python train.py task=Ant headless=True wandb_activate=True wandb_entity=danieltakeshi wandb_project=isaac-gym

Where you can replace danieltakeshi with your username, and change isaac-gym to your project.

After I run this, the reward goes up (good) but I also see this on wandb:

The code is recording the reward as a function of iter, step, and time. It stores it in rl_games here:

rl_games/rl_games/common/a2c_common.py

Lines 947 to 955 in d8645b2

    
           for i in range(self.value_size): 
        
               rewards_name = 'rewards' if i == 0 else 'rewards{0}'.format(i) 
        
               self.writer.add_scalar(rewards_name + '/step'.format(i), mean_rewards[i], frame) 
        
               self.writer.add_scalar(rewards_name + '/iter'.format(i), mean_rewards[i], epoch_num) 
        
               self.writer.add_scalar(rewards_name + '/time'.format(i), mean_rewards[i], total_time) 
        
           self.writer.add_scalar('episode_lengths/step', mean_lengths, frame) 
        
           self.writer.add_scalar('episode_lengths/iter', mean_lengths, epoch_num) 
        
           self.writer.add_scalar('episode_lengths/time', mean_lengths, total_time)

The code is storing the statistics with respect to different quantities (epoch, step, and time) to the self.writer which is a tensorboardX.SummaryWriter (link to docs). But the statistics on wandb seem to only show the x-axis as "iter" (which is the same as epoch_num here) and they don't show performance as a function of the step or time. Is there a way to address such an issue here?

(Also posting on the Isaac Gym repo isaac-sim/IsaacGymEnvs#87)

RNN for Experience Replay implemented?

Hi there,

I was just wondering whether the RNN Experience Replay is implemented right?
The reason is that in play_steps_rnn() update_data() is called but not update_data_rnn().
Specifically for replaying experiences in RNN a whole seq_len would have to be replayed for the GRU or LSTM respectively to deliver right results, right? Or is the current state of the GRU/LSTM cells also stored in the replay buffer in each step?

Or maybe I haven't fully understood these concepts in RL yet.

Kind regards

The input shape of MLP for discrete observation space

I am trying to use the BlackjackEnv (https://github.com/openai/gym/blob/master/gym/envs/toy_text/blackjack.py) in rl_games. It seems rl_games doesn't support the discrete observation space like:
spaces.Tuple((spaces.Discrete(32), spaces.Discrete(11), spaces.Discrete(2)))
Any plan to support this feature?

Entropy calculation for (tanh) transformed normal distribution - SquashedNormal

I am trying to use the Squashed Normal distribution for training a PPO agent to bound the action space. For the SquashedNormal distribution, entropy is assumed to be equal to entropy of the base (Normal) distribution, which ignores the additional (E[log(d(tanh)/dx)]) term. Would using entropy of the underlying Normal distribution as a proxy (since entropy for the new distribution does not have a closed form) cause any stability issues?

Thank you!

Ray or hvd

Hi there
Why is horovod needed if you have ray? Ray can also run on multiple GPUs. And do both not interfere? And where are the parameters handled? In rays database or hvd?
Kind regards

Latest version of rl_games not compatible with Isaac Gym

Hi, I was trying to update rl_games to the latest 1.4.0 version.
However, it shows that the latest version of rl_games failed to achieve the same performance in 1.1.4 which the Isaac Gym requires.

The environment I'm testing with is Humanoid, and the command I used is as follows:
python train.py task=Humanoid headless=True
and
python train.py task=Humanoid checkpoint=runs/Humanoid/nn/last_Humanoid_ep_500_rew_5396.84.pth test=True num_envs=9

When using 1.1.4, the humanoids can run forward, but with the latest version of rl_games, all the humanoids just collapse where they start.

May I know the changes between 1.1.4 and the latest version?
Or should I change something for the yaml config file to make it work in the latest version?

No module named '_tkinter'

The latest rl_games imports turtle, which imports tkinter, leading to this error. Is this an absolutely unavoidable import?

FYI: I commented-out that 1st line from rl_games/algos_torch/sac_agent.py (from turtle import shape), and things seem to work fine without that import.

Traceback (most recent call last):
  File "runner.py", line 44, in <module>
    from rl_games.torch_runner import Runner
  File "F:\dev\rl_games\rl_games\torch_runner.py", line 19, in <module>
    from rl_games.algos_torch import sac_agent
  File "F:\dev\rl_games\rl_games\algos_torch\sac_agent.py", line 1, in <module>
    from turtle import shape
  File "c:\python37\lib\turtle.py", line 107, in <module>
    import tkinter as TK
  File "c:\python37\lib\tkinter\__init__.py", line 36, in <module>
    import _tkinter # If this fails your Python may not be configured for Tk
ModuleNotFoundError: No module named '_tkinter'

Debugging multi-GPU issue

In IsaacGymEnvs, rl-games + multiGPU seems to have some issues. As shown in the screenshot, rl-games + multiGPU performs uses twice amount of data and performs worse than the single GPU setting in Ant

This issue tracks the investigation of this issue.

Proposed debugging route

I suggest making sure we make sure there is no loss in sample efficiency first before scaling to more envs by matching implementation details in our prototype in CleanRL: https://cleanrl-git-new-multi-gpu-vwxyzjn.vercel.app/rl-algorithms/ppo/#implementation-details_6.

Identified issues:

1. Seeding logic and configuration issue

#162

We need to seed multiGPU processes with different seeds to decorrelate experience, otherwise the multiGPU processes will produce the exact observations.

Configuration-wise we can set the overall seed with params.seed and env seed with params.config.env_config.seed, so if params.config.env_config.seed is set but params.seed is not set, we get identical observations from the environments as shown below:

This is probably ok since the agent still samples different actions, but it's nonetheless a problem. The correct implementation is to use seed = seed + local_rank.

2. stepping logic issue

#163

After fixing #163, I was able to match the sample efficiency in the single GPU setting:

However, the wall time is slower than I had expected. On a separate benchmark I made with CleanRL, the experiments show horovod should make Ant step 20% faster.

Maybe it's the averaging stats overhead? In the CleanRL benchmark experiments I did not mess with stats at all.

TypeError: conv1d(): argument 'padding' (position 5) must be tuple of ints, not str

Hi,

When I tried to run branch DM/torch_gpu with command python3 torch_runner.py --train --file configs/ppo_smac_cnn.yaml, I got the following complaint:

Traceback (most recent call last):
  File "torch_runner.py", line 141, in <module>
    runner.run(args)
  File "torch_runner.py", line 111, in run
    self.run_train()
  File "torch_runner.py", line 95, in run_train
    agent = self.algo_factory.create(self.algo_name, base_name='run', observation_space=obs_space, action_space=action_space, config=self.config)  
  File "/pymarl/common/object_factory.py", line 12, in create
    return builder(**kwargs)
  File "torch_runner.py", line 25, in <lambda>
    self.algo_factory.register_builder('a2c_discrete', lambda **kwargs : a2c_discrete.DiscreteA2CAgent(**kwargs)) 
  File "/pymarl/algos_torch/a2c_discrete.py", line 18, in __init__
    self.model = self.network.build(config)
  File "/pymarl/algos_torch/models.py", line 25, in build
(pid=67) Game has started.
    return ModelA2C.Network(self.network_builder.build('a2c', **config))
  File "/pymarl/algos_torch/network_builder.py", line 297, in build
    net = A2CBuilder.Network(self.params, **kwargs)
(pid=67) Sending ResponseJoinGame
  File "/pymarl/algos_torch/network_builder.py", line 157, in __init__
    'input_size' : self._calc_input_size(input_shape, self.actor_cnn), 
  File "/pymarl/algos_torch/network_builder.py", line 58, in _calc_input_size
    return nn.Sequential(*cnn_layers)(torch.rand(1, *(input_shape))).flatten(1).data.size(1)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 208, in forward
    self.padding, self.dilation, self.groups)
TypeError: conv1d(): argument 'padding' (position 5) must be tuple of ints, not str

Would you like to help me solve it? Or give me any guidline of how to run the ppo to get reported performance?

Really thanks

is it possible to "play" a model without initializing cuda? to avoid memory issues

i'm running this command to "play" my trained model without using the gpu:

python train.py task=Ant test=True checkpoint=cp.pth num_envs=4 sim_device=cpu rl_device=cpu pipeline=cpu

but i still get this CUDA memory error sometimes if i try to run this while a model is being trained in a different terminal window:

Error executing job with overrides: ['task=Ant', 'test=True', 'checkpoint=cp.pth', 'num_envs=4', 'sim_device=cpu', 'rl_device=cpu', 'pipeline=cpu']
Traceback (most recent call last):
  File "train.py", line 134, in <module>
    launch_rlg_hydra()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/main.py", line 52, in decorated_main
    config_name=config_name,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 378, in _run_hydra
    lambda: hydra.run(
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 381, in <lambda>
    overrides=args.overrides,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 130, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 142, in run
    player = self.create_player()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 128, in create_player
    return self.player_factory.create(self.algo_name, config=self.config)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/object_factory.py", line 15, in create
    return builder(**kwargs)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 29, in <lambda>
    self.player_factory.register_builder('a2c_continuous', lambda **kwargs : players.PpoPlayerContinuous(**kwargs))
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/algos_torch/players.py", line 28, in __init__
    self.actions_low = torch.from_numpy(self.action_space.low.copy()).float().to(self.device)
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/torch/cuda/__init__.py", line 170, in _lazy_init
    torch._C._cuda_init()
RuntimeError: CUDA error: out of memory

i asked in the nvidia forum too but thought i would check here if it's an unavoidable rl_games thing

https://forums.developer.nvidia.com/t/play-a-checkpoint-file-without-using-gpu-at-all-to-avoid-memory-errors/212764

also, the memory error persists until i reboot. is that a memory leak? or is there any way rl_games could clear the gpu memory?

How to Specify Sequence length for Recurrent Network

Amazing repo! I was wondering if you could help me clarify the confusion I have around the recurrent layer implementations.

I found that the input to A2CBuilder.Network.forward() seems to only have a sequence of 1, even though in the yaml, it's a non 1 value.

I am currently on commit a33b6c4d easy fix (#145), up to date with the most recent master commit

Steps to Reproduce

I ran this command:

python runner.py --train --file rl_games/configs/ppo_lunar_continiuos_torch.yaml

with a breakpoint at rl_games/algos_torch/network_builder.py:341~342

the shape of a_out, a_states, c_out, c_states are all torch.size([1, 16, 64]) (seq_length, batch_size, input_dim from previous mlp)

Although, in the yaml file. params.config.seq_length: 4 which I assumed to be the length of the rnn sequence.

I also didn't find a mechanism in the code that passes in a sequence of inputs to the RNN.

I'm wondering if I missed something? or if this feature is not yet implemented?

Multi-GPU with Central Value not working

Trying to run multi-gpu training with horovod, I get the following error:

[1,1]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
[1,1]<stderr>:  LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
[1,1]<stderr>:/opt/conda/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
[1,1]<stderr>:  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
[1,1]<stderr>:/workspace/isaacgymenvs/isaacgymenvs/tasks/allegro_hand.py:275: DeprecationWarning: an integer is required (got type isaacgym._bindings.linux-x86_64.gym_38.DofDriveMode).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
[1,1]<stderr>:  asset_options.default_dof_drive_mode = gymapi.DOF_MODE_POS
[1,1]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/common/util.py:227: DeprecationWarning: Parameter `average` has been replaced with `op` and will be removed in v0.21.0
[1,1]<stderr>:  warnings.warn('Parameter `average` has been replaced with `op` and will be removed in v0.21.0',
[1,1]<stderr>:Error executing job with overrides: ['task=AllegroHandLSTM', 'headless=True', 'multi_gpu=True', 'train.params.config.mixed_precision=False']
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "train.py", line 137, in launch_rlg_hydra
[1,1]<stderr>:    runner.run({
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 97, in run
[1,1]<stderr>:    self.run_train(args)
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 78, in run_train
[1,1]<stderr>:    agent.train()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1141, in train
[1,1]<stderr>:    step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1012, in train_epoch
[1,1]<stderr>:    self.train_central_value()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 516, in train_central_value
[1,1]<stderr>:    return self.central_value_net.train_net()
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 194, in train_net
[1,1]<stderr>:    self.update_lr(self.lr)
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 79, in update_lr
[1,1]<stderr>:    self.hvd.broadcast_value(lr_tensor, 'cv_learning_rate')
[1,1]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
[1,1]<stderr>:    raise AttributeError("'{}' object has no attribute '{}'".format(
[1,1]<stderr>:AttributeError: 'CentralValueTrain' object has no attribute 'hvd'
[1,1]<stderr>:
[1,1]<stderr>:Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[1,0]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/torch/sync_batch_norm.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
[1,0]<stderr>:  LooseVersion(torch.__version__) >= LooseVersion('1.5.0') and
[1,0]<stderr>:/opt/conda/lib/python3.8/site-packages/gym/spaces/box.py:84: UserWarning: WARN: Box bound precision lowered by casting to float32
[1,0]<stderr>:  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
[1,0]<stderr>:/workspace/isaacgymenvs/isaacgymenvs/tasks/allegro_hand.py:275: DeprecationWarning: an integer is required (got type isaacgym._bindings.linux-x86_64.gym_38.DofDriveMode).  Implicit conversion to integers using __int__ is deprecated, and may be removed in a future version of Python.
[1,0]<stderr>:  asset_options.default_dof_drive_mode = gymapi.DOF_MODE_POS
[1,0]<stderr>:/opt/conda/lib/python3.8/site-packages/horovod/common/util.py:227: DeprecationWarning: Parameter `average` has been replaced with `op` and will be removed in v0.21.0
[1,0]<stderr>:  warnings.warn('Parameter `average` has been replaced with `op` and will be removed in v0.21.0',
[1,0]<stderr>:Error executing job with overrides: ['task=AllegroHandLSTM', 'headless=True', 'multi_gpu=True', 'train.params.config.mixed_precision=False']
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train.py", line 137, in launch_rlg_hydra
[1,0]<stderr>:    runner.run({
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 97, in run
[1,0]<stderr>:    self.run_train(args)
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/torch_runner.py", line 78, in run_train
[1,0]<stderr>:    agent.train()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1141, in train
[1,0]<stderr>:    step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1012, in train_epoch
[1,0]<stderr>:    self.train_central_value()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 516, in train_central_value
[1,0]<stderr>:    return self.central_value_net.train_net()
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 194, in train_net
[1,0]<stderr>:    self.update_lr(self.lr)
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/rl_games/algos_torch/central_value.py", line 79, in update_lr
[1,0]<stderr>:    self.hvd.broadcast_value(lr_tensor, 'cv_learning_rate')
[1,0]<stderr>:  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1185, in __getattr__
[1,0]<stderr>:    raise AttributeError("'{}' object has no attribute '{}'".format(
[1,0]<stderr>:AttributeError: 'CentralValueTrain' object has no attribute 'hvd'
[1,0]<stderr>:
[1,0]<stderr>:Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[1,0]<stdout>:[1,1]<stdout>:--------------------------------------------------------------------------

It seems that central value module never creates or receives Horovod wrapper object.

what is self.mb_rnn_states

hi there,
what does the value self.mb_rnn_states stands for? self.rnn_states is already set, but this value is updated when horizon episode is finished.

kind regards

Extending Functionality for Policy and Replay Buffers

Hi,

Had a couple of questions regarding extending the current functionality present in rl_games:

What would be the best way to extend one of the algos (say, A2C Continuous) to allow for an external function (like a controller function) to be called after the NN forward pass? For example, normally the forward pass (at inference time) might look like model_forward() --> dist_from_output() --> sample_from_dist(), whereas I'm hoping to inject an external function after the forward pass so that the pipeline would look like model_forward() --> external_postprocessing() --> dist_from_output() --> sample_from_dist(), where external_postprocessing() would take in the model's outputted values and return the post-processed values (potentially of a different dimension, which would be the "final" action dimension used to generate the sampling distribution; e.g.: conversion from eef commands into joint torques)
What would be the best way to include additional information to be stored in the replay buffer (to be used by the above external function)? Ideally, this would be a dict of tensors that is stored along with the normal (s, a, r, s') values for a given env step.

Working with @ViktorM on applications relevant to these features and he thought it might be best if I posted here. Thanks!

How to get rl_games==1.1.4 source code

How can I get the rl_games v1.1.4 source code which is required by the IsaacGymEnvs. I want to add my own algorithms in rl_games. Therefore, I need to get the source code. Does anyone know how to do that?

Ways of integrating already existing knowledges into SAC training?

Hi,
I manually controlled the ant robot from this example (https://github.com/NVIDIA-Omniverse/IsaacGymEnvs/blob/main/isaacgymenvs/tasks/ant.py) and recorded the corresponding joint angle values.
I would like to know whether there is a way of integrating this per-existing knowledge into the agent training (such as the imitation learning from SB3: (https://imitation.readthedocs.io/en/latest/algorithms/gail.html)).

Allegro Hand In-hand Manipulation Example Env Source Code

On the bottom left of the example gif for Isaac Gym on the root readme on the master branch, there is a in-hand manipulation result on the Allegro Hand. Where is the source for that env? Was that created for use in rl_games? I would like to recreate those results and add to that experiment if possible. Thanks!

a2c_common.py: UnboundLocalError: local variable 'mean_rewards' referenced before assignment

(this is on line 1214 in the version that isaac gym is using):

rl_games/rl_games/common/a2c_common.py

Line 1207 in a33b6c4

    
           self.save(os.path.join(self.nn_dir, 'last_' + self.config['name'] + 'ep' + str(epoch_num) + 'rew' + str(mean_rewards)))

the only changes i made in Isaac Gym is this function in isaacgymenvs/tasks/cartpole.py:

@torch.jit.script
def compute_cartpole_reward(pole_angle, pole_vel, cart_vel, cart_pos,
                            reset_dist, reset_buf, progress_buf, max_episode_length):
    # type: (Tensor, Tensor, Tensor, Tensor, float, Tensor, Tensor, float) -> Tuple[Tensor, Tensor]
    reward = 1 - torch.abs(pole_angle) - 0.01*torch.abs(cart_vel)
    reset = reset_buf
    return reward, reset

error:

(rlgpu) stuart@hp:~/repos/IsaacGymEnvs/isaacgymenvs$ python train.py task=Cartpole

...
fps step: 229110.7 fps step and policy inference: 168550.7  fps total: 121436.5
Error executing job with overrides: ['task=Cartpole']
Traceback (most recent call last):
  File "train.py", line 131, in <module>
    launch_rlg_hydra()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/main.py", line 52, in decorated_main
    config_name=config_name,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 378, in _run_hydra
    lambda: hydra.run(
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/utils.py", line 381, in <lambda>
    overrides=args.overrides,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "train.py", line 127, in launch_rlg_hydra
    'play': cfg.test,
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 139, in run
    self.run_train()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/torch_runner.py", line 125, in run_train
    agent.train()
  File "/home/stuart/miniconda3/envs/rlgpu/lib/python3.7/site-packages/rl_games/common/a2c_common.py", line 1214, in train
    self.save(os.path.join(self.nn_dir, 'last_' + self.config['name'] + 'ep' + str(epoch_num) + 'rew' + str(mean_rewards)))
UnboundLocalError: local variable 'mean_rewards' referenced before assignment
(rlgpu) stuart@hp:~/repos/IsaacGymEnvs/isaacgymenvs$

hope this is helpful. i'm new to this stuff

Training performance dropped with the latest version

Hi, @Denys88 . I saw an appearant performance drop during training with the latest rl_games version, the reward picture is as follows (trained with the FrankaCabinet Environment in IsaacGymEnvs):

The orange line is training with the latest version and the blue one is with the old version (v1.4.0). I found that the latest code in a2c_common.py, there is no self.schedule_type, and all scheduler updates as the way when self.schedule_type=='standard'. The latest code is as follows:

for mini_ep in range(0, self.mini_epochs_num):
            ep_kls = []
            for i in range(len(self.dataset)):
                a_loss, c_loss, entropy, kl, last_lr, lr_mul, cmu, csigma, b_loss = self.train_actor_critic(self.dataset[i])
                a_losses.append(a_loss)
                c_losses.append(c_loss)
                ep_kls.append(kl)
                entropies.append(entropy)
                if self.bounds_loss_coef is not None:
                    b_losses.append(b_loss)

                self.dataset.update_mu_sigma(cmu, csigma)

            av_kls = torch_ext.mean_list(ep_kls)
            if self.multi_gpu:
                dist.all_reduce(av_kls, op=dist.ReduceOp.SUM)
                av_kls /= self.rank_size

            self.last_lr, self.entropy_coef = self.scheduler.update(self.last_lr, self.entropy_coef, self.epoch_num, 0, av_kls.item())
            self.update_lr(self.last_lr)

When I changed the code as follows

for mini_ep in range(0, self.mini_epochs_num):
            ep_kls = []
            for i in range(len(self.dataset)):
                a_loss, c_loss, entropy, kl, last_lr, lr_mul, cmu, csigma, b_loss = self.train_actor_critic(self.dataset[i])
                a_losses.append(a_loss)
                c_losses.append(c_loss)
                ep_kls.append(kl)
                entropies.append(entropy)
                if self.bounds_loss_coef is not None:
                    b_losses.append(b_loss)

                self.dataset.update_mu_sigma(cmu, csigma)

                if self.multi_gpu:
                    dist.all_reduce(av_kls, op=dist.ReduceOp.SUM)
                    av_kls /= self.rank_size
    
                self.last_lr, self.entropy_coef = self.scheduler.update(self.last_lr, self.entropy_coef, self.epoch_num, 0, av_kls.item())
                self.update_lr(self.last_lr)

            av_kls = torch_ext.mean_list(ep_kls)

Then the performance is as before.

So, why you choose to remove the selection of self.scheduler_type? Will it be better when you set the default self.schedule_type='legacy' as before?

some suggestions for rl-games

Allow masking of some environments to allow for validation of the policies at outside the DR ranges they are trained on.
Dropout is not supported yet but it would be nice to have. In general, it would be nice to have a support for arbitrary networks to be plugged in without going through YAML etc. or changing rl-games code in any way.
Allow for changing the LSTM states outside rl-games if possible. We may want to corrupt LSTM states on the fly as another adversarial perturbation to make the policies robust to this.
Allow test=True with a checkpoint. @ArthurAllshire has already done it but I think it would be good to have that in the same wrapper. Should be pretty straightforward and will make our lives very easy.
unit tests for single-gpu / multi-gpu implementations, checking memory limits etc.

updates for brax_visualization.ipynb

It seems brax changed their API and the brax_visualization.ipynb needs some quick fix:
in cell#3, line#5:

    config = runner.get_prebuilt_config()

needs to be commented/removed

in cell#3, line#8:

env_config = config['env_config']

should change to

env_config = runner.params['config']['env_config']

In cell#5, line#14:
env.state.qp should change to env.env._state.qp

In cell#5, line#17:
env.step(act.unsqueeze(0)) should change to env.step(act)

in cell#7,

display(visualize(env.env.sys, qps))

should change to

display(visualize(env.env._env.sys, qps))

value_bootstrap correctness

Value bootstrap is calculated here:

rl_games/rl_games/common/a2c_common.py

Line 618 in 92525ce

    
           shaped_rewards += self.gamma * res_dict['values'] * self.cast_obs(infos['time_outs']).unsqueeze(1).float()

Essentially, what the code does is:

a(t) = actor(obs(t))
v(t) = critic(obs(t))
obs(t+1), rew(t), is_timeout(t) = env.step(a(t))
rew(t) += gamma * v(t) * is_timeout(t) (1)

where t is the index of the timestep in the episode according to which timestep we populate in self.experience_buffer (hope my notation is clear)

The idea here is that we should add the estimated return for the rest of the episode as if it was infinitely long.
So, ideally,

rew(t) += gamma * v(t+1) (2)

instead of rew(t) += gamma * v(t) as in (1). Using (1) is undesirable because v(t) already accounts for rew(t) and so if the environment returns a large reward on the last step it will be accounted for twice.

The thing is that we can't really get v(t+1) = critic(obs(t+1)) because if is_timeout(t) is true, done(t) will also be true, which means obs(t+1) corresponds to the next episode.

We can't estimate v(t+1) using v(t) either because v(t) = rew(t) + gamma * v(t+1) ==> gamma * v(t+1) = v(t) - rew(t)
When we use this in the equation (2) above we get:

rew(t) += v(t) - rew(t) (3)

this just sets rew(t) to v(t) which entirely discards rew(t).

Basically this leaves us with just two options for value bootstrap, which is:

add rew(t) essentially twice as currently done in the codebase (equation (1))
ignore rew(t) entirely and use just the v(t) estimate for the last step (equation (3))

I feel like both options are really hacky and I wonder if there's even a right way to do it. What do you think? Am I missing something here?

On the other hand, both of these options are viable as long as rew(t) on the last step of the episode is negligible. If it is not, i.e. if the environment returns some non-trivial reward when is_timeout(t) is true, both options lead to incorrect learning behavior.

Error when loading agent weights

Hey,

first of all, thank you for the great work! I encountered your repo due to the IsaacGymEnvs and was training some Trifinger agents.
However, when trying to load the trained weights I'm getting the following error:

RuntimeError: Error(s) in loading state_dict for Network: Unexpected key(s) in state_dict: "value_mean_std.running_mean", "value_mean_std.running_var", "value_mean_std.count"

I'm running from the basically same repository and did not change any parameter in the config.

Customize the terminal output to show epoch progress?

This is a great project, thanks!

The current output of rl_games to the terminal doesn't show progress.

I know you can use Tensorboard etc, but is there a way to customize the terminal output, to include information, such as Episode [4/500], similar but less verbose than rsl_rl / legged_gym?

player.determenistic misspelling

"deterministic" is currently misspelled as "determenistic":

rl_games/rl_games/common/player.py

Line 45 in 1a89097

self.is_determenistic = self.player_config.get('determenistic', True)

A few of the cfgs in IsaacGymEnvs (ex.) and OmniIsaacGymEnvs (ex.) use deterministic: True, but this shouldn't affect them since the default value is True.

Might be a good idea to accept both "deterministic" and "determenistic" for backwards compatibility.

error while running

Hi I am getting the error below while running the code:

Traceback (most recent call last):
  File "tf14_runner.py", line 144, in <module>
    runner.run(args)
  File "tf14_runner.py", line 114, in run
    self.run_train()
  File "tf14_runner.py", line 98, in run_train
    agent = self.algo_factory.create(self.algo_name, sess=self.sess, base_name='run', observation_space=obs_space, action_space=action_space, config=self.config)  
  File "/home/anujm/Documents/rl_games/rl_games/common/object_factory.py", line 12, in create
    return builder(**kwargs)
  File "tf14_runner.py", line 25, in <lambda>
    self.algo_factory.register_builder('a2c_discrete', lambda **kwargs : a2c_discrete.A2CAgent(**kwargs)) 
  File "/home/anujm/Documents/rl_games/rl_games/algos_tf14/a2c_discrete.py", line 45, in __init__
    self.vec_env = vecenv.create_vec_env(self.env_name, self.num_actors, **self.env_config)
  File "/home/anujm/Documents/rl_games/rl_games/common/vecenv.py", line 138, in create_vec_env
    return RayVecSMACEnv(config_name, num_actors, **kwargs)
  File "/home/anujm/Documents/rl_games/rl_games/common/vecenv.py", line 101, in __init__
    self.num_agents = ray.get(res)
  File "/home/anujm/anaconda3/envs/rlgames/lib/python3.7/site-packages/ray/worker.py", line 2193, in get
    raise value
ray.exceptions.RayTaskError: ray_worker (pid=16737, host=anujm-X299-A)
  File "/home/anujm/Documents/rl_games/rl_games/common/vecenv.py", line 58, in get_number_of_agents
    return self.env.get_number_of_agents()
AttributeError: 'BatchedFrameStack' object has no attribute 'get_number_of_agents'

EnvPool advertisement

Hi, I just came across this repo. I'm quite surprised that you use envpool to achieve 2 min Pong and 20min Breakout, nice work!

I'm wondering if you'd like to open a pull request at EnvPool to link with your result (like the CleanRL ones), and if it is possible for us to include your experiment result in our incoming arXiv paper. Also, it would be great if you can make more amazing results based on EnvPool mujoco tasks (which has aligned with gym's implementation and can also get a free speedup). Thanks!

BTW, isn't it a typo?
https://github.com/Denys88/rl_games/blame/master/docs/ATARI_ENVPOOL.md#L9

-* **Breakout-v3** 20 minutes training time to achieve 20+ score.
+* **Breakout-v3** 20 minutes training time to achieve 400+ score.

Value Normalization

Hi, thanks for the amazing work!

I am wondering how important the value normalization is? When I disable the value normalization in some tasks, especially the ShadowHand, the PPO agent doesn't work anymore. I looked up the code and it seems to me that it normalizes the returns and predicted (old) values before calculating the loss. However, the (new) value output by the model is not normalized (due to the unnorm function). So why does it work or did I misunderstand something?

Also, if I want to test Isaac Gym with the SAC codes, can I achieve it using RL games?

Multi-GPU usage

How can I use multiple GPUs for simulation and training? I am enabling horovod but it seems that it can only use one device.

PPO performance for humanoid

Hi, Nice work!
I noticed your work when I was looking at the Brax repository:)
In their paper, the Brax team mentioned that their PPO implementation didn't work well on humanoid, and this bug still exists now.
Previously I had suspected that there were some bugs with the Brax env.
But your performance on the humanoid seems to demonstrate that the problem may lie in their algorithm or hyperparameters.
I'd appreciate it if you could let me know if there's anything to note when you try humanoid with Brax.
Congratulations again on your excellent work.

Save and load state for Isaac Gym

Hi, I notice that we have functions get_env_state() and set_env_state() to save and load the info for the environment. Does it work in Isaac Gym?

Logging in environments

I created an environment for a new robot in a repository derived from the IssacGymEnvs preview release (https://developer.nvidia.com/isaac-gym).

I would like to log different parts of the reward function of the environment to see what the neural network optimizes first. For this I would need to either create a new torch.utils.tensorboard.SummaryWriter, or use the existing one from the A2CBase. What is the best way to log scalar values from the environment?

	for i in range(self.value_size):
	rewards_name = 'rewards' if i == 0 else 'rewards{0}'.format(i)
	self.writer.add_scalar(rewards_name + '/step'.format(i), mean_rewards[i], frame)
	self.writer.add_scalar(rewards_name + '/iter'.format(i), mean_rewards[i], epoch_num)
	self.writer.add_scalar(rewards_name + '/time'.format(i), mean_rewards[i], total_time)

	self.writer.add_scalar('episode_lengths/step', mean_lengths, frame)
	self.writer.add_scalar('episode_lengths/iter', mean_lengths, epoch_num)
	self.writer.add_scalar('episode_lengths/time', mean_lengths, total_time)

denys88 / rl_games Goto Github PK

rl_games's People

Contributors

Stargazers

Watchers

Forkers

rl_games's Issues

Problem

My Solution

Proposed debugging route

Identified issues:

1. Seeding logic and configuration issue

2. stepping logic issue

Steps to Reproduce

Recommend Projects

Recommend Topics

Recommend Org